A number of common questions come up about basic numbers reporting for the final list. This notebook explores some ways that we can take our intermediate SGCN summary with the results of taxonomic authority consultation and answer those questions. Pandas grouping is particularly useful in this context.

In [1]:
import pandas as pd

In [2]:
sgcn_summary = pd.read_csv('sgcn_taxonomy_check.csv', low_memory=False)

Based on the taxonomic lookup process, we end up with final identified taxa at various levels of the taxonomic hierarchy. We record that detail in a taxonomic_rank property retrieved from the matching document in ITIS or WoRMS. In many cases, we want to report only on taxa identified at the species level, which we do in subsequent steps, but we should look at the distribution of the data across ranks first.

In [3]:
for rank, group in sgcn_summary.groupby("taxonomic_rank"):
    print(rank, len(group))

Class 4
Family 199
Form 1
Genus 312
Order 31
Phylum 4
Species 15560
Subclass 5
Subfamily 5
Suborder 3
Subspecies 1525
Variety 526


We may also want to limit our exploration to just those species that are included in the latest reporting period, 2015. This codeblock sets up a new dataframe filtered to only species reported in 2015.

In [4]:
matched_species = sgcn_summary.loc[(sgcn_summary["taxonomic_rank"] == "Species") & (sgcn_summary["2015"].notnull())]
print(len(matched_species))

12202


Now we can look at the distribution of species that were successfully aligned with taxonomic authorities (aka the National List) by the high level taxonomic group assigned based on the mapping of logical groups to higher level taxonomy.

In [5]:
for tax_group, group in matched_species.groupby("taxonomic_group"):
    print(tax_group, len(group))

Amphibians 289
Birds 772
Fish 1195
Mammals 414
Mollusks 1447
Other 19
Other Invertebrates 3932
Plants 3812
Reptiles 322


We might also want to look further at what happened in the taxonomic matching process. We generated a field in the processing metadata that captures the overall method used in matching a submitted name string to a taxon identifier.

* Exact Match - means that the submitted name was found to match exactly one valid ("accepted" in the case of ITIS plants) taxon
* Fuzzy Match - means that the original submitted name had a misspelling of some kind but that we were able to find it with a fuzzy search
* Followed Accepted TSN or Followed Valid AphiaID - means that the original submitted name string found a match to a taxon that is no longer considered valid and our process followed the taxonomic reference to retrieve a valid taxon for use
* Found multiple matches - means that our search on submitted name string found multiple matches for the name (often homynyms) but that only a single valid taxon was available to give us an acceptable match

In [6]:
for match_method, group in matched_species.groupby("match_method"):
    print(match_method, len(group))

Exact Match 10823
Followed Accepted TSN 855
Followed Valid AphiaID 74
Found multiple matches 59
Fuzzy Match 391


If we really want to dig into the details, we can pull just the details for those cases where the submitted name string does not match the final valid scientific name we matched to in the taxonomic authority. This codeblock outputs a subset dataframe with just the pertinent details.

In [7]:
matched_species.loc[matched_species["lookup_name"] != matched_species["valid_scientific_name"]][["lookup_name","valid_scientific_name","match_method"]]

Unnamed: 0,lookup_name,valid_scientific_name,match_method
1,Abacion tessalatum,Abacion tesselatum,Fuzzy Match
24,Acabaria bicolor,Melithaea bicolor,Followed Valid AphiaID
31,Acalypta lillianus,Acalypta lillianis,Fuzzy Match
108,Acipenser oxyrhynchus,Acipenser oxyrinchus,Followed Accepted TSN
131,Acris crepitans blanchardi,Acris blanchardi,Followed Accepted TSN
...,...,...,...
19325,Zoanthus kealakekuaensi,Zoanthus kealakekuaensis,Fuzzy Match
19352,Zygonopus krekeleri,Trichopetalum krekeleri,Followed Valid AphiaID
19353,Zygonopus packardi,Trichopetalum packardi,Followed Valid AphiaID
19354,Zygonopus weyeriensis,Trichopetalum weyeriensis,Followed Valid AphiaID
