A major focus of the SGCN species effort is to assemble one overall synthesized National List - those species that we can determine are in common across the States and Territories. We determine commonality by aligning the submitted taxon names with taxonomic authorities and identifying the valid or accepted taxon identifiers from that matching process. After pulling this information together in the processes documented in other notebooks for ITIS and WoRMS, we can now put everything together into a full list of unique names and start working with the data in various ways.

This notebook runs a series of processes to put the taxonomic lookup information together from both ITIS and WoRMS, summarize the submitted source data, and put everything together into one simplified table. We go ahead and include both matched and unmatched names so that we can examine both cases. The summarization creates lists of submitting states for each reporting period (2005 and 2015) along with lists of the submitted scientific and common names corresponding to each final name.

In [1]:
import pandas as pd
import json
from sciencebasepy import SbSession
import requests
from IPython.display import display

The original submitted lists generally included some type of higher level taxonomic group or guild categorization for help in reviewing the lists. For the final National List, we generate this concept based on the taxonomic hierarchy. This comes from a mapping that is housed in a file for the SGCN collection. Here, we retrieve that information, set it up for use, and provide a function to aid in adding it to the data.

In [2]:
sb = SbSession()

sgcn_base_item = sb.get_item('56d720ece4b015c306f442d5')

tax_group_mappings_file = next((f["url"] for f in sgcn_base_item["files"] if f["title"] == "Taxonomic Group Mappings"), None)
if tax_group_mappings_file is not None:
    tax_group_mappings = json.loads(requests.get(tax_group_mappings_file).content)
    available_group_ranks = list(set([i["rank"] for i in tax_group_mappings]))

def tax_group(hierarchy):
    for name in hierarchy:
        group_name = next((n["sgcntaxonomicgroup"] for n in tax_group_mappings if n["name"] == name), None)
        if group_name is not None:
            return group_name
    return None

Here we get the previously built files containing the results of looking up names in ITIS and WoRMS. Ultimately, these local cached files will be replaced with a data structure maintained in our cyberinfrastructure that is being continuously updated through checking for changes in source systems and processing new taxon names.

In [3]:
with open('itis.json', 'r') as f:
    itis_data = json.load(f)
    f.close()
    
with open('worms.json', 'r') as f:
    worms_data = json.load(f)
    f.close()

In this codeblock, we work through all of the cached taxonomic authority data to assemble the valid documents from either ITIS or WoRMS where we were able to make a match. The end result of this process is essentially the National List based on how we are building the data, because these are just the taxa from our original submissions that we were able to find in ITIS or WoRMS.

In [4]:
taxed_spp = list()
for record in [i for i in itis_data if "data" in i.keys()]:
    itis_doc = next((i for i in record["data"] if i["usage"] in ["valid", "accepted"]), None)
    if itis_doc is not None:
        t_info = {
            "lookup_name": record["parameters"]["Scientific Name"],
            "taxonomic_identifier": f"itis:{itis_doc['tsn']}",
            "taxonomic_rank": itis_doc["rank"],
            "taxonomic_group": tax_group([t["name"] for t in itis_doc["biological_taxonomy"] if t["rank"] in available_group_ranks]),
            "date_processed": record["processing_metadata"]["date_processed"],
            "match_method": record["processing_metadata"]["status_message"],
            "valid_scientific_name": itis_doc["nameWInd"]
        }
        for t in itis_doc["biological_taxonomy"]:
            t_info[t["rank"].lower()] = t["name"]
        taxed_spp.append(t_info)
        
for record in [i for i in worms_data if "data" in i.keys()]:
    check_itis = next((i for i in taxed_spp if i["lookup_name"] == record["parameters"]["Scientific Name"]), None)
    if check_itis is None:
        worms_doc = next((i for i in record["data"] if i["status"] == "accepted"), None)
        if worms_doc is not None:
            t_info = {
                "lookup_name": record["parameters"]["Scientific Name"],
                "taxonomic_identifier": f"worms:{worms_doc['AphiaID']}",
                "taxonomic_rank": worms_doc["rank"],
                "taxonomic_group": tax_group([t["name"] for t in worms_doc["biological_taxonomy"] if t["rank"].capitalize() in available_group_ranks]),
                "date_processed": record["processing_metadata"]["date_processed"],
                "match_method": record["processing_metadata"]["status_message"],
                "valid_scientific_name": worms_doc["scientificname"]
            }
            for t in worms_doc["biological_taxonomy"]:
                t_info[t["rank"].lower()] = t["name"]
            taxed_spp.append(t_info)

d_taxed_spp = pd.DataFrame(taxed_spp)

In this codeblock, we get our original source data and summarize it according to the unique names that we cleaned up previously for lookup purposes. This gives us a full list of all submitted names that we can merge together with those taxa that we were able to find in ITIS or WoRMS. We summarize fields for the submitted scientific and common names as well as the States and Territories that submitted those names for each reporting period (2005 and 2015).

In [5]:
sgcn_species = pd.read_csv('sgcn_source_data.csv')

sgcn_list = list()
for name, group in sgcn_species.groupby("clean_scientific_name"):
    sgcn_record = {
        "lookup_name": name,
        "submitted_scientific_names": ",".join(list(set(group["scientific name"].to_list()))),
        "submitted_common_names": ",".join(list(set(group[group['common name'].apply(lambda x: type(x)==str)]["common name"].to_list())))
    }
    for year, year_group in group.groupby("year"):
        sgcn_record[year] = ",".join(year_group["state"].to_list())
    sgcn_list.append(sgcn_record)

d_sgcn_list = pd.DataFrame(sgcn_list)

Now that we have everything together from the taxonomic lookup process and the original submissions all keyed on a "lookup_name" property (the cleaned scientific name string from the original processing step), we can merge the data together into a single dataframe for further use. This dataframe will have all unique names submitted, with fields from the original submissions summarized in a simple way for ease of use.

In [6]:
final_data = pd.merge(
    left=d_sgcn_list, 
    right=d_taxed_spp, 
    on="lookup_name", 
    left_index=False, 
    right_index=False,
    how="outer"
)

We can take a look at the final data to see what's included.

In [7]:
final_data

Unnamed: 0,lookup_name,submitted_scientific_names,submitted_common_names,2015,2005,taxonomic_identifier,taxonomic_rank,taxonomic_group,date_processed,match_method,...,suborder,infraorder,variety,infraphylum,superclass,subspecies,subgenus,section,subsection,form
0,A noctuid moth,A Noctuid Moth,Zale sp. 1 nr. lunifera,New Hampshire,,,,,,,...,,,,,,,,,,
1,Abacion tessalatum,Abacion tessalatum,A millipede,Virginia,Virginia,itis:570281,Species,Other Invertebrates,2019-09-18T03:14:25.343291,Fuzzy Match,...,,,,,,,,,,
2,Abacion wilhelminae,Abacion wilhelminae,"millipede,Millipede",Arkansas,Arkansas,worms:944219,Species,Other Invertebrates,2019-09-18T03:42:48.063554,Exact Match,...,,,,,,,,,,
3,Abagrotis barnesi,Abagrotis barnesi,A noctuid moth,,New York,itis:771360,Species,Other Invertebrates,2019-09-18T03:14:25.494657,Followed Accepted TSN,...,,,,,,,,,,
4,Abagrotis brunneipennis,Abagrotis brunneipennis,Yankee Dart,Pennsylvania,,itis:771341,Species,Other Invertebrates,2019-09-18T03:14:25.492554,Exact Match,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19351,Zygonemertes virescens,Zygonemertes virescens,No common name,,South Carolina,itis:57554,Species,Other Invertebrates,2019-09-18T03:32:02.668774,Exact Match,...,Monostilifera,,,,,,,,,
19352,Zygonopus krekeleri,Zygonopus krekeleri,West Virginia Blind Cave Millipede,West Virginia,,worms:945049,Species,Other Invertebrates,2019-09-18T03:45:39.409706,Followed Valid AphiaID,...,,,,,,,,,,
19353,Zygonopus packardi,Zygonopus packardi,Packard's Blind Cave Millipede,West Virginia,,worms:945048,Species,Other Invertebrates,2019-09-18T03:45:39.416734,Followed Valid AphiaID,...,,,,,,,,,,
19354,Zygonopus weyeriensis,Zygonopus weyeriensis,Grand Caverns Blind Cave Millipede,West Virginia,,worms:945050,Species,Other Invertebrates,2019-09-18T03:45:39.454942,Followed Valid AphiaID,...,,,,,,,,,,


It's useful to output two of the dataframes to CSV for future reference. The process of summarizing properties for each unique name from the original source submissions takes some time, so we output it here in case we want to skip that step in future. And the final set of summarized submissions with matched taxonomy is the core of what we can now explore in various ways.

In [8]:
d_sgcn_list.to_csv('summarized_sgcn_species.csv', index=False)
final_data.to_csv('sgcn_taxonomy_check.csv', index=False)