This notebook works with the Taxonomic Information Registry. It picks up scientific names (currently only looking for those submitted by the SGCN process), queries the ITIS Solr service for matches, and caches a few specific properties in a key/value store. A lot of what was originally all in context here as functions has been moved out to a set of modules in a package. The following notes still make sense as to what this process is doing, but they reference what are now those modularized functions.

### Clean the scientific name string for use in searches (bis.bis.cleanScientificName(name string))
Note: I moved this function into the bis.bis module.

This is one of the more tricky areas of the process. People encode a lot of different signals into scientific names. If we clean too much out of the name string, we run the risk of not finding the taxon that they intended to provide. If we clean up too little, we won't find anything with the search. So far, for the SGCN case, we've decided to do the following in this code block for the purposes of finding the taxon in ITIS:

* Ignore population designations
* Ignore strings after an "spp." designation
* Set case for what appear to be species name strings to uppercase genus but lowercase everything else
* Ignore text in between parentheses and brackets; these are often synonyms or alternate names that should be picked up from the ITIS record if we find a match

One thing that I deliberately did not do here was change cases where the name string includes signals like "sp." or "sp. 4". Those are seeming to indicate that the genus is known but species is not yet determined. Rather than strip this text and run the query, potentially resulting in a genus match, I opted to leave those strings in place, likely resulting in no match with ITIS. We may end up making a different design decision for the SGCN case and allow for matching to genus.

### Package up the specific attributes we want to cache from ITIS (bis.itis.packageITISPairs(ITIS Solr JSON Doc))
Note: I moved this and other ITIS functions into the bis.itis module.

This function takes the data coming from the ITIS service as JSON and pairs up the attributes and values we want to cache and use. The date/time stamp here for when the information is cached is vital metadata for determining usability. As soon as the information comes out of ITIS, it is potentially stale. The information we collect and use from ITIS through this process includes the following:
* Discovered and accepted TSNs for the taxon
* Taxonomic rank of the discovered taxon
* Names with and without indicators for the discovered taxon
* Taxonomic hierarchy with ranks (in the ITIS Solr service, this is always the accepted taxonomic hierarchy)
* Vernacular names for the discovered taxon

### Run the process for all supplied names
The main process run below should eventually be the substance of a microservice on name matching. I set this up to create a local data structure (dictionary) for each record. The main point here is to set up the search, execute the search and package ITIS results, and then submit those for the record back to the Taxonomic Information Registry.

One of the major things we still need to work out here is what to do with updates over time. This script puts some record into the ITIS pairs whether we find a match or not. The query that gets names to run from the registration property looks for cases where the ITIS information is null (mostly because I needed to pick up where I left off when I found a few remaining issues that failed the script). We can then use cases where the "matchMethod" is "NotMatched" to go back and see if we can find name matches. This is particularly going to be the case where we find more than one match on a fuzzy search, which I still haven't dealt with.

We also need to figure out what to do when we want to update the information over time. With ITIS, once we have a matched TSN, we can then use that TSN to grab updates as they occur, including changes in taxonomy. But we need to figure out if we should change the structure of the TIR cache to keep all the old versions of what we found over time so that it can always be referred back to.

In [5]:
import requests,re
from IPython.display import display
from bis import bis
from bis import itis
from bis import tir
from bis2 import gc2

In [6]:
# Set up the actions/targets for this particular instance
thisRun = {}
thisRun["instance"] = "DataDistillery"
thisRun["db"] = "BCB"
thisRun["baseURL"] = gc2.sqlAPI(thisRun["instance"],thisRun["db"])

### Get data to process
Right now, this just gets everything registered in the TIR where where is no ITIS information. Eventually, we need to figure out a time threshold where we should re-run ITIS processing, and we'll then need to figure out how to deal with older information that may have developed into dependencies.

In [9]:
q_tirRecords = "SELECT id, \
    registration->'source' AS source, \
    registration->'followTaxonomy' AS followtaxonomy, \
    registration->'taxonomicLookupProperty' AS taxonomiclookupproperty, \
    registration->'scientificname' AS scientificname, \
    registration->'tsn' AS tsn \
    FROM tir.tir LIMIT 10"
#    WHERE tir.itis IS NULL"
tirRecords  = requests.get(thisRun["baseURL"]+"&q="+q_tirRecords).json()

In [10]:
for feature in tirRecords["features"]:
    # Set up a local data structure for storage and processing
    thisRecord = {}
    
    # Set data from query results
    thisRecord["id"] = feature["properties"]["id"]
    thisRecord["source"] = feature["properties"]["source"]
    thisRecord["followTaxonomy"] = feature["properties"]["followtaxonomy"]
    thisRecord["taxonomicLookupProperty"] = feature["properties"]["taxonomiclookupproperty"]
    thisRecord["tsn"] = feature["properties"]["tsn"]
    thisRecord["scientificname"] = feature["properties"]["scientificname"]
    thisRecord["scientificname_search"] = bis.cleanScientificName(thisRecord["scientificname"])
    
    # Set defaults for thisRecord
    thisRecord["matchMethod"] = "Not Matched"
    thisRecord["matchString"] = thisRecord["scientificname_search"]
    thisRecord["itisPairs"] = itis.packageITISPairs(thisRecord["matchMethod"],0)
    
    if thisRecord["taxonomicLookupProperty"] == "scientificname" and len(thisRecord["scientificname_search"]) != 0:
        
        # The ITIS Solr service does not fail in an elegant way, and so we need to try this whole section and except it out if the query fails
        try:
            thisRecord["itisSearchURL"] = itis.getITISSearchURL(thisRecord["scientificname_search"])

            # Try an exact match search
            itisSearchResults = requests.get(thisRecord["itisSearchURL"]).json()
            thisRecord["numResults"] = len(itisSearchResults["response"]["docs"])

            # If we got only a single match on an exact match search, set the method and proceed
            if thisRecord["numResults"] == 1:
                thisRecord["matchMethod"] = "Exact Match:"

            # If we found nothing on an exact match search, try a fuzzy match
            elif thisRecord["numResults"] == 0:
                itisSearchResults = requests.get(thisRecord["itisSearchURL"]+fuzzyLevel).json()
                thisRecord["numResults"] = len(itisSearchResults["response"]["docs"])
                if thisRecord["numResults"] == 1:
                    thisRecord["matchMethod"] = "Fuzzy Match"

            # If there are results from exact or fuzzy match search, package the ITIS properties we want
            if thisRecord["numResults"] == 1:
                thisRecord["itisPairs"] = itis.packageITISPairs(thisRecord["matchMethod"],itisSearchResults["response"]["docs"][0])

            # Handle cases where discovered TSN usage is invalid by following accepted TSN
            if (thisRecord["itisPairs"].find('"usage"=>"valid"') == -1 or thisRecord["itisPairs"].find('"usage"=>"accepted"') == -1) and thisRecord["itisPairs"].find('"acceptedTSN"') > 0 and thisRecord["followTaxonomy"] == "True":
                thisRecord["acceptedTSN"] = re.search('\"acceptedTSN"\=\>\"(.+?)\"', thisRecord["itisPairs"]).group(1)
                thisRecord["discoveredTSN"] = re.search('\"tsn"\=\>\"(.+?)\"', thisRecord["itisPairs"]).group(1)

                thisRecord["itisSearchURL"] = itis.getITISSearchURL(thisRecord["acceptedTSN"])
                itisSearchResults = requests.get(thisRecord["itisSearchURL"]).json()
                thisRecord["numResults"] = len(itisSearchResults["response"]["docs"])
                if thisRecord["numResults"] == 1:
                    thisRecord["matchMethod"] = "Followed Accepted TSN"
                    thisRecord["itisPairs"] = itis.packageITISPairs(thisRecord["matchMethod"],itisSearchResults["response"]["docs"][0])
                    thisRecord["itisPairs"] = thisRecord["itisPairs"]+',"discoveredTSN"=>"'+thisRecord["discoveredTSN"]+'"'
            
        except:
            pass
    
    elif thisRecord["taxonomicLookupProperty"] == "tsn" and thisRecord["tsn"] is not None:
        
        thisRecord["itisSearchURL"] = itis.getITISSearchURL(thisRecord["tsn"])

    display (thisRecord)
    print (tir.cacheToTIR(thisRun["baseURL"],thisRecord["id"],"itis",thisRecord["itisPairs"]))

        

{'followTaxonomy': 'true',
 'id': 944,
 'itisPairs': '"cacheDate"=>"2017-06-22T14:10:57.434422","itisMatchMethod"=>"Exact Match:","createDate"=>"1996-06-13 14:51:08","updateDate"=>"2005-09-28 00:00:00","tsn"=>"177878","rank"=>"Species","nameWInd"=>"Otus flammeolus","nameWOInd"=>"Otus flammeolus","usage"=>"valid","Kingdom"=>"Animalia","Subkingdom"=>"Bilateria","Infrakingdom"=>"Deuterostomia","Phylum"=>"Chordata","Subphylum"=>"Vertebrata","Infraphylum"=>"Gnathostomata","Superclass"=>"Tetrapoda","Class"=>"Aves","Order"=>"Strigiformes","Family"=>"Strigidae","Subfamily"=>"Striginae","Genus"=>"Otus","Species"=>"Otus flammeolus","vernacular:English"=>"Flammulated Owl","vernacular:Spanish"=>"Tecolote ojo oscuro"',
 'itisSearchURL': 'http://services.itis.gov/?wt=json&rows=10&q=(usage:accepted%20OR%20usage:valid)%20AND%20nameWOInd:Otus\\%20flammeolus',
 'matchMethod': 'Exact Match:',
 'matchString': 'Otus flammeolus',
 'numResults': 1,
 'scientificname': 'Otus flammeolus',
 'scientificname_searc

{'followTaxonomy': 'true',
 'id': 649,
 'itisPairs': '"cacheDate"=>"2017-06-22T14:10:57.902132","itisMatchMethod"=>"Exact Match:","createDate"=>"1996-06-13 14:51:08","updateDate"=>"2010-10-05 00:00:00","tsn"=>"163893","rank"=>"Genus","nameWInd"=>"Catostomus","nameWOInd"=>"Catostomus","usage"=>"valid","Kingdom"=>"Animalia","Subkingdom"=>"Bilateria","Infrakingdom"=>"Deuterostomia","Phylum"=>"Chordata","Subphylum"=>"Vertebrata","Infraphylum"=>"Gnathostomata","Superclass"=>"Actinopterygii","Class"=>"Teleostei","Superorder"=>"Ostariophysi","Order"=>"Cypriniformes","Superfamily"=>"Cobitoidea","Family"=>"Catostomidae","Subfamily"=>"Catostominae","Tribe"=>"Catostomini","Genus"=>"Catostomus","vernacular:English"=>"common suckers"',
 'itisSearchURL': 'http://services.itis.gov/?wt=json&rows=10&q=(usage:accepted%20OR%20usage:valid)%20AND%20nameWOInd:Catostomus',
 'matchMethod': 'Exact Match:',
 'matchString': 'Catostomus',
 'numResults': 1,
 'scientificname': 'Catostomus sp.',
 'scientificname_sea

{'followTaxonomy': 'true',
 'id': 1141,
 'itisPairs': '"cacheDate"=>"2017-06-22T14:10:58.348265","itisMatchMethod"=>"Exact Match:","createDate"=>"1996-06-13 14:51:08","updateDate"=>"2009-09-01 00:00:00","tsn"=>"206991","rank"=>"Species","nameWInd"=>"Spea intermontana","nameWOInd"=>"Spea intermontana","usage"=>"valid","Kingdom"=>"Animalia","Subkingdom"=>"Bilateria","Infrakingdom"=>"Deuterostomia","Phylum"=>"Chordata","Subphylum"=>"Vertebrata","Infraphylum"=>"Gnathostomata","Superclass"=>"Tetrapoda","Class"=>"Amphibia","Order"=>"Anura","Family"=>"Scaphiopodidae","Genus"=>"Spea","Species"=>"Spea intermontana","vernacular:English"=>"Great Basin Spadefoot"',
 'itisSearchURL': 'http://services.itis.gov/?wt=json&rows=10&q=(usage:accepted%20OR%20usage:valid)%20AND%20nameWOInd:Spea\\%20intermontana',
 'matchMethod': 'Exact Match:',
 'matchString': 'Spea intermontana',
 'numResults': 1,
 'scientificname': 'Spea intermontana',
 'scientificname_search': 'Spea intermontana',
 'source': 'SGCN',
 'ta

{'followTaxonomy': 'true',
 'id': 1039,
 'itisPairs': '"cacheDate"=>"2017-06-22T14:10:58.350254","itisMatchMethod"=>"Not Matched"',
 'itisSearchURL': 'http://services.itis.gov/?wt=json&rows=10&q=(usage:accepted%20OR%20usage:valid)%20AND%20nameWOInd:Rana\\%20pipiens',
 'matchMethod': 'Not Matched',
 'matchString': 'Rana pipiens',
 'numResults': 0,
 'scientificname': 'Rana pipiens',
 'scientificname_search': 'Rana pipiens',
 'source': 'SGCN',
 'taxonomicLookupProperty': 'scientificname',
 'tsn': None}

{'followTaxonomy': 'true',
 'id': 865,
 'itisPairs': '"cacheDate"=>"2017-06-22T14:10:59.149627","itisMatchMethod"=>"Exact Match:","createDate"=>"1996-06-13 14:51:08","updateDate"=>"2015-10-28 00:00:00","tsn"=>"179223","rank"=>"Species","nameWInd"=>"Leucosticte australis","nameWOInd"=>"Leucosticte australis","usage"=>"valid","Kingdom"=>"Animalia","Subkingdom"=>"Bilateria","Infrakingdom"=>"Deuterostomia","Phylum"=>"Chordata","Subphylum"=>"Vertebrata","Infraphylum"=>"Gnathostomata","Superclass"=>"Tetrapoda","Class"=>"Aves","Order"=>"Passeriformes","Family"=>"Fringillidae","Genus"=>"Leucosticte","Species"=>"Leucosticte australis","vernacular:English"=>"Brown-capped Rosy Finch","vernacular:English"=>"Brown-capped Rosy-Finch"',
 'itisSearchURL': 'http://services.itis.gov/?wt=json&rows=10&q=(usage:accepted%20OR%20usage:valid)%20AND%20nameWOInd:Leucosticte\\%20australis',
 'matchMethod': 'Exact Match:',
 'matchString': 'Leucosticte australis',
 'numResults': 1,
 'scientificname': 'Leucosticte 

{'followTaxonomy': 'true',
 'id': 868,
 'itisPairs': '"cacheDate"=>"2017-06-22T14:10:59.773288","itisMatchMethod"=>"Exact Match:","createDate"=>"2009-09-01 16:05:19","updateDate"=>"2009-09-01 00:00:00","tsn"=>"775108","rank"=>"Species","nameWInd"=>"Lithobates pipiens","nameWOInd"=>"Lithobates pipiens","usage"=>"valid","Kingdom"=>"Animalia","Subkingdom"=>"Bilateria","Infrakingdom"=>"Deuterostomia","Phylum"=>"Chordata","Subphylum"=>"Vertebrata","Infraphylum"=>"Gnathostomata","Superclass"=>"Tetrapoda","Class"=>"Amphibia","Order"=>"Anura","Family"=>"Ranidae","Genus"=>"Lithobates","Species"=>"Lithobates pipiens","vernacular:English"=>"Northern Leopard Frog"',
 'itisSearchURL': 'http://services.itis.gov/?wt=json&rows=10&q=(usage:accepted%20OR%20usage:valid)%20AND%20nameWOInd:Lithobates\\%20pipiens',
 'matchMethod': 'Exact Match:',
 'matchString': 'Lithobates pipiens',
 'numResults': 1,
 'scientificname': 'Lithobates pipiens',
 'scientificname_search': 'Lithobates pipiens',
 'source': 'SGCN',

{'followTaxonomy': 'true',
 'id': 827,
 'itisPairs': '"cacheDate"=>"2017-06-22T14:11:00.613755","itisMatchMethod"=>"Exact Match:","createDate"=>"1996-06-13 14:51:08","updateDate"=>"2009-09-01 00:00:00","tsn"=>"207283","rank"=>"Species","nameWInd"=>"Hyla wrightorum","nameWOInd"=>"Hyla wrightorum","usage"=>"valid","Kingdom"=>"Animalia","Subkingdom"=>"Bilateria","Infrakingdom"=>"Deuterostomia","Phylum"=>"Chordata","Subphylum"=>"Vertebrata","Infraphylum"=>"Gnathostomata","Superclass"=>"Tetrapoda","Class"=>"Amphibia","Order"=>"Anura","Family"=>"Hylidae","Subfamily"=>"Hylinae","Genus"=>"Hyla","Species"=>"Hyla wrightorum","vernacular:English"=>"Arizona Treefrog","vernacular:English"=>"Mountain Treefrog"',
 'itisSearchURL': 'http://services.itis.gov/?wt=json&rows=10&q=(usage:accepted%20OR%20usage:valid)%20AND%20nameWOInd:Hyla\\%20wrightorum',
 'matchMethod': 'Exact Match:',
 'matchString': 'Hyla wrightorum',
 'numResults': 1,
 'scientificname': 'Hyla wrightorum (Huachuca-Canelo Hills DPS)',
 '

{'followTaxonomy': 'true',
 'id': 574,
 'itisPairs': '"cacheDate"=>"2017-06-22T14:11:00.616131","itisMatchMethod"=>"Not Matched"',
 'itisSearchURL': 'http://services.itis.gov/?wt=json&rows=10&q=(usage:accepted%20OR%20usage:valid)%20AND%20nameWOInd:Arsapnia\\%20arapahoe',
 'matchMethod': 'Not Matched',
 'matchString': 'Arsapnia arapahoe',
 'numResults': 0,
 'scientificname': 'Arsapnia [=Capnia] arapahoe',
 'scientificname_search': 'Arsapnia arapahoe',
 'source': 'SGCN',
 'taxonomicLookupProperty': 'scientificname',
 'tsn': None}

{'followTaxonomy': 'true',
 'id': 998,
 'itisPairs': '"cacheDate"=>"2017-06-22T14:11:01.220065","itisMatchMethod"=>"Not Matched"',
 'itisSearchURL': 'http://services.itis.gov/?wt=json&rows=10&q=(usage:accepted%20OR%20usage:valid)%20AND%20nameWOInd:Plestiodon\\%20gilberti\\%20arizonensis',
 'matchMethod': 'Not Matched',
 'matchString': 'Plestiodon gilberti arizonensis',
 'numResults': 0,
 'scientificname': 'Plestiodon gilberti arizonensis',
 'scientificname_search': 'Plestiodon gilberti arizonensis',
 'source': 'SGCN',
 'taxonomicLookupProperty': 'scientificname',
 'tsn': None}

{'followTaxonomy': 'true',
 'id': 585,
 'itisPairs': '"cacheDate"=>"2017-06-22T14:11:01.605050","itisMatchMethod"=>"Not Matched"',
 'itisSearchURL': 'http://services.itis.gov/?wt=json&rows=10&q=(usage:accepted%20OR%20usage:valid)%20AND%20nameWOInd:Aspidoscelis\\%20burti',
 'matchMethod': 'Not Matched',
 'matchString': 'Aspidoscelis burti',
 'numResults': 0,
 'scientificname': 'Aspidoscelis burti',
 'scientificname_search': 'Aspidoscelis burti',
 'source': 'SGCN',
 'taxonomicLookupProperty': 'scientificname',
 'tsn': None}