I'm experimenting with a new, smaller increment method here of working through the TIR registrations on WoRMS matching. In this process, we work on WoRMS matching until there is nothing left to work on by using a while loop on the number of possible records to work on. I also threw in a safeguard total number to process and include that in the check so the loop doesn't run away on us.

This might be a little less efficient than grabbing up a whole batch of records at once that meet some criteria and then processing through all of those. It has to execute the following three API interactions:

1) Get a single TIR registration that does not currently have any WoRMS data and has been processed by ITIS (so we have an alternative name to work against)
2) Check the TIR registered name against the WoRMS REST service
3) If we find a match, insert the WoRMS information into the TIR

However, that is only a single additional API interaction per record, and it allows us to simply kick off this process whenever and have it run until there's nothing left to do. It seems like that might be conducive to processing on the Kafka/microservices architecture.

In [1]:
import requests,re
from IPython.display import display
from bis import worms
from bis import bis
from bis import tir
from bis2 import gc2

In [2]:
# Set up the actions/targets for this particular instance
thisRun = {}
thisRun["instance"] = "DataDistillery"
thisRun["db"] = "BCB"
thisRun["baseURL"] = gc2.sqlAPI(thisRun["instance"],thisRun["db"])
thisRun["commitToDB"] = True
thisRun["totalRecordsToProcess"] = 5
thisRun["totalRecordsProcessed"] = 0

In [3]:
numberWithoutTIRData = 1

while numberWithoutTIRData == 1 and thisRun["totalRecordsProcessed"] <= thisRun["totalRecordsToProcess"]:
    q_recordToSearch = "SELECT id, \
        registration->'scientificname' AS scientificname, \
        itis->'nameWInd' AS itisNameWInd \
        FROM tir.tir \
        WHERE worms IS NULL \
        AND itis IS NOT NULL \
        LIMIT 1"
    recordToSearch  = requests.get(gc2.sqlAPI("DataDistillery","BCB")+"&q="+q_recordToSearch).json()
    
    numberWithoutTIRData = len(recordToSearch["features"])

    if numberWithoutTIRData == 1:
        tirRecord = recordToSearch["features"][0]

        # Set up a local data structure for storage and processing
        thisRecord = {}
        
        # Set data from query results
        thisRecord["id"] = tirRecord["properties"]["id"]
        thisRecord["scientificname_submitted"] = tirRecord["properties"]["scientificname"]
        thisRecord["scientificname_search"] = bis.cleanScientificName(thisRecord["scientificname_submitted"])
        thisRecord["itisNameWInd"] = tirRecord["properties"]["itisnamewind"]

        # Set defaults for thisRecord
        thisRecord["matchMethod"] = "Not Matched"
        thisRecord["matchString"] = thisRecord["scientificname_search"]
        wormsData = 0

        # Handle the cases where there is enough interesting stuff in the scientific name string that it comes back blank from the cleaners
        if len(thisRecord["scientificname_search"]) != 0:
            try:
                wormsSearchResults = requests.get("http://www.marinespecies.org/rest/AphiaRecordsByName/"+thisRecord["scientificname_search"]+"?like=false&marine_only=false&offset=1").json()
                thisRecord["matchMethod"] = "Exact Match"
                wormsData = wormsSearchResults[0]
            except:
                try:
                    wormsSearchResults = requests.get("http://www.marinespecies.org/rest/AphiaRecordsByName/"+thisRecord["scientificname_search"]+"?like=true&marine_only=false&offset=1").json()
                    thisRecord["matchMethod"] = "Fuzzy Match"
                    wormsData = wormsSearchResults[0]
                except:
                    if thisRecord["itisNameWInd"] != None and thisRecord["itisNameWInd"] != thisRecord["scientificname_search"]:
                        try:
                            wormsSearchResults = requests.get("http://www.marinespecies.org/rest/AphiaRecordsByName/"+thisRecord["itisNameWInd"]+"?like=false&marine_only=false&offset=1").json()
                            thisRecord["matchMethod"] = "ITIS Name Match"
                            wormsData = wormsSearchResults[0]
                        except:
                            pass

        thisRecord["wormsPairs"] = worms.packageWoRMSPairs(thisRecord["matchMethod"],wormsData)
        display (thisRecord)
        if thisRun["commitToDB"]:
            print (tir.cacheToTIR(gc2.sqlAPI("DataDistillery","BCB"),thisRecord["id"],"worms",thisRecord["wormsPairs"]))
        thisRun["totalRecordsProcessed"] = thisRun["totalRecordsProcessed"] + 1


{'id': 26041,
 'itisNameWInd': 'Erigeron strigosus var. calcicola',
 'matchMethod': 'Not Matched',
 'matchString': 'Erigeron strigosus var. calcicola',
 'scientificname_search': 'Erigeron strigosus var. calcicola',
 'scientificname_submitted': 'Erigeron strigosus var. calcicola',
 'wormsPairs': '"cacheDate"=>"2017-06-27T13:57:09.826488","wormsMatchMethod"=>"Not Matched"'}

{'success': True, 'auth_check': {'session': None, 'success': True, 'auth_level': None}, '_execution_time': 0.079, 'affected_rows': 1}


{'id': 28617,
 'itisNameWInd': None,
 'matchMethod': 'Not Matched',
 'matchString': 'Laevicephalus vannus',
 'scientificname_search': 'Laevicephalus vannus',
 'scientificname_submitted': 'Laevicephalus vannus',
 'wormsPairs': '"cacheDate"=>"2017-06-27T13:57:11.700979","wormsMatchMethod"=>"Not Matched"'}

{'success': True, 'auth_check': {'session': None, 'success': True, 'auth_level': None}, '_execution_time': 0.065, 'affected_rows': 1}


{'id': 26042,
 'itisNameWInd': 'Erigeron subtrinervis',
 'matchMethod': 'Not Matched',
 'matchString': 'Erigeron subtrinervis',
 'scientificname_search': 'Erigeron subtrinervis',
 'scientificname_submitted': 'Erigeron subtrinervis',
 'wormsPairs': '"cacheDate"=>"2017-06-27T13:57:13.376283","wormsMatchMethod"=>"Not Matched"'}

{'success': True, 'auth_check': {'session': None, 'success': True, 'auth_level': None}, '_execution_time': 0.065, 'affected_rows': 1}


{'id': 28640,
 'itisNameWInd': None,
 'matchMethod': 'Not Matched',
 'matchString': 'Lampetra kessleri',
 'scientificname_search': 'Lampetra kessleri',
 'scientificname_submitted': 'Lampetra kessleri',
 'wormsPairs': '"cacheDate"=>"2017-06-27T13:57:15.055133","wormsMatchMethod"=>"Not Matched"'}

{'success': True, 'auth_check': {'session': None, 'success': True, 'auth_level': None}, '_execution_time': 0.066, 'affected_rows': 1}


{'id': 26043,
 'itisNameWInd': 'Erigeron vetensis',
 'matchMethod': 'Not Matched',
 'matchString': 'Erigeron vetensis',
 'scientificname_search': 'Erigeron vetensis',
 'scientificname_submitted': 'Erigeron vetensis',
 'wormsPairs': '"cacheDate"=>"2017-06-27T13:57:16.674628","wormsMatchMethod"=>"Not Matched"'}

{'success': True, 'auth_check': {'session': None, 'success': True, 'auth_level': None}, '_execution_time': 0.08, 'affected_rows': 1}


{'id': 26477,
 'itisNameWInd': 'Eupseudomorpha brillians',
 'matchMethod': 'Not Matched',
 'matchString': 'Eupseudomorpha brillians',
 'scientificname_search': 'Eupseudomorpha brillians',
 'scientificname_submitted': 'Eupseudomorpha brillians',
 'wormsPairs': '"cacheDate"=>"2017-06-27T13:57:18.464775","wormsMatchMethod"=>"Not Matched"'}

{'success': True, 'auth_check': {'session': None, 'success': True, 'auth_level': None}, '_execution_time': 0.064, 'affected_rows': 1}
