# Get ITIS information and cache it in tircache

This notebook provides a prototype for a microservice approach to grab information from the Integrated Taxonomic Information System (ITIS) and store it in the Taxonomic Information Registry Cache (tircache). Eventually, this needs to be wrapped as a set of microservices on the Kafka queue. I tried to break things up into logical functions that should translate fairly closely to the microservices architecture we are working to build out.

In [22]:
import requests,datetime,configparser

In [23]:
# Set defaults
targetData = "sgcn"
dt = datetime.datetime.utcnow().isoformat()
itisGetCommonNamesJSONBaseURL = "https://www.itis.gov/ITISWebService/jsonservice/getCommonNamesFromTSN?tsn="

In [24]:
# Get API keys and any other config details from a file that is external to the code.
config = configparser.RawConfigParser()
config.read_file(open(r'/Users/sky/Documents/configs/stuff.py'))

In [25]:
# Build base URL with API key using input the external config.
def getBaseURL():
    gc2APIKey = config.get('apiKeys','apiKey_GC2_BCB').replace('"','')
    apiBaseURL = "https://gc2.mapcentia.com/api/v1/sql/bcb?key="+gc2APIKey
    return apiBaseURL

In [26]:
# Basic function to insert subject ID, property, and value into tircache
def insertTupleInTirCache(subjectid,prop,value):
    # Build query string
    insertSQL = "INSERT INTO tircache (subjectid,property,value) VALUES ('"+subjectid+"','"+prop+"','"+value+"')"
    # Execute query
    r = requests.get(getBaseURL()+"&q="+insertSQL)

The function below grabs up some data to process. Eventually, this kind of thing should probably be a periodic or triggered maintenance script that is kicked off in the message queue. Right now, it just looks to see if the tircache (data table containing properties and values tied to identifiers) already has the IDs it is gearing up to process. In the future, we should build in some kind of checkpoint to the system where we also look at the datetime stamp for when the particular data was input and run it again to re-cache new data. This could be triggered by another process that checks the source for changes.

The query also only currently retrieves 10 records to process. I kept the number down for demonstration purposes here in the notebook, and we'll see about setting this up as a microservice somewhere to run in the background.

In [27]:
# Retrieve target data (only SGCN ITIS IDs at this point)
def getSubjectIDs(targetData):
    if targetData == 'sgcn':
        # Build query string to retrieve data
        targetDataSQL = "SELECT DISTINCT sgcn.taxonomicauthorityid_accepted AS subjectid \
            FROM sgcn \
            WHERE sgcn.taxonomicauthorityid_accepted LIKE '%itis%' AND \
            sgcn.taxonomicauthorityid_accepted NOT IN \
            (SELECT DISTINCT tircache.subjectid FROM tircache \
            WHERE tircache.subjectid LIKE '%itis%' AND tircache.property LIKE 'ITIS:RecordCheck%') \
            LIMIT 1000"
        # Get Data
        targetData = requests.get(getBaseURL()+"&q="+targetDataSQL).json()
        return targetData

In [28]:
# Extract the TSN from the longer "URI-like" string we used for ITIS in the SGCN data
def extractTSN(subjectid):
    tsn = subjectid.replace('http://services.itis.gov/?q=tsn:','')
    return tsn

In [29]:
# Loop through the target data, extract Subject IDs (only ITIS at this point), retrieve data from service, and insert results in tircache
def getITISCommonNamesFromTSN(targetData):
    for feature in targetData['features']:
        # Set the subjectid for query and recording
        strSubjectId = feature['properties']['subjectid']
        
        try:
            # Set the URL path to get the common names from TSN via one of the ITIS JSON service end points, get the response JSON, and pull out the common names structure
            itisCommonNameURL = itisGetCommonNamesJSONBaseURL+extractTSN(strSubjectId)
            itisCommonNameJSON = requests.get(itisCommonNameURL).json()
            commonNamesJSON = itisCommonNameJSON['commonNames']

            # Check the response and insert activity plus common names when present
            if str(commonNamesJSON[0]) == 'None':
                insertTupleInTirCache(strSubjectId,"ITIS:RecordCheck:CommonName:NegativeResponse",dt)
            else:
                insertTupleInTirCache(strSubjectId,"ITIS:RecordCheck:CommonName:PositiveResponse",dt)
                for x in commonNamesJSON:
                    # Escape the oftfound single quotes in common name
                    strCommonName = x['commonName'].replace("'", "''")
                    # Insert common name with language qualifier in the property
                    insertTupleInTirCache(strSubjectId,"ITIS:BiologicalSpecies:CommonName:"+x['language'],strCommonName)
        except:
            print ("Problem with "+strSubjectId)

In [31]:
# Run the process by firing functions
getITISCommonNamesFromTSN(getSubjectIDs("sgcn"))

Problem with http://services.itis.gov/?q=tsn:


TO DO:
* Spend some time working up a vocabulary for properties used in this context. Examine whether or not the overall property registry work for the DataDistillery project could apply. Current property names are kind of a hack.
* Build out the building functions for tircache into a module.
* Investigate wrapping the building block functions for execution as microservices from Kafka queue.