### This notebook explores methods for extracting locations related to mentions of taxa.

Status: In Development 

Last Updated: 201904

Summary: Using output from the eXtract Dark Data (xDD) (previously named GeoDeepDive) database we are exploring ways to extract information about species/taxa of interest from literature. These efforts are using a list of taxa being studied by the USGS Nonindigenous Aquatic Species Program, but should be applicable to any list of taxanomic names.

Inputs:  *Taxa Information (url='https://nas.er.usgs.gov/api/v1/species')
        
        *xDD processed data, output from 
        https://github.com/dwief-usgs/app-template-nas

Contact: Daniel Wieferich (dwieferich@usgs.gov)

In [4]:
#Import needed packages
import pandas as pd
import requests

#Import Functions
def get_species_list(url='https://nas.er.usgs.gov/api/v1/species'):
    """return list of taxa information for NAS species of interest
    ----------
    URL : API that returns JSON results of NAS specie taxonomy
    """
    try:
        r = requests.get(url)
        if r.status_code == 200:
            return r.json()
        else:
            raise Exception('NAS API URL returning: {}'.format(r.status_code))
    except Exception as e:
        raise Exception(e)

#Keeps rows in pd from being truncated
pd.set_option('display.max_colwidth', -1)

#### Step 1
--------------
*Import source datasets including list of taxa names (from NAS API) and literature passages from xDD


#### Progress
--------------
-Currently using a set of passages from dam removal exercise for testing while taxa information is being processed by xDD staff

-Need to rethink logic behind taxanomic names to process, based on conversations with NAS team.  For example, species 3118 is returning a common name of "mussel".  This is currently being processed but should not be.

In [3]:
#Import example passage output from xDD
#This will be updated with taxa mentions coming from

xdd_export = 'dam_year_river_22h33m_06Nov2018_a4c1766/river-cand-df.csv'
xdd_df = pd.read_csv(xdd_export, encoding='utf-8')

In [8]:
#This is a big file, lets make it smaller (5,000 records) for testing purpose
xdd_df.shape
xdd_df_sub = xdd_df[:1000]

In [6]:
#Run function to return NAS taxa information as JSON response
taxa_r = get_species_list()

In [7]:
taxa_list = []
for taxa in taxa_r['results']:
    #captures a hybrid based on x of species, only return common name 
    if ' x ' in taxa['species']:
        taxa_list.append({'speciesID': taxa['speciesID'], 'common_name': taxa['common_name']})
    #for taxa with species = sp., return genus and common name
    elif 'sp.' in taxa['species']:
        taxa_list.append({'speciesID': taxa['speciesID'], 'genus': taxa['genus'], 'common_name': taxa['common_name']})
    #for everything else return scientific name (including subspecies and variety as available) and common name
    else:
        sci_name = (taxa['genus']+' '+taxa['species'] + ' '+ taxa['subspecies'] + ' ' + taxa['variety']).strip()
        taxa_list.append({'speciesID': taxa['speciesID'], 'sci_name': sci_name, 'common_name': taxa['common_name']})

taxa_df = pd.DataFrame(taxa_list)

#### Step 2
--------------
*For each passage identify mentions of species and explore ways to extract location information


#### Progress
--------------
-starting with basic use of NER tags within close proximity

-we have efforts in progress to create NER tags specific to rivers (using SpaCy), to better understand and extract river mentions

-first pass on running this with full 2 million records and full taxa list did not complete in a full 8 hr work day... need to incorporate a mode of doing batches

In [10]:
import ast
mention = []
for row_xdd in xdd_df_sub.itertuples():
    for row_taxa in taxa_df.itertuples():
        speciesID = row_taxa.speciesID
        if str(row_taxa.sci_name)!= 'nan' and str(row_taxa.sci_name) in ast.literal_eval(row_xdd.passage):
            #print (str(speciesID)+': '+str(row_river.passage))
            #record speciesID, passageID, passage
            mention.append({'species_id': speciesID, 'taxa':row_taxa.sci_name, 'passage': row_river.passage, 'docid':row_xdd.docid, 'ner': row_xdd.ner, 'sentid':row_xdd.sentid})
        #if str(row_taxa.genus)!= 'nan' and str(row_taxa.genus) in ast.literal_eval(row_river.passage):
        #    mention.append({'species_id': speciesID, 'taxa':row_taxa.genus, 'passage': row_xdd.passage, 'docid':row_xdd.docid, 'ner': row_xdd.ner, 'sentid':row_xdd.sentid})
            #print (row_taxa.genus)
            #print (str(speciesID)+': '+str(row_river.passage))
        #if str(row_taxa.common_name)!= 'nan' and str(row_taxa.common_name)!='' and str(row_taxa.common_name) in ast.literal_eval(row_river.passage):
        #    mention.append({'species_id': speciesID, 'taxa':row_taxa.common_name, 'passage': row_xdd.passage, 'docid':row_xdd.docid, 'ner': row_xdd.ner, 'sentid':row_xdd.sentid})
            #print (row_taxa.common_name)
            #print (str(speciesID)+': '+str(row_xdd.passage))

mention_df = pd.DataFrame(mention)
mention_df.to_csv("./mention_df_sciname.csv", sep=',', index=False)


In [None]:
mention_df.tail()

In [19]:
#List Pairs of Species / Locations using NER tags

import itertools
def intervals_extract(iterable): 
    iterable = sorted(set(iterable)) 
    for key, group in itertools.groupby(enumerate(iterable), 
    lambda t: t[1] - t[0]): 
        group = list(group) 
        yield [group[0][1], group[-1][1]] 


import ast

for row_xdd in xdd_df_sub.itertuples():
    passage = list(ast.literal_eval(row_xdd.passage))
    passage_str = ' '.join(word for word in passage)
    ner = list(ast.literal_eval(row_xdd.ner))
    docid = row_xdd.docid
    sentid = row_xdd.sentid
    if 'Anguilla' in passage_str:
        
        print (passage_str)
        index_locations = list([i for i,s in enumerate(ner) if 'LOCATION' in s])
        location_intervals = list(intervals_extract(index_locations))
        #index_taxa = list([i for i,s in enumerate(passage) if 'Anguilla' in s])
        print (location_intervals)
        #for i in index_locations:
        #    p = passage[i]
        #    n = ner[i]
        #    print (sentid + ': ' + str(i)+ ': '+ p)
        #    print (i)

Generally , there is no gradient in salinity between the lower lakes and the Coorong ; instead , there is an abrupt transition between fresh and brackish/marine salinities . The impact that changes in such physiochemical signals have on the upstream movements of these species is uncertain . In the Murray-Darling Basin , connectivity between the Southern Ocean , estuary and the freshwater environments of the lower lakes and Murray River is imperative for at least ﬁve species of diadromous ﬁshes , namely anadromous Short-headed and Pouched Lamprey -LRB- Mordacia mordax and Geotria australis -RRB- and catadromous Common Galaxias -LRB- Galaxias maculatus -RRB- , Congolli and Short-ﬁnned Eel -LRB- Anguilla australis -RRB- .
[[14, 14], [50, 51], [56, 57], [69, 70], [92, 92], [109, 109]]
The impact that changes in such physiochemical signals have on the upstream movements of these species is uncertain . In the Murray-Darling Basin , connectivity between the Southern Ocean , estuary and the fr