## < DRAFT > Exploring Species of Interest to WLCI in the Literature using GeoDeepDive

#### Daniel Wieferich 

Purpose: This effort is being used to help understand what species are being studied in WLCI work and tracking specie references in the literature.  This effort is using GeoDeepDive, a literature database that USGS is partnering with University of Wisconsin-Madison on.

The current code does the following. 
    #1. Access WLCI Publications as defined by WLCI Publications Sciencebase folders
    #2. Register WLCI publications into Research Reference Library
    #3. Identify Publications already in GeoDeepDive
    #4. Query WLCI publications that are in GeoDeepDive for referenced Taxa
    #5. Query publications in GDD that reference 'Wyoming Landscape Conservation Initiative'

Summary of findings:
    * Currently GDD does not include common names in their ITIS dictionary 
    * Currently a large number of WLCI publications are not represented in WLCI publication folders and GDD 
    * 134 publications and reports are documented in the WLCI Publication folders in ScienceBase (I added 5 known to be in GDD)
    * Of those 56 are in cross_ref and have DOIs, only 11 found in GeoDeepDive 
    * 114 Taxa referenced (Uncleaned at this point)
    * GDD queries currently do not filter Literature Cited but this is being looked into by UW
    * 19 WLCI Publications were found when searching GDD, Manual verification shows many of these reference WLCI in acknowledgements
    

### Requirements

##### Python version 3.x

##### Required Packages
    * requests
    * os
    * pymongo
    * pandas
    * pybis is a package we developed for managing methods for the Biogeographic Information System

#### 1. Access WLCI Publications as defined by WLCI Publications Sciencebase folders
        The following ScienceBase folders were used:
            * https://www.sciencebase.gov/catalog/item/4f4e476fe4b07f02db47e19f : WLCI Publications
            * https://www.sciencebase.gov/catalog/item/537e7253e4b05ed6215c121e : USGS Publications Produced for WLCI
        

In [1]:
import requests

#Create empty list to add WLCI publication information too
item_pub_info = []

#Loop through ScienceBase folders known to have WLCI publications
for parent_id in ['4f4e476fe4b07f02db47e19f','537e7253e4b05ed6215c121e']:

    #sciencebase limits the number of items returned at a time so we use next link and a while loop to ensure we return all relevant items
    next_link = 'https://www.sciencebase.gov/catalog/items?q=&max=100&fields=dates,title,contacts,identifiers&format=json&parentId='+ parent_id

    while next_link is not None:
        #if next_link is not None:-
        sb_results = requests.get(next_link).json()

        if 'nextlink' in sb_results.keys():
            next_link = sb_results['nextlink']['url']
        else:
            next_link = None


        # if the list of items is not empty then loop through the items adding publication data to item_pub_info variable
        if len(sb_results['items']) != 0:
            for item in sb_results['items']:
                this_sb_item = {}
                this_sb_item['id'] = item['id']
                this_sb_item['title'] = item['title']

                try:
                    #Sciencebase may return multiple types of dates, here we check to see if a publication date is available
                    #Note: similar to list comprehension but only expects one value.  Once that value is collected it moves on.  If no value is collected it gets a default of None
                    pub_date=next((i['dateString'] for i in item['dates'] if i['label']=='Publication Date'), None)
                except:
                    pub_date=''
                this_sb_item['date'] = pub_date

                try:
                    #ScienceBase may return multiple types of contacts, here we check and grab only Authors
                    authors = [i['name'] for i in item['contacts'] if i['type']=='Author']
                except:
                    authors = []
                this_sb_item['authors'] = authors
                
                #Appends information about the item to the item list.  Includes: ScienceBase id, title, publication date, authors
                item_pub_info.append(this_sb_item)

In [2]:
#Display pub information as a pandas dataframe
import pandas as pd
pd.DataFrame(item_pub_info)

Unnamed: 0,authors,date,id,title
0,[],2007,53ab3a0ae4b065055fab183a,Strategic Habitat Plan Annual Report - 2007
1,[Matthew J Holloran],2005,4f4e4783e4b07f02db4835f0,Greater sage-grouse (Centrocercus urophasianus...
2,[Mike Mackey],1997-11-01,4f4e4a34e4b07f02db619cfa,Black gold : patterns in the development of Wy...
3,"[Timothy W Clark, Thomas M Campbell III]",1981-04,4f4e4783e4b07f02db48398f,Colony Characteristics and Vertebrate Associat...
4,"[Douglas A Keinath, Brian E Smith]",2005-01,4f4e4b15e4b07f02db6a4c21,Species Assessment For The Northern Leopard Fr...
5,[Glenn Giroir],2004,4f4e4b32e4b07f02db6b4a4a,"Addendum to Monitoring Wyoming's Birds, 2002-2..."
6,"[Gary P Beauvais, Amber Travsky]",2004-10,4f4e49a0e4b07f02db5bdafd,Species Assessment for the Midget Faded Rattle...
7,"[Aaron Kern, Rob Keith]",2007-03,4f4e4b32e4b07f02db6b4a56,2006 Green River Watershed Native Non-Game Fis...
8,"[E Bergquiest, Thomas J Stohlgren, N Alley, Pa...",2006-10,4f4e4b32e4b07f02db6b4a42,Invasive species and coal bed methane developm...
9,"[Rebecca Buseck, Douglas A Keinath]",2004-08,4f4e499fe4b07f02db5bcdfc,Species Assessment for Western Long-Eared Myot...


#### 2. Register WLCI publications into Research Reference Library
        * This step creates a citation string, creates a rrl record, then searches crossref for additional (standardized) metadata to add to the record

In [3]:
#Register WLCI pubs into research reference library

#Import needed packages
from pybis.db import Db as db
from pybis.rrl import ResearchReferenceLibrary as rrl
import os
from pymongo import MongoClient

#Connect to local mongo instance, this requires pybis and local environments to be set
mongo_rrl= db.connect_mongodb('rrl')

#Loop through publications to build a citation string using info we have from ScienceBase see step 1.
#This step then registers the information to the research reference library (BIS database) and then tries linking to cross_ref api to pull back standard metadata on the pub
for pub in item_pub_info:
    
    #Loop through JSON list of authors and create string of authors 
    author_str = ''
    for author in pub['authors']:
        if author_str == '':
            author_str = author
        else:
            author_str = author_str + ' , ' + author
        
    #create citation string
    short_citation = author_str + ', ' + str(pub['date']) + ' , ' + str(pub['title'])
    
    #Set needed variables for the rrl registration
    pub_url = 'https://www.sciencebase.gov/catalog/item/' + str(pub['id'])
    source = 'Wyoming Landscape Conservation Initiative'
    
    #register pub in rrrl
    reg_result, hash_id = rrl.register_citation(mongo_rrl, short_citation, source, pub_url)
    mongo_rrl.update_one({'_id': hash_id}, {'$set': {'title':pub['title'], 'authors':author_str, 'pub_date':pub['date']}}, upsert=False)
    
    
    #look up pub in cross_ref system
    crossref_results = rrl.lookup_crossref(short_citation)
    crossref_entry = {"cross_ref": [crossref_results]}

    #Update current mongo record with crossref info  
    mongo_rrl.update_one({'_id': hash_id}, {'$set': crossref_entry}, upsert=False)
    
    
    

#### 3. Identify Publications already in GeoDeepDive

In [4]:
#Check registered pubs to see if they are in GDD
#If in GeoDeepDive create RRL Annotations   _ID SAME_AS GDD_ID


gdd_url = 'https://geodeepdive.org/api/'
doi_count = 0
cross_success_ct = 0
no_cross_ct = 0
rrl_gdd = []

#Select documents (records) in RRL that are WLCI pubs and loop through them trying to make linkage to GeoDeepDive
for document in mongo_rrl.find({"Sources":{'$elemMatch':{'source':"Wyoming Landscape Conservation Initiative"}}}):
    #First check to see if there is cross_ref metadata, we will use this when it is available
    if document['cross_ref'][0]['Success']== True:
        cross_success_ct += 1
        try:
            #It is best to link across these systems using DOI when possible, first check the DOI
            doi = (document['cross_ref'][0]['Record']['DOI'])
            doi_count += 1
            #Send request to GDD API, searching for article by
            gdd_doi_query = gdd_url + 'articles?doi=' + doi
            gdd_response = requests.get(gdd_doi_query).json()
            
            try:
                #try accessing the geodeepdive id from the response 
                gdd_id = gdd_response['success']['data'][0]['_gddid']
                #if geodeepdive id is available append into list of relationships btwn gdd and rrl
                rrl_gdd.append({'rrl_id': document['_id'], 'relation': 'doi_match', 'gdd_id':gdd_id})
            except:
                continue
        except:
            try:
                #If the DOI doesn't link across to GDD try using the title
                title = (document['cross_ref'][0]['Record']['title'])
                gdd_title_query = gdd_url + 'articles?title=' + title
                #Send request to GDD API, searching for article by title
                gdd_response = requests.get(gdd_title_query).json()
                try:
                    gdd_id = gdd_response['success']['data'][0]['_gddid']
                    rrl_gdd.append({'rrl_id': document['_id'], 'relation': 'title_match', 'gdd_id':gdd_id})
                except:
                    continue
            except:
                continue
    else:
        no_cross_ct += 1
        try:
            title = (document['title'])
            gdd_title_query = gdd_url + 'articles?title=' + title
            gdd_response = requests.get(gdd_title_query).json()
            try:
                #print ('\n' + 'title')
                gdd_id = gdd_response['success']['data'][0]['_gddid']
                rrl_gdd.append({'rrl_id': document['_id'], 'relation': 'title_match', 'gdd_id':gdd_id})
            except:
                continue
        except:
            continue
  
            

In [5]:
#Summarize number of articles having linkage to crossref, those with DOIs, those that have GDD match
print ('Have crossref DOI: ' + str(doi_count))
print ('Have crossref: ' + str(cross_success_ct))
print ('No cross_ref information: ' + str(no_cross_ct))
print ('Successful match to GDD: ' + str(len(rrl_gdd)))

Have crossref DOI: 56
Have crossref: 56
No cross_ref information: 77
Successful match to GDD: 11


In [6]:
#Display linkage of identifiers between RRL and GDD
pd.DataFrame(rrl_gdd)

Unnamed: 0,gdd_id,relation,rrl_id
0,558f05afe13823109f3ee710,title_match,b078a8c94b933c011618dad33222980b
1,572ca4a4cf58f11a661f414b,doi_match,048250af41552bdc1c59010d5fdc972b
2,571bd57acf58f15e6bbda2ca,doi_match,fd7608e143625cd30b0ce0b546f0fd40
3,57bf4d1bcf58f160e280dc1d,doi_match,8f4c2f2dad4e3011c5d84667245c6a47
4,58976134cf58f1a1de231c7a,doi_match,380ab328148d2506c736b23f7bdfc356
5,557ce591e138239225f86a2f,title_match,e3a76200d7cba0e5d97bfac872ec66b9
6,583da612cf58f107cb09840d,doi_match,3293eeee0a448e1f5de652beafc3eaad
7,5ad055c3cf58f1a9152a8fd5,doi_match,31ee99601082da365b66a5f9b973740f
8,5acfde1ecf58f17c7517b27d,doi_match,2370bf384e52881b853e5f5cb7f6dea1
9,579f4458cf58f123c56623f8,doi_match,65f884bd1ab020c865e940ad8023606a


#### 4. Query WLCI publications that are in GeoDeepDive for referenced Taxa

In [7]:
#Example of searching GDD for terms within a dictionary and within one document
#https://geodeepdive.org/api/terms?docid=571bd57acf58f15e6bbda2ca&dict_id=44

#Example where gdd returned a hit for Macropus fuliginosus in literature cited
#https://www.jstor.org/stable/pdf/2387146.pdf?refreqid=excelsior%3Af39d56d3df051f6638911a9dc7b92a50

In [8]:
#Search GeoDeepDive for species referenced in each publication

#Create two empty lists, one used to capture both taxa names and the associated publication (taxa_doc), the other to just list tax
taxa_doc = []
taxa = []

#Loop through WLCI publications that we found in GDD searching for ITIS taxanomic names referenced in the documents
for pub in rrl_gdd:
    gdd_id = pub['gdd_id']
    gdd_api = 'https://geodeepdive.org/api/'
    gdd_url = gdd_api + 'terms?dict_id=44&docid=' + gdd_id 
    gdd_request = requests.get(gdd_url).json()
    
    #create list of referenced species
    try:
        for term in gdd_request['success']['data']:
            taxa_doc.append({ 'taxa': term['term'], 'gdd_id': gdd_id })
            taxa.append(term['term'])
    except:
        continue
    

In [9]:
#Count of complete list of Taxa Referenced without Duplicates
len(list(set(taxa)))

114

In [12]:
#Visualize complete list of Taxa Referenced without Duplicates
list(set(taxa))

['Here',
 'Dendroctonus',
 'Fringilla',
 'Fabaceae',
 'Chrysothamnus',
 'Argentina',
 'Florida',
 'Pinus contorta',
 'Arizona',
 'Dendroctonus pseudotsugae',
 'Arctostaphylos',
 'Canis',
 'Artemisia tridentata ssp. wyomingensis',
 'Macropus',
 'Purshia tridentata',
 'Antilocapra',
 'Procapra gutturosa',
 'Aquila',
 'Strix occidentalis',
 'Arctostaphylos patula',
 'Phlox',
 'Pseudotsuga menziesii',
 'Morales',
 'Centrocercus',
 'Jones',
 'Artemisia tridentata',
 'Cervus',
 'Sphagnum',
 'Procapra',
 'Hesperostipa',
 'Deschampsia caespitosa',
 'Cervus elaphus',
 'Otis',
 'Festuca',
 'Buteo',
 'Odocoileus',
 'Elymus lanceolatus',
 'Odocoileus hemionus',
 'Branta leucopsis',
 'Krascheninnikovia lanata',
 'Bromus tectorum',
 'Nevada',
 'Saiga',
 'Deschampsia',
 'Pinus',
 'Cyanocitta stelleri',
 'Connochaetes',
 'Cyanocitta',
 'Krascheninnikovia',
 'Juncus',
 'Equus burchelli',
 'Rangifer',
 'Aquila chrysaetos',
 'Pseudoroegneria spicata',
 'Delta',
 'Fringilla montifringilla',
 'Ardea',
 'El

#### 5. Query publications in GDD that reference 'Wyoming Landscape Conservation Initiative'

In [13]:
url = 'https://geodeepdive.org/api/articles?term=Wyoming%20Landscape%20Conservation%20Initiative'
gdd_request = requests.get(url).json()
print ('Number of publications referencing Wyoming Landscape Conservation Initiative: ' + str(len(gdd_request['success']['data'])))

Number of publications referencing Wyoming Landscape Conservation Initiative: 19


I took a look at some of these 19 documents and many of which reference WLCI in the acknowledgements.  These should be included in the WLCI publication list.  I added a few to the publication list and each that I added had connections to both crossref and GeoDeepDive when rerunning steps 1-3.

####  Exploration of GDD API before having MONGO set up locally.  This was initial exploration based on a csv file I developed by manually identifying specie references within WLCI publications.

As an example, the current code shows the number of times literature currently in GeoDeepDive references thirty species that are relevant to WLCI.
Specific efforts that we plan to address in this code are listed below.¶
#1. Link the WLCI literature stored in SB to GeoDeepDive
#2. Develop a comprehensive list of species that WLCI literature has referenced.
#3. Store information about specie references within WLCI literature within GeoDeepDive.
#4. Explore where and how much WLCI species are being referenced within all of the GeoDeepDive corpus. (current code gives an example of this)

In [14]:
#Import needed packages
import pandas as pd
import requests
import pybis

# the data folder contains a csv of species that Daniel found mentioned in WLCI literature... this is currently being used as an example and will be replaced by the entire list of species referenced in WLCI literature
species = pd.DataFrame.from_csv('data\sp_list.csv')
species

  import sys


Unnamed: 0_level_0,wlci_sp,scientific_name,reference
n,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,pygmy rabbit,Brachylagus idahoensis,
2,mule deer,Odocoileus heminonus,
3,elk,Cervus elaphus,
4,pronghorn,Antilocapra americana,
5,greater sage-grouse,Centrocercus urophasianus,https://onlinelibrary.wiley.com/doi/pdf/10.100...
6,boreal toad,Bufo boreas boreas,https://www.sciencebase.gov/catalog/item/4f4e4...
7,boreal chorus frog,Pseudacris maculata,https://www.sciencebase.gov/catalog/item/4f4e4...
8,tiger salamander,Ambystoma tigrinum,https://www.sciencebase.gov/catalog/item/4f4e4...
9,columbia spotted frog,Rana luteiventris,https://www.sciencebase.gov/catalog/item/4f4e4...
10,flannelmouth sucker,Catostomus latipinnus,https://www.sciencebase.gov/catalog/item/4f4e4...


In [15]:
#API call that returns number of references of all species names that are referenced in GeoDeepDive
gdd_itis_api = 'https://geodeepdive.org/api/dictionaries?dict=ITIS&show_terms=TRUE'
r = requests.get(gdd_itis_api).json()

In [16]:
#Loops through results from GeoDeepDive API call and collects information only on WLCI species
sci_gdd_hits = []
for row in species.itertuples():
    common_name = row.wlci_sp
    sci_name = row.scientific_name
    for term in r['success']['data'][0]['term_hits']:
        if term.lower() == str(sci_name).lower():
            sci_name_count = r['success']['data'][0]['term_hits'][term]
            sci_gdd_hits.append({'common_name':common_name, 'sci_name':sci_name, 'sci_name_ct': sci_name_count})

In [17]:
#Sums number of references of WLCI species within all GeoDeepDive literature

sn = pd.DataFrame(sci_gdd_hits)
#sn.loc[sn['sci_name_ct'].idxmin()]
sn

print ('Total references of WLCI species in GeoDeepDive: ' + str(sn['sci_name_ct'].sum()))

Total references of WLCI species in GeoDeepDive: 81846


In [18]:
#Show all species and number of times they were referenced in GeoDeepDive
sn

Unnamed: 0,common_name,sci_name,sci_name_ct
0,pygmy rabbit,Brachylagus idahoensis,439
1,elk,Cervus elaphus,17148
2,pronghorn,Antilocapra americana,1417
3,greater sage-grouse,Centrocercus urophasianus,1403
4,boreal toad,Bufo boreas boreas,63
5,boreal chorus frog,Pseudacris maculata,149
6,tiger salamander,Ambystoma tigrinum,3188
7,columbia spotted frog,Rana luteiventris,526
8,bluehead sucker,Catostomus discobolus,114
9,roundtail chub,Gila robusta,212


In [19]:
pygmy_rabbit_snippets='https://geodeepdive.org/api/snippets?term=Brachylagus%20idahoensis'
pygmy_rabbit_r = requests.get(pygmy_rabbit_snippets).json()    

In [20]:
#This is a use case to show what can be done.  Print some information about one of the articles that referenced Pygmy Rabbits
pr = (pygmy_rabbit_r['success']['data'][4])
print ('Article title:  ' + pr['title'])
print ('Publisher:  ' + pr['pubname'])
print ('Authors:  ' + pr['authors'])
print ('This article referencces Pygmy Rabbits:  ' + str(pr['hits']) + ' times.')
print ('')
print ('Some references of Pygmy Rabbits in the article include: ' + str(pr['highlight']))

Article title:  A Method for Capturing Pygmy Rabbits in Summer
Publisher:  Journal of Wildlife Management
Authors:  LARRUCEA, EVELINE S.; BRUSSARD, PETER F.
This article referencces Pygmy Rabbits:  5 times.

Some references of Pygmy Rabbits in the article include: [' the pygmy rabbit (<em class="hl">Brachylagus</em> <em class="hl">idahoensis</em>) as threatened or endangered under the Endangered Species Act', '-186 KEY WORDS box trap, <em class="hl">Brachylagus</em> <em class="hl">idahoensis</em>, California, drift fence, Havaharte, Nevada, noose', ', pygmy rabbit, sagebrush, trapping.  Pygmy rabbits (<em class="hl">Brachylagus</em> <em class="hl">idahoensis</em>) are the smallest members', '. Pygmy rabbit petition: a petition for rules to list the pygmy rabbit <em class="hl">Brachylagus</em> <em class="hl">idahoensis</em> occurring', '. Murrelet 60:112–113. Katzner, T. E. 1994. Winter ecology of the pygmy rabbit (<em class="hl">Brachylagus</em> <em class="hl">idahoensis</em>']


In [21]:
print(pr)

{'pubname': 'Journal of Wildlife Management', 'publisher': 'Wiley', '_gddid': '572c6a04cf58f10692780c55', 'title': 'A Method for Capturing Pygmy Rabbits in Summer', 'coverDate': 'May 2007', 'URL': 'http://www.bioone.org/doi/abs/10.2193/2006-186', 'authors': 'LARRUCEA, EVELINE S.; BRUSSARD, PETER F.', 'hits': 5, 'highlight': [' the pygmy rabbit (<em class="hl">Brachylagus</em> <em class="hl">idahoensis</em>) as threatened or endangered under the Endangered Species Act', '-186 KEY WORDS box trap, <em class="hl">Brachylagus</em> <em class="hl">idahoensis</em>, California, drift fence, Havaharte, Nevada, noose', ', pygmy rabbit, sagebrush, trapping.  Pygmy rabbits (<em class="hl">Brachylagus</em> <em class="hl">idahoensis</em>) are the smallest members', '. Pygmy rabbit petition: a petition for rules to list the pygmy rabbit <em class="hl">Brachylagus</em> <em class="hl">idahoensis</em> occurring', '. Murrelet 60:112–113. Katzner, T. E. 1994. Winter ecology of the pygmy rabbit (<em class="

In [None]:
'https://geodeepdive.org/api/snippets?docid=572c6a04cf58f10692780c55&term=Brachylagus idahoensis&clean&fragment_limit=100'