### This notebook explores methods for relating articles of interest to literature databases

Status: In Development 

Last Updated: 201904

Summary: Our use cases often start with literature citation information,
which can be variable in completeness.  This notebook explores options for
connecting a csv based source of citation information to crossref and the 
University of Wisconsin-Madison xDD database.
    
Inputs:  *csv of citation information 
    this uses data from USGS NAS that has varing completion of the following
    variables: (internal id, authors, year, title, journal and doi )

Contact: Daniel Wieferich (dwieferich@usgs.gov)

In [6]:
#Run this cell to ensure all packages and functions are available for 
#code that follows

import requests
import pandas as pd

def gdd_api(route, params):
    """Create list of docs mentioning a term of interest
    Parameters : see https://geodeepdive.org/api for more detail
    ----------
    routes : str of available api routes for xDD 
    params : str of key value pairs of paramaters:values separated by &
    """
    base_url = 'https://geodeepdive.org/api'
    search = (base_url + '/' + route + '?' + str(params))
    #print (search)
    try:
        r=requests.get(search)
        if r.status_code == 200 and 'success' in r.json():
            json_r = r.json()
            data = json_r['success']['data']
            return data
        elif r.status_code == 200:
            data = []
            return data
        else:
            raise Exception('xDD API returning: {}'.format(r.status_code))
    except Exception as e:
        raise Exception(e)
        
#DOI assignment
def clean_doi(doi_str):
    """Return DOI in common format without prefix of DOI:
    ----------
    Example output format: 10.1080/01411598708240781
    ----------
    Parameters : see https://geodeepdive.org/api for more detail
    ----------
    doi_str : input DOI 
    """
    if str(doi_str)!='nan':
        temp_doi = str((row.DOI).lower())
        temp_doi = temp_doi.replace('doi:', '')
        temp_doi = temp_doi.replace('doi ', '')
        if not temp_doi.startswith('doi') and not temp_doi.startswith('https:'):
            doi = temp_doi
            return (doi)
        else:
            doi = ''
            return (doi)
    else:
        doi = ''
        return (doi)

##### Test for curiousity of seeing total list of journals included in xDD

In [3]:
journals = gdd_api('journals', 'all')

In [5]:
#List of journal names in xDD
journal = []
for j in journals:
    journal.append(j['journal'])
journal = list(set(journal))

print (str(len(journal)) + ' journals currently included')
journal

5816 journals currently included


['',
 'International Review of Psychiatry',
 'Southeast European and Black Sea Studies',
 'Journal of Free Radicals in Biology & Medicine',
 'Annals of Oncology',
 'Cultural and Social History',
 'New Directions for Teaching and Learning',
 'Journal of African Diaspora Archaeology and Heritage',
 'Health Policy and Education',
 'New Zealand Journal of Geology and Geophysics',
 'Deep Sea Research Part B. Oceanographic Literature Review',
 'Open-File ReportBibliography of reports resulting from U.S. Geological Survey scientific and technical cooperation with other countries, 1975 to June 1980',
 'Journal of Geophysical Research: Oceans',
 'Politics & Policy',
 'Open-File ReportGeologic map of the Thoreau quadrangle, McKinley County, New Mexico',
 'Evolutionary Anthropology: Issues, News, and Reviews',
 'Flora oder Allgemeine Botanische Zeitung',
 'Healthcare',
 'EMC - Kinesiterapia - Medicina Física',
 'Learning, Media and Technology',
 'Sleep Medicine Clinics',
 'EXPLORE: The Journal of

In [7]:
#Import data into a pandas dataframe and print field names
file_name = '20190326_NAS_journal_articles.csv'
df = pd.read_csv(file_name, sep=',')
df.columns

Index(['refnum', 'author', 'publication_year', 'title', 'journal_name',
       'volume', 'issue', 'pages', 'specimen_data_entered', 'key_words',
       'DOI'],
      dtype='object')

Initial Testing of Crossref suggests using a threshold of 93 to 
ensure correct matches, more testing with NAS team is being conducted.

Notes: This cell took a few hours to run on the 10,000 article databse, it 
was intentionally throttled to ensure within crossrefs suggested 
load. (0.15 second sleep between iterations)

This effort will eventually feed into the Biogeographic Information System's Research Reference Library efforts and database

In [None]:
#using the pybis package from https://github.com/usgs-bis/pybis
from pybis.rrl import ResearchReferenceLibrary as rrl
import time

crossref_info = []
for row in df.itertuples():
    refnum = row.refnum
    title_nas = row.title
    doi_nas = clean_doi(row.DOI)
    short_citation = (str(row.author)+'. '+str(row.publication_year)+'. '+str(title_nas)+' '+str(row.journal_name) + str(doi_nas))
    try:
        crossref_results = rrl.lookup_crossref(short_citation, threshold=70)
        if crossref_results['Success'] is True:
            if 'DOI' in crossref_results['Record']:
                doi = crossref_results['Record']['DOI']
            if 'title' in crossref_results['Record']:
                title = crossref_results['Record']['title'][0]
            if 'Score' in crossref_results:
                score = crossref_results['Score']
            crossref_info.append({'refnum': refnum, 'short_citation': short_citation, 'cr_title':title, 'title_nas':title_nas, 'cr_doi':doi, 'doi_nas': doi_nas, 'score':score, 'crossref':crossref_results})
        else:
            crossref_info.append({'refnum': refnum, 'short_citation': short_citation, 'cr_title':'', 'title_nas':title_nas, 'cr_doi':'', 'doi_nas': doi_nas, 'score':'', 'crossref':'False'})

    except:
        crossref_info.append({'refnum': refnum, 'short_citation': short_citation, 'cr_title':'', 'title_nas':title_nas, 'cr_doi':'', 'doi_nas': doi_nas, 'score':'', 'crossref':'failed'})
        continue
    print (refnum)
    
    time.sleep(0.15)
    
cross_ref_df=pd.DataFrame(crossref_info)
cross_ref_df.to_csv("./crossref_nas_df.csv", sep=',', index=False)

In [9]:
#Until database is in place, use this line so we don't have to 
#rerun crossref cell above
crossref_df = pd.read_csv("crossref_nas_df.csv", sep=',')

In [10]:
#Temporary fix to code, remove cr_title where crossref=='False' and crossref=='failed'
crossref_df.loc[crossref_df.crossref =='False', 'cr_title']= ''
crossref_df.loc[crossref_df.crossref =='failed', 'cr_title']= ''

In [11]:
crossref_df.tail(5)

Unnamed: 0,cr_doi,cr_title,crossref,doi_nas,refnum,score,short_citation,title_nas
10654,10.1080/02705060.2003.9664007,The Introduction of an Invasive Snail (Melanoi...,"{'Success': True, 'Date Checked': '2019-04-03T...",,32622,104.05482,"Rader, R.B., M.C. Belk, and M.J. Keleher. 2003...",The Introduction of an Invasive Snail (Melanoi...
10655,10.1371/journal.pone.0048233,Phylogeny and Evolutionary Patterns in the Dwa...,"{'Success': True, 'Date Checked': '2019-04-03T...",10.1371/journal.pone.0048233,32623,126.86783,"Pedraza-Lara, C., I. Doadrio, J.W. Breinholt, ...",Phylogeny and evolutionary patterns in the dwa...
10656,10.1371/journal.pone.0079516,Sequencing and De Novo Assembly of the Asian C...,"{'Success': True, 'Date Checked': '2019-04-03T...",,32629,121.16053,"Chen, H., J. Zha, X. Liang, J. Bu, M. Wang, an...",Sequencing and De Novo Assembly of the Asian C...
10657,,,False,,32630,,"Ladd, H.L.A., and D.L. Rogowski. 2012. Egg pre...",Egg predation and parasite prevalence in the i...
10658,,,False,,32632,,"Ituarte, C.F.. 1994. <em>Corbicula</em> and <e...",<em>Corbicula</em> and <em>Neocorbicula</em> (...


In [None]:
#Grab subset for testing of xDD next steps
cr_25_df = crossref_df.iloc[-10:]
cr_25_df.head(25)

#### Initial test of linking article information to xDD ids (when article is present in xDD)

Logic behind linkage to xDD

1) If DOI from Crossref (cr_doi) is available use to search/relate to xDD
    
    a) no doi relation, try matching title
    
    b) no title match, try title_like parameter
    
2) If cr_doi is not available try using source data DOI
    
    a) no doi relation, try matching title
    
    b) no title match, try title_like parameter
    
3) If DOI is not available try using title to search/relate to xDD
    
    a) no title match, try title_like parameter
    
In this exercise the title_like matching produced false matches more times than not, we need to look into this more to see if it is worth including this into the future

In [7]:
import requests

rrl_gdd = []
for row in crossref_df.itertuples():
    try:
        refnum = row.refnum
        title_nas = row.title_nas
        
        #remove html tags/formating
        title_nas = title_nas.replace("</em>", "")
        title_nas = title_nas.replace("<em>","")
        
        #print (refnum)
        route = 'articles'
        #NAS has DOI, yes/no
        if row.score and row.cr_doi and row.score>=93 and str(row.cr_doi)!='nan':
            param = 'max=1&doi='+str(row.cr_doi)
            gdd_data = gdd_api(route, param)
            if gdd_data:
                gdd_id = gdd_data[0]['_gddid']
                rrl_gdd.append({'refnum': refnum, 'relation': 'doi_match', 'gdd_id':gdd_id, 'param':param})
            elif row.title_nas and str(row.title_nas) != '' and not title_nas.startswith("(") :
                param = 'max=1&title='+ title_nas
                gdd_data = gdd_api(route, param)
                #Does title return successful match
                if gdd_data:
                    gdd_id = gdd_data[0]['_gddid']
                    rrl_gdd.append({'refnum': refnum, 'relation': 'title_match', 'gdd_id':gdd_id, 'param':param})
                else:
                    param = 'max=1&title_like=' + title_nas
                    gdd_data = gdd_api(route, param)
                    if gdd_data:
                        gdd_id = gdd_data[0]['_gddid']
                        rrl_gdd.append({'refnum': refnum, 'relation': 'title_like', 'gdd_id':gdd_id, 'param':param})
                    else:
                        rrl_gdd.append({'refnum': refnum, 'relation': 'no_match', 'gdd_id':gdd_id, 'param':''})
        elif row.doi_nas and str(row.doi_nas)!='nan':
            param = 'max=1&doi='+str(row.doi_nas)
            gdd_data = gdd_api(route, param)
            if gdd_data:
                gdd_id = gdd_data[0]['_gddid']
                rrl_gdd.append({'refnum': refnum, 'relation': 'doi_match', 'gdd_id':gdd_id, 'param':param})
            elif row.title_nas and str(row.title_nas) != '' and not title_nas.startswith("(") :
                param = 'max=1&title='+ title_nas
                gdd_data = gdd_api(route, param)
                #Does title return successful match
                if gdd_data:
                    gdd_id = gdd_data[0]['_gddid']
                    rrl_gdd.append({'refnum': refnum, 'relation': 'title_match', 'gdd_id':gdd_id, 'param':param})
                else:
                    param = 'max=1&title_like=' + title_nas
                    gdd_data = gdd_api(route, param)
                    if gdd_data:
                        gdd_id = gdd_data[0]['_gddid']
                        rrl_gdd.append({'refnum': refnum, 'relation': 'title_like', 'gdd_id':gdd_id, 'param':param})
                    else:
                        rrl_gdd.append({'refnum': refnum, 'relation': 'no_match', 'gdd_id':gdd_id, 'param':''})
        elif row.title_nas and str(row.title_nas) != ''and not title_nas.startswith("("):
            param = 'max=1&title='+ title_nas
            gdd_data = gdd_api(route, param)
            #Does title return successful match
            if gdd_data:
                gdd_id = gdd_data[0]['_gddid']
                rrl_gdd.append({'refnum': refnum, 'relation': 'title_match', 'gdd_id':gdd_id, 'param':param})
            else:
                param = 'max=1&title_like=' + title_nas
                gdd_data = gdd_api(route, param)
                if gdd_data:
                    gdd_id = gdd_data[0]['_gddid']
                    rrl_gdd.append({'refnum': refnum, 'relation': 'title_like', 'gdd_id':gdd_id, 'param':param})
                else:
                    rrl_gdd.append({'refnum': refnum, 'relation': 'no_match', 'gdd_id':gdd_id, 'param':''})
        else:
            rrl_gdd.append({'refnum': refnum, 'relation': 'no_match', 'gdd_id':gdd_id, 'param':''})
            #print ('else:' + str(refnum))
    except:
            rrl_gdd.append({'refnum': refnum, 'relation': 'failed exception', 'gdd_id':''})
            print ('try:' + str(refnum)+':  '+param)
            continue
        
rrl_xdd_df = pd.DataFrame(rrl_gdd)
rrl_xdd_df.to_csv("./rrl_xdd_df.csv", sep=',', index=False)       

try:12249:  max=1&title_like=Accumulation of trace elements, pesticides, and polychlorinated biphenyls in sediments and the clam Corbicula manilensis of the Apalachicola River, Florida.
try:12250:  max=1&title=Genetic studies of Asiatic clams, Corbicula, in Thailand: allozymes of 21 nominal species are identical.
try:12251:  max=1&title=Comparative shell microstructure of North American Corbicula (Bivalvia: Sphaeriacea).
try:12252:  max=1&doi=10.2307/3226519
try:12253:  max=1&title=Cipangopaludina chinensis (Gastropoda: Viviparidae) in North America, review and update.
try:12254:  max=1&title=Thousands of living European snails sold as fish bait in state of Ohio.
try:12255:  max=1&title=Snails on migratory birds.
try:12256:  max=1&title=Corbicula fluminea (M&uuml;ller) in Louisiana.
try:13003:  max=1&doi=10.1093/icb/37.6.621
try:13004:  max=1&title=Invasion pressure to a ballast-flooded estuary and assessment of inoculant survival.
try:13216:  max=1&title=The effects of temperature, bo

In [13]:
rrl_xdd_df = pd.read_csv("./rrl_xdd_df.csv", sep=',')
rrl_xdd_df.tail(25)

Unnamed: 0,gdd_id,param,refnum,relation
10634,5ac253b2cf58f132fd5c93da,max=1&title_like=A successful crayfish invader...,32589,title_like
10635,56b98663cf58f14c238d4b94,max=1&title_like=An updated classification of ...,32590,title_like
10636,57b17ebbcf58f1442ca068cf,max=1&doi=10.1139/f08-162,32591,doi_match
10637,58c9abe3cf58f128a153cf46,"max=1&title_like=The spiny-cheek crayfish, Orc...",32592,title_like
10638,571bcb24cf58f15b2a782a1f,max=1&title_like=Invasive crayfish and crayfis...,32594,title_like
10639,57a5b2cecf58f17c163722fc,max=1&title_like=Invasive non-indigenous crayf...,32595,title_like
10640,57a22a2dcf58f13b8c118fc9,max=1&title_like=Crayfish Plague agent detecte...,32596,title_like
10641,56cf9bbccf58f1b14db25bdf,max=1&title_like=The diet of the spiny-cheek c...,32598,title_like
10642,572cc56bcf58f12aa3017605,max=1&title_like=Understanding invasion succes...,32599,title_like
10643,57b17ebbcf58f1442ca068cf,max=1&title_like=Aggressive interactions and c...,32600,title_like


In [16]:
#Need to alter this but gives basic stats on number matched each way
group_df = rrl_xdd_df.groupby(['relation']).count()
#group_df.reset_index(inplace=True)

In [17]:
group_df

Unnamed: 0_level_0,gdd_id,param,refnum
relation,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
doi_match,1751,1751,1751
failed exception,23,0,23
no_match,169,169,169
title_like,8094,8094,8094
title_match,622,622,622
