To keep up with current literature on species of interest this notebook accesses eXtract Dark Data (xDD), formaly known as GeoDeepDive.  The xDD database is continuously being updated with published scientific literature from many of the large publishers including Elsiver, Taylor & Francis, GSA, and USGS.  We are currently working with the xDD team  on a number of tools and techniques for a) identifying literature potentially applicable to species-based research and b) using natural language processing tools to pull specific data from those sources for use. This is an ongoing effort that will result in improved production capabilities over time.

In the near term, we take advantage of some basic and enhanced search functionality to identify potential articles of interested in the xDD library of millions of documents that are increasing daily. The xdd module in the bispy package contains some search and packaging functionality that interfaces with the xDD REST API.

In [3]:
import requests
import json
import bispy
from IPython.display import display
from joblib import Parallel, delayed
import random

xdd = bispy.xdd.Xdd()

import warnings
warnings.filterwarnings('ignore')

In [4]:
# Open source WLCI list created from build-specie-list.ipynb
with open("sources/wlci_xdd_specie_list.txt", "r") as f:
    sp_list =f.readlines()
sp_list= [x.strip() for x in sp_list]

In [5]:
# Use joblib to run multiple requests to xDD in parallel via scientific names
xdd_results = Parallel(n_jobs=8)(delayed(xdd.snippets)(name) for name in [r for r in sp_list])


In [6]:
# Dump the records we discovered and packaged to a cache file
# I need to revisit this once I get some things cleared up with taxonomic matching to hopefully find more records
with open("cache/xdd.json", "w") as f:
    f.write(json.dumps([x for x in xdd_results if "Number Documents" in x["Processing Metadata"].keys() and x["Processing Metadata"]["Number Documents"] > 0], indent=4))

In [7]:
# Open the file back up and verify
with open("cache/xdd.json", "r") as f:
    xdd_cache = json.loads(f.read())

print(len(xdd_cache))
display(xdd_cache[random.randint(0,len(xdd_cache)-1)])

58


{'Processing Metadata': {'Status': 'Success',
  'Date Processed': '2019-07-03T15:28:49.990156',
  'Search URL': 'https://geodeepdive.org/api/snippets?full_results&clean&term=Centrocercus minimus',
  'Search Term': 'Centrocercus minimus',
  'Number Documents': 41},
 'Data': [{'pubname': 'Biological Conservation',
   'publisher': 'Elsevier',
   '_gddid': '57a9de64cf58f138b8812975',
   'title': 'Polygyny and female breeding failure reduce effective population size in the lekking Gunnison sage-grouse',
   'doi': '10.1016/j.biocon.2007.10.018',
   'coverDate': 'February 2008',
   'URL': 'http://www.sciencedirect.com/science/article/pii/S0006320707004223',
   'authors': 'Stiver, Julie R.; Apa, Anthony D.; Remington, Thomas E.; Gibson, Robert M.',
   'highlight': ['of a small population of the Gunnison sage-grouse (Centrocercus minimus), a ground',
    'Gunnison sage-grouse (Centrocercus minimus), a ground nesting bird with a lek mating',
    'al., 1999; Kelly, 2001). The Gunnison sage-grouse