The FWS Ecological Conservation Online System contains information about T&E species along with other information. USGS EMA staff (or others?) went through the work plan species list, determined the appropriate links to ECOS species web pages, and recorded those in one of the tables provided in the source inventory. In reviewing the various systems and access points, we found that there is information on ECOS species web pages that is not accessible through the ECOS TESS web services. We also found that the identifiers used on the ECOS species web pages do not seem to be found anywhere in the other accessible TESS interfaces. From this, we determined that we should run a rudimentary web scraping tool to gather a few usable pieces of information from the linked ECOS pages as a first step, cache this information in a file, and use it in later work.

One of the main things extracted here is FWS' own determination of the appropriate ITIS species to link to. We use this in favor of running a species search whenever it's available as one avenue of establishing a linkage and retrieving information for later use.

In writing the ECOS web scraper, we found that the ECOS pages are really quite hard to deal with. They are assembled dynamically from what appear to be various sources in a somewhat inconsistent way in terms of where and how the information is output to HTML/Javascript on the pages. This first scraper is kind of crude, and we'll revisit as needed down the road.

# Data prep
The ECOS links were contained in the "FWS 7 Year Workplan Species" worksheet from the original "Prelisting Science USGS Master_19Mar2018" spreadsheet used as source material for this exercise. The links were embedded as hyperlinks on the species "Scientific Name" field using Excel proprietary methods. As such, we had to use a simple VBA script to extract out the links to their own field. We did this by copying the scientific name fields over to another Excel file, running the VBA macro there, and then including that as an intermediary file for processing.

In [1]:
import pandas as pd
import bispy
from IPython.display import display
import json
from joblib import Parallel, delayed
import random

ecos = bispy.tess.Ecos()

In [2]:
# Retrieve the extraction of ECOS links with scientific names from an excel file
spp_ecos_links = pd.read_excel(
    "sources/AssitionalSourceData.xlsx",
    sheet_name="Extracted Species ECOS Links"
)
# Put just the links into a list for processing
ecos_link_list = spp_ecos_links[spp_ecos_links["ECOS Link"].notnull()]["ECOS Link"].tolist()

In [3]:
# Use joblib to run multiple requests for ECOS documents in parallel
ecos_cache = Parallel(n_jobs=8)(delayed(ecos.scrape_ecos)(url) for url in ecos_link_list)

In [4]:
# Dump the cache of ecos data to a JSON file for later use
with open("cache/ecos.json", 'w') as f:
    f.write(json.dumps(ecos_cache, indent=4))

In [5]:
# Open up the JSON file and validate that it works showing number of cached records and an example
with open("cache/ecos.json", "r") as f:
    cached_ecos_data = json.loads(f.read())

print(len(cached_ecos_data))
display(cached_ecos_data[random.randint(0,len(cached_ecos_data)-1)])

342


{'Common Name': 'Yellow banded Bumble bee',
 'Federal Register Documents': [{'Citation Page': '81 FR 14058 14072',
   'Date': '2016-03-16',
   'Title': '90-Day Findings on 29 Petitions; Notice of petition findings and initiation of status reviews',
   'Title_link': 'https://www.govinfo.gov/link/fr/81/14058?link-type=pdf'}],
 'ITIS TSN': '714843',
 'Processing Metadata': {'Date Processed': '2019-07-01T16:47:35.650051',
  'Search URL': 'https://ecos.fws.gov/ecp/species/10403',
  'Status': 'Page Successfully Retrieved'},
 'Scientific Name': 'Bombus terricola',
 'Status Summary': [{'Date Listed': '',
   'Lead Region': 'Northeast Region (Region 5)',
   'Lead Region_link': 'http://www.fws.gov/northeast/',
   'Status': 'Under Review',
   'Where Listed': 'Wherever found'}]}