After talking about the model catalog today, I got curious about the DOIs that have been minted in the USGS ID space that claim to be models. This notebook takes a look. A quick look at the DataCite works API shows that there are quite a few DOIs minted, with the majority coming from work I did to register all those GAP species habitat models. But there seem to be some other interesting things in there. I put in a process to pull all those together by paginating through the API results but then stash them in a Feather file in case for further reference.

In [1]:
import requests
import pandas as pd

recordset = list()

In [2]:
# Function to create lookup URL for USGS DOIs where resource type is model
def get_url(page_number):
    return f"https://api.datacite.org/works?data-center-id=usgs.prod&resource-type-id=model&page[number]={page_number}"


In [3]:
%%time
# Get the first page of results and set up a structure to contain everything
page_number = 1
result = requests.get(get_url(page_number)).json()

CPU times: user 26.7 ms, sys: 7.96 ms, total: 34.6 ms
Wall time: 1.69 s


In [4]:
%%time
# Loop through and build out the full result set (the lazy way)
while len(result["data"]) > 0:
    recordset.extend(result["data"])
    page_number += 1
    result = requests.get(get_url(page_number)).json()

CPU times: user 1.33 s, sys: 104 ms, total: 1.44 s
Wall time: 5min 57s


If we ran the process to get the latest set of model type DOIs, then we build a dataframe from the attributes structure in each record. If we didn't run it, then we pull in the dataframe from the stashed Pickle file.

In [7]:
if len(recordset) == 0:
    df_models = pd.read_pickle("ModelsStash")
else:
    df_models = pd.DataFrame([i["attributes"] for i in recordset])
    df_models.to_pickle("ModelsStash")


In [8]:
# See what the data look like in a dataframe
df_models.head(5)

Unnamed: 0,doi,identifier,url,author,title,container-title,description,resource-type-subtype,data-center-id,member-id,...,view-count,views-over-time,download-count,downloads-over-time,published,registered,checked,updated,media,xml
0,10.5066/p9hiyvg2,https://doi.org/10.5066/p9hiyvg2,https://water.usgs.gov/GIS/metadata/usgswrd/XM...,"[{'given': 'Tracie R', 'family': 'Jackson'}]",MODFLOW-2005 and PEST models used to character...,U.S. Geological Survey,,Model,usgs.prod,usgs,...,0,[],0,[],2020,2020-03-05T18:30:50.000Z,,2020-03-05T18:30:50.000Z,[],PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLT...
1,10.5066/f76t0jpb,https://doi.org/10.5066/f76t0jpb,https://ca.water.usgs.gov/projects/reg_hydro/b...,"[{'given': 'Lorraine E.', 'family': 'Flint'}, ...",California Basin Characterization Model: A Dat...,U.S. Geological Survey,"The Basin Characterization Model (BCM), can tr...",Model,usgs.prod,usgs,...,0,[],0,[],2014,2014-09-17T00:57:18.000Z,,2020-03-02T21:34:34.000Z,[],PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLT...
2,10.5066/f798853m,https://doi.org/10.5066/f798853m,http://water.usgs.gov/GIS/metadata/usgswrd/XML...,"[{'given': 'J Hal', 'family': 'Davis'}, {'give...",MODFLOW 2000 and MT3DMS models of potentiometr...,U.S. Geological Survey,This model is a preliminary characterization o...,Model,usgs.prod,usgs,...,0,[],0,[],2016,2016-05-05T18:15:15.000Z,,2020-03-02T21:18:55.000Z,[],PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLT...
3,10.5066/f76w98jb,https://doi.org/10.5066/f76w98jb,https://water.usgs.gov/GIS/metadata/usgswrd/XM...,"[{'given': 'Alex R', 'family': 'Fiore'}, {'giv...",MODFLOW-2005 model used to evaluate the potent...,U.S. Geological Survey,"A three-dimensional groundwater flow model, MO...",Model,usgs.prod,usgs,...,0,[],0,[],2018,2018-03-13T14:32:08.000Z,,2020-03-02T21:18:05.000Z,[],PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLT...
4,10.5066/f7h99392,https://doi.org/10.5066/f7h99392,http://water.usgs.gov/GIS/metadata/usgswrd/XML...,"[{'given': 'Stephen J', 'family': 'Cauller'}, ...",MODFLOW2005 model used to simulate the effects...,U.S. Geological Survey,A three-dimensional groundwater flow model was...,Model,usgs.prod,usgs,...,0,[],0,[],2016,2016-07-26T16:40:24.000Z,,2020-03-02T21:16:07.000Z,[],PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLT...


In [9]:
# Set the option to be able to look at everything in columns for convenience
pd.set_option('display.max_colwidth', None)

One of the things I was interested in is where these DOIs actually point in terms of their dereferencing URLs. To examine this, I create a new column with just the domain part of the URL so that I can group and look at things.

In [10]:
df_models['url_domain'] = df_models.apply(lambda row: row.url.split("/")[2], axis = 1)

In [11]:
df_models[["doi","url_domain"]].groupby('url_domain').count()

Unnamed: 0_level_0,doi
url_domain,Unnamed: 1_level_1
alaska.usgs.gov,1
axiomdatascience.com,1
ca.water.usgs.gov,1
coastal.er.usgs.gov,1
code.usgs.gov,2
dx.doi.org,1
github.com,1
my.usgs.gov,3
nrtwq.usgs.gov,3
regclim.coas.oregonstate.edu,1


That's kind of an interesting spread. Almost all of the 1761 "models" in ScienceBase are the GAP habitat maps, which are a model output of sorts from a habitat affinity-based species distribution modeling method that involves human intervention. But it looks like we have some other interesting things to look at and think about in terms of what they actual represent for a future model catalog. In the sections below, I output particular records, take a look to see what's on the other end, and provide some notes.

In [21]:
# Helper function to filter the dataframe for a particular URL domain
def get_domain_items(domain):
    return df_models.loc[df_models['url_domain'] == domain][["doi","url","title","description"]]

In [18]:
get_domain_items("usgs.gov")

Unnamed: 0,doi,url,title,description
1874,10.5066/f78050nh,http://usgs.gov,Title,test111222
1875,10.5066/f7cr5rd0,https://usgs.gov,Title,description of the dataset


Well, those are disappointing. Looks like someone needs to do some cleanup in what actually got registered out with DataCite, if that's even possible.

In [19]:
get_domain_items("axiomdatascience.com")

Unnamed: 0,doi,url,title,description
113,10.5066/f72b8w3t,http://axiomdatascience.com/maps/integrated/?portal_id=51#,"Wave and Wind projections for United States Coasts; Mainland, Pacific Islands, and United States-Affiliated Pacific Islands","Coastal managers and ocean engineers rely heavily on projected average and extreme wave conditions for planning and design purposes, but when working on a local or regional scale, are faced with much uncertainty as changes in the global climate impart spatially-varying trends. Future storm conditions are likely to evolve in a fashion that is unlike past conditions and is ultimately dependent on the complicated interaction between the Earth?s atmosphere and ocean systems. Despite a lack of available data and tools to address future impacts, consideration of climate change is increasingly becoming a requirement for organizations considering future nearshore and coastal vulnerabilities. To address this need, the USGS used winds from four different atmosphere-ocean coupled general circulation models (AOGCMs) or Global Climate Models (GCMs) and the WaveWatchIII numerical wave model to compute historical and future wave conditions under the influence of two climate scenarios. The GCMs respond to specified, time-varying concentrations of various atmospheric constituents (such as greenhouse gases) and include an interactive representation of the atmosphere, ocean, land, and sea ice. The two climate scenarios are derived from the Coupled Model Inter-Comparison Project, Phase 5 (CMIP5; World Climate Research Programme, 2013) and represent one medium-emission mitigation scenario (Representative Concentration Pathways (RCP4.5) and one high-emissions scenario (RCP8.5). The historical time-period spans the years 1976 through 2005, whereas the two future time-periods encompass the mid (years 2026 through 2045) and end of the 21st century (years 2081 through 2099/2100). Continuous time-series of dynamically-downscaled hourly wave parameters (significant wave heights, peak wave periods, and wave directions) and three-hourly winds (wind speed and wind direction) are available for download at discrete deep-water locations along four U.S. coastal regions: ? Pacific Islands [this should be hyperlinked] ? West Coast [this should be hyperlinked] ? East Coast [this should be hyperlinked] ? Alaska Coasts [this should be hyperlinked] The data and cursory overviews of changing conditions along the coasts are summarized in (make these into links.. or have them as a box on the right-hand-side as is currently the case.. if so, then change the text here to read ?? along the coasts are summarized in the documents provided in the right-hand-side box.?) Storlazzi, C.D., Shope, J.B., Erikson, L.H., Hegermiller, C.A., and Barnard, P.L., 2015. Future wave and wind projections for United States and United States-affiliated Pacific Islands: U.S. Geological Survey Open-File Report 2015?1001, 426 p., http://dx.doi.org/10.3133/ofr20151001. Erikson, L.H., Hegermiller, C.E., Barnard, P.L., and Storlazzi, C. 2016. Wave projections for United States mainland coasts. U.S. Geological Survey pamphlet to accompany data set, http://dx.doi.org/10.5066/F7D798GR The time-series data cursory overviews provide information on trends and variability of geophysical variables that are expected to respond to changes in global-scale forcing. The data are being used for and are made available for further evaluation of trends and variability in offshore conditions, and as boundary conditions for regiona-l and local-scale coastal hazard models. Because winds and waves are the key processes driving extreme water levels and wave-driven flooding, the data are expected to be crucial for projecting future transient sea-level extremes on coasts and for defining areas that might be vulnerable to changing wind and wave conditions."


This one from a USGS partner in Alaska that I know pretty well looks interesting. The DOI de-references to a web application run by Axiom Data Science that provides an interactive system to visualize and download the cached results (model output data) for a wave height and wind speed model that looks at both historic conditions and projected futures under several scenarios for coastal zones. The abstract and the web site list other interesting parts of the picture, including a USGS Open File Report describing the methods and what is possibly an "off the books" web page that purportedly accompanied a data release but is the kind of thing that might otherwise have ended up in a USGS Data Series or something similar with a journal. This is a pretty interesting case that has just about all the components for a model catalog except that I couldn't find the source code with a cursory look at least.

In [20]:
get_domain_items("code.usgs.gov")

Unnamed: 0,doi,url,title,description
344,10.5066/p9nquaow,https://code.usgs.gov/coawstmodel/COAWST,COAWST Modeling System v3.4,Coupled ocean atmosphere wave sediment transport modeling system
481,10.5066/p9nqnh41,https://code.usgs.gov/ecosystems/FRESC/sagebrush_hurdle_model,sagebrush_hurdle_model,


I know a little bit about the COAWST model coupling system as I looked at is as an example for EarthMAP. The DOI reference here points to a fine enough source repository, but it does present some challenges in terms of interfacing toward the model catalog idea. You really have to dig into the user manual in a Word document to get at deeper level information about the model, which means writing code to crawl and put pieces together here would be pretty challenging.

I'd heard about the Sagebrush Hurdle Model but hadn't looked at the source code before. This is an interesting case where there is some useful metadata embedded in the README, namely a reference to a journal article that model was used in. The repo also has a CSV file that contains the input data for the model, which probably happens a fair bit depending on the size of the necessary data. It's not a bad practice, in some ways, to stash input data with the model. However, in this case, there's really not any real documentation in or with the dataset (species occurrence points and basic environmental condition) to understand where it came from without some sleuthing. So, it gets points for transparency, but some demerits for re-usability.

In [22]:
get_domain_items("github.com")

Unnamed: 0,doi,url,title,description
484,10.5066/p9yjbmbq,https://github.com/rerickson-usgs/eDNA_field_study,Code to assist the USFWS with eDNA field sampling designs for eDNA,


This is an interesting case of a minted DOI pointing to a provisional code repo sitting in a personal space in GitHub. I guess that's okay under policy, but it opens a few questions. I would tend to not mint a DOI for something that was not going to eventually be production code. Richie Erickson does provide a nice disclaimer up front in the README indicating that he doesn't expect someone to build on it. There's a title reference to an article that's now [published](https://www.ncbi.nlm.nih.gov/pubmed/31188494), and some guidance about following up to put those links into some type of code metadata that would support the reference would be a good thing to have.

In [23]:
get_domain_items("my.usgs.gov")

Unnamed: 0,doi,url,title,description
370,10.5066/p9t9j3ju,https://my.usgs.gov/bitbucket/users/rerickson_usgs.gov/repos/networknodeipm/browse,Spatially explicit integral projection model,
1849,10.5066/f7dj5d4h,https://my.usgs.gov/bitbucket/projects/UMESC/repos/migratorypathwaysourcesink/browse,carpIPM R code,"Fish grow continuously. Integral projection models (IPM) model size continuously. We developed an IPM for fish. We specifically developed the model for grass carp, however, the model could also be applied to species with similar life-histories such as other carp species. We included YY-males within the model because releasing YY-males has been proposed a control method for the species. YY-males are fish that have 2 male chromosomes compared to a XY-male. When YY-males mate, they only produce male (XY) offspring. This decreases the female proportion of the population and can, in theory, eradicate local populations by biasing the sex-ratio. We created our model as an R package and the model has been documented by both a TRACE Document and peer-reviewed manuscript."
1852,10.5066/f7416v7z,https://my.usgs.gov/bitbucket/projects/UMESC/repos/migratorypathwaysourcesink/,Migratory Pathway Source Sink,This is the R code to support the manuscript ``Defining and classifying migratory habitats as sources and sinks: the migratory pathway approach''.\r\nThe are three files besides this file (the README file):\r\nA source file of function: functionsUploaded2016.R\r\nA script file that was used for the manuscript and is an R Markdown file: SubmittedCode.Rmd; and\r\nand a LICENSE file.


These three cases are pointing to the Atlassian BitBucket instance that we've now decomissioned on myUSGS. Someone probably should follow up to see if these repos got moved somewhere else and then change the dereferencing URL in the DOI system. The lack of a description in the DataCite metadata here (and elsewhere) should also be a flag.

In [24]:
get_domain_items("regclim.coas.oregonstate.edu")

Unnamed: 0,doi,url,title,description
283,10.5066/f7r78dhk,http://regclim.coas.oregonstate.edu/data-access/index.html,Dynamically downscaled climate simulations over North America,


This is a web site that provides access to data distribution services for NetCDF files from a climate model downscaling process. It's essentially model output data or perhaps more precisely, model derivative data. This process has been done in a number of different places using a variety of downscaling methods appropriate to different uses. The most notable framework in USGS for distributing these data with some value-added services for statistical summarization is through the GeoDataPortal. In this case, the DOI might actually be miscategorized as a model but it's related to modeling, indicating that some guidance and parameters are probably needed in USGS policy.

In [25]:
get_domain_items("dx.doi.org")

Unnamed: 0,doi,url,title,description
1853,10.5066/f7639nn9,http://dx.doi.org/10.5066/F7639NN9,Capture Map Bias Hypothetical Model Archive,"A MODFLOW model of a hypothetical stream-aquifer system is presented for the evaluation and characterization of capture map bias. The hypothetical model is a single-layer model constructed with 30 rows and 100 columns. The MODFLOW model includes a stream (represented with the MODFLOW SFR or CHD Packages), mountain block recharge (represented with the MODFLOW RCH Package), and evapotransipration (represented with the MODFLOW EVT Package). The hypothetical model is used to create capture maps and capture difference maps. Map regions with large capture and depletion fraction differences are evaluated with new methods to compute capture map bias. The hypothetical stream-aquifer system model is used for sensitivity analyses to characterize capture map bias."


This is a curious one that dereferences to another DOI that is not active at the moment. Something weird happened with this. It seems like we should have some safeguards in our toolset to keep this kind of thing from happening.

The description does present an interesting class of model "stuff" that will end up in our catalog somewhere - inputs and outputs from hypothetical model runs made to critically examine aspects of model performance or provide some level of calibration. These are important artifacts in a modeling lifecycle that we should come up with a way to handle, but they should be somehow distinct from other types of model "packages."

In [30]:
# Provide an index on a key substring identifying GAP habitat maps and then show the rest
sciencebase_items = get_domain_items("www.sciencebase.gov")
sciencebase_items["hab_map"] = sciencebase_items["title"].str.find("_2001v1",2)
sciencebase_items.loc[sciencebase_items["hab_map"] < 0]

Unnamed: 0,doi,url,title,description,hab_map
7,10.5066/p9qrd7a3,https://www.sciencebase.gov/catalog/item/5c5ddfa2e4b0fe48cb32e717,"A Soil-Water-Balance model and precipitation data used for HEC/HMS modelling at the Glacial Ridge National Wildlife Refuge area, northwestern Minnesota, 2002-15.","A soil-water balance model (SWB) was developed to estimate evapotranspiration in six ditch basins of the Glacial Ridge National Wildlife Refuge area, northwestern Minnesota, during 2002-2015. The model was used to estimate evapotranspiration in water balances in six ditch basins as part of the associated report, U.S. Geological Survey Scientific Investigations Report 2019-5041 (http://dx.doi.org/10.3133/SIRXXXX). This SWB model was derived from the statewide Minnesota SWB potential recharge model, described, calibrated, and documented as part of U.S. Geological Survey Scientific Investigations Report 2015-5038 (http://dx.doi.org/10.3133/sir20155038). The data sets and calibrations from the Minnesota statewide model were used without modification except for the more detailed precipitation, water capacity, and land use input data. In this model, precipitation data were interpolated from local raingages. Water capacity data were taken from the gSSURGO soils data base. Land-use data were compiled from three sources using the most detailed data: the National Land Cover Database, the Cropland Data Layer and data from the local Natural Resources Conservation Service office. Details of the procedures used to produce these three detailed data sets can be found in U.S. Geological Survey Scientific Investigations Report 2019-XXXX (http://dx.doi.org/10.3133/SIRXXXX). This model was not recalibrated. All calibrated parameters remain the same as those in the statewide Minnesota SWB model. The areal resolution of this model was increased to a 60-meter square grid and the temporal period was extended through 2015 relative to the statewide SWB model. Daymet (version 2) daily surface temperature data necessary to run this SWB model are available upon request through the following link: https://doi.org/10.3334/ORNLDAAC/1219. Also included in this data archive is a file of selected hourly precipitation totals for six ditch basins used in HEC/HMS ditch-flow modelling described in the associated report.",-1
9,10.5066/p90jy18d,https://www.sciencebase.gov/catalog/item/5e594162e4b01d50924ab17d,Assessment of uncontained Zequanox applications in a Midwestern lake code,,-1
10,10.5066/p94owfsm,https://www.sciencebase.gov/catalog/item/5e593589e4b01d50924a99ed,R code Evidence for a growing population of eastern migratory monarch butterflies is currently insufficient,,-1
16,10.5066/p995smvw,https://www.sciencebase.gov/catalog/item/5e3af9b8e4b0edb47bddac83,R Code to analyze data from sediment incubation experiments (Fox and Duck Rivermouths; 2016),,-1
30,10.5066/f7nk3c59,https://www.sciencebase.gov/catalog/item/57db0908e4b090824ffc3324,"Physics-based numerical circulation model outputs of ocean surface circulation during the 2010-2013 summer coral-spawning seasons in Maui Nui, Hawaii, USA","Here we present surface current results from a physics-based, 3-dimensional coupled ocean-atmosphere numerical model that was generated to understand coral larval dispersal patterns in Maui Nui, Hawaii, USA. The model was used to simulate coral larval dispersal patterns from a number of existing State-managed reefs and large tracks of reefs with high coral coverage that might be good candidates for marine-protected areas (MPAs) during 8 spawning events during 2010-2013. The goal of this effort is to provide geophysical data to help provide guidance to sustain coral health in Maui Nui, Hawaii, USA. Each model output run is available as a netCDF file with self-contained attribute information. Each file name is appended with the model-simulation date in YYYYMMDD format; the file name denotes the beginning of simulation portion of the model run, with the model starting and spinning up over two days before the model-simulation date in the file name.",-1
38,10.5066/f7kd1x68,https://www.sciencebase.gov/catalog/item/5df12287e4b02caea0f635ca,Data Retrieval and Graphing Using the LTRM Fish Catch GeoJSON Data Service,"UMESC hosts a web service for the retrieval of LTRM fish catch data using the GeoJSON data format. By using this data service, the public can automate data access to LTRM fisheries data. UMESC has written a series of example Python scripts that illustrate data retrieval and plotting. These example scripts focus on plotting fish catch, but there are limitless geo-spatial, tabular, and plotting products that can be generated using this data interface. The GeoJSON data format can be readily consumed by JavaScript and JavaScript components (e.g. the open-source libraries of Leaflet) to create web-based or mobile mapping applications.",-1
39,10.5066/f7ft8k8k,https://www.sciencebase.gov/catalog/item/5df11bc2e4b02caea0f5f70c,Population Objectives Regional Planning Tool for Grassland Birds,,-1
40,10.5066/f7qn662b,https://www.sciencebase.gov/catalog/item/5df11b10e4b02caea0f5f147,Use of Alternating and Pulse Direct Current Electrified Fields for Zebra Mussel Control Code,,-1
41,10.5066/f76972tt,https://www.sciencebase.gov/catalog/item/5df124b1e4b02caea0f6479a,SAS code for analyzing water temperature data,"This code may be used to fit linear models with multivariate random effects and heterogeneous measurement-level residual variances. The code as written may be used to estimate associations between water temperature ('temp') and continuous year ('yearctr'), study reach (or field station; 'fs'), log-transformed mean July water discharge (in 1000 cms units; 'logmeanJulycms1000'), number of days from a central sampling date (for a given year, days from a standard month and day; 'jdatectr'), time of sampling (in fractions of hours from noon on the given date; 'timectrhr') and interactions thereof.",-1
42,10.5066/f7zc8239,https://www.sciencebase.gov/catalog/item/5df121a8e4b02caea0f62f2d,Composite Raster and Divergence Tool,"ArcGIS ArcMap add-in, when using this tool, a user can: 1) create a date prioritized composite raster from a collection of raster layers 2) create a lookup raster for the composite raster that identifies which input layers were used to create the composite raster 3) create a divergence raster where pixel values represent the divergence from a user specified value",-1


I already know about the GAP habitat maps, but I was curious about what other "models" are distributed via ScienceBase. It looks like some cool stuff online. A bunch of missing descriptions is problematic and something that should be easily fixed in the DOI system. Unfortunately, ScienceBase seems to be taking some much needed time off this evening, so I can't pull up any of these. Descriptions look like model outputs, and I would guess that the ScienceBase Items might actually be classed as data releases (or at least they could have been). The nice thing about using ScienceBase as the backend landing system for these DOIs would be the reasonably simple structured metadata capability. I'd be looking for things like links to source code, links to associated publications, and links or relationships to input data that might have been structured into the items, making the production of a linked catalog more feasible.