This notebook provides an example of running through and making a bunch of DOI updates based on records in ScienceBase. In this case, we had previously reserved DOIs, all pointing at ScienceBase Items as their de-referencing URLs, and we now need to finalize the DOIs and turn them on for real. It works with Brandon Serna's usgs_datatools package, where the latest version works with the new DOI REST API. Most of the parts and pieces of this should be fairly easily reused by others needing to do something similar.

The first few blocks here do all the same things Brandon has shown in other examples, setting up a session with the DOI tool to do work.

In [1]:
import os
import json
import requests
import getpass
from IPython.display import display

from usgs_datatools import doi

In [2]:
username = 'sbristol@usgs.gov'
password = getpass.getpass('USGS AD Password: ')

USGS AD Password: ········


In [3]:
doi_session = doi.DoiSession(env='production')

In [4]:
doi_session.doi_authenticate(username, password)

<usgs_datatools.doi.DoiSession at 0x107ba47b8>

In this case, I'm starting from the perspective of the existing ScienceBase Items where we previously recorded the reserved DOIs as one of the identifiers. I need to validate that all the critical information associated with that DOI matches what's in the ScienceBase Item, resetting a couple of attributes along the way and making the DOI public. I could just run it all in one process that goes through each ScienceBase Item, connects to the DOI tool, and does the work, but since we are running this in a Notebook environment, we can simply build out a stripped down data object in memory and then run against that.

We're dealing with two different ScienceBase collections with 1,719 items in each one, so we need to paginate through ScienceBase search results to build out what we need. You could also do this with pysb and would need to use something like that if dealing with restricted items. Because we just finished the data release process on these and made the collections public, we can simply access the ScienceBase REST API with requests.

In [5]:
sbItems = []
for collectionID in ['527d0a83e4b0850ea0518326', '5951527de4b062508e3b1e79']:

    nextLink = 'https://www.sciencebase.gov/catalog/items?max=100&format=json&fields=title,identifiers&parentId='+collectionID

    while nextLink is not None:
        if nextLink is not None:
            sbResult = requests.get(nextLink).json()

            if 'nextlink' in sbResult.keys():
                nextLink = sbResult['nextlink']['url']
            else:
                nextLink = None

            if len(sbResult['items']) != 0:
                for item in sbResult['items']:
                    thisSBItem = {'id':item['id']}
                    thisSBItem['title'] = item['title']
                    thisSBItem['doi'] = next((i['key'] for i in item["identifiers"] if i['type'] == 'doi'), None)
                    sbItems.append(thisSBItem)


In [6]:
# See if we pulled the expected number of items - 3,438
print(len(sbItems))

3438


In [14]:
%%time

for index,sbItem in enumerate(sbItems):
    # Put in a break for testing
    #if index > 5:
    #    break

    # Get the current record from the DOI database
    thisDOIRecord = doi_session.get_doi(sbItem['doi'])

    # Convert the returned 'message' to a dict for comparing current attributes
    thisDOIDict = json.loads(thisDOIRecord['message'])
    
    # The proccess horked out, so I'm throwing in a check on status to see if I can get it to finish off eventually
    if thisDOIDict['status'] == 'public':
        continue
    
    # Create a dictionary to conduct the update containing all the information it's supposed to have
    thisDoi = sbItem.copy()
    # Get rid of the ScienceBase ID for the doi update packet (though it doesn't seem to break anything if left in)
    thisDoi.pop('id')

    # Set some hard-coded parameters, including making the DOI public
    thisDoi['status'] = 'public'
    thisDoi['pubDate'] = '2018'
    thisDoi['ipdsNumbers'] = [{'ipdsNumber': '082267', 'ipdsType': 'DATA_RELEASE'}]
    thisDoi['dataSourceId'] = 59507
    thisDoi['dataSourceName'] = 'Core Science Analytics, Synthesis and Libraries'
    # Despite including the IPDS number, I'm getting an error returned when trying to make public
    # I'm setting this flag to see if I can get the process done, since there seems to be a checking error
    thisDoi['noPublicationIdAvailable'] = True
    
    # Check to make sure the DOI URL matches the ScienceBase Item
    # If not, add it to the update packet
    if thisDOIDict['url'].split('/')[-1] != sbItem['id']:
        thisDoi['url'] = 'https://www.sciencebase.gov/catalog/item/'+sbItem['id']
    
    # This will be the DOI update package plus the parameter to make it public
    display(thisDoi)
    
    # Update the DOI with the update packet, including setting to public from reserved
    upd_r = doi_session.doi_update(thisDoi)
    display(upd_r)
    print(upd_r['doi'], upd_r['status'])
        

{'dataSourceId': 59507,
 'dataSourceName': 'Core Science Analytics, Synthesis and Libraries',
 'doi': 'doi:10.5066/F7028PW8',
 'ipdsNumbers': [{'ipdsNumber': '082267', 'ipdsType': 'DATA_RELEASE'}],
 'noPublicationIdAvailable': True,
 'pubDate': '2018',
 'status': 'public',
 'title': 'Scarlet Tanager (Piranga olivacea) bSCTAx_CONUS_2001v1 Habitat Map'}

{'error': 400,
 'message': 'An error occurred when attempting to process your request: Missing date parameter'}

KeyError: 'doi'