<a href="https://colab.research.google.com/github/skybristol/GeoArchiveSummer2021/blob/main/Citation_strings_from_GeoArchive_items.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The purpose of this notebook is to explore methods to generate standardized USGS citation strings from the ScienceBase Items where we are documenting and repositing NI 43-101 Technical Reports within the GeoArchive.

Researchers using these reports as background and reference materials need to cite them for other reports and articles. USGS uses a particular citation string format laid out in a guide for authors of USGS reports. All of the information elements needed to generate these citations are included in the ScienceBase Item metadata, but authors need a quick way to generate citation strings they can drop into a report.

The Zotero tool for managing bibliographic reference materials is one option for managing references. It has a built-in method for generating the specific string format used in USGS guidelines (along with many other standard formats). There are quite a number of USGS publishing authors who use Zotero already, and it is a pretty accessible option for anyone needing to work with these materials.

We can go about making citation information for the GeoArchive materials available to Zotero clients in a variety of ways. One way would be to generate one or another of the structured bibliographic information that Zotero can read (many options) as a file output from reading ScienceBase information in through its API, downloading those, and then importing into any local Zotero. A more robust method would be to use Zotero's online system with a shared group (potentially mimicing the whole GeoArchive structure of multiple "folders"), syncing reference items programmatically to the group via the Zotero API, and then having any client connect and sync the group into their local instance. This could, conceivably, allow us to develop a dynamic system that checks ScienceBase for updates and syncs any changes on some regular schedule.

This notebook explores the latter option, though simply completing the basic mapping steps needed to align ScienceBase Item metadata with the Zotero model will accomplish the hard part of getting to any reasonable citation format output.

Two specific packages are needed to work through this notebook.
* sciencebasepy is required because an authenticated session is necessary to connect to the one GeoArchive collection we've built so far with currently protected items
* pyzotero is a Python abstraction on the Zotero API 

In [1]:
!pip install sciencebasepy
!pip install pyzotero

Collecting sciencebasepy
  Downloading https://files.pythonhosted.org/packages/1a/8b/3aead3f9d3fa3ea29fdb20b563772f82088c96e4522d9b1980871c862fde/sciencebasepy-1.6.9-py3-none-any.whl
Installing collected packages: sciencebasepy
Successfully installed sciencebasepy-1.6.9
Collecting pyzotero
  Downloading https://files.pythonhosted.org/packages/e9/44/da1cacf283d0cedac32dc2e22cb88ac8462e9fd58eb5a84483a77481e4f3/Pyzotero-1.4.24-py2.py3-none-any.whl
Collecting feedparser<6,>5.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/91/d8/7d37fec71ff7c9dbcdd80d2b48bcdd86d6af502156fc93846fb0102cb2c4/feedparser-5.2.1.tar.bz2 (192kB)
[K     |████████████████████████████████| 194kB 7.4MB/s 
Collecting bibtexparser
[?25l  Downloading https://files.pythonhosted.org/packages/7c/c3/c184a4460ba2f4877e3389e2d63479f642d0d3bdffeeffee0723d3b0156d/bibtexparser-1.2.0.tar.gz (46kB)
[K     |████████████████████████████████| 51kB 5.8MB/s 
Building wheels for collected packages: feedparser, bibtexpar

In [2]:
from sciencebasepy import SbSession
from pyzotero import zotero
from getpass import getpass

A Zotero client connection needs to be established to work with the system. This can either point at a user or group library ("Library Type") and needs to supply the ID for the Library and an API Key that has to be generated for whatever application is making the connection. We put all of these into prompts to allow someone running this notebook to use their own specific information and settings.

In [3]:
zot = zotero.Zotero(input("Library ID "), input("Library Type "), getpass(prompt="API Key "))

Library ID 4373054
Library Type group
API Key ··········


One of the first things to figure out, if we're going to push items into Zotero, is what item type we would use. There's a function to output those from the API that we can look over. For our immediate use case, there is a "report" type that seems like it would fit our need for the NI 43-101 Technical Reports.

In [4]:
zot.item_types()

[{'itemType': 'artwork', 'localized': 'Artwork'},
 {'itemType': 'audioRecording', 'localized': 'Audio Recording'},
 {'itemType': 'bill', 'localized': 'Bill'},
 {'itemType': 'blogPost', 'localized': 'Blog Post'},
 {'itemType': 'book', 'localized': 'Book'},
 {'itemType': 'bookSection', 'localized': 'Book Section'},
 {'itemType': 'case', 'localized': 'Case'},
 {'itemType': 'computerProgram', 'localized': 'Computer Program'},
 {'itemType': 'conferencePaper', 'localized': 'Conference Paper'},
 {'itemType': 'dictionaryEntry', 'localized': 'Dictionary Entry'},
 {'itemType': 'document', 'localized': 'Document'},
 {'itemType': 'email', 'localized': 'E-mail'},
 {'itemType': 'encyclopediaArticle', 'localized': 'Encyclopedia Article'},
 {'itemType': 'film', 'localized': 'Film'},
 {'itemType': 'forumPost', 'localized': 'Forum Post'},
 {'itemType': 'hearing', 'localized': 'Hearing'},
 {'itemType': 'instantMessage', 'localized': 'Instant Message'},
 {'itemType': 'interview', 'localized': 'Interview'}

The bulk of our work here will really be in figuring out which ScienceBase Item metadata fields should go into what Zotero fields. We would want to both create a complete citation but also provide as much information as possible from what we are developing out in ScienceBase that a user will be able to simply use the group library in Zotero itself to navigate the collection and find what they need. This would include links back into ScienceBase to retrieve the actual PDF file for an item as well as descriptive information we've put into tags.

The following codeblock runs a function to show us all of the available fields on any Zotero item type.

In [5]:
zot.item_fields()

[{'field': 'numPages', 'localized': '# of Pages'},
 {'field': 'numberOfVolumes', 'localized': '# of Volumes'},
 {'field': 'abstractNote', 'localized': 'Abstract'},
 {'field': 'accessDate', 'localized': 'Accessed'},
 {'field': 'applicationNumber', 'localized': 'Application Number'},
 {'field': 'archive', 'localized': 'Archive'},
 {'field': 'artworkSize', 'localized': 'Artwork Size'},
 {'field': 'assignee', 'localized': 'Assignee'},
 {'field': 'billNumber', 'localized': 'Bill Number'},
 {'field': 'blogTitle', 'localized': 'Blog Title'},
 {'field': 'bookTitle', 'localized': 'Book Title'},
 {'field': 'callNumber', 'localized': 'Call Number'},
 {'field': 'caseName', 'localized': 'Case Name'},
 {'field': 'code', 'localized': 'Code'},
 {'field': 'codeNumber', 'localized': 'Code Number'},
 {'field': 'codePages', 'localized': 'Code Pages'},
 {'field': 'codeVolume', 'localized': 'Code Volume'},
 {'field': 'committee', 'localized': 'Committee'},
 {'field': 'company', 'localized': 'Company'},
 {'f

Zotero uses a model where there are specific fields that are valid for each item type. If we want to hone in on the "report" item type, then we'll need to use those fields that are valid for that item type. The following codeblock calls a function that outputs a template (a "blank" Python dictionary) for the report item type, giving us the fields of information we have available to us to map from ScienceBase and fill in.

In [6]:
zot.item_template('report')

{'abstractNote': '',
 'accessDate': '',
 'archive': '',
 'archiveLocation': '',
 'callNumber': '',
 'collections': [],
 'creators': [{'creatorType': 'author', 'firstName': '', 'lastName': ''}],
 'date': '',
 'extra': '',
 'institution': '',
 'itemType': 'report',
 'language': '',
 'libraryCatalog': '',
 'pages': '',
 'place': '',
 'relations': {},
 'reportNumber': '',
 'reportType': '',
 'rights': '',
 'seriesTitle': '',
 'shortTitle': '',
 'tags': [],
 'title': '',
 'url': ''}

So far, I've taken a single report and set up a more complete template in the GeoArchive group library. This is shown here to start working out specifics on how we will map what we have from the original inventory spreadsheet and now in ScienceBase to the Zotero report item type model. There is quite a bit of content needed to create a full citation that is missing and needs to be extracted in some way (human or algorithm) from the reports themselves (e.g., actual title of the report, author names, dates). We may have to start with what we have and then work at filling in the blanks.

Going the Zotero route for direct management of the archive would mean that we'd have a stable, accessible platform for both people and software to act against. Group members can be added to use the Zotero interface (web or desktop clients) to manage items, and we can use the API to feed back extracted data.

Here is what a USGS formatted citation string would look like for this particular item.

F. Ghazanfari, B. T. Hennessey, L. Pignatari, T.R. Raponi, I. Dymov, P. C. Rodriguez, and A. Wheeler, 2020, Updated Feasibility Study Technical Report (NI 43-101) for the Almas Gold Project, Almas Municipality, Tocantins, Brazil: Aura Minerals, 360 Mining feasibility study, 459 p., accessed at https://www.sciencebase.gov/catalog/file/get/60d20afad34e86b938ada670?f=__disk__31%2F22%2F7f%2F31227f86e158c963d891341a44f78d136aebd0c2.


In [8]:
zot.top(limit=5)

[{'data': {'abstractNote': '',
   'accessDate': '',
   'archive': 'ScienceBase',
   'archiveLocation': 'https://www.sciencebase.gov/catalog/item/60d20afad34e86b938ada670',
   'callNumber': '',
   'collections': ['L56BFTCE'],
   'creators': [{'creatorType': 'author', 'name': 'F. Ghazanfari'},
    {'creatorType': 'author', 'name': 'B. T. Hennessey'},
    {'creatorType': 'author', 'name': 'L. Pignatari'},
    {'creatorType': 'author', 'name': 'T.R. Raponi'},
    {'creatorType': 'author', 'name': 'I. Dymov'},
    {'creatorType': 'author', 'name': 'P. C. Rodriguez'},
    {'creatorType': 'author', 'name': 'A. Wheeler'}],
   'date': '12/31/2020',
   'dateAdded': '2021-07-16T14:09:49Z',
   'dateModified': '2021-07-16T14:33:15Z',
   'extra': '',
   'institution': 'Aura Minerals, 360 Mining',
   'itemType': 'report',
   'key': 'QPXCGSSE',
   'language': 'en',
   'libraryCatalog': '',
   'pages': '459',
   'place': 'Almas Mincipality, Tocantins, Brazil',
   'relations': {},
   'reportNumber': '',

Now that we have a basic idea of what we're shooting for to send items into Zotero, we need to look at our source material and figure out a mapping. We need to figure out if we've captured everything for our ScienceBase Items that we need to build a full citation to the reports in Zotero and provide some additional things like tags that will help people navigate the Zotero library and find what they are looking for.

Because our one collection in the GeoArchive is currently restricted, we need to login to ScienceBase using sciencebasepy and establish a secure connection to the ScienceBase API. We can then start with the ID for the open and accessible top level collection and get the one child ID that we want to work against. (A more robust method will be needed when we have multiple virtual collections operating.)

In [14]:
sb = SbSession()
sb.loginc(input("User Name:"))

User Name:sbristol@usgs.gov
··········
Invalid password, try again
··········


<sciencebasepy.SbSession.SbSession at 0x7fc333aca610>

In [15]:
geoarchive_item = "607ef112d34e8564d6809e58"
disclosure_reports_collection = sb.get_child_ids(geoarchive_item)[0]

Now we can get all of the items within the one parent collection and the fields for those items where we have stored any information currently. We can then work out way through those items, create a logical mapping to the Zotero template we've chosen, and see how things line out. We should be able to send in a sampling of items via the API and see how they are going to work for our purposes.

In [18]:
disclosure_reports = list()
items = sb.find_items(
    {
        'parentId': disclosure_reports_collection, 
        'max': 1000,
        'fields': 'title,subtitle,contacts,dates,extensions,files,tags'
     }
)
while items and 'items' in items:
    if items:
        disclosure_reports.extend(items["items"])
    items = sb.next(items)

In [19]:
disclosure_reports[0]

{'dates': [{'dateString': '2021-06-22T10:06:49.442-06:00',
   'label': 'Date Created',
   'type': 'dateCreated'},
  {'dateString': '2021-07-01T07:47:58.405-06:00',
   'label': 'Last Updated',
   'type': 'lastUpdated'}],
 'files': [{'contentType': 'application/pdf',
   'dateUploaded': '2021-07-01T13:47:58.401Z',
   'downloadUri': 'https://www.sciencebase.gov/catalog/file/get/60d20a99d34e86b938ada3a3?f=__disk__5e%2F7a%2Fad%2F5e7aad545e458b3688a942ef31519b63cf525a17',
   'name': 'Rozino Au 8-2020.pdf',
   'pathOnDisk': '__disk__5e/7a/ad/5e7aad545e458b3688a942ef31519b63cf525a17',
   'size': 21932906,
   'uploadedBy': 'sbristol@usgs.gov',
   'url': 'https://www.sciencebase.gov/catalog/file/get/60d20a99d34e86b938ada3a3?f=__disk__5e%2F7a%2Fad%2F5e7aad545e458b3688a942ef31519b63cf525a17'}],
 'id': '60d20a99d34e86b938ada3a3',
 'link': {'rel': 'self',
  'url': 'https://www.sciencebase.gov/catalog/item/60d20a99d34e86b938ada3a3'},
 'relatedItems': {'link': {'rel': 'related',
   'url': 'https://www.