This notebook builds on the process I started of working this system from the Zotero API against a given library. In the first step I outlined in "Pull References via Zotero API", I established a linkage from a Zotero library item to an xDD item. From that point, we can essentially run any number of processes to extract value from xDD for the associated processed articles. This could come from an NLP process run on the xDD system, but in this case, we're looking at the dictionary routes that have already extracted/associated specific terms from a given dictionary against articles (ITIS scientific names in this case).

One of the interesting dynamics with the Zotero platform is that we can create reference item attachments in a hierarchical structure (I don't yet know if there's a limit to hierarchy depth). Looking at the Zotero attachment schema, it seems reasonable to continue using the linked_url linkMode to record the route to the dictionary API that we use and then the tags construct to capture the specific ITIS scientific names for the given article. This puts everything into a logical structure that should make sense for both human users and subsequent software processing. We can again use the title property to give our algorithm-generated xDD dictionary attachment a logical name for classification.

In [1]:
from pyzotero import zotero
import os
import requests
import json
from IPython.display import display

In [2]:
wlci_library_group_id = "2341914"
wlci_lib = zotero.Zotero(wlci_library_group_id, "group", os.environ["ZOTERO_API_KEY"])

It ended up making the best sense (to me) to work this entire process into a specific combined function that retrieves and stores xDD Dictionary terms for a given xDD derivative associated with a given article or report. I found that I could not nest an attachment item under another attachment items, so these end up going in directly under the parent item. There is a relations concept in the attachment data model that could be explored further to create linkages between attachments that contain the xDD article association to any of those attachments where we stash terms, but it's not all that necessary at this point because the terms attachments will have everything necessary to get back to the xDD article.

After fiddling with some different ways of structuring the data into these attachments, I ended up json dumping an array of objects containing the dictionary name (from xDD), the terms, and hit counts and then putting just the terms themselves into tags. Zotero treats the tags as facets within the context of the Library itself, so we immediately start adding a bunch of potentially useful tags to help filter through a library. However, we might want to revisit this if we determine that secondary processing filters out tags that we retrieve from the xDD process that are not actually relevant for our case.

The function will either insert a new document or update an existing document with the fresh haul from the xDD dictionary lookup. We could refine this further so that we check to see if there's anything different in the xDD response before updating an item. One of the interesting things I found in this process was the fairly nice way that Zotero versions everything. It looks like there is a way to get back to older versions of any given item through the API as well that we might explore at some point.

In [3]:
def xdd_terms_to_zotero_attachment(parent_id, xdd_id, dictionary):
    search_url = f"https://geodeepdive.org/api/terms?docid={xdd_id}&dictionary={dictionary}"
    r = requests.get(search_url, headers={"Accept": "application/json"})
    
    if r.status_code == 200 and 'success' in r.json():
        data = r.json()['success']['data']
        # Prepare a list of tuples with scientific name and hit count to shove into the notes
        terms_and_hits = [{"dictionary": dictionary, "term":i['term'], "hit_count":i['n_hits']} for i in data if len(i['term'].split()) > 1]

        if len(terms_and_hits) == 0:
            return None
        
        # Prepare just the list of scientific names to use as tags
        tag_list = [i["term"] for i in terms_and_hits]
        
        # Check to see if the top level item already has an attachment for this information
        child_items = wlci_lib.children(parent_id)
        current_xdd_doc_attachment = next((i for i in child_items if i["data"]["title"] == f"xDD terms from document for the {dictionary} dictionary"), None)
        
        # If not create it
        if current_xdd_doc_attachment is None:
            template = wlci_lib.item_template("attachment", linkmode="linked_url")
            template["parentItem"] = parent_id
            template["title"] = f"xDD terms from document for the {dictionary} dictionary"
            template["url"] = search_url
            template["accessDate"] = "CURRENT_TIMESTAMP"
            template["note"] = json.dumps(terms_and_hits)

            create_response = wlci_lib.create_items([template])
            if create_response["successful"]:
                current_xdd_doc_attachment = create_response["successful"]["0"]
            else:
                return create_response

        # If the doc already exists update the note
        else:
            current_xdd_doc_attachment["data"]["note"] = json.dumps(terms_and_hits)
            wlci_lib.update_item(current_xdd_doc_attachment)
            # Have to retrieve the document we just updated so we have the current version
            current_xdd_doc_attachment = wlci_lib.item(current_xdd_doc_attachment["key"])
        
        # We have to add the tags as a separate API call to the attachment that was either just created or updated
        if current_xdd_doc_attachment is not None:
            wlci_lib.add_tags(current_xdd_doc_attachment, *tag_list)

        return True
    else:
        return None


The final step here runs through everything in the library with the "xdd_doc_link" tag. I went back and added this tag to the process of establishing the linkage to an xDD document so that we have an easy way of simply querying for all of those established connections that we might want to exploit for various purposes over time. The responses that got kicked out in a saved snapshot from the Jupyter process might show some interim results that I was displaying in the function to keep track of what was going on. A True response means that we successfully updated the item with its tags. A None response indicates that we didn't get anything from the xDD API.

In [5]:
for xdd_record in wlci_lib.items(tag="xdd_doc_link"):
    display(
        xdd_terms_to_zotero_attachment(
            xdd_record["data"]["parentItem"], 
            xdd_record["data"]["url"].split("=")[-1],
            "ITIS"
        )
    )


True

True

None

None

None

None

None

None

None

True

None

None

True

True

True

True

True

None

None

True

True

True

True

True

True

PreConditionFailed: 
Code: 412
URL: https://api.zotero.org/groups/2341914/items/BSGI73AF
Method: PATCH
Response: Item has been modified since specified version (expected 3259, found 3392)