This notebook works through the process of pulling metadata via the DataCite API and building content for the Hugo site. A more elegant solution might be to simply dump DataCite metadata in its original JSON form (possibly ld_json) and then run everything with templates that read data content. However, I've found that some amount of digesting that content into markdown files with YAML metadata works pretty well and provides some ready options. So, I'm trying not to overload this part of the process.

In [1]:
import requests
import json
import yaml
import os
import shutil
from datetime import datetime


# Get DataCite Items
This part of the process is really up to the individual use case. Any type of process is perfectly fine here as long as it returns some set of items in DataCite's native JSON format. This could also be retuned to work with JSON-LD or any other output format desired. The predominant use case is likely to work through a set of items from one or more DataCite repositories (DOI prefixes) that do not otherwise have landing pages in some primary source repository. But this can also be used to spin up some particular context of assets that need to be presented in a particular way.

In [2]:
datacite_api = "https://api.datacite.org/dois?prefix=10.5066&page[size]=100"
items = requests.get(datacite_api).json()

# Functions
Eventually, I may need to run this as some type of automated process in a pipeline (e.g., in the GitHub actions to load site content before deploying). The following can be pulled out into a different type of pipeline environment for that purpose.

In [13]:
def datacite_repositories(documents):
    repositories = [i['id'].split('/')[0] for i in documents]
    return list(set(repositories))

def build_sections(repositories, content_path='../content'):
    for prefix in repositories:
        folder_path = os.path.join(content_path, prefix)
        os.makedirs(folder_path, exist_ok=True)
        if not os.path.exists(os.path.join(folder_path, '_index.md')):
            with open(os.path.join(folder_path, '_index.md'), 'w') as f:
                f.write(f'---\ntitle: {prefix}\ndate: {str(datetime.utcnow().isoformat())}\n---\n')   

def datacite_categories(doc):
    categories = [doc['attributes']['types']['resourceTypeGeneral']]
    return categories

def datacite_tags(doc):
    tags = []
    for subject in doc['attributes']['subjects']:
        if subject['subject']:
            if ',' in subject['subject']:
                tags.extend([i.strip() for i in subject['subject'].split(',')])
            else:
                tags.append(subject['subject'])
    return tags

def datacite_publishers(doc):
    publishers = [doc['attributes']['publisher']]
    return publishers

def datacite_creators(doc):
    authors = []
    affiliations = []
    for creator in doc['attributes']['creators']:
        if creator['affiliation']:
            affiliations.extend(creator['affiliation'])
        name_string = creator['name']
        if creator['nameType'] == 'Personal' and 'givenName' in creator and 'familyName' in creator:
            name_string = f"{creator['givenName']} {creator['familyName']}"
        authors.append(name_string)
            
    return authors, list(set(affiliations))

def datacite_funders(doc):
    funders = []
    for funder in doc['attributes']['fundingReferences']:
        if funder['funderName']:
            funders.append(funder['funderName'])
    return list(set(funders))

def datacite_orcids(doc, orcid_mapping=[]):
    for creator in doc['attributes']['creators']:
        orcid = next((i['nameIdentifier'].split('/')[-1] for i in creator['nameIdentifiers'] if i['nameIdentifierScheme'] == 'ORCID'), None)
        if not orcid:
            return orcid_mapping

        name_string = creator['name']
        if creator['nameType'] == 'Personal' and 'givenName' in creator and 'familyName' in creator:
            name_string = f"{creator['givenName']} {creator['familyName']}"
        orcid_mapping.append((name_string, orcid))

    return orcid_mapping

def datacite_meta(doc):
    meta_content = ['---']
    title = doc['attributes']['titles'][0]['title']
    meta_content.append(f'title: "{title}"')
    meta_content.append(f"doi: {doc['id']}")
    meta_content.append(f"date: {doc['attributes']['updated']}")
    # meta_content.append(f"date: {doc['attributes']['publicationYear']}")
    meta_content.append(f"categories: {datacite_categories(doc)}")
    meta_content.append(f"tags: {datacite_tags(doc)}")
    meta_content.append(f"publishers: {datacite_publishers(doc)}")
    authors, affiliations = datacite_creators(doc)
    if authors:
        meta_content.append(f"author: {authors}")
    if affiliations:
        meta_content.append(f"affiliations: {affiliations}")
    meta_content.append(f"funders: {datacite_funders(doc)}")
    meta_content.append("---")
    return '\n'.join(meta_content)

def datacite_md(doc):
    md = datacite_meta(doc)
    
    abstract = next((i['description'] for i in doc['attributes']['descriptions'] if i['descriptionType'] == 'Abstract'), None)
    if abstract:
        md+= '\n\n# Abstract'
        md+= f'\n{abstract}'
    other_descriptions = [i for i in doc['attributes']['descriptions'] if i['descriptionType'] != 'Abstract']
    if other_descriptions:
        for desc in other_descriptions:
            md+= f'\n\n## {desc["descriptionType"]}'
            md+= f'\n{desc["description"].lstrip("#").strip()}'

    if doc['attributes']['url']:
        md+= f'\n\n# Access Points\n{doc["attributes"]["url"]}'
    
    return md

# Build DataCite Repository Sections
We want the DOI prefixes in our collection from DataCite to act as sections within the Hugo site. This means setting up root folders within /content/ for each DOI prefix in our recordset returned from the DataCite API. We'll them write markdown files to these with the remainder of the DOI identifier to provide logical paths at the root of our site that match the DOI. We also write _index.md files into each DOI prefix folder so that it is treated as a section in Hugo's architecture. This will also provide a listing of items at that path depending on the template used.

In [8]:
build_sections(datacite_repositories(items['data']))

# Setup Taxonomy

In [9]:
names = ['affiliations', 'author', 'funders', 'publishers']
layouts_folder = '../layouts'
themes_folder = '../themes/PaperMod/layouts/_default'

for name in names:
    folder_path = os.path.join(layouts_folder, name)
    os.makedirs(folder_path, exist_ok=True)
    
    list_copy_path = os.path.join(folder_path, 'list.html')
    terms_copy_path = os.path.join(folder_path, 'terms.html')
    
    list_source_path = os.path.join(themes_folder, 'list.html')
    terms_source_path = os.path.join(themes_folder, 'terms.html')
    
    if not os.path.exists(list_copy_path):
        shutil.copy(list_source_path, list_copy_path)
    
    if not os.path.exists(terms_copy_path):
        shutil.copy(terms_source_path, terms_copy_path)


# Process DataCite Records

In [10]:
orcid_mapping = []
for document in items['data']:
    doi_prefix = document['id'].split('/')[0]
    doi_suffix = document['id'].split('/')[1]
    file_path = os.path.join('../content', doi_prefix, doi_suffix + '.md')
    with open(file_path, 'w') as f:
        f.write(datacite_md(document))

    orcid_mapping = datacite_orcids(document, orcid_mapping)

json.dump({'authors': {item[0]: item[1] for item in list(set(orcid_mapping))}}, open('../data/orcid_mapping.json', 'w'), indent=2)

# Organize Additional Data

In [82]:
authors = []
for document in items['data']:
    if document['attributes']['creators']:
        for creator in document['attributes']['creators']:
            if "nameIdentifiers" in creator:
                orcid_url = next((i['nameIdentifier'] for i in creator['nameIdentifiers'] if i['nameIdentifierScheme'] == 'ORCID'), None)
                if orcid_url:
                    if creator['nameType'] == 'Personal':
                        if 'givenName' in creator:
                            name = f"{creator['givenName']} {creator['familyName']}"
                        else:
                            name = creator['name']
                    elif creator['nameType'] == 'Organizational':
                        name = creator['name']
                    authors.append({
                        'title': name,
                        'orcid': orcid_url,
                        'url': "/authors/" + orcid_url.split('/')[-1]
                    })
unique_authors = list(set(tuple(sorted(author.items())) for author in authors))
unique_authors = [dict(author) for author in unique_authors]

for author in unique_authors:
    orcid = author['orcid'].split('/')[-1]
    if not os.path.exists(os.path.join('../content/authors', orcid)):
        os.makedirs(os.path.join('../content/authors', orcid))
    file_path = os.path.join('../content/authors', orcid, '_index.md')
    yaml_content = "---\n" + yaml.dump(author, default_flow_style=False) + "\n---"
    with open(file_path, 'w') as f:
        f.write(yaml_content)