## Bulk upload OA paper metadata records from Dimensions JSON export
This formats the json export from Dimensions for upload to Figshare. This uses the create private article API endpoint: https://docs.figshare.com/#private_article_create


All records are created as linked file records.

Here are the steps:
1. Open or create a json file
2. Pull out the relevant fields and give them the proper keys (account for partial dates, author formatting,and missing abstracts)
3. Interate through and upload the records
 - convert the json record to a string with double quotes
 - upload the record
 - log the api response details if it fails
 - update the author list of the new record (removes the admin account as an author). The create record response returns the api endpoint for updating.
 - Add the existing DOI as a linked file
4. This can upload to a specific group with specific custom metadata. You can change the api key to upload to different accounts. 

## Import libraries

In [3]:
import json
import requests
import pandas as pd

## Set token and descriptor

In [4]:
#Set the token in the header and base URL

text_file = open("../../../zambia-token.txt", "r")
TOKEN = text_file.read()
TOKEN.strip() #removes any hidden spaces
text_file.close()

#TOKEN = str(ENTER TOKEN HERE WITH QUOTES)

api_call_headers = {'Authorization': 'token ' + TOKEN}

#Set the base URL
BASE_URL = 'https://api.figsh.com/v2'

## Load the json file

In [8]:
#Open a file if you have one
with open("zambia-pubs.json", "r", encoding='utf8') as read_file: #Replace this with the filename of your choice
    jsonfile = json.load(read_file)

In [9]:
jsonfile[0]

{'abstract': '<b>Background:</b> Tuberculosis (TB) remains a major challenge in many domains including diagnosis, pathogenesis, prevention, treatment, drug resistance and long-term protection of the public health by vaccination. A controlled human infection model (CHIM) could potentially facilitate breakthroughs in each of these domains but has so far been considered impossible owing to technical and safety concerns. <b>Methods:</b> A systematic review of mycobacterial human challenge studies was carried out to evaluate progress to date, best possible ways forward and challenges to be overcome. We searched MEDLINE (1946 to current) and CINAHL (1984 to current) databases; and Google Scholar to search citations in selected manuscripts. The final search was conducted 3 <sup>rd</sup> February 2022. Inclusion criteria: adults ≥18 years old; administration of live mycobacteria; and interventional trials or cohort studies with immune and/or microbiological endpoints. Exclusion criteria: anima

## Format for upload

In [12]:
#Format records for upload. Customize the Custom field section for your group.

result = []
doi_list = []
for item in jsonfile:
    my_dict={}
    my_dict['title']=item.get('title')
    if 'abstract' in item: #abstract isn't always present
        my_dict['description']=item.get('abstract')
    else:
        my_dict['description']="No description available"
    authors = [] #format authors
    for name in item['authors']:
        authorname = {"name" : name['first_name'] + " " + name['last_name']}
        authors.append(authorname)
    my_dict['authors']= authors
    my_dict['defined_type'] = 'journal contribution'
    my_dict['doi']= item.get('doi')
    my_dict['resource_doi']= item.get('doi')
    my_dict['resource_title']=item.get('title')
    #my_dict['references'] = [item.get('URL')]
    my_dict['timeline'] =  {"firstOnline" : str(item['year']) + "-01-01"} #year only 
    my_dict['group'] = 50585
    cats = []
    keywords =[]
    for cat in item['category_for_2020']:
        catcode = cat['name'].split(" ",1)[0] #Split on first space
        keyname = cat['name'].split(" ",1)[1]
        cats.append(str(catcode))
        keywords.append(keyname)
    my_dict['categories_by_source_id'] = cats
    my_dict['keywords'] = keywords
    #my_dict['is_metadata_record'] = True #Use these if you want a metadata only record
    #my_dict['metadata_reason'] = 'See publisher version'
    result.append(my_dict)
    doi_list.append(item['doi'])


print(len(result),"records are ready for upload.")


15 records are ready for upload.


## Validate metadata - example

In [5]:
test = json.dumps(result[1])

In [6]:
from jsonschema import validate

In [7]:
#I copied the schema from https://github.com/figshare/user_documentation/tree/master/swagger_documentation/documentation/models
#and added the $id and $schema info
base = json.loads(open('create-item.json').read())

In [9]:
#If there is no output, the validation is successful
validate(test, schema=base)

## Upload the records with a link as the file


In [14]:
#Upload the records

record_fails = []
partial_record_ids = []
created_record_ids = [] #Use this to delete all the draft records if needed 
success_count = 0
count = 0 #This just tracks what index value the loop is on and is used to connect the metadata with the DOI update

for index, item in enumerate(result):
    jsonresult = json.dumps(item) #Takes one record and makes it a json string (double quotes)
    r = requests.post(BASE_URL + '/account/articles', headers=api_call_headers, data = jsonresult)
    if r.status_code != 201:
        record_fails.append(str(index) + ":" + str(r.content[0:75])) #Add failed index to list with partial description
        count += 1
    else:
        count += 1 #increment here otherwise have to do it at each if statement below
        #Remove the admin account as an author by updating the record just created
        #This uses the article url returned by the API response (r)
        
        # Get the location URL and item id
        response_json = json.loads(r.content)
        new_url = response_json['location']
        item_id = response_json['entity_id']
        created_record_ids.append(item_id)
        
        
        #Format and update authors
        authordict = {}
        authordict['authors'] = item['authors']
        authorjson = json.dumps(authordict) #formats everything with double quotes
        s = requests.put(new_url, headers=api_call_headers, data = authorjson) 
        if r.status_code != 201:
            record_fails.append(str(index) + "failed at author update:" + str(r.content[0:75])) #Add failed index to list with partial description         
            partial_record_ids.append(item_id)
        else:
            #Upload a link as a file
            link = '{"link":"https://doi.org/'+ str(doi_list[count-1]) + '"}' #count-1 because already incremented value to next index
            t = requests.post(new_url +'/files', headers=api_call_headers, data = link)
            if r.status_code != 201:
                record_fails.append(str(index) + "failed at doi update:" + str(r.content[0:75])) #Add failed index to list with partial description
                partial_record_ids.append(item_id)
            else:
                success_count += 1

        
print(success_count,"records created and updated. There were",len(result)-len(created_record_ids),"records that were not created at all.")
print('There were',len(partial_record_ids),'records created but with a failed DOI link.')
print("Failed record descriptions:",record_fails)


15 records created and updated. There were 0 records that were not created at all.
There were 0 records created but with a failed DOI link.
Failed record descriptions: []


## Create item id list for batch publish

In [61]:
ids = []
ids.append(created_record_ids)

## For testing purposes, use this to delete all the records you just created

In [17]:
delete_record_fails = []

for item in created_record_ids:
    r = requests.delete(BASE_URL + '/account/articles/' + str(item), headers=api_call_headers)
    if r.status_code != 204:
        delete_record_fails.append(str(index) + ":" + str(r.content[0:75])) #Add failed index to list with partial description
    else:
        print("Record deleted")
        

Record deleted
Record deleted
Record deleted
Record deleted
Record deleted
Record deleted
Record deleted
Record deleted
Record deleted
Record deleted
Record deleted
Record deleted
Record deleted
Record deleted
Record deleted


## Create a batch upload CSV
The categories cause problems here.

Also, need to open file, replace all single quotes with double, then reformat dates.

In [16]:
df = pd.DataFrame(result)
df.head()

Unnamed: 0,title,description,authors,defined_type,doi,resource_doi,resource_title,timeline,group,categories_by_source_id,keywords
0,Practical considerations for a TB controlled h...,<b>Background:</b> Tuberculosis (TB) remains a...,"[{'name': 'Stephen B Gordon'}, {'name': 'Simon...",journal contribution,10.12688/wellcomeopenres.18767.2,10.12688/wellcomeopenres.18767.2,Practical considerations for a TB controlled h...,{'firstOnline': '2023-01-01'},50585,"[32, 3202]","[Biomedical and Clinical Sciences, Clinical Sc..."
1,Cross-sectional study to assess depression amo...,OBJECTIVES: We sought to assess depression amo...,"[{'name': 'Sandra Simbeza'}, {'name': 'Jacob M...",journal contribution,10.1136/bmjopen-2022-069257,10.1136/bmjopen-2022-069257,Cross-sectional study to assess depression amo...,{'firstOnline': '2023-01-01'},50585,"[32, 4203, 4206, 42, 3202]","[Biomedical and Clinical Sciences, Health Serv..."
2,The epidemiology of human Taenia solium infect...,BACKGROUND: Taenia solium is a tapeworm that c...,"[{'name': 'Gideon Zulu'}, {'name': 'Dominik St...",journal contribution,10.1371/journal.pntd.0011042,10.1371/journal.pntd.0011042,The epidemiology of human Taenia solium infect...,{'firstOnline': '2023-01-01'},50585,"[32, 3202]","[Biomedical and Clinical Sciences, Clinical Sc..."
3,Practical considerations for a TB controlled h...,Background: Tuberculosis (TB) remains a major...,"[{'name': 'Stephen B. Gordon'}, {'name': 'Simo...",journal contribution,10.12688/wellcomeopenres.18767.1,10.12688/wellcomeopenres.18767.1,Practical considerations for a TB controlled h...,{'firstOnline': '2023-01-01'},50585,"[32, 3202]","[Biomedical and Clinical Sciences, Clinical Sc..."
4,Zambian Parents’ Perspectives on Early-Infant ...,Despite increasing interest in Early-Infant an...,"[{'name': 'Violeta J. Rodriguez'}, {'name': 'S...",journal contribution,10.1007/s10461-022-03912-1,10.1007/s10461-022-03912-1,Zambian Parents’ Perspectives on Early-Infant ...,{'firstOnline': '2023-01-01'},50585,"[4206, 42]","[Public Health, Health Sciences]"


In [20]:
#The dates are all contained within one column called 'timeline'. 
#Use the JSON to create a better format and then merge with the dataframe
#with the proper article id in a new dataframe

temp_date_list = []

for item in result:
    dateitem = item['timeline']
    dateitem['doi'] = item['doi']
    temp_date_list.append(dateitem)

df_dates_items = pd.json_normalize(
    temp_date_list 
)

#Merge the date dataframe with the metadata dataframe
df_formatted = df.merge(df_dates_items, how='outer', on='doi')

print("Dates split out and merged")

Dates split out and merged


In [21]:
df_formatted = df_formatted.drop(columns=['timeline', 'categories_by_source_id'])
df_formatted['categories'] = df_formatted.loc[:, 'keywords']
df_formatted = df_formatted.rename(columns={"firstOnline": "first_online_date"})

In [22]:
df_formatted.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15 entries, 0 to 14
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   title              15 non-null     object
 1   description        15 non-null     object
 2   authors            15 non-null     object
 3   defined_type       15 non-null     object
 4   doi                15 non-null     object
 5   resource_doi       15 non-null     object
 6   resource_title     15 non-null     object
 7   group              15 non-null     int64 
 8   keywords           15 non-null     object
 9   first_online_date  15 non-null     object
 10  categories         15 non-null     object
dtypes: int64(1), object(10)
memory usage: 1.4+ KB


In [25]:
#Save as CSV
df_formatted.to_csv('zambia-publications.csv', mode = 'w', index=False)

In [40]:
query

'{"page_size":20,"institution":"figshare","item_type":3,"search_for":":description: health AND :description: Zambia OR :title: Zambia"}'

# Figshare.com datasets
Not sure how to limit to figshare.com

In [48]:
#Retrieve list of metadata
#SET THE PAGE SIZE to make sure you get all the records

#Gather basic metadata for items (articles) that meet your search criteria

query = '{"page_size":20,"item_type":3,"search_for":":description: health AND :description: Zambia OR :title: Zambia"}' #Set up string
y = json.loads(query) #Convert the string to a dictionary (JSON)

#y = json.loads(query) #Figshare API requires json paramaters
r=requests.post('https://api.figshare.com/v2/articles/search', params=y)
articles = json.loads(r.text) 

if r.status_code != 200:
    print('Something is wrong:',r.content)
else:
    print('Collected',len(articles),'metadata records')

Collected 20 metadata records


In [52]:
full_articles = []
for item in articles: 
    s=requests.get('https://api.figshare.com/v2/articles/' + str(item['id']))
    metadata=json.loads(s.text)
    full_articles.append(metadata)
print(len(full_articles),'collected')

20 collected


In [59]:
#Format records for upload.

result_datasets = []
doi_list = []
for item in full_articles:
    my_dict={}
    my_dict['title']=item.get('title')
    my_dict['description']=item.get('description')
    authors = [] #format authors
    for name in item['authors']:
        authorname = {"name" : name['full_name']}
        authors.append(authorname)
    my_dict['authors']= authors
    my_dict['defined_type'] = 'dataset'
    my_dict['doi']= item.get('doi')
    my_dict['resource_doi']= item.get('doi')
    my_dict['resource_title']=item.get('title')
    #my_dict['references'] = [item.get('URL')]
    my_dict['timeline'] =  {"firstOnline" : item['timeline']['firstOnline']} #year only 
    my_dict['group'] = 50585
    my_dict['keywords'] = item['tags']
    cats = []
    for cat in item['categories']:
        catcode = cat['id']
        cats.append(catcode)
    my_dict['categories'] = cats
    #my_dict['is_metadata_record'] = True #Use these if you want a metadata only record
    #my_dict['metadata_reason'] = 'See publisher version'
    result_datasets.append(my_dict)
    doi_list.append(item['doi'])


print(len(result_datasets),"records are ready for upload.")

20 records are ready for upload.


In [None]:
#Upload the records

record_fails = []
partial_record_ids = []
created_record_ids2 = [] #Use this to delete all the draft records if needed 
success_count = 0
count = 0 #This just tracks what index value the loop is on and is used to connect the metadata with the DOI update

for index, item in enumerate(result):
    jsonresult = json.dumps(item) #Takes one record and makes it a json string (double quotes)
    r = requests.post(BASE_URL + '/account/articles', headers=api_call_headers, data = jsonresult)
    if r.status_code != 201:
        record_fails.append(str(index) + ":" + str(r.content[0:75])) #Add failed index to list with partial description
        count += 1
    else:
        count += 1 #increment here otherwise have to do it at each if statement below
        #Remove the admin account as an author by updating the record just created
        #This uses the article url returned by the API response (r)
        
        # Get the location URL and item id
        response_json = json.loads(r.content)
        new_url = response_json['location']
        item_id = response_json['entity_id']
        created_record_ids2.append(item_id)
        
        
        #Format and update authors
        authordict = {}
        authordict['authors'] = item['authors']
        authorjson = json.dumps(authordict) #formats everything with double quotes
        s = requests.put(new_url, headers=api_call_headers, data = authorjson) 
        if r.status_code != 201:
            record_fails.append(str(index) + "failed at author update:" + str(r.content[0:75])) #Add failed index to list with partial description         
            partial_record_ids.append(item_id)
        else:
            #Upload a link as a file
            link = '{"link":"https://doi.org/'+ str(doi_list[count-1]) + '"}' #count-1 because already incremented value to next index
            t = requests.post(new_url +'/files', headers=api_call_headers, data = link)
            if r.status_code != 201:
                record_fails.append(str(index) + "failed at doi update:" + str(r.content[0:75])) #Add failed index to list with partial description
                partial_record_ids.append(item_id)
            else:
                success_count += 1

        
print(success_count,"records created and updated. There were",len(result)-len(created_record_ids2),"records that were not created at all.")
print('There were',len(partial_record_ids),'records created but with a failed DOI link.')
print("Failed record descriptions:",record_fails)

## Add ids to batch publish list

In [None]:
ids = []
ids.append(created_record_ids)

In [None]:
df_ids = pd.DataFrame(ids)
#Save as CSV
df_ids.to_csv('items-to-publish.csv', mode = 'w', index=False)