## This retrieves all metadata including for private or fully embargoed items in all accounts
For demonstration only and would need slight tweaking to give you exactly the metadata you want.
This does not retrieve metadata for collections or projects.

The end result is a spreadsheet of metadata with the several things added or modified:
1. The item owners name and email is added
2. The group the item belongs to is added
3. The author names are formatted to be more readable and ORCID is included
4. The dates are split out into their own columns
5. Any custom fields are separated out into their own columns


## Import libraries

In [1]:
import json
import requests
import pandas as pd
import csv
import datetime

## Set token

In [2]:
#Set the base URL
BASE_URL = 'https://api.figshare.com/v2'

## Retrieve Metadata

In [3]:
etd_records = []
for i in range(1,3):
    query = '{"institution":231, "group":18614, "page_size":10, "page":'+ str(i) + '}'
    y = json.loads(query) #convert to json
    records = json.loads(requests.post(BASE_URL + '/articles/search', params=y).content)
    etd_records.extend(records)

print('Gathered',len(etd_records),'etd records')

#Gather basic metadata for items (articles) that meet your search criteria
 #Can used advanced syntax for 'search_for'

#y = json.loads(query) #Figshare API requires json paramaters
#articles = json.loads(requests.post(BASE_URL + "/articles/search", params=y).content)

Gathered 20 etd records


## Collect record metadata


In [4]:
#For each id in the list, retrieve all the metadata for the article by visiting the Figshare article API endpoint 
#Admin token does not need to impersonate
#This may take a while if there are a lot of items. ~1.5 seconds per item


#Gather and format each record:
full_etd_records = []
for item in etd_records: 
    s=requests.get(BASE_URL + '/articles/' + str(item['id']))
    metadata=json.loads(s.text)
    full_etd_records.append(metadata)

print('Full metadata for',len(full_etd_records),'ETD records retrieved')

Full metadata for 20 ETD records retrieved


In [72]:
#OPTIONAL: save the json. Change the file name to represent the list of ids you used.
with open('full_etd_records-'+str(datetime.datetime.now().strftime("%Y-%m-%d"))+'.json', 'w') as f:
    json.dump(full_records, f)

In [5]:
#Create a dataframe from the JSON formatted data
df = pd.DataFrame(full_etd_records)

## Gather Stats

In [18]:
#Create a list of all the article ids
article_ids = [item['id'] for item in full_etd_records]           

#Gather views and downloads
stat_file = []            
for l in article_ids:
    s=requests.get('https://stats.figshare.com/total/views/article/'+ str(l))
    r=json.loads(s.text)
    t=requests.get('https://stats.figshare.com/total/downloads/article/'+ str(l))
    q=json.loads(t.text)
    stats = '{"id":'+str(l)+',"views":'+str(r.get('totals'))+',"downloads":'+str(q.get('totals'))+'}'
    stats_json = json.loads(stats)
    stat_file.append(stats_json)

#Create a dataframe from the JSON formatted data
dfstats = pd.DataFrame(stat_file)

df = df.merge(dfstats, how='inner', on='id')

print('The resulting dataframe has',len(dfstats),'rows and they were merged to the metadata dataframe')

The resulting dataframe has 4 rows and they were merged to the metadata dataframe


## Format the spreadsheet

### Split out the dates

In [20]:
#The dates are all contained within one column called 'timeline'. Flatten that column and associate the values
#with the proper article id in a new dataframe

temp_date_list = []

for item in full_etd_records:
    dateitem = item['timeline']
    dateitem['id'] = item['id']
    temp_date_list.append(dateitem)

df_dates = pd.json_normalize(
    temp_date_list 
)

#Merge the dataframes
df_formatted = df.merge(df_dates, how='outer', on='id')

print("Dates split out and merged")

Dates split out and merged


### Split out custom fields
This creates new columns for each custom field.

If different groups have different custom metadata, check the output carefully to make sure things mapped properly

In [20]:
#The custom fields are all contained within one column called 'custom_fields'. Flatten that column and associate the values
#with the proper article id in a new dataframe
custom = pd.json_normalize(
    full_etd_records, 
    record_path =['custom_fields'], 
    meta=['id']
)
#This reshapes the data so that metadata field names are columns and each row is an id.
custom = custom.pivot(index="id", columns="name", values="value")

#Merge the dataframes so that all the custom fields are visible along with all the other metadata
df_formatted = df_formatted.merge(custom, how='outer', on='id') #Outer merge keeps records that have no custom metadata.

print("Custom fields split out and merged")

Custom fields split out and merged


## Save the spreadsheet

In [21]:
#Save a CSV file of all the metadata. Change the file name if necessary to match dates.
save_file = df_formatted.to_csv('etd-records-'+str(datetime.datetime.now().strftime("%Y-%m-%d"))+'.csv',encoding='utf-8')

In [16]:
#Or save an Excel file of all the metadata. Change the file name if necessary to match dates.
save_file = df_formatted.to_excel('etd-records-'+str(datetime.datetime.now().strftime("%Y-%m-%d"))+'.xlsx')