## This retrieves all metadata including for private or fully embargoed items in all accounts
For demonstration only and would need slight tweaking to give you exactly the metadata you want.
This does not retrieve metadata for collections or projects.

The end result is a spreadsheet of metadata with the several things added or modified:
1. The item owners name and email is added
2. The group the item belongs to is added
3. The author names are formatted to be more readable and ORCID is included
4. The dates are split out into their own columns
5. Any custom fields are separated out into their own columns


## Import libraries

In [1]:
import json
import requests
import pandas as pd
import csv
import datetime

## Set token, admin id, and base URL

In [2]:
#Set the token in the header.
api_call_headers = {'Authorization': 'token ENTER TOKEN'} #example: {'Authorization': 'token dkd8rskjdkfiwi49hgkw...'}

#Don't want to impersonate the admin account you are using so put that id here. Retrieve this from this 
#  endpoint (put the token in the upper left box): https://docs.figsh.com/#private_institution_accounts_list
token_user_id = ENTER ID #example: token_user_id = 2938474

#Set the base URL
BASE_URL = 'https://api.figsh.com/v2'

## Retrieve Metadata
1. Get a list of basic metadata for all private records
2. Select only published records (includes embargoed records)
3. For each record get the full metadata and add in the owner name and email  
4. Format dates
5. Add in the name of the Group the record is part of
6. Split out the custom metadata
7. Save the dataframe to CSV or Excel

In [3]:
#Gather all private records (make sure your token is for a top level admin account)
private_records = []
for i in range(1,2):
    records = json.loads(requests.get(BASE_URL + '/account/institution/articles?page_size=1000&page={}'.format(i), headers=api_call_headers).content)
    private_records.extend(records)

print('Gathered',len(private_records),'private records')

Gathered 42 private articles


In [6]:
#Keep records that are either public or fully embargoed
published_records = []
for item in private_records:
    if item['published_date'] != None: #if a record has a published date
           published_records.append(item)
            
print(len(published_records), "records kept,",len(private_records) - len(published_records),"records removed")

32 records kept, 10 records removed


## Collect full metadata
Using the list you have of basic metadata with owner id and owner name


In [10]:
#For each id in the list, retrieve all the metadata for the article by visiting the Figshare article API endpoint 
#Admin token does not need to impersonate
#This may take a while if there are a lot of items. ~1.5 seconds per item

#First get a list of all the users. Email and name will be extracted based on account id later
users = []
for i in range(1,2):
    usr = json.loads(requests.get(BASE_URL + '/account/institution/accounts?page_size=1000&page={}'.format(i), headers=api_call_headers).content)
    users.extend(usr)

#Then gather and format each record:
full_records = []
for item in published_records: 
    s=requests.get(BASE_URL + '/account/articles/' + str(item['id']), headers=api_call_headers)
    metadata=json.loads(s.text)
    counter = 0
    author_list = ""
    author_count = len(metadata['authors'])
    for name in metadata['authors']: #Format author list to be readable
        if counter == 0:
            author_list = author_list + name['full_name'] + ' (ORCID: ' + name['orcid_id'] + ')'
            counter += 1
        elif counter < author_count:
            author_list = author_list + ' | ' + name['full_name'] + ' (ORCID: ' + name['orcid_id'] + ')'
            counter += 1
    metadata['author_readable'] = author_list
    
    for person in users:
        if person['id'] == metadata['account_id']: 
            metadata['record_owner_name'] = person['first_name'] + ' ' + person['last_name'] #add in user name
            metadata['owner_email'] = person['email'] #add user email

    full_records.append(metadata)

print('Full metadata for',len(full_records),'records retrieved')

Full metadata for 32 records retrieved


In [72]:
#OPTIONAL: save the json. Change the file name to represent the list of ids you used.
with open('full_records-'+str(datetime.datetime.now().strftime("%Y-%m-%d"))+'.json', 'w') as f:
    json.dump(full_records, f)

In [11]:
#Create a dataframe from the JSON formatted data
df = pd.DataFrame(full_records)

### Open a previous json file if you need to, otherwise skip the the Formatting section

In [9]:
#If needed, open up the same file for reading. Replace the file titles as needed.
with open("full_records-DATE.json", "r", encoding='utf8') as read_file: #Replace this with the filename of your choice
    full_articles = json.load(read_file)
    
#Create a dataframe from the JSON formatted data
df = pd.DataFrame(full_records)

print(len(full_records),"records")

33 records


## Format the spreadsheet

### Split out the dates

In [12]:
#The dates are all contained within one column called 'timeline'. Flatten that column and associate the values
#with the proper article id in a new dataframe

temp_date_list = []

for item in full_records:
    dateitem = item['timeline']
    dateitem['id'] = item['id']
    temp_date_list.append(dateitem)

df_dates = pd.json_normalize(
    temp_date_list 
)

#Merge the dataframes
df_formatted = df.merge(df_dates, how='outer', on='id')

print("Dates split out and merged")

Dates split out and merged


### Add Group names
This retrieves a list of Groups and then formats the dataframe so that each group has id of its parent Group. The top level group has itself as the parent. The group names are then added to the main dataframe.

In [13]:
#Get list of groups. 
s=requests.get(BASE_URL + '/account/institution/groups', headers=api_call_headers)
groups=json.loads(s.text)

#Create a dataframe of groups
df_groups = pd.json_normalize(groups)

df_groups_parent = df_groups[['id','name']] #Create reference dataframe
df_groups = df_groups.rename(columns={'id': 'group_id','name': 'group_name'}) #Rename id col in main dataframe
df_groups_parent = df_groups_parent.rename(columns={'name': 'parent_group_name'}) #Rename name col in reference dataframe

df_groups = df_groups.sort_values(by=['parent_id'])
top_group_id = df_groups.iloc[0]['group_id'] #Store the group id for top group 

df_groups.loc[df_groups['parent_id'] == 0, 'parent_id'] = top_group_id #For top level group, replace the zero value parent id with top level group id

df_groups = df_groups.merge(df_groups_parent, how='inner',left_on=['parent_id'], right_on=['id']) #Add parent group name

df_groups = df_groups[['group_id','group_name','parent_group_name']] #Pare down to needed columns


#Merge the dataframes 
df_formatted = df_formatted.merge(df_groups, how='inner', on='group_id') #If you use 'outer' it will include a blank record for each group with no records

print("Names for",len(df_groups),"different groups were added to the metadata records")

Names for 11 different groups were added to the metadata records


### Split out custom fields
This creates new columns for each custom field.

If different groups have different custom metadata, check the output carefully to make sure things mapped properly

In [14]:
#The custom fields are all contained within one column called 'custom_fields'. Flatten that column and associate the values
#with the proper article id in a new dataframe
custom = pd.json_normalize(
    full_records, 
    record_path =['custom_fields'], 
    meta=['id']
)
#This reshapes the data so that metadata field names are columns and each row is an id.
custom = custom.pivot(index="id", columns="name", values="value")

#Merge the dataframes so that all the custom fields are visible along with all the other metadata
df_formatted = df_formatted.merge(custom, how='outer', on='id') #Outer merge keeps records that have no custom metadata.

print("Custom fields split out and merged")

Custom fields split out and merged


## Save the spreadsheet

In [54]:
#Save a CSV file of all the metadata. Change the file name if necessary to match dates.
save_file = df_formatted.to_csv('all-records-'+str(datetime.datetime.now().strftime("%Y-%m-%d"))+'.csv',encoding='utf-8')

In [16]:
#Or save an Excel file of all the metadata. Change the file name if necessary to match dates.
save_file = df_formatted.to_excel('all-records-'+str(datetime.datetime.now().strftime("%Y-%m-%d"))+'.xlsx')