# Reading and analysing the ORCID public profiles (activities)
This notebook describes the process of extracting and analyising data from the 2017 public data release. The analysis uses the activities extract of the profiles in JSON format (https://doi.org/10.6084/m9.figshare.5479792.v1).

The method is based on the one used by Bohannon (2017, https://doi.org.10.1126/science.aal1189) for which dataset and scripts can be found here: http://dx.doi.org/10.5061/dryad.48s16.

I am using only the "activities" file as it is unlikely to establish affiliations using other profile data more accurately. Options such as searching for email domains in the public dataset are unlikely to yield additional detail as these are most likely set to private.


In [None]:
# python tarfile module is too memory expensive for reading the uncompressed archive. 
# Use command line to extract the archive onto an external hard drive

#tar -xzvf public_profiles_API2.0activities_2017_10_json.tar.gz -C ~/destination

## Setup
Load a couple of profilse to adapt the functions to the new ORCID message schema:

In [71]:
#load a profile with content to adjust functions
json.load(open("/media/eva/Eva-passport/ORCIDpubData2017/public_profiles_API-2.0-activities_2017_10_json/9/0000-0003-4965-2969_activities.json"))


{'educations': {'education-summary': [{'created-date': {'value': 1499540565501},
    'department-name': 'Computer Science',
    'end-date': None,
    'last-modified-date': {'value': 1499540565501},
    'organization': {'address': {'city': 'St Andrews',
      'country': 'GB',
      'region': None},
     'disambiguated-organization': None,
     'name': 'University of St Andrews'},
    'path': '/0000-0003-4965-2969/education/4229471',
    'put-code': 4229471,
    'role-title': 'Management and Information Technology',
    'source': {'source-client-id': None,
     'source-name': {'value': 'Eva Borger'},
     'source-orcid': {'host': 'orcid.org',
      'path': '0000-0003-4965-2969',
      'uri': 'http://orcid.org/0000-0003-4965-2969'}},
    'start-date': {'day': {'value': '12'},
     'month': {'value': '09'},
     'year': {'value': '2016'}},
    'visibility': 'public'},
   {'created-date': {'value': 1499540208691},
    'department-name': 'Medicine',
    'end-date': {'day': {'value': '28'},
 

In [3]:
#load a profile without content to adjust functions
json.load(open("/media/eva/Eva-passport/ORCIDpubData2017/public_profiles_API-2.0-activities_2017_10_json/x/0000-0003-2914-115X_activities.json"))


{'educations': {'education-summary': None,
  'last-modified-date': None,
  'path': '/0000-0003-2914-115X/educations'},
 'employments': {'employment-summary': None,
  'last-modified-date': None,
  'path': '/0000-0003-2914-115X/employments'},
 'fundings': {'group': None,
  'last-modified-date': None,
  'path': '/0000-0003-2914-115X/fundings'},
 'last-modified-date': None,
 'path': '/0000-0003-2914-115X/activities',
 'peer-reviews': {'group': None,
  'last-modified-date': None,
  'path': '/0000-0003-2914-115X/peer-reviews'},
 'works': {'group': None,
  'last-modified-date': None,
  'path': '/0000-0003-2914-115X/works'}}

## The functions needed to load the profiles

In [217]:
import json, os, sys
import pandas as pd

#the original file generator enumerated each file. Needed a workaround as we are iterating through subfolders. 
#running just the for-loop results in the same strucutre.
def file_generator(json_dir):
    ''' Using a generator allows pausing and restarting
    without having to figure out where you left off. '''
    n = 0
    for root, directories, files in os.walk(json_dir):
            item = None
            for filename in files:
                m = n
                item = m, os.path.join(root, filename)
                n += 1
                yield (item)
        
def get_profiles(data, json_files, stop = None):
    ''' Iterate over JSON files and process them '''
    for n, filepath in json_files:
        # keep track of progress
        sys.stdout.flush()
        sys.stdout.write('\r{}'.format(filepath))
        # terminate if stop is specified and reached
        if stop is not None and n >= stop:
            return
        # process this JSON file and harvest the data
        if filepath.endswith(".json"):
            with open(filepath) as f:
                profile = json.load(f)
                for row in get_affiliations(profile):
                    data.append(row)

def has_affiliation(profile):
    ''' This tests whether the profile has any affiliations '''
    try:
        if profile["educations"]["education-summary"] != None:
            return True
        if profile["employments"]["employment-summary"] != None:
            return True
    except:
        return False

def get_affiliations(profile):
    ''' For each profile, extract all affiliations and metadata '''
    profile_data = []
    orcid_id = None
    if has_affiliation(profile):
        orcid_id = profile["educations"]["path"][1:20]
        if profile["educations"]["education-summary"] != None:
            for edu in profile["educations"]["education-summary"]:
                row = [orcid_id]
                row.append(edu["organization"]["address"]["country"])
                try:
                    row.append(edu["organization"]["name"])
                except:
                    row.append(None)
                try:
                    row.append(edu["organization"]["disambiguated-organization"]["disambiguated-organization-identifier"])
                except:
                    row.append(None)
                try:
                    row.append(afeduf["start-date"]["year"]["value"])
                except:
                    row.append(None)
                try:
                    row.append(edu["end-date"]["year"]["value"])
                except:
                    row.append(None)
                try:
                    row.append(aff["role-title"])
                except:
                    row.append(None)
                profile_data.append(row)
        if profile["employments"]["employment-summary"] != None:
            for empl in profile["employments"]["employment-summary"]:
                row = [orcid_id]
                row.append(empl["organization"]["address"]["country"])
                try:
                    row.append(empl["organization"]["name"])
                except:
                    row.append(None)
                try:
                     row.append(empl["organization"]["disambiguated-organization"]["disambiguated-organization-identifier"])
                except:
                     row.append(None)
                try:
                    row.append(afeduf["start-date"]["year"]["value"])
                except:
                    row.append(None)
                try:
                    row.append(empl["end-date"]["year"]["value"])
                except:
                    row.append(None)
                try:
                    row.append(empl["role-title"])
                except:
                    row.append(None)
                profile_data.append(row)
    return profile_data

### Testing

In [218]:
json_dir = "/media/eva/Eva-passport/ORCIDpubData2017/public_profiles_API-2.0-activities_2017_10_json/0"
json_files = file_generator(json_dir)

In [219]:
data = []

In [226]:
%%time
get_profiles(data, json_files, stop=500)

/media/eva/Eva-passport/ORCIDpubData2017/public_profiles_API-2.0-activities_2017_10_json/0/0000-0003-2129-9710_activities.jsonCPU times: user 353 ms, sys: 90.5 ms, total: 443 ms
Wall time: 697 ms


In [227]:
df = pd.DataFrame(data, columns = ["orcid_id", "country", "organization_name", 
                              "Ringgold_id", "start_year", "end_year", "affiliation_role"])
df.tail()

Unnamed: 0,orcid_id,country,organization_name,Ringgold_id,start_year,end_year,affiliation_role
345,0000-0003-4564-4400,KR,"College of Medicine, Korea University",http://dx.doi.org/10.13039/501100006468,,,Professor
346,0000-0003-2129-9120,US,University of California Davis,8789,,2015.0,
347,0000-0003-2129-9120,US,HP Labs,96953,,2014.0,Research Associate Intern
348,0000-0003-2129-9120,US,eBay Inc,260665,,2015.0,PhD Intern
349,0000-0003-2129-9120,US,NVIDIA Corp,196328,,2013.0,Software Engineer


In [228]:
df.orcid_id.nunique(), len(df)

(137, 350)

### Reading in all data
After successful testing of the setup, the code can now be run with all data files

In [229]:
json_dir = "/media/eva/Eva-passport/ORCIDpubData2017/public_profiles_API-2.0-activities_2017_10_json"
json_files = file_generator(json_dir)

In [230]:
data = []

In [None]:
%%time
get_profiles(data, json_files)