# Reading and analysing the ORCID public profiles (activities)
This notebook describes the process of extracting and analyising data from the 2017 public data release. The analysis uses the activities extract of the profiles in JSON format (https://doi.org/10.6084/m9.figshare.5479792.v1).

The method is based on the one used by Bohannon (2017, https://doi.org.10.1126/science.aal1189) for which dataset and scripts can be found here: http://dx.doi.org/10.5061/dryad.48s16.

I am using only the "activities" file as it is unlikely to establish affiliations using other profile data more accurately. Options such as searching for email domains in the public dataset are unlikely to yield additional detail as these are most likely set to private.


In [None]:
# python tarfile module is too memory expensive for reading the uncompressed archive. 
# Use command line to extract the archive onto an external hard drive

#tar -xzvf public_profiles_API2.0activities_2017_10_json.tar.gz -C ~/destination

## Setup
Load a couple of profilse to adapt the functions to the new ORCID message schema:

In [274]:
#load my own ORCID profile to check contents
json.load(open("/media/eva/Eva-passport/ORCIDpubData2017/public_profiles_API-2.0-activities_2017_10_json/9/0000-0003-4965-2969_activities.json"))


In [275]:
#load an empty ORCID profile to check contents
json.load(open("/media/eva/Eva-passport/ORCIDpubData2017/public_profiles_API-2.0-activities_2017_10_json/x/0000-0003-2914-115X_activities.json"))


## The functions needed to load the profiles

In [217]:
import json, os, sys
import pandas as pd

#the original file generator enumerated each file. Needed a workaround as we are iterating through subfolders. 
#running just the for-loop results in the same strucutre.
def file_generator(json_dir):
    ''' Using a generator allows pausing and restarting
    without having to figure out where you left off. '''
    n = 0
    for root, directories, files in os.walk(json_dir):
            item = None
            for filename in files:
                m = n
                item = m, os.path.join(root, filename)
                n += 1
                yield (item)
        
def get_profiles(data, json_files, stop = None):
    ''' Iterate over JSON files and process them '''
    for n, filepath in json_files:
        # keep track of progress
        sys.stdout.flush()
        sys.stdout.write('\r{}'.format(filepath))
        # terminate if stop is specified and reached
        if stop is not None and n >= stop:
            return
        # process this JSON file and harvest the data
        if filepath.endswith(".json"):
            with open(filepath) as f:
                profile = json.load(f)
                for row in get_affiliations(profile):
                    data.append(row)

def has_affiliation(profile):
    ''' This tests whether the profile has any affiliations '''
    try:
        if profile["educations"]["education-summary"] != None:
            return True
        if profile["employments"]["employment-summary"] != None:
            return True
    except:
        return False

def get_affiliations(profile):
    ''' For each profile, extract all affiliations and metadata '''
    profile_data = []
    orcid_id = None
    if has_affiliation(profile):
        orcid_id = profile["educations"]["path"][1:20]
        if profile["educations"]["education-summary"] != None:
            for edu in profile["educations"]["education-summary"]:
                row = [orcid_id]
                row.append(edu["organization"]["address"]["country"])
                try:
                    row.append(edu["organization"]["name"])
                except:
                    row.append(None)
                try:
                    row.append(edu["organization"]["disambiguated-organization"]["disambiguated-organization-identifier"])
                except:
                    row.append(None)
                try:
                    row.append(afeduf["start-date"]["year"]["value"])
                except:
                    row.append(None)
                try:
                    row.append(edu["end-date"]["year"]["value"])
                except:
                    row.append(None)
                try:
                    row.append(aff["role-title"])
                except:
                    row.append(None)
                profile_data.append(row)
        if profile["employments"]["employment-summary"] != None:
            for empl in profile["employments"]["employment-summary"]:
                row = [orcid_id]
                row.append(empl["organization"]["address"]["country"])
                try:
                    row.append(empl["organization"]["name"])
                except:
                    row.append(None)
                try:
                     row.append(empl["organization"]["disambiguated-organization"]["disambiguated-organization-identifier"])
                except:
                     row.append(None)
                try:
                    row.append(afeduf["start-date"]["year"]["value"])
                except:
                    row.append(None)
                try:
                    row.append(empl["end-date"]["year"]["value"])
                except:
                    row.append(None)
                try:
                    row.append(empl["role-title"])
                except:
                    row.append(None)
                profile_data.append(row)
    return profile_data

### Testing

In [218]:
json_dir = "/media/eva/Eva-passport/ORCIDpubData2017/public_profiles_API-2.0-activities_2017_10_json/0"
json_files = file_generator(json_dir)

In [219]:
data = []

In [226]:
%%time
get_profiles(data, json_files, stop=500)

/media/eva/Eva-passport/ORCIDpubData2017/public_profiles_API-2.0-activities_2017_10_json/0/0000-0003-2129-9710_activities.jsonCPU times: user 353 ms, sys: 90.5 ms, total: 443 ms
Wall time: 697 ms


In [227]:
df = pd.DataFrame(data, columns = ["orcid_id", "country", "organization_name", 
                              "Ringgold_id", "start_year", "end_year", "affiliation_role"])
df.tail()

Unnamed: 0,orcid_id,country,organization_name,Ringgold_id,start_year,end_year,affiliation_role
345,0000-0003-4564-4400,KR,"College of Medicine, Korea University",http://dx.doi.org/10.13039/501100006468,,,Professor
346,0000-0003-2129-9120,US,University of California Davis,8789,,2015.0,
347,0000-0003-2129-9120,US,HP Labs,96953,,2014.0,Research Associate Intern
348,0000-0003-2129-9120,US,eBay Inc,260665,,2015.0,PhD Intern
349,0000-0003-2129-9120,US,NVIDIA Corp,196328,,2013.0,Software Engineer


In [228]:
df.orcid_id.nunique(), len(df)

(137, 350)

### Reading in all data
After successful testing of the setup, the code can now be run with all data files

In [229]:
json_dir = "/media/eva/Eva-passport/ORCIDpubData2017/public_profiles_API-2.0-activities_2017_10_json"
json_files = file_generator(json_dir)

In [230]:
#data = [] #commenting this out, so we don't accidentally reset the data frame!

In [231]:
%%time
get_profiles(data, json_files)

/media/eva/Eva-passport/ORCIDpubData2017/public_profiles_API-2.0-activities_2017_10_json/x/0000-0003-2914-115X_activities.jsonCPU times: user 3h 50min 38s, sys: 58min 6s, total: 4h 48min 44s
Wall time: 1d 15h 7min 3s


In [317]:
df = pd.DataFrame(data, columns = ["orcid_id", "country", "organization_name", 
                              "Ringgold_id", "start_year", "end_year", "affiliation_role"])
df.head()

Unnamed: 0,orcid_id,country,organization_name,Ringgold_id,start_year,end_year,affiliation_role
0,0000-0001-5000-1640,KR,Sogang University Graduate School of Internati...,92200.0,,2016.0,
1,0000-0001-5000-1640,KR,Citizens' Alliance for North Korean Human Rights,,,,Deputy Director General
2,0000-0001-5000-2520,GB,University College London,4919.0,,,
3,0000-0001-5000-4390,IN,University of Delhi,28742.0,,1986.0,
4,0000-0001-5000-4390,IN,University of Delhi,28742.0,,1981.0,


In [268]:
len(df), df.orcid_id.nunique()

(3040444, 1111585)

There are 1,111,585 profiles with an education of employment affiliation. In total just over 3 million affiliations have been identified.

In [238]:
affiliation_without_dates = df[((df["start_year"].isnull()) & (df["end_year"].isnull()))]
len(affiliation_without_dates), affiliation_without_dates.orcid_id.nunique()

(1235569, 950407)

Of all the affiliations identified, 1,235,569 do not have dates. That's 950,407 of the 1.1 million ORCI profiles with affiliations in the dataset.

Identifying ORCID records with an ongoing affiliation is not trivial:
* users might not have added a start date to their affiliation
* CRIS or other local systems might not have added any dates to the asserted affiliation. 
    * This is for example the case with the information pushed from Pure to ORCID in the case of St Andrews: No start or end date is provided. For current affiliation it says "present". However, this is not a value that is part of the metadata export.  
    * Reasons for this might include privacy concerns, so this is unlikely to change and might be the case for many other systems.

Do CRIS systems add an end date to the affiliation when a researcher leaves?
* At least one case would indicate that it doesn't in the case of St Andrews: the employment information source is University of St Andrews CRIS but no end date is provided even though the researcher is no longer affiliated with the University. 



In [335]:
UStA_all = df[(df.organization_name == "University of St Andrews")]
len(UStA_all), UStA_all.orcid_id.nunique()
UStA_all[UStA_all.end_year == "2017"]

Unnamed: 0,orcid_id,country,organization_name,Ringgold_id,start_year,end_year,affiliation_role
124118,0000-0003-3429-4230,GB,University of St Andrews,,,2017,
533044,0000-0002-9168-4721,GB,University of St Andrews,,,2017,
549991,0000-0001-6744-5061,GB,University of St Andrews,,,2017,
578652,0000-0002-3705-9802,GB,University of St Andrews,,,2017,
650352,0000-0001-6139-7732,GB,University of St Andrews,,,2017,
819142,0000-0002-1727-0862,GB,University of St Andrews,http://dx.doi.org/10.13039/501100000740,,2017,
929461,0000-0002-8864-1333,GB,University of St Andrews,,,2017,
1071004,0000-0002-4331-7863,GB,University of St Andrews,http://dx.doi.org/10.13039/501100000740,,2017,Postdoctoral Research Fellow
1256526,0000-0002-0704-4714,GB,University of St Andrews,,,2017,
1270666,0000-0001-5113-4904,GB,University of St Andrews,,,2017,


There are 1136 records in the 2017 dataset with an affiliation at the University of St Andrews. Including records where the affiliation might no longer be current. 

In [323]:
UoE_all = df[(df.organization_name == "University of Edinburgh")]
len(UoE_all), UoE_all.orcid_id.nunique()

(4706, 3562)

In [287]:
#Goettingen = ongoing[((ongoing.organization_name == "Georg-August-Universität Göttingen") | (ongoing.organization_name == "Georg August University Göttingen") | 
#                      (ongoing.organization_name == "Georg August University of Göttingen") | (ongoing.organization_name == "University of Göttingen") | 
#                      (ongoing.organization_name == "Georg August University") | (ongoing.organization_name == "University of Goettingen") |
#                     (ongoing.organization_name == "University Medical Center Goettingen") | (ongoing.organization_name == "University Medical Center Göttingen") |
#                     (ongoing.organization_name == "Universitätsmedizin Göttingen"))] 
#len(Goettingen), Goettingen.orcid_id.nunique()

(816, 623)

In [327]:
Goettingen_all = df[((df.organization_name.str.contains("Göttingen")) | 
                      (df.organization_name.str.contains("Goettingen")))]
len(Goettingen_all), Goettingen_all.orcid_id.nunique()

(1539, 1149)

There is great varation in the way "Georg-August-Universität Göttingen" is referred to in ORCID profiles. The easiest way is looking for the name of the city in German and English spelling, which identifies 788 profiles with a total of 1063 affiliations.

In [328]:
# Ruhr-Universität Bochum, University of Bochum, Ruhr University Bochum...
RUB_all = df[(df.organization_name.str.contains("Bochum"))]
len(RUB_all), RUB_all.orcid_id.nunique()

(1262, 888)

In [319]:
df.dtypes

orcid_id             object
country              object
organization_name    object
Ringgold_id          object
start_year           object
end_year             object
affiliation_role     object
dtype: object

In [336]:
df[df.orcid_id =="0000-0003-4965-2969"]

Unnamed: 0,orcid_id,country,organization_name,Ringgold_id,start_year,end_year,affiliation_role
2702368,0000-0003-4965-2969,GB,University of St Andrews,,,,
2702369,0000-0003-4965-2969,DE,Philipps-Universität Marburg,9377.0,,2007.0,
2702370,0000-0003-4965-2969,GB,University of St Andrews,,,2012.0,
2702371,0000-0003-4965-2969,GB,University of St Andrews,,,2014.0,Postdoctoral Researcher
2702372,0000-0003-4965-2969,GB,The University of St Andrews,,,,
2702373,0000-0003-4965-2969,DE,Philipps-Universität Marburg,9377.0,,2008.0,Teaching Fellow
2702374,0000-0003-4965-2969,GB,University of Edinburgh,3124.0,,2016.0,UKRMP Postdoctoral Researcher
