# Reading and analysing the ORCID public profiles (activities)
This notebook describes the process of extracting and analyising data from the 2017 public data release. The analysis uses the activities extract of the profiles in JSON format (https://doi.org/10.6084/m9.figshare.5479792.v1).

The method is based on the one used by Bohannon (2017, https://doi.org.10.1126/science.aal1189) for which dataset and scripts can be found here: http://dx.doi.org/10.5061/dryad.48s16.

I am initially using only the "activities" file as the additional data contained in the person-section of the ORCID metadata is less likely to enrich the data significantly. For example, email addresses, which would be useful in identifying affiliations based on email domains, are most likely goint to be set to private. 
However, useful information might be provided by researcher-urls (links) or alternative identifiers, where available, e.g. a link to an institutional profile. Although using education and employment affiliations is likely to provide a more complete dataset, it will be useful to test if links could be used to fill the gap for records without affiliations or to assert if an affiliation is still current (the link resolves).


In [None]:
# Use command line/ terminal to extract the archive onto an external hard drive
#tar -xzvf public_profiles_API2.0activities_2017_10_json.tar.gz -C ~/destination

## Setup
Load a couple of profiles to adapt the functions to the new ORCID message schema:

In [None]:
#load my own ORCID profile to check contents
json.load(open("/media/eva/Eva-passport/ORCIDpubData2017/public_profiles_API-2.0-activities_2017_10_json/9/0000-0003-4965-2969_activities.json"))

In [275]:
#load an empty ORCID profile to check contents
json.load(open("/media/eva/Eva-passport/ORCIDpubData2017/public_profiles_API-2.0-activities_2017_10_json/x/0000-0003-2914-115X_activities.json"))

## The functions needed to load the profiles

In [2]:
import json, os, sys
import pandas as pd

#the original file generator enumerated each file. Needed a workaround as we are iterating through subfolders. 
#running just the for-loop results in the same strucutre.
def file_generator(json_dir):
    ''' Using a generator allows pausing and restarting
    without having to figure out where you left off. '''
    n = 0
    for root, directories, files in os.walk(json_dir):
            item = None
            for filename in files:
                m = n
                item = m, os.path.join(root, filename)
                n += 1
                yield (item)
        
def get_profiles(data, json_files, stop = None):
    ''' Iterate over JSON files and process them '''
    for n, filepath in json_files:
        # keep track of progress
        sys.stdout.flush()
        sys.stdout.write('\r{}'.format(filepath))
        # terminate if stop is specified and reached
        if stop is not None and n >= stop:
            return
        # process this JSON file and harvest the data
        if filepath.endswith(".json"):
            with open(filepath) as f:
                profile = json.load(f)
                for row in get_affiliations(profile):
                    data.append(row)

def has_affiliation(profile):
    ''' This tests whether the profile has any affiliations '''
    try:
        if profile["educations"]["education-summary"] != None:
            return True
        if profile["employments"]["employment-summary"] != None:
            return True
    except:
        return False

def get_affiliations(profile):
    ''' For each profile, extract all affiliations and metadata '''
    profile_data = []
    orcid_id = None
    if has_affiliation(profile):
        orcid_id = profile["educations"]["path"][1:20]
        if profile["educations"]["education-summary"] != None:
            for edu in profile["educations"]["education-summary"]:
                row = [orcid_id]
                row.append(edu["organization"]["address"]["country"])
                try:
                    row.append(edu["organization"]["name"])
                except:
                    row.append(None)
                try:
                    row.append(edu["organization"]["disambiguated-organization"]["disambiguated-organization-identifier"])
                except:
                    row.append(None)
                try:
                    row.append(edu["start-date"]["year"]["value"])
                except:
                    row.append(None)
                try:
                    row.append(edu["end-date"]["year"]["value"])
                except:
                    row.append(None)
                try:
                    row.append(aff["role-title"])
                except:
                    row.append(None)
                profile_data.append(row)
        if profile["employments"]["employment-summary"] != None:
            for empl in profile["employments"]["employment-summary"]:
                row = [orcid_id]
                row.append(empl["organization"]["address"]["country"])
                try:
                    row.append(empl["organization"]["name"])
                except:
                    row.append(None)
                try:
                     row.append(empl["organization"]["disambiguated-organization"]["disambiguated-organization-identifier"])
                except:
                     row.append(None)
                try:
                    row.append(edu["start-date"]["year"]["value"])
                except:
                    row.append(None)
                try:
                    row.append(empl["end-date"]["year"]["value"])
                except:
                    row.append(None)
                try:
                    row.append(empl["role-title"])
                except:
                    row.append(None)
                profile_data.append(row)
    return profile_data

### Testing

In [6]:
json_dir = "/media/eva/Eva-passport/ORCIDpubData2017/public_profiles_API-2.0-activities_2017_10_json/0"
json_files = file_generator(json_dir)

In [7]:
data = []

In [8]:
%%time
get_profiles(data, json_files, stop=25)

/media/eva/Eva-passport/ORCIDpubData2017/public_profiles_API-2.0-activities_2017_10_json/0/0000-0002-5497-9790_activities.jsonCPU times: user 988 ms, sys: 242 ms, total: 1.23 s
Wall time: 5.96 s


In [None]:
df = pd.DataFrame(data, columns = ["orcid_id", "country", "organization_name", 
                              "oganization_identifier", "start_year", "end_year", "affiliation_role"])
df.tail()

In [10]:
df.orcid_id.nunique(), len(df)

(4, 8)

### Reading in all data
After successful testing of the setup, the code can now be run with all data files

In [229]:
json_dir = "/media/eva/Eva-passport/ORCIDpubData2017/public_profiles_API-2.0-activities_2017_10_json"
json_files = file_generator(json_dir)

In [230]:
#data = [] #commenting this out, so we don't accidentally reset the data frame!

In [231]:
%%time
get_profiles(data, json_files)

/media/eva/Eva-passport/ORCIDpubData2017/public_profiles_API-2.0-activities_2017_10_json/x/0000-0003-2914-115X_activities.jsonCPU times: user 3h 50min 38s, sys: 58min 6s, total: 4h 48min 44s
Wall time: 1d 15h 7min 3s


In [None]:
df = pd.DataFrame(data, columns = ["orcid_id", "country", "organization_name", 
                              "oganization_identifier", "start_year", "end_year", "affiliation_role"])
df.head()

In [540]:
len(df), df.orcid_id.nunique()

(3040444, 1111585)

There are 1,111,585 profiles with an education or employment affiliation. In total just over 3 million affiliations have been identified.

### Affiliation dates, estimating 'active' affiliations

In [533]:
#affiliation_without_dates = df[(df["start_year"].isnull()) & (df["end_year"].isnull())]
#start year can't be used because of a mistake in function reading in the data meant all these fields are None
affiliation_without_end_year = df[(df["end_year"].isnull())]
print ("Total number of affiliations without end date:", len(affiliation_without_end_year),"unique records:",affiliation_without_end_year.orcid_id.nunique())

Total number of affiliations without end date: 1235569 unique records: 950407


Of all the affiliations identified, 1,235,569 in 950,407 ORCID records do not an end date. That's around 40% of the affiliations and represents 85% of ORCID records with an affilitation.

**Identifying ORCID records with an ongoing affiliation, as done by Bohannon (2017) is not trivial:**
* users might not have added a start date to their affiliation
* CRIS or other local systems might not have added any dates to the asserted affiliation. 
    * This is for example the case with the information pushed from Pure to ORCID in the case of St Andrews: No start or end date is provided. For current affiliation it says "present". However, this is not a value that is part of the metadata export.
    * Spot checks would indicate that CRIS system to also not necessarily add end dates to affiliations. Cases were found where, e.g. the employment information source is a CRIS but no end date is provided even though the researcher is no longer affiliated with the university.
    * Reasons for this might include privacy concerns, so this is unlikely to change and might be the case for many other systems.
* At the time of writing, a mistake in the 'get_affiliations()' function meant that all start_year fields are empty.
    * The ability to use the start year as well would have allowed identifying records where _any_ date has been added to the affiliation, increasing the confidence that those entries show current affiliations. 

Are there other routes which might enable us to estimate if an affiliation is current?
* One possibility would be to use an email address (if made public) or url added in the person section of the record. This will be tested separately as it requires further development of the functions and could include additional tests.

## Some institutional data

In [None]:
UStA_all = df[(df.organization_name.str.contains("University of St Andrews"))]
UStA_all_orgID = df[(df.organization_name.str.contains("University of St Andrews")) & (df.oganization_identifier.notnull())]
UStA_current = df[(df.organization_name.str.contains("University of St Andrews")) & (df.end_year.isnull())]

print ("University of St Andrews affililiations:", len(UStA_all), ", unique profiles: ", UStA_all.orcid_id.nunique())
print ("University of St Andrews affiliations with entry in organizational identifier field: ", UStA_all_RGID.orcid_id.nunique())
print ("Unique University of St Andrews affiliations without end date: ", len(UStA_current), ", unique profiles:", UStA_current.orcid_id.nunique())

#pd.unique(UStA_all[['oganization_identifier']].values.ravel('K')) # check how many different organizational identifiers are associated with the institution.

### Observations so far
* Looking at affiliations without an end date might still provide a reasonable estimate of how many affiliations are still 'active'
    * Spot checks for records without an end date mostly revealed profiles with an ongoing affiliation.
    * The number of false negatives might not be significant (at this time)
    * Using the information source provided in the metadata would help clarifying this further  
* Adding the start_year to the analyiss would provide greater confidence that the affiliations are in fact current, as opposed to affiliations where no dates have been added at all.
* Comparison with the number of records with urls to institutinoal profiles will still be interesting.

For institutions where a local system is able to write to member's ORCID records, it would be interesting to see how many of the affiliations have been added by it. This would require to include the source of the education or employment information.

Additional insights might also be gained from having information about the type of affiliation (education or employment), as this might not always be available from the role description. A column with this descirption could be added to the data.