# Reading and analysing the ORCID public profiles (activities)
This notebook describes the process of extracting and analyising data from the 2017 public data release. The analysis uses the activities extract of the profiles in JSON format (https://doi.org/10.6084/m9.figshare.5479792.v1).

The method is based on the one used by Bohannon (2017, https://doi.org.10.1126/science.aal1189) for which dataset and scripts can be found here: http://dx.doi.org/10.5061/dryad.48s16.

I am initially using only the "activities" file as the additional data contained in the person-section of the ORCID metadata is less likely to enrich the data significantly. For example, email addresses, which would be useful in identifying affiliations based on email domains, are most likely goint to be set to private. 
However, useful information might be provided by researcher-urls (links) or alternative identifiers, where available, e.g. a link to an institutional profile. Although using education and employment affiliations is likely to provide a more complete dataset, it will be useful to test if links could be used to fill the gap for records without affiliations or to assert if an affiliation is still current (the link resolves).


In [None]:
# Use command line/ terminal to extract the archive onto an external hard drive
#tar -xzvf public_profiles_API2.0activities_2017_10_json.tar.gz -C ~/destination

## Setup
Load a couple of profiles to adapt the functions to the new ORCID message schema:

In [536]:
#load my own ORCID profile to check contents
json.load(open("/media/eva/Eva-passport/ORCIDpubData2017/public_profiles_API-2.0-activities_2017_10_json/9/0000-0003-4965-2969_activities.json"))

{'educations': {'education-summary': [{'created-date': {'value': 1499540565501},
    'department-name': 'Computer Science',
    'end-date': None,
    'last-modified-date': {'value': 1499540565501},
    'organization': {'address': {'city': 'St Andrews',
      'country': 'GB',
      'region': None},
     'disambiguated-organization': None,
     'name': 'University of St Andrews'},
    'path': '/0000-0003-4965-2969/education/4229471',
    'put-code': 4229471,
    'role-title': 'Management and Information Technology',
    'source': {'source-client-id': None,
     'source-name': {'value': 'Eva Borger'},
     'source-orcid': {'host': 'orcid.org',
      'path': '0000-0003-4965-2969',
      'uri': 'http://orcid.org/0000-0003-4965-2969'}},
    'start-date': {'day': {'value': '12'},
     'month': {'value': '09'},
     'year': {'value': '2016'}},
    'visibility': 'public'},
   {'created-date': {'value': 1499540208691},
    'department-name': 'Medicine',
    'end-date': {'day': {'value': '28'},
 

In [275]:
#load an empty ORCID profile to check contents
json.load(open("/media/eva/Eva-passport/ORCIDpubData2017/public_profiles_API-2.0-activities_2017_10_json/x/0000-0003-2914-115X_activities.json"))

## The functions needed to load the profiles

In [217]:
import json, os, sys
import pandas as pd

#the original file generator enumerated each file. Needed a workaround as we are iterating through subfolders. 
#running just the for-loop results in the same strucutre.
def file_generator(json_dir):
    ''' Using a generator allows pausing and restarting
    without having to figure out where you left off. '''
    n = 0
    for root, directories, files in os.walk(json_dir):
            item = None
            for filename in files:
                m = n
                item = m, os.path.join(root, filename)
                n += 1
                yield (item)
        
def get_profiles(data, json_files, stop = None):
    ''' Iterate over JSON files and process them '''
    for n, filepath in json_files:
        # keep track of progress
        sys.stdout.flush()
        sys.stdout.write('\r{}'.format(filepath))
        # terminate if stop is specified and reached
        if stop is not None and n >= stop:
            return
        # process this JSON file and harvest the data
        if filepath.endswith(".json"):
            with open(filepath) as f:
                profile = json.load(f)
                for row in get_affiliations(profile):
                    data.append(row)

def has_affiliation(profile):
    ''' This tests whether the profile has any affiliations '''
    try:
        if profile["educations"]["education-summary"] != None:
            return True
        if profile["employments"]["employment-summary"] != None:
            return True
    except:
        return False

def get_affiliations(profile):
    ''' For each profile, extract all affiliations and metadata '''
    profile_data = []
    orcid_id = None
    if has_affiliation(profile):
        orcid_id = profile["educations"]["path"][1:20]
        if profile["educations"]["education-summary"] != None:
            for edu in profile["educations"]["education-summary"]:
                row = [orcid_id]
                row.append(edu["organization"]["address"]["country"])
                try:
                    row.append(edu["organization"]["name"])
                except:
                    row.append(None)
                try:
                    row.append(edu["organization"]["disambiguated-organization"]["disambiguated-organization-identifier"])
                except:
                    row.append(None)
                try:
                    row.append(edu["start-date"]["year"]["value"])
                except:
                    row.append(None)
                try:
                    row.append(edu["end-date"]["year"]["value"])
                except:
                    row.append(None)
                try:
                    row.append(aff["role-title"])
                except:
                    row.append(None)
                profile_data.append(row)
        if profile["employments"]["employment-summary"] != None:
            for empl in profile["employments"]["employment-summary"]:
                row = [orcid_id]
                row.append(empl["organization"]["address"]["country"])
                try:
                    row.append(empl["organization"]["name"])
                except:
                    row.append(None)
                try:
                     row.append(empl["organization"]["disambiguated-organization"]["disambiguated-organization-identifier"])
                except:
                     row.append(None)
                try:
                    row.append(edu["start-date"]["year"]["value"])
                except:
                    row.append(None)
                try:
                    row.append(empl["end-date"]["year"]["value"])
                except:
                    row.append(None)
                try:
                    row.append(empl["role-title"])
                except:
                    row.append(None)
                profile_data.append(row)
    return profile_data

### Testing

In [218]:
json_dir = "/media/eva/Eva-passport/ORCIDpubData2017/public_profiles_API-2.0-activities_2017_10_json/0"
json_files = file_generator(json_dir)

In [219]:
data = []

In [226]:
%%time
get_profiles(data, json_files, stop=500)

/media/eva/Eva-passport/ORCIDpubData2017/public_profiles_API-2.0-activities_2017_10_json/0/0000-0003-2129-9710_activities.jsonCPU times: user 353 ms, sys: 90.5 ms, total: 443 ms
Wall time: 697 ms


In [227]:
df = pd.DataFrame(data, columns = ["orcid_id", "country", "organization_name", 
                              "oganization_identifier", "start_year", "end_year", "affiliation_role"])
df.tail()

Unnamed: 0,orcid_id,country,organization_name,Ringgold_id,start_year,end_year,affiliation_role
345,0000-0003-4564-4400,KR,"College of Medicine, Korea University",http://dx.doi.org/10.13039/501100006468,,,Professor
346,0000-0003-2129-9120,US,University of California Davis,8789,,2015.0,
347,0000-0003-2129-9120,US,HP Labs,96953,,2014.0,Research Associate Intern
348,0000-0003-2129-9120,US,eBay Inc,260665,,2015.0,PhD Intern
349,0000-0003-2129-9120,US,NVIDIA Corp,196328,,2013.0,Software Engineer


In [228]:
df.orcid_id.nunique(), len(df)

(137, 350)

### Reading in all data
After successful testing of the setup, the code can now be run with all data files

In [229]:
json_dir = "/media/eva/Eva-passport/ORCIDpubData2017/public_profiles_API-2.0-activities_2017_10_json"
json_files = file_generator(json_dir)

In [230]:
#data = [] #commenting this out, so we don't accidentally reset the data frame!

In [231]:
%%time
get_profiles(data, json_files)

/media/eva/Eva-passport/ORCIDpubData2017/public_profiles_API-2.0-activities_2017_10_json/x/0000-0003-2914-115X_activities.jsonCPU times: user 3h 50min 38s, sys: 58min 6s, total: 4h 48min 44s
Wall time: 1d 15h 7min 3s


In [539]:
df = pd.DataFrame(data, columns = ["orcid_id", "country", "organization_name", 
                              "oganization_identifier", "start_year", "end_year", "affiliation_role"])
df.head()

Unnamed: 0,orcid_id,country,organization_name,oganization_identifier,start_year,end_year,affiliation_role
0,0000-0001-5000-1640,KR,Sogang University Graduate School of Internati...,92200.0,,2016.0,
1,0000-0001-5000-1640,KR,Citizens' Alliance for North Korean Human Rights,,,,Deputy Director General
2,0000-0001-5000-2520,GB,University College London,4919.0,,,
3,0000-0001-5000-4390,IN,University of Delhi,28742.0,,1986.0,
4,0000-0001-5000-4390,IN,University of Delhi,28742.0,,1981.0,


In [540]:
len(df), df.orcid_id.nunique()

(3040444, 1111585)

There are 1,111,585 profiles with an education or employment affiliation. In total just over 3 million affiliations have been identified.

### Affiliation dates, estimating 'active' affiliations

In [533]:
#affiliation_without_dates = df[(df["start_year"].isnull()) & (df["end_year"].isnull())]
#start year can't be used because of a mistake in function reading in the data meant all these fields are None
affiliation_without_end_year = df[(df["end_year"].isnull())]
print ("Total number of affiliations without end date:", len(affiliation_without_end_year),"unique records:",affiliation_without_end_year.orcid_id.nunique())

Total number of affiliations without end date: 1235569 unique records: 950407


Of all the affiliations identified, 1,235,569 in 950,407 ORCID records do not an end date. That's around 40% of the affiliations and represents 85% of ORCID records with an affilitation.

**Identifying ORCID records with an ongoing affiliation, as done by Bohannon (2017) is not trivial:**
* users might not have added a start date to their affiliation
* CRIS or other local systems might not have added any dates to the asserted affiliation. 
    * This is for example the case with the information pushed from Pure to ORCID in the case of St Andrews: No start or end date is provided. For current affiliation it says "present". However, this is not a value that is part of the metadata export.
    * Spot checks would indicate that CRIS system to also not necessarily add end dates to affiliations. Cases were found where, e.g. the employment information source is a CRIS but no end date is provided even though the researcher is no longer affiliated with the university.
    * Reasons for this might include privacy concerns, so this is unlikely to change and might be the case for many other systems.
* At the time of writing, a mistake in the 'get_affiliations()' function meant that all start_year fields are empty.
    * The ability to use the start year as well would have allowed identifying records where _any_ date has been added to the affiliation, increasing the confidence that those entries show current affiliations. 

Are there other routes which might enable us to estimate if an affiliation is current?
* One possibility would be to use an email address (if made public) or url added in the person section of the record. This will be tested separately as it requires further development of the functions and could include additional tests.

## Some institutional data

In [541]:
UStA_all = df[(df.organization_name.str.contains("University of St Andrews"))]
UStA_all_orgID = df[(df.organization_name.str.contains("University of St Andrews")) & (df.oganization_identifier.notnull())]
UStA_current = df[(df.organization_name.str.contains("University of St Andrews")) & (df.end_year.isnull())]

print ("University of St Andrews affililiations:", len(UStA_all), ", unique profiles: ", UStA_all.orcid_id.nunique())
print ("University of St Andrews affiliations with entry in organizational identifier field: ", UStA_all_RGID.orcid_id.nunique())
print ("Unique University of St Andrews affiliations without end date: ", len(UStA_current), ", unique profiles:", UStA_current.orcid_id.nunique())

#pd.unique(UStA_all[['oganization_identifier']].values.ravel('K')) # check how many different organizational identifiers are associated with the institution.

University of St Andrews affililiations: 1553 , unique profiles:  1225
University of St Andrews affiliations with entry in organizational identifier field:  217
Unique University of St Andrews affiliations without end date:  492 , unique profiles: 430


There are 1225 records in the 2017 dataset with an affiliation at the University of St Andrews, including records where the affiliation might no longer be current. 217 of the profiles contain an identifier, however, this is a funder DOI, not an orgnisational identifier. 

492 of the affiliations have an end date associated with them, suggesting that these affiliations are no longer active. Among these are 430 unique records, which would be those records with an active affiliation or where no end date has been added by the CRIS/ the user.

_Note:_
Affiliations provided by the University's CRIS system or those chosen from the drop-down menu in the user interface more recently do not have a oganizational identifier id or grid identifier associated with them. (see: [info about institutional identifiers on ORCID pages](https://members.orcid.org/api/resources/orgids-in-orcid#fundref))

In [543]:
UoE = df[(df.organization_name.str.contains("University of Edinburgh")) & (df.oganization_identifier.notnull())]
UoE_orgID = df[(df.organization_name.str.contains("University of Edinburgh")) & (df.oganization_identifier.notnull())]
UoE_current = df[(df.organization_name.str.contains("University of Edinburgh")) & (df.end_year.isnull())]

print ("University of Edinburgh affililiations", len(UoE), ", unique profiles:", UoE.orcid_id.nunique())
print ("University of Edinburgh affiliations with organizational identifier: ", UoE_orgID.orcid_id.nunique())
print ("University of Edinburgh affiliations without end date: ", len(UoE_current), ", unique profiles:", UoE_current.orcid_id.nunique())

#pd.unique(UoE_orgID[['oganization_identifier']].values.ravel('K'))

University of Edinburgh affililiations 4354 , unique profiles: 3324
University of Edinburgh affiliations with organizational identifier:  3324
University of Edinburgh affiliations without end date:  1322 , unique profiles: 1237


There are 3875 unique ORCID profiles which have an affiliation with the University of Edinburgh, including records where the affiliation might no longer be current. A oganizational identifier is present in most of these, as 3324 of records with University of Edinburgh affiliation have an identifier. 

There are only two identifiers associated with University of Edinburgh affiliations, one for the University and one for the Roslin Institute.

Of the 5146 affiliatinos, 1322 affiliations in 1237 profiles have no end date associated with them.

In [544]:
Goettingen = df[((df.organization_name.str.contains("Georg-August-Universität Göttingen")) | (df.organization_name.str.contains("Georg August University Göttingen")) | 
                      (df.organization_name.str.contains("Georg August University of Göttingen")) | (df.organization_name.str.contains("University of Göttingen")) | 
                      (df.organization_name.str.contains("Georg August University")) | (df.organization_name.str.contains("University of Goettingen")) |
                     (df.organization_name.str.contains("University Medical Center Goettingen")) | (df.organization_name.str.contains("University Medical Center Göttingen")) |
                     (df.organization_name.str.contains("Universitätsmedizin Göttingen")))] 
#Goettingen_all = df[(df.organization_name.str.contains("Göttingen")) | (df.organization_name.str.contains("Goettingen"))]
Goettingen_orgID = Goettingen[(Goettingen.oganization_identifier.notnull())]
Goettingen_current = Goettingen[(df.end_year.isnull())]

print ("University of Goettingen affililiations", len(Goettingen))
print ("Unique University of Goettingen profiles: ", Goettingen.orcid_id.nunique())
print ("Unique University of Goettingen profiles with oganizational identifier: ", Goettingen_orgID.orcid_id.nunique())
print ("University of Goettingen affiliations without end date: ", len(Goettingen_current), ", unique profiles:", Goettingen_current.orcid_id.nunique())

#Goettingen.head()
#pd.unique(Goettingen_orgID[['organization_name']].values.ravel('K')) # check which organization names exist in affiliation with the institution.
#pd.unique(Goettingen_orgID[['oganization_identifier']].values.ravel('K')) # check how many different oganizational identifier iDs are associated with the institution.

  


University of Goettingen affililiations 1371
Unique University of Goettingen profiles:  1041
Unique University of Goettingen profiles with oganizational identifier:  790
University of Goettingen affiliations without end date:  409 , unique profiles: 373


There is significant variation in how affiliation Georg-August-Universität Göttingen is referred to in ORCID profiles. In addition, there are units which are part of the University but whose name does not include the Univerity's full name (e.g. the University's Medical Center) and which use separate oganizational identifier identifiers. One option to capture all these variants might be to use just the name of the city, possibly using matches to german and english spelling for the search, however there are a number of institutions which include the city's name, but are not associated with the University. 

The most accurate way is therefore to use matches to the most common variants of the University's name and of its Medical Center. This reveals 1041 profiles with current or former affiliations to the University, out of which 790 have a oganization identifier identifier. The funder DOIs for both the University and the Max-Planck Gesellschaft are also listed in the oganization_identifier field.

Of the 1371 affiliations, 409 affiliations in 373 records do not have an end date.

In [545]:
# Ruhr-Universität Bochum, University of Bochum, Ruhr University Bochum...
RUB_all = df[(df.organization_name.str.contains("Ruhr-Universität Bochum")) | (df.organization_name.str.contains("University of Bochum")) | (df.organization_name.str.contains("University Bochum")) | (df.organization_name.str.contains("Ruhr-University Bochum"))]
RUB_all_orgID = df[((df.organization_name.str.contains("Ruhr-Universität Bochum")) | (df.organization_name.str.contains("University of Bochum")) | (df.organization_name.str.contains("University Bochum")) | (df.organization_name.str.contains("Ruhr-University Bochum"))) & (df.oganization_identifier.notnull())]
RUB_current = df[(df.organization_name.str.contains("Ruhr-Universität Bochum")) | (df.organization_name.str.contains("University of Bochum")) | (df.organization_name.str.contains("University Bochum")) | (df.organization_name.str.contains("Ruhr-University Bochum")) & (df.end_year.isnull())]

print ("University of Bochum affililiations", len(RUB_all))
print ("Unique University of Bochum profiles: ", RUB_all.orcid_id.nunique())
print ("Unique University of Bochum profiles with oganizational identifier: ", RUB_all_orgID.orcid_id.nunique())
print ("Unique University of Bochum profiles without end date: ", len(RUB_current), ", unique profiles:", RUB_current.orcid_id.nunique())

#pd.unique(df[(df.organization_name.str.contains("Bochum"))].organization_name.values.ravel('K'))
#pd.unique(RUB_all[['oganization_identifier']].values.ravel('K'))

University of Bochum affililiations 1044
Unique University of Bochum profiles:  749
Unique University of Bochum profiles with oganizational identifier:  696
Unique University of Bochum profiles without end date:  1044 , unique profiles: 749


There are 749 unique ORCID records with an affiliation containing "Bochum" in its name, out of a total of 1044 affiliations. When using just "Bochum" for selecting the organization, a number of other units which don't belong to the University are included (teacher training center, hospital), so more specific terms are used for the search. The majority of the records, 825, also have the a Ringgold or funder ID associated with them although variation is again present where affiliations are units belonging to the University, rather than the parent organisation.

Of the 1044 affiliations, 516 affiliations in 473 records do not have an end date associated with them.

In [546]:
BIE =  df[(df.organization_name.str.contains("Universität Bielefeld")) | (df.organization_name.str.contains("University of Bielefeld")) |(df.organization_name.str.contains("University Bielefeld")) |(df.organization_name.str.contains("Bielefeld University"))]
BIE_orgID = df[(df.organization_name.str.contains("Universität Bielefeld")) | (df.organization_name.str.contains("University of Bielefeld")) |(df.organization_name.str.contains("University Bielefeld")) |(df.organization_name.str.contains("Bielefeld University")) & (df.oganization_identifier.notnull())]
BIE_current = df[((df.organization_name.str.contains("Universität Bielefeld")) | (df.organization_name.str.contains("University of Bielefeld")) |(df.organization_name.str.contains("University Bielefeld")) |(df.organization_name.str.contains("Bielefeld University"))) & (df.end_year.isnull())]

print ("University of Bielefeld affililiations", len(BIE))
print ("Unique University of Bielefeld profiles: ", BIE.orcid_id.nunique())
print ("Unique University of Bielefeld profiles with oganizational identifier: ", BIE_orgID.orcid_id.nunique())
print ("Unique University of Bielefeld profiles without end date: ", len(BIE_current), ", unique profiles:", BIE_current.orcid_id.nunique())

#pd.unique(BIE[["organization_name"]].organization_name.values.ravel('K'))
#pd.unique(RUB_all[['oganization_identifier']].values.ravel('K'))
BIE_current.head()

University of Bielefeld affililiations 454
Unique University of Bielefeld profiles:  324
Unique University of Bielefeld profiles with oganizational identifier:  304
Unique University of Bielefeld profiles without end date:  126 , unique profiles: 120


Unnamed: 0,orcid_id,country,organization_name,oganization_identifier,start_year,end_year,affiliation_role
16602,0000-0003-3586-2930,DE,Universität Bielefeld,235712,,,Postdoc researcher
34753,0000-0002-8005-6420,DE,Universität Bielefeld Fakultät für Biologie,98894,,,
47023,0000-0001-8466-6480,DE,Universität Bielefeld,235712,,,
47024,0000-0001-8466-6480,DE,Universität Bielefeld,235712,,,wissenschaftliche mitarbeiterin
75772,0000-0002-5834-3730,DE,Universität Bielefeld Lehrstuhl für Gentechnol...,210424,,,


There are 454 affiliations mentioned in 324 ORCID records for the University of Bielefeld. Of these, the majority, 126 (in 120 profiles) do not have an end date. 

_Note that using just the search-term "Bielefeld" results in 533 affiliations in 377 records. However, these additional affiliations appear to mostly refer to the local hospitals and University of Applied Sciences. _

In [547]:
CORNELL_all = df[(df.organization_name.str.contains("Cornell"))]
CORNELL_all_orgID = df[(df.organization_name.str.contains("Cornell")) & (df.oganization_identifier.notnull())]
CORNELL_current = df[(df.organization_name.str.contains("Cornell")) & (df.end_year.isnull())]

print ("Cornell University affililiations", len(CORNELL_all))
print ("Unique Cornell University profiles: ", CORNELL_all.orcid_id.nunique())
print ("Unique Cornell University profiles with oganizational identifier ID: ", CORNELL_all_orgID.orcid_id.nunique())
print ("Unique Cornell University profiles without end date: ", len(CORNELL_current), ", unique profiles:", CORNELL_current.orcid_id.nunique())

#pd.unique(CORNELL_all[['organization_name']].values.ravel('K'))
#pd.unique(CORNELL_all_RGID[['oganization_identifier']].values.ravel('K'))

Cornell University affililiations 6244
Unique Cornell University profiles:  5255
Unique Cornell University profiles with oganizational identifier ID:  4892
Unique Cornell University profiles without end date:  1750 , unique profiles: 1637


There are 5255 ORCID profiles with 'Cornell' affiliation. 4482 records are identified when limiting the search to "Cornell University", but this excludes some of the centers associated with the University, especially the Medical College. Most of the records have a Ringgold ID assoicated with them (4892).

Of the 6244 affiliatinos, 1750 affiliations in 1637 profiles do not have an end date.

In [548]:
SYR_all = df[(df.organization_name.str.contains("Syracuse"))]
SYR_all_orgID = df[(df.organization_name.str.contains("Syracuse")) & (df.oganization_identifier.notnull())]
SYR_current = df[(df.organization_name.str.contains("Syracuse")) & (df.end_year.isnull())]

print ("Syracuse University affililiations", len(SYR_all))
print ("Unique Syracuse University profiles: ", SYR_all.orcid_id.nunique())
print ("Unique Syracuse University profiles with oganizational identifier ID: ", SYR_all_orgID.orcid_id.nunique())
print ("Unique Syracuse University profiles without end date: ", len(SYR_current), ", unique profiles:", SYR_current.orcid_id.nunique())

#pd.unique(SYR_all[['organization_name']].values.ravel('K'))SYR_all

Syracuse University affililiations 1046
Unique Syracuse University profiles:  860
Unique Syracuse University profiles with oganizational identifier ID:  800
Unique Syracuse University profiles without end date:  278 , unique profiles: 263


There are 860 unique ORCID records mentioning affiliation with Syracuse, with a total of 1046 affiliations. 800 of these records also have a Ringgold identifier.

Of the 1046 affiliations, 278 affiliations in 263 profiles do not have an end date associated with them.

In [549]:
HARV = df[(df.organization_name.str.contains("Harvard University"))]
HARV_orgID = df[(df.organization_name.str.contains("Harvard University")) & (df.oganization_identifier.notnull())]
HARV_current = df[(df.organization_name.str.contains("Harvard University")) & (df.end_year.isnull())]

print ("Harvard University affililiations", len(HARV))
print ("Unique Harvard University profiles: ", HARV.orcid_id.nunique())
print ("Unique Harvard University profiles with Ringgold ID: ", HARV_orgID.orcid_id.nunique())
print ("Unique Harvard University profiles without end date: ",len(HARV_current), ", unique profiles:", HARV_current.orcid_id.nunique())

#pd.unique(SYR_all[['organization_name']].values.ravel('K'))

Harvard University affililiations 5304
Unique Harvard University profiles:  4641
Unique Harvard University profiles with Ringgold ID:  4342
Unique Harvard University profiles without end date:  1154 , unique profiles: 1096


There are 5304 Harvard University affiliations in the 2017 public adata file, belonging to 4641 profiles. Of these, most, 4342, also have a Ringgold identifier.

Of the 5304 Harvard affiliations, 1154 in 1096 profiles do not have an end date.

### Observations so far
* Looking at affiliations without an end date might still provide a reasonable estimate of how many affiliations are still 'active'
    * Spot checks for records without an end date mostly revealed profiles with an ongoing affiliation.
    * The number of false negatives might not be significant (at this time)
    * Using the information source provided in the metadata would help clarifying this further  
* Adding the start_year to the analyiss would provide greater confidence that the affiliations are in fact current, as opposed to affiliations where no dates have been added at all.
* Comparison with the number of records with urls to institutinoal profiles will still be interesting.

For institutions where a local system is able to write to member's ORCID records, it would be interesting to see how many of the affiliations have been added by it. This would require to include the source of the education or employment information.

Additional insights might also be gained from having information about the type of affiliation (education or employment), as this might not always be available from the role description. A column with this descirption could be added to the data.

There's a pattern: 
* For most institutions looked at so far, 20-30% of all affiliations (and records) do not have an end date. 
* Instead, for Bochum a much higher proportion of affilaitions (nearly half) do not have an end date. _What is the difference there? Has this to do with technical implementation at the institution? Is user behaviour different?_