Modified version of jpbarrett13's OpenAlex institution ID prediction service that predicts both OpenAlex institution IDs and ROR IDs from affiliation strings.
This code uses the v2 version of OpenAlex's institutional classification models to predict institution IDs and then map them their corresponding ROR IDs. For more details on the original work, see OpenAlex's paper and jpbarrett13's notebooks for model training and inference.
-
Install the required packages:
pip install -r requirements.txt
-
Download the OpenAlex v2 model artifacts as described in the notes section of the OpenAlex institution parsing repo.
aws s3 cp s3://openalex-institution-tagger-model-artifacts/ . --recursive
-
Run the
download_and_parse_openalex_institutions.py
script to create the OpenAlex institution ROR ID mapping pkl file and place this inside theinstitution_tagger_v2_artifacts
directory. This requires you have the AWS command line tool installed. Additional details are provided in the repo for the download script.
There are two main scripts:
openalex_ror_predictor.py
: A command-line tool for batch prediction of affiliation strings.institution_tagger.py
: The core classInstitutionTagger
that handles the prediction logic.
To use the command-line tool:
python openalex_ror_predictor.py -i <input_file.csv> -o <output_file.csv>
The input CSV file should contain an 'affiliation_string' column with the affiliation strings to be processed.
To use the InstitutionTagger
class in your code:
from institution_tagger import InstitutionTagger
tagger = InstitutionTagger(model_path="institution_tagger_v2_artifacts")
affiliation_strings = ['University of Michigan, Ann Arbor, USA', 'Getty Conservation Institute, Los Angeles']
predictions = tagger.predict(affiliation_strings)
The predictor returns a list of dictionaries, where each dictionary corresponds to an input affiliation string and contains the following keys:
affiliation_string
: The original input affiliation string.institution_id
: A list of predicted OpenAlex institution IDs.ror_id
: A list of corresponding ROR IDs (where available).score
: A list of prediction scores corresponding to each institution ID.category
: A list of prediction categories corresponding to each institution ID.
Example output:
[
{
'affiliation_string': 'University of Michigan, Ann Arbor, USA; Getty Conservation Institute, Los Angeles',
'institution_id': [27837315, 200193707],
'ror_id': ['https://ror.org/00jmfr291', 'https://ror.org/019496w77'],
'score': [0.95, 0.85],
'category': ['model_match', 'string_match']
},
{
'affiliation_string': 'Getty Conservation Institute, Los Angeles',
'institution_id': [200193707],
'ror_id': ['https://ror.org/019496w77'],
'score': [0.90],
'category': ['basic_thresh']
}
]