Skip to content

adambuttrick/ror-predictor-fasttext

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ror-predictor-fasttext

ROR prediction service, trained with fastText

setup

Install fastText:

git clone https://github.com/facebookresearch/fastText.git
cd fastText
sudo pip install .

Install requirements.txt

pip install -r requirements.txt

Download the model files from Hugging Face and place in a directory. Pass this directory to the Predictor class when creating, e.g.:

PREDICTOR = Predictor('path_to/model_files_dir/')

usage

See test.py for an example and test_data for sample datasets. Create an instance of the predictor class and feed it an affiliation string and prediction confidence level. In testing, 0.85 was found to be a good good threshold for returning a sufficient amount of accurate predictions (75-80% predicted at 85-90% accuracy).

training

Prediction service was trained on a subset of affiliation strings from OpenAlex that contained ROR IDs whose assignments could be validated. See the OpenAlex documentation for downloading their works dataset. See parse-openalex-works for extracting the training data. See validate-ror-id-assignments for validation logic. See the training directory for training on the validated assignments.

limitations

Training data that could be validated was only available for 64,656 ROR IDs (~63% of total ROR IDs) in the OpenAlex works dataset. See model_ids.txt for a complete list of IDs that are able to be predicted. Predictions cannot be made for ROR IDs on which the service was not trained. Use the affiliation service in the ROR API for more general matching (but please run it locally using the Docker image if you're trying to match a large volume of affiliation data).

About

ROR prediction service, trained using fastText

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages