Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


The Semantic Scholar Search Reranker

The code in this repo is for when you have a plain-text query and some academic documents, and your goal is to search within the documents and obtain a score for how good of a match each document is for the query. The standard pipeline involves a first-stage ranker (like ElasticSearch) and a reranker. The model included with this repository is for the reranking stage only, but you may have few-enough documents that a first-stage ranker is not necessary. The model and featurization are both fast.


To install this package, run the following:

git clone
cd s2search
conda create -y --name s2search python==3.7
conda activate s2search
python develop
pip install

To obtain the necessary data, run this command after the package is installed:

aws s3 cp --no-sign-request s3://ai2-s2-research-public/ .

Then unzip the file. Iniside the zip is folder named s2search/ that will contain all of the artifacts you'll need to get predictions.

Warning: this zip file is 10G compressed and 17G uncompressed.


Warning: you will need more than 17G of ram because of the large kenlm models that need to be loaded into memory.

An example of how to use this repo:

from s2search.rank import S2Ranker

# point to the artifacts downloaded from s3
data_dir = 's2search/'

# the data is a list of dictionaries
papers = [
        'title': 'Neural Networks are Great',
        'abstract': 'Neural networks are known to be really great models. You should use them.',
        'venue': 'Deep Learning Notions',
        'authors': ['Sergey Feldman', 'Gottfried W. Leibniz'],
        'year': 2019,
        'n_citations': 100,
        'n_key_citations': 10
        'title': 'Neural Networks are Terrible',
        'abstract': 'Neural networks have only barely worked and we should stop working on them.',
        'venue': 'JMLR',
        'authors': ['Isaac Newton', 'Sergey Feldman'],
        'year': 2009,
        'n_citations': 5000  # we don't have n_key_citations here and that's OK

# only do this once because we have to load the giant language models into memory
s2ranker = S2Ranker(data_dir)

# higher scores are better
print(s2ranker.score('neural networks', papers))
print(s2ranker.score('feldman newton', papers))
print(s2ranker.score('jmlr', papers))
print(s2ranker.score('missing', papers))

Note that n_key_citations is a Semantic Scholar feature. If you don't have it, just leave that key out of the data dictionary. The other paper fields are required.


The Semantic Scholar Search Reranker







No releases published


No packages published