Skip to content
Repository for NAACL 2019 paper on Citation Intent prediction
Branch: master
Clone or download
Latest commit 2024f96 May 11, 2019
Type Name Latest commit message Commit time
Failed to load latest commit information.
experiment_configs Initial commit Apr 2, 2019
scicite Initial commit Apr 2, 2019
scripts add training script May 11, 2019
tests Initial commit Apr 2, 2019
LICENSE Initial commit Mar 28, 2019 Update May 11, 2019
constraints.txt update allennlp Apr 2, 2019 Initial commit Mar 28, 2019



This repository contains datasets and code for classifying citation intents in academic papers.
For details on the model and data refer to our NAACL 2019 paper: "Structural Scaffolds for Citation Intent Classification in Scientific Publications".


We introduce SciCite a new large dataset of citation intents. Download from the following link:

scicite.tar.gz (22.1 MB)

The data is in the Jsonlines format (each line is a json object).
The main citation intent label for each Json object is spacified with the label key while the citation context is specified in with a context key. Example entry:

 'string': 'In chacma baboons, male-infant relationships can be linked to both
    formation of friendships and paternity success [30,31].'
 'sectionName': 'Introduction',
 'label': 'background',
 'citingPaperId': '7a6b2d4b405439',
 'citedPaperId': '9d1abadc55b5e0',

You may obtain the full information about the paper using the provided paper ids with the Semantic Scholar API.

We also run experiments on a pre-existing dataset of citation intents in the computational linguistics domain (ACL-ARC) introduced by Jurgens et al., (2018). The preprocessed dataset is available at ACL-ARC data.


The project needs Python 3.6 and is based on the AllenNLP library.

Setup an environment manually

Use pip to install dependencies in your desired python environment

pip install -r -c constraints.txt

Running a pre-trained model on your own data

Download one of the pre-trained models and run the following command:

allennlp predict [path-to-model.tar.gz] [path-to-data.jsonl] \
--predictor [predictor-type] \
--include-package scicite \
--overrides "{'model':{'data_format':''}}"


  • [path-to-data.jsonl] contains the data in the same format as the training data.
  • [path-to-model.tar.gz] is the path to the pretrained model
  • [predictor-type] is one of predictor_scicite (for the SciCite dataset format) or predictor_aclarc (for the ACL-ARC dataset format).
  • --output-file [out-path.jsonl] is an optional argument showing the path to the output. If you don't pass this, the output will be printed in the stdout.

If you are using your own data, you need to first convert your data to be according to the SciCite data format.

Pretrained models

We also release our pretrained models; download from the following path:

Training your own models

First you need a config file for your training configuration. Check the experiment_configs/ directory for example configurations. Important options (you can specify them with environment variables) are:

  "train_data_path":  # path to training data,
  "validation_data_path":  #path to development data,
  "test_data_path":  # path to test data,
  "train_data_path_aux": # path to the data for section title scaffold,
  "train_data_path_aux2": # path to the data for citation worthiness scaffold,
  "mixing_ratio": # parameter \lambda_2 in the paper (sensitivity of loss to the first scaffold)
  "mixing_ratio2": # parameter \lambda_3 in the paper (sensitivity of loss to the second scaffold)

After downloading the data, edit the configuration file with the correct paths. You also need to pass in an environment variable specifying whether to use ELMo contextualized embeddings.

export elmo=true

Note that with elmo training speed will be significantly slower.

After making sure you have the correct configuration file, start training the model.

python scripts/ train_multitask_2 [path-to-config-file.json] \
-s [path-to-serialization-dir/] 
--include-package scicite

Where the model output and logs will be stored in [path-to-serialization-dir/]


If you found our dataset, or code useful, please cite Structural Scaffolds for Citation Intent Classification in Scientific Publications.

  author={Arman Cohan and Waleed Ammar and Madeleine Van Zuylen and Field Cady},
  title={Structural Scaffolds for Citation Intent Classification in Scientific Publications},
You can’t perform that action at this time.