Data and code for Kang et al., NAACL 2018's paper titled "A Dataset of Peer Reviews (PeerRead): Collection, Insights and NLP Applications"
Clone or download
Latest commit 5d73b7a Jun 5, 2018


Data and code for "A Dataset of Peer Reviews (PeerRead): Collection, Insights and NLP Applications" by Dongyeop Kang, Waleed Ammar, Bhavana Dalvi, Madeleine van Zuylen, Sebastian Kohlmeier, Eduard Hovy and Roy Schwartz, NAACL 2018

The PeerRead dataset

PearRead is a dataset of scientific peer reviews available to help researchers study this important artifact. The dataset consists of over 14K paper drafts and the corresponding accept/reject decisions in top-tier venues including ACL, NIPS and ICLR, as well as over 10K textual peer reviews written by experts for a subset of the papers.

We structured the dataset into sections each corresponding to a venue or an arxiv category, e.g., ./data/acl_2017 and ./data/arxiv.cs.cl_2007-2017. Each section is further split into the train/dev/test splits (same splits used in the paper). Due to licensing constraints, we provide instructions for downloading the data for some sections instead of including it in this repository, e.g., ./data/nips_2013-2017/


In order to experiment with (and hopefully improve) our models for aspect prediction and for predicting whether a paper will be accepted, see ./code/

Setup Configuration

Run ./ at the root of this repository to install dependencies and download some of the larger data files not included in this repo.


  title = {A Dataset of Peer Reviews (PeerRead): Collection, Insights and NLP Applications},
  author = {Dongyeop Kang and Waleed Ammar and Bhavana Dalvi and Madeleine van Zuylen and Sebastian Kohlmeier and Eduard Hovy and Roy Schwartz},
  booktitle = {Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  address = {New Orleans, USA},
  month = {June},
  url = {},
  year = {2018}


  • We use some of the code in CanaanShen for web crawling.
  • We use some of the code in jiegzhan for our aspect prediction experiments.
  • This work would not have been possible without the efforts of Rich Gerber and Paolo Gai (developers of the conference management system), Stefan Riezler, Yoav Goldberg (chairs of CoNLL 2016), Min-Yen Kan, Regina Barzilay (chairs of ACL 2017) for allowing authors and reviewers to opt-in for this dataset during the official review process.
  • We thank the, and teams for their commitment to promoting transparency and openness in scientific communication.