Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Ground Truth for Grammatical Error Correction Metrics

This repository contains a python implementation of the GLEU metric (General Language Evaluation Understanding), which can be used for any monolingual "translation" task. It also contains human rankings of the CoNLL-14 Shared Task system output as well as scripts to evaluate the rankings to extract an absolute system ranking.

These results were described in the ACL 2015 paper:

Ground Truth for Grammatical Error Correction Metrics by Courtney Napoles, Keisuke Sakaguchi, Joel Tetreault, and Matt Post

Please cite this work when using this data or the GLEU metric.

  author    = {Napoles, Courtney  and  Sakaguchi, Keisuke  and  Post, Matt  and  Tetreault, Joel},
  title     = {Ground Truth for Grammatical Error Correction Metrics},
  booktitle = {Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)},
  month     = {July},
  year      = {2015},
  address   = {Beijing, China},
  publisher = {Association for Computational Linguistics},
  pages     = {588--593},
  url       = {}

GLEU Update

As of May 2, 2016, we have identified a problem with the GLEU metric as the number of references increases. To resolve this issue, we made a minor adjustment to the metric so that it no longer has a tunable weight and is reliable using any number of reference sets. This update to GLEU is reflected in scripts/compute_gleu and scripts/ The original GLEU scripts can be found in scripts/original_gleu/. We do not recommend using the original GLEU code. The new GLEU should be used instead.

The changes to GLEU and updated results to our ACL 2015 paper are described in the eprint, GLEU Without Tuning. The citation for the updated metric is

  author    = {Napoles, Courtney  and  Sakaguchi, Keisuke  and  Post, Matt  and  Tetreault, Joel},
  title     = {{GLEU} Without Tuning},
  journal   = {eprint arXiv:1605.02592 [cs.CL]},
  year      = {2016},
  url       = {}


1. Obtain the raw system output

The rankings found in the gec-ranking-data correspond to the 12 system outputs from the CoNLL-14 Shared Task on Grammatical Error Correction, which can be downloaded from

Human judgments are located in gec-ranking/data.

2. Run TrueSkill

To get the human rankings, run TrueSkill (which can be downloaded from on all_judgments.csv, following the instructions in the TrueSkill readme.

3. Calculate metric scores

GLEU is included in gec-ranking/scripts. To obtain the GLEU scores for system output, run the following command:

./compute_gleu -s source_sentences -r reference [reference ...] \
        -o system_output [system_output ...] -n 4 -l 0.0

where each file contains one sentence per line. GLEU can be run with multiple references. To get the GLEU scores of multiple outputs, include the path to each system output file. GLEU was developed using Python 2.7.

I-measure scores were taken from Felice and Briscoe's 2015 NAACL paper, Towards a standard evaluation method for grammatical error detection and correction. The I-measure scorer can be downloaded from

M2 scores were calculated using the official scorer (3.2) of the CoNLL-2014 Shared Task (


There was an error in the calculation of the GLEU denominator, which was corrected in the 10 March 2016 commit.

Please contact Courtney Napoles (courtneyn[at]jhu[dot]edu) or Keisuke Sakaguchi (keisuke[at]cs[dot]jhu[dot]edu) with any questions.

Last updated 10 May 2016


Data and code used in the 2015 ACL paper, "Ground Truth for Grammatical Error Correction Metrics"






No releases published


No packages published