Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Lite2-3Pyramid (EMNLP 2021)

This repository contains the code and data for the following paper:

Finding a Balanced Degree of Automation for Summary Evaluation

  title={Finding a Balanced Degree of Automation for Summary Evaluation},
  author={Zhang, Shiyue and Bansal, Mohit},
  booktitle={The 2021 Conference on Empirical Methods in Natural Language Processing},

Note that this repo is still work-in-progress.


  • Python 3
  • requirements.txt

Quick Start


Run the following command to get the Lite2Pyramid score for abstractive BART summaries for 100 CNN/DM examples from REALSumm.

python --unit data/REALSumm/SCUs.txt --summary data/REALSumm/summaries/abs_bart_out.summary --label data/REALSumm/labels/abs_bart_out.label --device 0

expected output: {'p2c': 0.5014365067350771, 'l2c': 0.5159722360972361, 'p3c': 0.43105412152833117, 'l3c': 0.4964144744144744, 'human': 0.48349483849483854, 'model_type': 'shiyue/roberta-large-tac08'}

To get its Lite3Pyramid score:

python --unit data/REALSumm/STUs.txt --summary data/REALSumm/summaries/abs_bart_out.summary --device 0

expected output: {'p2c': 0.4535250155726269, 'l2c': 0.48114382512911924, 'p3c': 0.38368004398714206, 'l3c': 0.45765291326320745, 'human': None, 'model': 'shiyue/roberta-large-tac08'}

Usually, "p2c" should be taken as the final score. When using the zero-shot NLI model (i.e., ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli), "l3c" should be used.

To extract STUs:

python --extract_stus --reference data/REALSumm/references.txt --doc_id data/REALSumm/ids.txt --output_dir data/REALSumm --use_coref --device 0

Then, the extracted STUs for REALSumm references will be saved in "data/REALSumm/STUs.txt". Besides, two intermediate files (ref_srls.pkl and ref_corefs.pkl) will also be saved under "data/REALSumm".

To get its Lite2.5Pyramid score (using the regressor trained on TAC2008):

python --unit data/REALSumm/STUs_SCUs_percentage50.txt --summary data/REALSumm/summaries/abs_bart_out.summary --device 0

expected output: {'p2c': 0.47894588174027, 'l2c': 0.4990391414141414, 'p3c': 0.40794726118628005, 'l3c': 0.478387445887446, 'human': None, 'model': 'shiyue/roberta-large-tac08'}

To mix STUs and SCUs, SCUs need to be first obtained. Then, the percentage of STUs need to be specified, e.g., if it is 50, then there will be 50% STUs + 50% SCUs. Besides, we also need to specify the regressor used for mixing STUs and SCUs.

python --mix_stus_and_scus --stu_percentage 50 --scus_file data/REALSumm/SCUs.txt --regressor regressors/TAC08/all_xgb.json --reference data/REALSumm/references.txt --doc_id data/REALSumm/ids.txt --output_dir data/REALSumm --use_coref --device 0

If you have intermediate SRL and Coreference results saved, you can use the following command to save time.

python --mix_stus_and_scus --stu_percentage 50 --scus_file data/REALSumm/SCUs.txt --regressor regressors/TAC08/all_xgb.json --reference data/REALSumm/references.txt --srl_file data/REALSumm/ref_srls.pkl --coref_file data/REALSumm/ref_corefs.pkl --doc_id data/REALSumm/ids.txt --output_dir data/REALSumm --use_coref --device 0


Similar to REALSumm, run the following command to get the Lite2Pyramid score for abstractive T5-large summaries for 100 XSum examples.

python --unit data/PyrXSum/SCUs.txt --summary data/PyrXSum/summaries/t5-large.summary --label data/PyrXSum/labels/t5-large.label --device 0

expected output: {'p2c': 0.354199199620129, 'l2c': 0.37358008658008657, 'p3c': 0.27956780085637617, 'l3c': 0.32769913419913416, 'human': 0.29117532467532464, 'model': 'shiyue/roberta-large-tac08'}

When set "--unit data/PyrXSum/STUs.txt", it will give the Lite3Pyramid score.

Similar to REALSumm (except, do not use --use_coref), To extract STUs:

python --extract_stus --reference data/PyrXSum/references.txt --doc_id data/PyrXSum/ids.txt --output_dir data/PyrXSum/ --device 0

One intermediate file (ref_srls.pkl) will also be saved under "data/PyrXSum".

To mix STUs and SCUs,

python --mix_stus_and_scus --stu_percentage 50 --scus_file data/PyrXSum/SCUs.txt --regressor regressors/TAC08/all_xgb.json --reference data/PyrXSum/references.txt --srl_file data/PyrXSum/ref_srls.pkl --doc_id data/PyrXSum/ids.txt --output_dir data/PyrXSum/ --device 0

Pretrained NLI Models

We provide the zero-shot NLI model and 4 other NLI models finetuned on the 4 meta-evaluation sets respectively.

We suggest using X-finetuned NLI model on X dataset. When evaluating on a new dataset, we suggest using TAC08-finetuned model (i.e., shiyue/roberta-large-tac08) by default.

pretrained or finutuned on Huggingface hub name
SNLI+MNLI+FEVER+ANLI ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli
TAC2008 shiyue/roberta-large-tac08
TAC2009 shiyue/roberta-large-tac09
REALSumm shiyue/roberta-large-realsumm
PyrXSum shiyue/roberta-large-pyrxsum


See reproduce


  • reproduction code for cross-validation experiments on TAC08/09/PyrXSum
  • reproduction code for out-of-the-box experiments
  • provide version control via pypi package
  • provide support through sacrerouge


No description, website, or topics provided.






No releases published


No packages published