Skip to content

Contains the code for my Imperial College London Master's thesis on text summarization

Notifications You must be signed in to change notification settings

alexgaskell10/nlp_summarization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

On the Summarization and Evaluation of Long Documents

This file documents the steps required to finetune a model, generate summaries and then run evaluation on these predictions.

Prerequisites

Environment

python >= 3.6, GPU >= 12Gb, tested on Linux only

  1. Create and activate a new virtual environment
  2. cd to the root of this directory
  3. Run the following command to install packages: sh install_packages.sh

Data

  1. CNN/DailyMail: Download instructions here
  2. PubMed & arXiv: Download instructions here
  3. Quora Question Pairs: Download instructions here
  4. Annotated CNN/DailyMail dataset: Download instructions here

Check install has worked correctly

A demo script is provided in example. cd here and run sh run_example.sh to run this pipeline. Instructions included within this file.

Evaluation Analysis

Eval metrics

In project we test the relative merits of a set of eval metrics. These metrics are as follows and will here on in be referred to as the metrics.

Human-Metric Correlations

The file here is: scripts/benchmarking/get_corrs.py. This is based heavily on this file. The objective here is to compute the correlation between human judgement and the metric eval scores using 2.5K human scores across 500 CNN/DailyMail summaries. The data for this task is in: scripts/benchmarking/learned_eval/data. Example usage:

  • cd scripts/benchmarking/
  • python get_corrs.py -m ROUGE-1-F

The results in section 6.1.2 can be replicated by performing this for each evaluation metric.

Quora Question Pairs Analysis

The file here is: scripts/benchmarking/qqp_corrs.py. This is also based heavily on this file. The objective here is to distingush whether two sentences are semantically equivalent or not. The path to the data can be set in scripts/benchmarking/resources.py. Example usage:

  • cd scripts/benchmarking/
  • python qqp_corrs.py

Adversarial Analysis

The file here is: scripts/benchmarking/adversarial.py. The objective here is to test which of the eval metrics are most sensitive to artificial corruption when applied to summaries. We performed this analysis using PEGASUS as these have been shown to be of human quality. This process involves 2 steps:

  1. Corrupt the summaries using 3 different sources of noise:
  • Random word dropping
  • Random word permutation
  • BERT mask-filling (mask random words then use BERT to in-fill these)
  1. Perform evaluation using PEGASUS's original and corrupted summaries using the (human-produced) target summaries as the ground truth.

To perform step 1):

  • cd scripts/benchmarking/
  • Open adversarial.py and set the SUMMS_PATH (the path of the summaries to corrupt) and OUT_DIR (where the corrupted summaries should be saved)
  • Run python adversarial.py

To perform step 2):

  • Follow the instructions in Step 3. Running evaluation to score each summary.
  • Run the AdversarialAnalyzer() class in adversarial.py to compute the accuracy of each metric. You will have to manually add the paths to the original and corrupted summaries in this class (lines 145-158)

Architecures Analysis

Replicating LED Results This is a multi-stage pipeline as each step is GPU intensive. The steps in the pipeline are as follows:

  1. Finetune a model
  2. Generate the test set predictions (summaries)
  3. Evaluate the test summaries

A demo script is provided in example. cd here and run sh run_example.sh to run this pipeline

Step 1. Finetune the LED

The main file is scrips/models/finetune.py. This is based heavily on the Huggingface script and this is a useful resource if you have any difficulties here. There are a number of shell scripts in sh_scripts/ which are configured for many of the tasks you will be interested in with this repo. To finetine the LED to replicate results from the thesis use the following steps:

  1. cd to scripts/models # TODO: check path
  2. Open sh_scripts/finetune_led.sh
  3. Change DATA_DIR and OUT_DIR to point to where the data is stored and where you want results to be saved to
  4. Run sh_scripts/finetune_led.sh

This script takes approx 40hrs to run per epoch. This is because we used batch size=1 to fit onto a 12Gb GPU. If you have larger hardware this can be sped-up considerably. The best version of the model will be saved in $OUT_DIR/best_tfmr and this will be used to generate the test set summaries in the next step.

Step 2. Generate test set summaries

Here you also use scrips/models/finetune.py. This step should be run after finetuning as it requires a saved model in $OUT_DIR/best_tfmr.

  1. cd to scripts/models # TODO: check path
  2. Open sh_scripts/predict.sh
  3. Change DATA_DIR and OUT_DIR to point to where the data is stored and where you want results to be saved to. $OUT_DIR should point to the location of the saved model you want to generate predictions using
  4. Run sh_scripts/predict.sh

This will generate 2 files within $OUT_DIR: 1) test_generations.txt and test_targets.txt. These contain the generated summaries from the test set and the targets respectively. These files will be used for the evaluation stage.

Step 3. Running evaluation

Here you use scripts/benchmarking/benchmark.py. The purpose of this step is to generate evaluation scores using each of eval metrics.

Process:

  1. cd to scripts/benchmarking
  2. Open sh_scripts/run_eval.sh
  3. Change SUMS_DIR and OUTDIR appropriately. OUTDIR is where the eval output will be saved to. SUMS_DIR is where the dir containing the generated summaries along with the targets. These should be named according to Step 2. above.
  4. Run sh_scripts/run_eval.sh

This will save the output to $OUTDIR/eval_output_raw.txt and $OUTDIR/eval_output.txt. The first contains the scores for each metric per summary; the second contains the means and standard deviations only. These are saved as a single line per directory- if you want to perform eval for a different model it will append these scores to a new line.

About

Contains the code for my Imperial College London Master's thesis on text summarization

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published