This repository contains the code to reproduce the results from the paper 🔗 Evaluating natural language processing models with generalization metrics that do not need access to any training or testing data. Our main results are that metrics from the 🔗 HT-SR theory can predict the generalization of NLP models. Also, unlike existing generalization metrics that focus on the "generalization gap", the HT-SR theory can predict the quality of NLP models, e.g., measured by the test-time BLEU scores when the NLP task is neural machine translation.
We mainly study Transformers in this paper. For Transformer training, we follow 🔗 Vaswani et al.. We develop our implementation based on an 🔗 online repository. This code reproduces the results from Vaswani et al. with more easily configurable Transformer architectures. In addition to the HT-SR theory, we also evaluate generalization metrics from 🔗 Dziugaite et al. 2020. and 🔗 Jiang et al. 2019.
Step 1. Create a conda environment.
conda env create
Activate the environment.
conda activate NLP_metrics
Step 2. Download data and pretrained results.
./download_data.sh
python create_experiment.py --CKPT_DIR <your_checkpoint_directory>
For example, on my machine, the checkpoint directory is /data/yyaoqing/Generalization_metrics_for_NLP/checkpoint/
.
You can check the examples of PL and E-TPL fittings. Take a look at visualization/Visualize_example_WW_layers.ipynb
.
Then, you can reproduce the scatter plots that compare the generalization metrics with the BLEU scores. Check visualization/reproduce_scatterplot.ipynb
.
You can also reproduce the box plots that rank the generalization metrics considered in the paper.
First, use the following commands to generate the time-wise correlations. The argument --bleu_type
can be used to choose the correlation with the test BLEU scores or the generalization gap.
python time_wise_correlation.py --bleu_type test
python time_wise_correlation.py --bleu_type gap
Second, Generate the correlation results when a single hyperparameter is varied.
python aggregate_hyperparameter_correlation.py
Now, you should have all the results. Check visualization/calculate_rank_correlation_with_colored_groups.ipynb
to see the box plots.
Fully reproducing our results requires 🔗 slurm and about 6T storage.
Step 1. Generate slurm configuration files. Check the scripts/generate_script.ipynb
to generate the training and evaluation slurm configrations.
Step 2. Submit the slurm files. Remember to change the directories in the slurm file and make a slurm log folder.
mkdir slurm_logs
For training, do the following.
sbatch ./scripts/slurm_train_models.sh
For evaluation, use the following bash files.
sbatch ./scripts/slurm_eval_bleu.sh
sbatch ./scripts/slurm_compute_ww.sh
sbatch ./scripts/slurm_robust_measures.sh
Notice that we evaluate PL, E-TPL and EXP fittings. To select the distribution, change L23-33 in the file slurm_compute_ww.sh
.
Step 3. After generating all the evaluation files, you will get all the json and pickle files similar to the checkpoint.zip
. Then, you can draw the scatter plots and calculate the rank correlations using the following commands.
./scripts/run_plot_scatterplot.sh
./scripts/run_hyperparameter_correlation.sh
After that, you will get all the plots and rank correlation results similar to the plots.zip
and results.zip
.
We appreciate it if you would please cite the following paper if you found the repository useful for your work:
@TECHREPORT{yang2022evaluating,
author = {Yang, Yaoqing and Theisen, Ryan and Hodgkinson, Liam and Gonzalez, Joseph E and Ramchandran, Kannan and Martin, Charles H and Mahoney, Michael W},
title = {Evaluating natural language processing models with generalization metrics that do not need access to any training or testing data},
number = {Preprint: arXiv:2202.02842},
year = {2022},
}
MIT