This repo contains the code for our ICML 2022 paper Co-training Improves Prompt-based Learning for Large Language Models and updates / extensions, including tuning based on T-Few.
This code is useful for:
- boosting the zero-shot and few-shot performance of large language models
- distilling large models like GPT-3 and T0 into smaller task-specific models.
Large parts of the repo are built on top of the excellent T-Few repository.
If you find this code useful, please consider citing our paper:
@inproceedings{lang2022co,
title={Co-training improves prompt-based learning for large language models},
author={Lang, Hunter and Agrawal, Monica N and Kim, Yoon and Sontag, David},
booktitle={International Conference on Machine Learning},
pages={11985--12003},
year={2022},
organization={PMLR}
}
conda create -n cotrain
conda activate cotrain
pip install -r requirements.txt -f https://download.pytorch.org/whl/cu113/torch_stable.html
Since the publication of our ICML paper, T-Few has emerged as a better technique for fine-tuning T0 than soft prompt tuning. We have included code for co-training T0 (using T-Few) with BERT (using regular head tuning).
Method/Model | RTE | CB |
---|---|---|
T0-3B (no training) | 62.1 | 51.8 |
T0-3B + co-training | 86.1 (0.6) | 78.9 (9.5) |
DeBERTa-large + co-training | 87.1 (0.3) | 79.3 (9.4) |
The median performances of cotrained T0 and BERT on CB are 82.1 and 85.7, respectively. The large standard deviation is because 2/5 seeds get stuck at 67% accuracy for both models; even these low seeds are near the best performance of cotraining with soft-prompt tuning.
To run co-training for all seeds and datasets:
CUDA_VISIBLE_DEVICES=0 ./bin/cotrain_tfew.sh
Once this is finished, we can look in dev_scores.json.<model>
to get mean performance across seeds after 5 iterations:
cat exp_out/cotrain_ia3_rte_seed*_round5/dev_scores.json.bert | cut -d':' -f 2 | cut -d ',' -f 1 | jq -s add/length;
cat exp_out/cotrain_ia3_rte_seed*_round5/dev_scores.json.t0 | awk 'NR % 2 == 0' |cut -d':' -f 2 | cut -d ',' -f 1 | jq -s add/length;
cat exp_out/cotrain_ia3_cb_seed*_round5/dev_scores.json.bert | cut -d':' -f 2 | cut -d ',' -f 1 | jq -s add/length;
cat exp_out/cotrain_ia3_cb_seed*_round5/dev_scores.json.t0 | awk 'NR % 2 == 0' |cut -d':' -f 2 | cut -d ',' -f 1 | jq -s add/length;
To just do one run:
CUDA_VISIBLE_DEVICES=0 dataset=rte seed=0 t03b.json+${dataset}.json+ia3.json -k exp_name=cotrain_ia3_${dataset}_seed${seed} seed=${seed} few_shot=False allow_skip_exp=True train_template_idx=0 eval_template_idx=0 bert_name=microsoft/deberta-large-mnli bert_epochs=40 eval_epoch_interval=1
Note: The performance is sensitive to the choice of prompt for T0 (train_template_idx
/ eval_template_idx
), since this dictates the quality of the initial pseudo-labeled data used for co-training. By default the code uses the first template.
This is the original method from our ICML paper, which used soft prompt tuning since T-Few had not been released.
Method/Model | RTE | CB |
---|---|---|
T0-3B + co-training | 84.8 (0.8) | 64.6 (2.4) |
DeBERTa-large + co-training | 86.4 (0.7) | 72.9 (1.3) |
To run all seeds and datasets:
CUDA_VISIBLE_DEVICES=0 ./bin/cotrain_spt.sh
Once this is finished, we can look in dev_scores.json.<model>
to get mean performance across seeds after 5 iterations:
cat exp_out/cotrain_spt_rte_seed*_round5/dev_scores.json.bert | cut -d':' -f 2 | cut -d ',' -f 1 | jq -s add/length;
cat exp_out/cotrain_spt_rte_seed*_round5/dev_scores.json.t0 | awk 'NR % 2 == 0' |cut -d':' -f 2 | cut -d ',' -f 1 | jq -s add/length;
cat exp_out/cotrain_spt_cb_seed*_round5/dev_scores.json.bert | cut -d':' -f 2 | cut -d ',' -f 1 | jq -s add/length;
cat exp_out/cotrain_spt_cb_seed*_round5/dev_scores.json.t0 | awk 'NR % 2 == 0' |cut -d':' -f 2 | cut -d ',' -f 1 | jq -s add/length;
To just do one run:
CUDA_VISIBLE_DEVICES=0 dataset=rte seed=0 python -m src.cotrain -c t03b.json+${dataset}.json+prompt_tuning-10_prompts.json -k exp_name=cotrain_spt_${dataset}_seed${seed} seed=${seed} few_shot=False allow_skip_exp=True train_template_idx=0 eval_template_idx=0 bert_name=microsoft/deberta-large-mnli bert_epochs=40 eval_epoch_interval=1 prompt_tuning_num_prefix_emb=20 prompt_tuning_decoder=False num_steps=30000 prompt_tuning_init_with_pad=True cotrain_load_best=True batch_size=16 grad_accum_factor=2
The large number of steps for soft prompt tuning here is key to obtaining good performance. Replacing SPT with T-Few thus maintains (or improves) the performance while being much more efficient due to requiring fewer steps.
- Create a dataset reader for your dataset in
src/data/dataset_readers.py
inheriting fromBaseDatasetReader
. Your dataset reader should setself.templates
with appropriate templates to use with T0. SeeHSwagReader
for a good example to follow. Note: the code usesvalidation
as the name of the test split because, following other work, we report test performance on the public SuperGLUE validation sets. Make sure your test split is calledvalidation
. The co-training code samples a separate validation set for you already. - Add your reader to the
get_dataset_reader
function insrc/data/dataset_readers.py
- Add a config file in
configs/<your-dataset-name>.json
.configs/rte.json
is a good one to copy. - Tell BERT how to tokenize your data by adding an entry in
task_text_field_map
("task_name": ["<input_column_name>", "<target_column_name>"]
) for your task insrc/data/dataset_module.py
.
This code is useful for distilling the outputs of GPT-3 into a smaller performant model.
Method/Model | RTE | CB | TREC |
---|---|---|---|
Label model (no cotrain) | 62.8 | 76.8 | 77.2 |
Label model + cotrain | 67.2 (1.3) | 82.1 (2.3) | 79.2 (1.8) |
DeBERTa-large + cotrain | 80.1 (4.2) | 84.6 (1.4) | 81.6 (1.6) |
These results differ from Table 1 in the paper because we replaced the more sensitive confidence-based data selection for the label model by using the cut statistic on the BERT representations in each iteration. This selects higher-quality pseudo-labeled training data based on the label model pseudolabels and removes the need to set a constraint on the minimum label frequency.
To run all seeds and datasets:
CUDA_VISIBLE_DEVICES=0 ./bin/cotrain_gpt.sh
Once this is finished, we can look in dev_scores.json.<model>
to get mean performance across seeds after 5 iterations:
cat exp_out/cotrain_gpt_rte_seed*_round5/dev_scores.json.bert | cut -d':' -f 2 | cut -d ',' -f 1 | jq -s add/length;
cat exp_out/cotrain_gpt_rte_seed*_round5/dev_scores.json.lm | awk 'NR % 2 == 0' |cut -d':' -f 2 | cut -d ',' -f 1 | jq -s add/length;
cat exp_out/cotrain_gpt_cb_seed*_round5/dev_scores.json.bert | cut -d':' -f 2 | cut -d ',' -f 1 | jq -s add/length;
cat exp_out/cotrain_gpt_cb_seed*_round5/dev_scores.json.lm | awk 'NR % 2 == 0' |cut -d':' -f 2 | cut -d ',' -f 1 | jq -s add/length;
cat exp_out/cotrain_gpt_trec_seed*_round5/dev_scores.json.bert | cut -d':' -f 2 | cut -d ',' -f 1 | jq -s add/length;
cat exp_out/cotrain_gpt_trec_seed*_round5/dev_scores.json.lm | awk 'NR % 2 == 0' |cut -d':' -f 2 | cut -d ',' -f 1 | jq -s add/length;
- Get GPT-3 (or other LLM) probabilities for each output token in your desired vocabulary (i.e., the feature set of tokens you want to use for the label model). For each input example, you should have a
num_prompts x num_tokens
matrix. Turn this into a vector with.reshape(-1)
and add it as a new column to your Huggingface dataset. Note: make sure the initial verbalizer tokens are the first columns (see paper Figure 2). - Obtain calibrate-before-use output matrix for each prompt and add it to
CBU_MAT
insrc/cotrain_gpt.py
. This should benum_prompts x num_initial_verbalizer_tokens
. Each row corresponds to the diagonal of the initial calibration matrix. - Add a config file for you dataset to
configs/gpt-<your-dataset-name>.json
. You can copygpt-trec
, but update the config with the number of prompts you used. - Add your dataset to
get_dataset_reader
insrc/data/dataset_readers.py
(map it toGPTReader
) - Tell BERT how to tokenize your data by setting
task_text_field_map
for your task insrc/data/dataset_modules.py