Skip to content

Toolkit for a systematic exploration of how much transfer occurs when models are denied any information about word identity via random scrambling. Codebase for "Identifying the Limits of Cross-Domain Knowledge Transfer for Pretrained Models".

Notifications You must be signed in to change notification settings

frankaging/limits-cross-domain-transfer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 

Repository files navigation

Identifying the Limits of Cross-Domain Knowledge Transfer for Pretrained Models

Toolkit for a systematic exploration of how much transfer occurs when models are denied any information about word identity via random scrambling. Codebase for Identifying the Limits of Cross-Domain Knowledge Transfer for Pretrained Models.

Contents

Citation

Zhengxuan Wu, Nelson F. Liu, Christopher Potts. 2021. Identifying the Limits of Cross-Domain Knowledge Transfer for Pretrained Models. Ms., Stanford University.

  @article{wu-etal-2021-identify,
    title={Identifying the Limits of Cross-Domain Knowledge Transfer for Pretrained Models},
    author={Wu, Zhengxuan and Liu, Nelson F. and Potts, Christopher},
    journal={arxiv},
    url={https://arxiv.org/abs/2104.08410},
    year={2021}}

Quick start

Install Requirements

You have to download modules required by HuggingFace Transformer. Note that you also have to install all the dependecies needed for running HuggingFace as well. For future, we will add a auto install script here so you don't have to worry about it (in the most cases).

Scramble Inputs

One important experiment we ran was to finetune BERT with systematically scrambled English sentences. This process can be found in the notebook via running,

cd code/
jupyter notebook

In vocab_mismatch.ipynb notebook, it will work you through how we generate scrambled datasets.

BERT Model Training

We largely use HuggingFace Transformer for our model training to ensure reproducibility. There are two main scripts that help you run both sequence classification and labeling tasks with some useful options.

Here is an example command to run sequence classification:

CUDA_VISIBLE_DEVICES=0,1,2,3 python run_sentence_classification.py \
--run_name sst-tenary \
--task_name sst3 \
--inoculation_data_path ../data-files/sst-tenary \
--model_type bert-base-uncased \
--output_dir ../sst-tenary-result/ \
--max_seq_length 128 \
--learning_rate 2e-5 \
--per_device_train_batch_size 64 \
--per_device_eval_batch_size 64 \
--metric_for_best_model Macro-F1 \
--greater_is_better \
--is_tensorboard \
--logging_steps 10 \
--eval_steps 10 \
--seed 42 \
--load_best_model_at_end \
--inoculation_step_sample_size 1.00 \
--num_train_epochs 3 \
--inoculation_patience_count 5 \
--save_total_limit 3

Here is an example command to run sequence labeling:

CUDA_VISIBLE_DEVICES=0,1,2,3 python run_token_classification.py \
--run_name conll2003 \
--task_name conll2003 \
--inoculation_data_path conll2003 \
--token_type ner_tags
--model_type bert-base-uncased \
--output_dir ../conll2003-result/ \
--max_seq_length 128 \
--learning_rate 2e-5 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 128 \
--metric_for_best_model f1 \
--greater_is_better \
--is_tensorboard \
--logging_steps 10 \
--eval_steps 10 \
--seed 42 \
--load_best_model_at_end \
--inoculation_step_sample_size 1.00 \
--num_train_epochs 3 \
--inoculation_patience_count 5 \
--save_total_limit 3 \
--n_layer_to_finetune 1

Here is an example command to run evaluation:

CUDA_VISIBLE_DEVICES=0,1,2,3 python evaluate.py \
--task_name classification \
--model_path ../saved-models/sst-tenary-result/pytorch_model.bin \
--model_type bert-base-uncased \
--cache_dir ../tmp/ \
--max_seq_length 128 \
--per_device_eval_batch_size 64 \
--data_path ../data-files/sst-tenary/sst-tenary-test.tsv

Model Training Options

As you can see, we devloped our own wrapper outside HuggingFace script. You have different options to setup your training scripts. For example --inoculation_step_sample_size can control how much in proportion you want to use to train the model. --scramble_proportion allows you to see the ordering effect in case you want to see how model perform when fine-tuning with scrambled datasets. --is_tensorboard will allow you log results to Weights & Biases. --inoculation_patience_count let you control for patient steps. For full support, you can look into the script. Since HuggingFace is constantly updating their scripts, you may need to adapt the codebase! --n_layer_to_finetune let you control how many layers you want to fine-tune. --no_pretrain will no load BERT pretrained weights. --model_type or --model_path helps you with loading models from HuggingFace or from local drive.

Other Model Training

Other than BERT, we also offer training for other models.

BoW, CRFs and Random Classifers

BoW and Random model is included in the notebook as they are easy to train with CPU at run_bow_classifier.ipynb and run_crf_classifier.ipynb.

Pretraining BERT from Scratch with Scrambled Data

Yes, we also allow you to do this. We tried this but not included in the paper. To pretrain a BERT, you can use our wrapper as:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9 python run_pretrain_bert.py \
--model_type bert \
--train_file ../data_files/wikitext-15M \
--do_train \
--do_eval \
--output_dir ../256-bert-base-uncased-wikitext-15M-results/ \
--cache_dir ../.huggingface_cache/ \
--num_train_epochs 40 \
--seed 42 \
--max_seq_length 256 \
--run_name 256-mlm-bert-base-uncased-wikitext-15M-results \
--per_device_train_batch_size 24 \
--per_device_eval_batch_size 24 \
--learning_rate 2e-5 \
--evaluation_strategy steps \
--eval_steps 500 \
--logging_steps 50 \
--line_by_line \
--pad_to_max_length

You can replace --train_file with the scrambled dataset file. And pretrain your MLMs!

LSTM + GloVe

We use an open source training script at here LSTM-GloVe. To train a LSTM model, you can use:

CUDA_VISIBLE_DEVICES=0,1,2,3 python run_classifier.py \
--model_type LSTMSequenceLabeling \
--eval_test \
--do_lower_case \
--max_seq_length 128 \
--train_batch_size 512 \
--eval_batch_size 512 \
--learning_rate 1e-3 \
--num_train_epochs 200 \
--seed 123 \
--task_name CONLL_2003 \
--data_dir ../../pretrain-data-distribution/data-files/conll2003-corrupted-matched/ \
--vocab_file ../models/LSTM/vocab.txt \
--output_dir ../results/CONLL_2003-LSTM-scratch-matched/

You have two options --embed_file to specify if you want to load pretrained embeddings for the vocab file. And you may change your model type to --model_type LSTM for sequence classification tasks.

Making Our TensorBoard Public

We are hoping to release a set of tensorboard to public, so to ensure reproducibility! Here is an example:

SST-3 BERT Training Results

Stayed tuned. We need to go through them and name them correctly.

License

This repo has a Creative Commons Attribution 4.0 International License.

About

Toolkit for a systematic exploration of how much transfer occurs when models are denied any information about word identity via random scrambling. Codebase for "Identifying the Limits of Cross-Domain Knowledge Transfer for Pretrained Models".

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published