CKD - Distilling Context Linguistic for Language model compression

This repository is the official PyTorch implementation of "Distilling Linguistic Context for Language Model Compression" by GeondoPark, Gyeongman Kim and Eunho Yang.
Currently, it only contains code for task-specific distillation.

Setup

Currently, install other dependencies with conda environments

conda create -n ckd python=3.7
conda activate ckd
pip install -r requirement.txt

Prepare the datasets

GLUE dataset

Before running, download the GLUE data using this repository and unpack it to directory $datas/glue.
In $datas/glue directory, check that a folder is created for each task - CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI. (We rename the folder in CoLA to COLA)

SQuADv1.1 dataset

Before running, download the SQuADv1.1 data using this website and put it to directory $datas/squad.
In $datas/squad directory, check the file named train-v1.1.json and dev-v1.1.json.

Prepare the pre-trained smaller BERT provided by Turc etal,.2019

Before task-specific distillation, you download the pre-trained teacher and student model from this repository.
For example, download the bert-base-uncased model to be used as a teacher and the bert 6/768 model to be used as a student and unpack it to directory $download_models.

Training the teacher model

Overall hyperparameters for training are set by default. Hyperparameters for each task including the number of epochs and learning rate for each tasks, please refer the our paper and original BERT.

GLUE Training

python main_glue.py --exp_name teachers_glue \    
                    --do_train \  
                    --model_type bert \  
                    --model_path ./download_models/bert-base-uncased \  
                    --do_lower_case \  
                    --task_name [TASK_NAME] \  
                    --data_dir ./datas/glue \  
                    --per_gpu_batch_size 32 \  
                    --num_train_epochs [#EPOCHS] \  
                    --learning_rate [LR] \

Training the student model with task-specific distillation

Perform task specific distillation with CKD.

GLUE distillation training

We included specific hyperparameters of CKD for each task in the BERT 6/768 student model setting.
For other smaller BERTs or hyperparameter tuning, comment out 'args = change_args(args)' in main_glue_distill.py

python main_glue_distill.py --exp_name distill_student \  
                    --distil_loss kl+wrdist+wrangle_window+ltrdist+ltrangle \  
                    --do_lower_case \   
                    --do_train \  
                    --temperature [TEMPERATURE] \  
                    --task_name [TASK_NAME] \  
                    --matching_strategy_wr emb+skip \  
                    --matching_strategy_ltr emb+skip \  
                    --wrdist [COEFFICIENT FOR WRDIST] \  
                    --wrangle [COEFFICIENT FOR WRANGLE] \  
                    --ltrdist_w [COEFFICIENT FOR LTRDIST] \  
                    --ltrangle_w [COEFFICIENT FOR LTRANGLE] \  
                    --num_train_epoch [#EPOCHS] \  
                    --learning_rate [#LR] \
                    --window_size 21 \  
                    --distance cos \  
                    --lossfunc_d kldiv \  
                    --lossfunc_a l2loss \  
                    --teacher_type bert \  
                    --teacher_path ./teachers_glue \  
                    --student_type bert_small \  
                    --student_path ./download_models/bert-6-768 \  
                    --logging_step 500 \  
                    --eval_steps 500 \  
                    --seed [SEED]

Citation

@inproceedings{park2021distilling,
  title={Distilling Linguistic Context for Language Model Compression},
  author={Park, Geondo and Kim, Gyeongman and Yang, Eunho},
  booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},
  pages={364--378},
  year={2021}
}

Acknowledgement

This work was supported by the National Research Foundation of Korea (NRF) grants (2018R1A5A1059921, 2019R1C1C1009192) and Institute of Information & Communications Technology Planning & Evaluation (IITP) grants(No.2017-0-01779, A machine learning and statistical inference framework for explainable artificial intelligence, No.2019-0-01371, Development
of brain-inspired AI with human-like intelligence, No.2019-0-00075, Artificial Intelligence Graduate School Program (KAIST)) funded by the Korea government (MSIT).

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data_augment.py		data_augment.py
dataset.py		dataset.py
distil_config.py		distil_config.py
glue_train.py		glue_train.py
loss.py		loss.py
main_glue.py		main_glue.py
main_glue_distill.py		main_glue_distill.py
main_squad.py		main_squad.py
main_squad_distill.py		main_squad_distill.py
requirement.txt		requirement.txt
squad_train.py		squad_train.py
test_run.py		test_run.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CKD - Distilling Context Linguistic for Language model compression

Setup

Prepare the datasets

GLUE dataset

SQuADv1.1 dataset

Prepare the pre-trained smaller BERT provided by Turc etal,.2019

Training the teacher model

GLUE Training

Training the student model with task-specific distillation

GLUE distillation training

Citation

Acknowledgement

About

Releases 1

Packages

Languages

License

GeondoPark/CKD

Folders and files

Latest commit

History

Repository files navigation

CKD - Distilling Context Linguistic for Language model compression

Setup

Prepare the datasets

GLUE dataset

SQuADv1.1 dataset

Prepare the pre-trained smaller BERT provided by Turc etal,.2019

Training the teacher model

GLUE Training

Training the student model with task-specific distillation

GLUE distillation training

Citation

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages