Reproducibility Challenge 2020: Replication of ELECTRA
This repository contains a reimplementation in PyTorch of ELECTRA for the Reproducibility Challenge 2020. This project was undertaken as part of the course IFT6268 Self Supervised Representation Learning at Mila / University of Montreal.
The preprocessing process is embedding and cached with the command lines for pretraining and downstream tasks.
A pretrained model with 1M steps training is also available via this link.
This work leverages HuggingFace libraries (Transformers, Datasets, Tokenizers) and PyTorch (1.7.1).
For more information, please refer to the associated paper.
My results are similar to the original ELECTRA’s implementation (Clark et al. [2020]), despite minor differences compared to the original paper for both implementations. With only 14M parameters, ELECTRA outperforms, in absolute performances, concurrent pretraining approaches from some previous SOTA, such as GPT, or alternative efficient approaches using knowledge distillation, such as DistilBERT. By taking into account compute cost, ELECTRA is clearly outperforming all compared approaches, including BERT and TinyBERT. Therefore, this work supports the claim that ELECTRA achieves high level of performances in low-resource settings, in term of compute cost. Furthermore, with an increased generator capacity than recommended by Clark et al. [2020], the discriminant can collapses by being unable to distinguish if inputs are fake or not. Thus, while ELECTRA is easier to train than GAN (Goodfellow et al. [2014]), it appears to be sensitive to capacity allocation between generator and discriminator.
More details are available in WandB
Model | CoLA (Mcc) | SST-2 (Acc) | MRPC (Acc) | STS-B(Spc) | QQP (Acc) | MNLI (Acc) | QNLI (Acc) | RTE (Acc) | AVG | GLUE |
---|---|---|---|---|---|---|---|---|---|---|
Original | 56.8 | 88.3 | 87.4 | 86.8 | 88.3 | 78.9 | 87.9 | 68.5 | 80.4 | |
Mine | 53.5 | 88.7 | 87.6 | 85.2 | 86.1 | 80.2 | 87.5 | 61.5 | 79.2 | 76.7 |
Model | # Params | Training time + hardware | pfs-days | AVG | GLUE | pfs-days per AVG | pfs-days per GLUE |
---|---|---|---|---|---|---|---|
GPT | 110M | 30d on 8 P600 | 0.95 | 77.9 | 75.4 | 0.17 | 0.18 |
DistilBERT | 67M | 90h on 8 V100 | 0.16 | 77.0 | 0.21 | ||
ELECTRA-Original | 14M | 4d on 1 V100 | 0.02 | 80.4 | 0.03 | ||
ELECTRA-Mine | 14M | 3.75d on 1 RTX 3090 | 0.03 | 79.2 | 76.7 | 0.05 | 0.06 |
- Preprocessing. This reimplementation caches the tokenization step and dynamically pick a random segment during training. The segmentation is therefore dynamic instead of static in the original implementation. Furthermore, this reimplementation handles completely the download of pretraining datasets with HuggingFace datasets library.
- Fine-tuning, the original implementation has got a discrepancy with the paper for the layerwise learning rate decay, see Github.
- Task specific data augmentation. The original implementation uses a technique, called double_unordered, to increase by 2 the dataset for MRPC and STS. This implementation doesn't use any task specific data augmentation, see Github.
For more information, please refer to the [paper (under review)](To be added later).
@misc{
mercier2021efficient,
title={Efficient transfer learning for {NLP} with {ELECTRA}},
author={Fran{\c{c}}ois MERCIER},
year={2021},
url={https://openreview.net/forum?id=Or5sv1Pj6od}
}
In this experiment, we use the maximum sequence length of 128 like ElectraSmall and the masking strategy is 15% of input tokens with 85% chance of replacement.
python run_pretraining.py --mlm_probability 0.15 --mlm_replacement_probability 0.85 --max_length 128 --per_device_train_batch_size 128 --gradient_accumulation_steps 1 --logging_steps 3840 --eval_steps 12800 --save_steps 1280000 --experiment_name pretraining_OWT --generator_layer_size 1.0 --generator_size 0.25
python run_glue.py --experiment_name "electra_replication_6-25" --pretrain_path pretrained_model/checkpoint-1000000
python run_pretraining.py --mlm_probability 0.15 --mlm_replacement_probability 0.85 --max_length 128 --per_device_train_batch_size 128 --gradient_accumulation_steps 1 --logging_steps 3840 --eval_steps 12800 --save_steps 1280000 --experiment_name pretraining_OWT --generator_layer_size 1.0 --generator_size 0.125
python run_pretraining.py --mlm_probability 0.15 --mlm_replacement_probability 0.85 --max_length 128 --per_device_train_batch_size 128 --gradient_accumulation_steps 1 --logging_steps 3840 --eval_steps 12800 --save_steps 1280000 --experiment_name pretraining_OWT --generator_layer_size 1.0 --generator_size 0.5
python run_pretraining.py --mlm_probability 0.15 --mlm_replacement_probability 0.85 --max_length 128 --per_device_train_batch_size 128 --gradient_accumulation_steps 1 --logging_steps 3840 --eval_steps 12800 --save_steps 1280000 --experiment_name pretraining_OWT --generator_layer_size 1.0 --generator_size 0.75
python run_pretraining.py --mlm_probability 0.15 --mlm_replacement_probability 0.85 --max_length 128 --per_device_train_batch_size 128 --gradient_accumulation_steps 1 --logging_steps 3840 --eval_steps 12800 --save_steps 1280000 --experiment_name pretraining_OWT --generator_layer_size 1.0 --generator_size 1.0
python train_tokenizer.py --output_dir models
conda install pytorch=1.7.1 torchvision torchaudio -c pytorch
pip install -r requirements