Reproducibility Challenge 2020: Replication of ELECTRA

This repository contains a reimplementation in PyTorch of ELECTRA for the Reproducibility Challenge 2020. This project was undertaken as part of the course IFT6268 Self Supervised Representation Learning at Mila / University of Montreal.

The preprocessing process is embedding and cached with the command lines for pretraining and downstream tasks.

A pretrained model with 1M steps training is also available via this link.

This work leverages HuggingFace libraries (Transformers, Datasets, Tokenizers) and PyTorch (1.7.1).

For more information, please refer to the associated paper.

Main results

My results are similar to the original ELECTRA’s implementation (Clark et al. [2020]), despite minor differences compared to the original paper for both implementations. With only 14M parameters, ELECTRA outperforms, in absolute performances, concurrent pretraining approaches from some previous SOTA, such as GPT, or alternative efficient approaches using knowledge distillation, such as DistilBERT. By taking into account compute cost, ELECTRA is clearly outperforming all compared approaches, including BERT and TinyBERT. Therefore, this work supports the claim that ELECTRA achieves high level of performances in low-resource settings, in term of compute cost. Furthermore, with an increased generator capacity than recommended by Clark et al. [2020], the discriminant can collapses by being unable to distinguish if inputs are fake or not. Thus, while ELECTRA is easier to train than GAN (Goodfellow et al. [2014]), it appears to be sensitive to capacity allocation between generator and discriminator.

Training behaviour

Original implementation

This implementation

More details are available in WandB

Results on GLUE dev set

Model	CoLA (Mcc)	SST-2 (Acc)	MRPC (Acc)	STS-B(Spc)	QQP (Acc)	MNLI (Acc)	QNLI (Acc)	RTE (Acc)	AVG	GLUE
Original	56.8	88.3	87.4	86.8	88.3	78.9	87.9	68.5	80.4
Mine	53.5	88.7	87.6	85.2	86.1	80.2	87.5	61.5	79.2	76.7

Efficiency analysis

Model	# Params	Training time + hardware	pfs-days	AVG	GLUE	pfs-days per AVG	pfs-days per GLUE
GPT	110M	30d on 8 P600	0.95	77.9	75.4	0.17	0.18
DistilBERT	67M	90h on 8 V100	0.16		77.0		0.21
ELECTRA-Original	14M	4d on 1 V100	0.02	80.4		0.03
ELECTRA-Mine	14M	3.75d on 1 RTX 3090	0.03	79.2	76.7	0.05	0.06

Main differences with Electra original implementation

Preprocessing. This reimplementation caches the tokenization step and dynamically pick a random segment during training. The segmentation is therefore dynamic instead of static in the original implementation. Furthermore, this reimplementation handles completely the download of pretraining datasets with HuggingFace datasets library.
Fine-tuning, the original implementation has got a discrepancy with the paper for the layerwise learning rate decay, see Github.
Task specific data augmentation. The original implementation uses a technique, called double_unordered, to increase by 2 the dataset for MRPC and STS. This implementation doesn't use any task specific data augmentation, see Github.

For more information, please refer to the [paper (under review)](To be added later).

How to cite this work

@misc{
mercier2021efficient,
title={Efficient transfer learning for {NLP} with {ELECTRA}},
author={Fran{\c{c}}ois MERCIER},
year={2021},
url={https://openreview.net/forum?id=Or5sv1Pj6od}
}

Experiment 1: Reproduce ElectraSmall

In this experiment, we use the maximum sequence length of 128 like ElectraSmall and the masking strategy is 15% of input tokens with 85% chance of replacement.

Pretraining

python run_pretraining.py --mlm_probability 0.15  --mlm_replacement_probability 0.85 --max_length 128 --per_device_train_batch_size 128 --gradient_accumulation_steps 1 --logging_steps 3840 --eval_steps 12800 --save_steps 1280000 --experiment_name pretraining_OWT --generator_layer_size 1.0 --generator_size 0.25

Downstream tasks: Glue

python run_glue.py  --experiment_name "electra_replication_6-25" --pretrain_path pretrained_model/checkpoint-1000000

Experiment 2: Sensitivity to generator

Pretraining for 12.5% generator size

python run_pretraining.py --mlm_probability 0.15  --mlm_replacement_probability 0.85 --max_length 128 --per_device_train_batch_size 128 --gradient_accumulation_steps 1 --logging_steps 3840 --eval_steps 12800 --save_steps 1280000 --experiment_name pretraining_OWT --generator_layer_size 1.0 --generator_size 0.125

Pretraining for 50% generator size

python run_pretraining.py --mlm_probability 0.15  --mlm_replacement_probability 0.85 --max_length 128 --per_device_train_batch_size 128 --gradient_accumulation_steps 1 --logging_steps 3840 --eval_steps 12800 --save_steps 1280000 --experiment_name pretraining_OWT --generator_layer_size 1.0 --generator_size 0.5

Pretraining for 75% generator size

python run_pretraining.py --mlm_probability 0.15  --mlm_replacement_probability 0.85 --max_length 128 --per_device_train_batch_size 128 --gradient_accumulation_steps 1 --logging_steps 3840 --eval_steps 12800 --save_steps 1280000 --experiment_name pretraining_OWT --generator_layer_size 1.0 --generator_size 0.75

Pretraining for 100% generator size

python run_pretraining.py --mlm_probability 0.15  --mlm_replacement_probability 0.85 --max_length 128 --per_device_train_batch_size 128 --gradient_accumulation_steps 1 --logging_steps 3840 --eval_steps 12800 --save_steps 1280000 --experiment_name pretraining_OWT --generator_layer_size 1.0 --generator_size 1.0

How to pretrain tokenizer - Optional since one is already provided in this repository

python train_tokenizer.py --output_dir models

Requirements

conda install pytorch=1.7.1 torchvision torchaudio -c pytorch
pip install -r requirements

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows		.github/workflows
images		images
models		models
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env.yml		env.yml
requirements.txt		requirements.txt
run_downstream.py		run_downstream.py
run_glue.py		run_glue.py
run_pretraining.py		run_pretraining.py
train_tokenizer.py		train_tokenizer.py

License

cccwam/rc2020_electra

Folders and files

Latest commit

History

Repository files navigation