<a href="https://colab.research.google.com/github/finardi/tutos/blob/master/Electra_FromScratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Introduction

At ICLR 2020, [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://openreview.net/pdf?id=r1xMH1BtvB), a new method for self-supervised language representation learning, was introduced. ELECTRA is another member of the Transformer pre-training method family, whose previous members such as BERT, GPT-2, RoBERTa have achieved many state-of-the-art results in Natural Language Processing benchmarks.

Different from other masked language modeling methods, ELECTRA is a more sample-efficient pre-training task called replaced token detection. At a small scale, ELECTRA-small can be trained on a single GPU for 4 days to outperform [GPT (Radford et al., 2018)](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf) (trained using 30x more compute) on the GLUE benchmark. At a large scale, ELECTRA-large outperforms [ALBERT (Lan et al., 2019)]() on GLUE and sets a new state-of-the-art for SQuAD 2.0.

![](https://github.com/chriskhanhtran/spanish-bert/blob/master/img/electra-performance.JPG?raw=true)
*ELECTRA consistently outperforms masked language model pre-training approaches.*





## 2. Method

Masked language modeling pre-training methods such as [BERT (Devlin et al., 2019)](https://arxiv.org/abs/1810.04805) corrupt the input by replacing some tokens (typically 15% of the input) with `[MASK]` and then train a model to re-construct the original tokens.

Instead of masking, ELECTRA corrupts the input by replacing some tokens with samples from the outputs of a smalled masked language model. Then, a discriminative model is trained to predict whether each token was an original or a replacement. After pre-training, the generator is thrown out and the discriminator is fine-tuned on downstream tasks.

![](https://github.com/chriskhanhtran/spanish-bert/blob/master/img/electra-overview.JPG?raw=true)
*An overview of ELECTRA.*

Although having a generator and a discriminator like GAN, ELECTRA is not adversarial in that the generator producing corrupted tokens is trained with maximum likelihood rather than being trained to fool the discriminator.

**Why is ELECTRA so efficient?**

With a new training objective, ELECTRA can achieve comparable performance to strong models such as [RoBERTa (Liu et al., (2019)](https://arxiv.org/abs/1907.11692) which has more parameters and needs 4x more compute for training. In the paper, an analysis was conducted to understand what really contribute to ELECTRA's efficiency. The key findings are:

- ELECTRA is greatly benefiting from having a loss defined over all input tokens rather than just a subset. More specifically, in ELECTRA, the discriminator predicts on every token in the input, while in BERT, the generator only predicts 15% masked tokens of the input.
- BERT's performance is slightly harmed because in the pre-training phase, the model sees `[MASK]` tokens, while it is not the case in the fine-tuning phase.

![](https://github.com/chriskhanhtran/spanish-bert/blob/master/img/electra-vs-bert.JPG?raw=true)
*ELECTRA vs. BERT*

## 3. Pre-train ELECTRA

In this section, we will train ELECTRA from scratch with TensorFlow using scripts provided by ELECTRA's authors in [google-research/electra](https://github.com/google-research/electra). Then we will convert the model to PyTorch's checkpoint, which can be easily fine-tuned on downstream tasks using Hugging Face's `transformers` library.

### Setup

### Data

We will pre-train ELECTRA on a Spanish movie subtitle dataset retrieved from OpenSubtitles. This dataset is 5.4 GB in size and we will train on a small subset of ~30 MB for presentation.

In [1]:
!pip install -q tensorflow==1.15
!pip install -q transformers==2.8.0
!git clone https://github.com/google-research/electra.git

[K     |████████████████████████████████| 412.3MB 41kB/s 
[K     |████████████████████████████████| 3.8MB 38.6MB/s 
[K     |████████████████████████████████| 512kB 38.2MB/s 
[K     |████████████████████████████████| 51kB 6.4MB/s 
[?25h  Building wheel for gast (setup.py) ... [?25l[?25hdone
[31mERROR: tensorflow-probability 0.11.0 has requirement gast>=0.3.2, but you'll have gast 0.2.2 which is incompatible.[0m
[K     |████████████████████████████████| 573kB 2.8MB/s 
[K     |████████████████████████████████| 3.7MB 12.8MB/s 
[K     |████████████████████████████████| 1.1MB 41.3MB/s 
[K     |████████████████████████████████| 133kB 18.4MB/s 
[K     |████████████████████████████████| 890kB 36.4MB/s 
[K     |████████████████████████████████| 71kB 7.8MB/s 
[K     |████████████████████████████████| 6.7MB 42.5MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
Cloning into 'electra'...
remote: Enumerating objects: 104, done.[K
remote: Total 104 (delta 0),

In [2]:
import os
import json
from tokenizers import BertWordPieceTokenizer
from tokenizers.normalizers import Lowercase, NFKC, Sequence

In [3]:
# %%time
# tokenizer = BertWordPieceTokenizer(
#     clean_text=True,
#     handle_chinese_chars=False,
#     strip_accents=True,
#     lowercase=False,
# )

# tokenizer.normalizer = Sequence([NFKC()])

# path = ['/content/drive/My Drive/Colab Notebooks/from_scratch/artigo2020/Data e Dataprep/wikidump_clean_shuffle_merge.txt']

# tokenizer.train(files=path, 
#                 vocab_size=30000,
#                 min_frequency=2, 
#                 show_progress=True,
#                 special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"],
#                 limit_alphabet=2000,
#                 wordpieces_prefix="##",
#                 )

# tokenizer.save('/content/drive/My Drive/Colab Notebooks/from_scratch/')

CPU times: user 11min 9s, sys: 8.3 s, total: 11min 17s
Wall time: 11min 18s


In [3]:
DATA_DIR = "/content/drive/My\ Drive/Colab\ Notebooks/from_scratch"
MODEL_NAME = "electra_MALS"

Before building the pre-training dataset, we should make sure the corpus has the following format:
- each line is a sentence
- a blank line separates two documents

### Build Pretraining Dataset

We will use the tokenizer of `bert-base-multilingual-cased` to process Spanish texts.

We use `build_pretraining_dataset.py` to create a pre-training dataset from a dump of raw text.

In [5]:
# !python3 electra/build_pretraining_dataset.py \
#   --corpus-dir $DATA_DIR \
#   --vocab-file $DATA_DIR/vocab.txt \
#   --output-dir $DATA_DIR/pretrain_tfrecords \
#   --max-seq-length 128 \
#   --blanks-separate-docs False \
#   --no-lower-case \
#   --num-processes 5

Job 0: Creating example writer
Job 1: Creating example writer
Job 2: Creating example writer
Job 3: Creating example writer
Job 4: Creating example writer
Job 3: Writing tf examples
Job 4: Writing tf examples
Job 0: Writing tf examples
Job 3: Done!
Job 0: Done!
Job 1: Writing tf examples
Job 1: Done!
Job 2: Writing tf examples
Job 4: Done!
Job 2: Done!


### Start Training

We use `run_pretraining.py` to pre-train an ELECTRA model.

To train a small ELECTRA model for 1 million steps, run:

```
python3 run_pretraining.py --data-dir $DATA_DIR --model-name electra_small
```

This takes slightly over 4 days on a Tesla V100 GPU. However, the model should achieve decent results after 200k steps (10 hours of training on the v100 GPU).

To customize the training, create a `.json` file containing the hyperparameters. Please refer [`configure_pretraining.py`](https://github.com/google-research/electra/blob/master/configure_pretraining.py) for default values of all hyperparameters.

Below, we set the hyperparameters to train the model for only 100 steps.

In [9]:
hparams = {
    "do_train": "true",
    "do_eval": "false",
    "model_size": "small",
    "do_lower_case": "false",
    "vocab_size": 30000,
    "num_train_steps": 10_000,
    "save_checkpoints_steps": 100,
    "train_batch_size": 32,
}
           
with open("/content/drive/My Drive/Colab Notebooks/from_scratch/hparams.json", "w") as f:
    json.dump(hparams, f)

Let's start training:

In [10]:
!python3 electra/run_pretraining.py \
  --data-dir $DATA_DIR \
  --model-name $MODEL_NAME \
  --hparams "/content/drive/My Drive/Colab Notebooks/from_scratch/hparams.json"

[1;30;43mA saída de streaming foi truncada nas últimas 5000 linhas.[0m
5138/10000 = 51.4%, SPS: 3.8, ELAP: 18:00, ETA: 21:09 - loss: 24.4225
5139/10000 = 51.4%, SPS: 3.8, ELAP: 18:01, ETA: 21:09 - loss: 24.6090
5140/10000 = 51.4%, SPS: 3.8, ELAP: 18:01, ETA: 21:09 - loss: 24.0571
5141/10000 = 51.4%, SPS: 3.8, ELAP: 18:01, ETA: 21:08 - loss: 24.0268
5142/10000 = 51.4%, SPS: 3.8, ELAP: 18:01, ETA: 21:08 - loss: 25.5911
5143/10000 = 51.4%, SPS: 3.8, ELAP: 18:02, ETA: 21:08 - loss: 24.7074
5144/10000 = 51.4%, SPS: 3.8, ELAP: 18:02, ETA: 21:08 - loss: 24.1777
5145/10000 = 51.5%, SPS: 3.8, ELAP: 18:02, ETA: 21:07 - loss: 24.9475
5146/10000 = 51.5%, SPS: 3.8, ELAP: 18:02, ETA: 21:07 - loss: 23.5475
5147/10000 = 51.5%, SPS: 3.8, ELAP: 18:03, ETA: 21:07 - loss: 24.6130
5148/10000 = 51.5%, SPS: 3.8, ELAP: 18:03, ETA: 21:07 - loss: 24.6527
5149/10000 = 51.5%, SPS: 3.8, ELAP: 18:03, ETA: 21:06 - loss: 24.6164
5150/10000 = 51.5%, SPS: 3.8, ELAP: 18:03, ETA: 21:06 - loss: 24.7649
5151/10000 = 51.5

## 4. Convert Tensorflow checkpoints to PyTorch format

Hugging Face has [a tool](https://huggingface.co/transformers/converting_tensorflow_models.html) to convert Tensorflow checkpoints to PyTorch. However, this tool has yet been updated for ELECTRA. Fortunately, I found a GitHub repo by @lonePatient that can help us with this task.

In [19]:
!git clone https://github.com/lonePatient/electra_pytorch.git

Cloning into 'electra_pytorch'...
remote: Enumerating objects: 196, done.[K
remote: Counting objects: 100% (196/196), done.[K
remote: Compressing objects: 100% (146/146), done.[K
remote: Total 196 (delta 89), reused 124 (delta 46), pack-reused 0[K
Receiving objects: 100% (196/196), 404.99 KiB | 643.00 KiB/s, done.
Resolving deltas: 100% (89/89), done.


In [26]:
MODEL_DIR = "/content/drive/My Drive/Colab Notebooks/from_scratch/models/electra_MALS/"

config = {
  "vocab_size": 30_000,
  "embedding_size": 128,
  "hidden_size": 256,
  "num_hidden_layers": 12,
  "num_attention_heads": 4,
  "intermediate_size": 1024,
  "generator_size":"0.25",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "attention_probs_dropout_prob": 0.1,
  "max_position_embeddings": 512,
  "type_vocab_size": 2,
  "initializer_range": 0.02
}

with open(MODEL_DIR + "config.json", "w") as f:
    json.dump(config, f)

In [30]:
!python electra_pytorch/convert_electra_tf_checkpoint_to_pytorch.py \
    --tf_checkpoint_path=/content/drive/My\ Drive/Colab\ Notebooks/from_scratch/models/electra_MALS/ \
    --electra_config_file=/content/drive/My\ Drive/Colab\ Notebooks/from_scratch/models/electra_MALS/config.json \
    --pytorch_dump_path=/content/drive/My\ Drive/Colab\ Notebooks/from_scratch/models/electra_MALS/pytorch_model.bin

INFO:model.configuration_utils:loading configuration file /content/drive/My Drive/Colab Notebooks/from_scratch/models/electra_MALS/config.json
INFO:model.configuration_utils:Model config {
  "attention_probs_dropout_prob": 0.1,
  "embedding_size": 128,
  "finetuning_task": null,
  "generator_size": "0.25",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 256,
  "initializer_range": 0.02,
  "intermediate_size": 1024,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "num_attention_heads": 4,
  "num_hidden_layers": 12,
  "num_labels": 2,
  "output_attentions": false,
  "output_hidden_states": false,
  "pruned_heads": {},
  "torchscript": false,
  "type_vocab_size": 2,
  "vocab_size": 30000
}

INFO:model.modeling_electra:Converting TensorFlow checkpoint from /content/drive/My Drive/Colab Notebooks/from_scratch/models/electra_MALS
INFO:model.modeling_electra:Loading TF weight discriminator_predictions/dense/bias with shape [256]
INFO:model.modeling_electr

**Use ELECTRA with `transformers`**

After converting the model checkpoint to PyTorch format, we can start to use our pre-trained ELECTRA model on downstream tasks with the `transformers` library.

In [11]:
from transformers import ElectraTokenizer, ElectraModel

In [33]:
import torch
from transformers import ElectraForPreTraining, ElectraTokenizerFast

discriminator = ElectraForPreTraining.from_pretrained(MODEL_DIR)
tokenizer = ElectraTokenizerFast.from_pretrained('/content/drive/My Drive/Colab Notebooks/from_scratch', do_lower_case=False)

In [34]:
fake_sentence = "Isso é um exemplo!"

fake_tokens = tokenizer.tokenize(fake_sentence, add_special_tokens=True)
fake_inputs = tokenizer.encode(fake_sentence, return_tensors="pt")
discriminator_outputs = discriminator(fake_inputs)
predictions = discriminator_outputs[0] > 0

[print("%7s" % token, end="") for token in fake_tokens]
print("\n")
[print("%7s" % int(prediction), end="") for prediction in predictions.tolist()];

  [CLS]   Isso      e     umexemplo      !  [SEP]

      1      0      0      0      0      0      1

Our model was trained for only 100 steps so the predictions are not accurate. The fully-trained ELECTRA-small for Spanish can be loaded as below:

```python
discriminator = ElectraForPreTraining.from_pretrained("skimai/electra-small-spanish")
tokenizer = ElectraTokenizerFast.from_pretrained("skimai/electra-small-spanish", do_lower_case=False)
```


## 5. Conclusion

In this article, we have walked through the ELECTRA paper to understand why ELECTRA is the most efficient transformer pre-training approach at the moment. At a small scale, ELECTRA-small can be trained on one GPU for 4 days to outperform GPT on the GLUE benchmark. At a large scale, ELECTRA-large sets a new state-of-the-art for SQuAD 2.0.

We then actually train an ELECTRA model on Spanish texts and convert Tensorflow checkpoint to PyTorch and use the model with the `transformers` library.

## References
- [1] [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://openreview.net/pdf?id=r1xMH1BtvB)
- [2] [google-research/electra](https://github.com/google-research/electra) - the official GitHub repository of the original paper
- [3] [electra_pytorch](https://github.com/lonePatient/electra_pytorch) - a PyTorch implementation of ELECTRA