# Train RoBERTa from Scratch

This notebook is based on a [tutorial](https://huggingface.co/blog/how-to-train) from the Hugging Face Transformers Library.

## Train a tokenizer

We choose to train a byte-level Byte-pair encoding tokenizer (the same as GPT-2), with the same special tokens as RoBERTa. Let’s arbitrarily pick its size to be 52,000.

We recommend training a byte-level BPE (rather than let’s say, a WordPiece tokenizer like BERT) because it will start building its vocabulary from an alphabet of single bytes, so all words will be decomposable into tokens (no more `<unk>` tokens!).


In [1]:
from google.colab import drive
drive.mount('/content/drive')
%cd drive/MyDrive/

Mounted at /content/drive
/content/drive/MyDrive


In [2]:
# We won't need TensorFlow here
# !pip install -y tensorflow
# Install `transformers` from master
!pip install git+https://github.com/huggingface/transformers
!pip list | grep -E 'transformers|tokenizers'
!pip install textacy
# transformers version at notebook update --- 2.11.0
# tokenizers version at notebook update --- 0.8.0rc1

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-2kwmka7j
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-2kwmka7j
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting tokenizers==0.9.4
[?25l  Downloading https://files.pythonhosted.org/packages/0f/1c/e789a8b12e28be5bc1ce2156cf87cb522b379be9cadc7ad8091a4cc107c4/tokenizers-0.9.4-cp36-cp36m-manylinux2010_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 5.2MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 46.5MB/s 
Building wheels for collected packages: transformers
  Building wheel for t

In [None]:
# load wikipedia file and make corpus. This cell takes many hours to run because of the size of the wikipedia dump.
# for smaller wikipedia dumps, look here: https://dumps.wikimedia.org/enwiki/latest/
import sys
from gensim.corpora import WikiCorpus

def make_corpus(in_f, out_f):

    """Convert wikipedia xml dump file to text corpus"""

    output = open(out_f, 'w')
    print('starting corpus')
    wiki = WikiCorpus(in_f)
    print('loaded corpus')

    i = 0
    for text in wiki.get_texts():
    output.write(bytes(' '.join(text), 'utf-8').decode('utf-8') + '\n')
    i = i + 1
    if (i % 100 == 0):
        print('Processed ' + str(i) + ' articles')
    output.close()
    print('Processing complete!')
 
# !wget -c https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
# make_corpus('enwiki-latest-pages-articles.xml.bz2', 'wiki_en_full.txt')

# !wget -c https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles10.xml-p4045403p5399366.bz2
# make_corpus('enwiki-latest-pages-articles10.xml-p4045403p5399366.bz2', 'wiki_en_10.txt')

In [None]:
%%time 
from pathlib import Path

from tokenizers import ByteLevelBPETokenizer

paths = ['wiki_en_10.txt']

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

CPU times: user 7min 47s, sys: 7.45 s, total: 7min 54s
Wall time: 2min 11s


In [None]:
!mkdir wikipedia_model_3e
tokenizer.save_model("wikipedia_model_3e")

mkdir: cannot create directory ‘wikipedia_model_3e’: File exists


['wikipedia_model_3e/vocab.json', 'wikipedia_model_3e/merges.txt']

In [3]:
%cd lm_training_experiments/

/content/drive/MyDrive/lm_training_experiments


In [4]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing


tokenizer = ByteLevelBPETokenizer(
    "./wikipedia_model_3e/vocab.json",
    "./wikipedia_model_3e/merges.txt",
)

In [5]:
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

In [6]:
tokenizer.encode("I am a fireman.").tokens

['<s>', 'I', 'Ġam', 'Ġa', 'Ġfireman', '.', '</s>']

## Train a language model from scratch

**Update:** This section follows along the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py) script, using our new [`Trainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) directly. Feel free to pick the approach you like best.

> We’ll train a RoBERTa-like model, which is a BERT-like with a couple of changes (check the [documentation](https://huggingface.co/transformers/model_doc/roberta.html) for more details).

As the model is BERT-like, we’ll train it on a task of *Masked language modeling*, i.e. the predict how to fill arbitrary tokens that we randomly mask in the dataset. This is taken care of by the example script.


In [None]:
# Check that we have a GPU
!nvidia-smi

Tue Dec 29 01:18:41 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P0    24W / 300W |      0MiB / 16130MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# Check that PyTorch sees it
import torch
torch.cuda.is_available()

True

### We'll define the following config for the model

In [7]:
from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

Now let's re-create our tokenizer in transformers

In [8]:
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("./wikipedia_model_3e", max_len=512)

Finally let's initialize our model.

**Important:**

As we are training from scratch, we only initialize from a config, not from an existing pretrained model or checkpoint.

In [9]:
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config=config)

In [10]:
model.num_parameters()
# => 84 million parameters

83504416

### Now let's build our training Dataset

We'll build our dataset by applying our tokenizer to our text file.

Here, as we only have one text file, we don't even need to customize our `Dataset`. We'll just use the `LineByLineDataset` out-of-the-box.

In [None]:
%%time
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./wiki_en_10.txt",
    block_size=128,
)



Like in the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py) script, we need to define a data_collator.

This is just a small helper that will help us batch different samples of the dataset together into an object that PyTorch knows how to perform backprop on.

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

### Finally, we are all set to initialize our Trainer

In [None]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./wikipedia_model_3e",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=64,
    save_steps=10_000
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset  
)

### Start training

In [None]:
%%time
trainer.train()

Step,Training Loss
500,7.880714
1000,7.322539
1500,7.120128
2000,6.984567
2500,6.899908
3000,6.826775
3500,6.777635
4000,6.726898
4500,6.669105
5000,6.635973


CPU times: user 29min 6s, sys: 19min 26s, total: 48min 33s
Wall time: 48min 28s


TrainOutput(global_step=7371, training_loss=6.847868527474647, metrics={'train_runtime': 2908.7329, 'train_samples_per_second': 2.534, 'total_flos': 30249366474276864, 'epoch': 3.0})

#### 🎉 Save final model (+ tokenizer + config) to disk

In [16]:
trainer.save_model("./wikipedia_model_3e")
# %cd lm_training_experiments

# if model is pretrained, use the following line to import the model directly
# model = RobertaForMaskedLM(config=config).from_pretrained('./wikipedia_model_3e')

## Check that the LM actually trained

In [11]:
from transformers import pipeline

unmasker = pipeline(
    "fill-mask",
    model="./wikipedia_model_3e",
    tokenizer="./wikipedia_model_3e"
)

Some weights of RobertaModel were not initialized from the model checkpoint at ./wikipedia_model_3e and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Run Knowledge Graph Experiments

In [None]:
# The sun <mask>.
# =>
from run_training_kg_experiments import *

run_experiments(tokenizer, model, unmasker, "Roberta3e")