In [0]:
#@title
%%html
<div style="background-color: pink;">
  Notebook written in collaboration with <a href="https://github.com/aditya-malte">Aditya Malte</a>.
  <br>
  The Notebook is on GitHub, so contributions are more than welcome.
</div>
<br>
<div style="background-color: yellow;">
  Aditya wrote another notebook with a slightly different use case and methodology, please check it out.
  <br>
  <a target="_blank" href="https://gist.github.com/aditya-malte/2d4f896f471be9c38eb4d723a710768b">
    https://gist.github.com/aditya-malte/2d4f896f471be9c38eb4d723a710768b
  </a>
</div>


# How to train a new language model from scratch using Transformers and Tokenizers

### Notebook edition (link to blogpost [link](https://huggingface.co/blog/how-to-train))


Over the past few weeks, we made several improvements to our [`transformers`](https://github.com/huggingface/transformers) and [`tokenizers`](https://github.com/huggingface/tokenizers) libraries, with the goal of making it way easier to **train a new language model from scratch**.

In this post we’ll demo how to train a “small” model (84 M parameters = 6 layers, 768 hidden size, 12 attention heads) – that’s the same number of layers & heads as DistilBERT – on **Esperanto**. We’ll then fine-tune the model on a downstream task of part-of-speech tagging.


## 1. Find a dataset

First, let us find a corpus of text in Esperanto. Here we’ll use the Esperanto portion of the [OSCAR corpus](https://traces1.inria.fr/oscar/) from INRIA.
OSCAR is a huge multilingual corpus obtained by language classification and filtering of [Common Crawl](https://commoncrawl.org/) dumps of the Web.

<img src="https://huggingface.co/blog/assets/01_how-to-train/oscar.png" style="margin: auto; display: block; width: 260px;">

The Esperanto portion of the dataset is only 299M, so we’ll concatenate with the Esperanto sub-corpus of the [Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download), which is comprised of text from diverse sources like news, literature, and wikipedia.

The final training corpus has a size of 3 GB, which is still small – for your model, you will get better results the more data you can get to pretrain on. 



In [0]:
# in this notebook we'll only get one of the files (the Oscar one) for the sake of simplicity and performance
#!wget -c https://s3.amazonaws.com/datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt
!wget -c https://traces1.inria.fr/oscar/files/compressed-orig/no.txt.gz

--2020-05-02 21:01:10--  https://traces1.inria.fr/oscar/files/compressed-orig/no.txt.gz
Resolving traces1.inria.fr (traces1.inria.fr)... 128.93.193.43
Connecting to traces1.inria.fr (traces1.inria.fr)|128.93.193.43|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3106497292 (2.9G) [application/gzip]
Saving to: ‘no.txt.gz’


2020-05-02 21:05:34 (11.3 MB/s) - ‘no.txt.gz’ saved [3106497292/3106497292]



In [0]:
!gunzip no.txt.gz

In [0]:
!ls -l

total 8384168
-rw-r--r-- 1 root root 8585378971 Apr  8  2019 no.txt
drwxr-xr-x 1 root root       4096 Apr  3 16:24 sample_data


## 2. Train a tokenizer

We choose to train a byte-level Byte-pair encoding tokenizer (the same as GPT-2), with the same special tokens as RoBERTa. Let’s arbitrarily pick its size to be 52,000.

We recommend training a byte-level BPE (rather than let’s say, a WordPiece tokenizer like BERT) because it will start building its vocabulary from an alphabet of single bytes, so all words will be decomposable into tokens (no more `<unk>` tokens!).


In [0]:
# Install dependencies
!pip uninstall -y tensorflow
!pip install transformers==2.8.0
# transformers version at notebook creation --- 2.5.1
# tokenizers version at notebook creation --- 0.5.2

Uninstalling tensorflow-2.2.0rc3:
  Successfully uninstalled tensorflow-2.2.0rc3
Collecting transformers==2.8.0
[?25l  Downloading https://files.pythonhosted.org/packages/a3/78/92cedda05552398352ed9784908b834ee32a0bd071a9b32de287327370b7/transformers-2.8.0-py3-none-any.whl (563kB)
[K     |████████████████████████████████| 573kB 2.8MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/99/50/93509f906a40bffd7d175f97fd75ea328ad9bd91f48f59c4bd084c94a25e/sacremoses-0.0.41.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 12.0MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/98/2c/8df20f3ac6c22ac224fff307ebc102818206c53fc454ecd37d8ac2060df5/sentencepiece-0.1.86-cp36-cp36m-manylinux1_x86_64.whl (1.0MB)
[K     |████████████████████████████████| 1.0MB 17.9MB/s 
Collecting tokenizers==0.5.2
[?25l  Downloading https://files.pythonhosted.org/packages/d1/3f/73c881ea4723e43c1e9acf317cf407fab3a278daab3a69c98d

In [0]:
## Workaround for latest bug: https://github.com/huggingface/transformers/issues/3893
#!git clone https://github.com/huggingface/transformers
#!pip install ./transformers

In [0]:
%%time 
from pathlib import Path

from tokenizers import ByteLevelBPETokenizer

paths = [str(x) for x in Path(".").glob("**/*.txt")]

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
model_name = 'NorBERTa'
path = "/content/drive/My Drive/{name}".format(name=model_name) 

Now let's save files to disk

In [0]:
from pathlib import Path
Path(path).mkdir(parents=True, exist_ok=True)
tokenizer.save(path)

In [0]:
!ls '/content/drive/My Drive/NorBERTa'
! cat '/content/drive/My Drive/NorBERTa/merges.txt'

🔥🔥 Wow, that was fast! ⚡️🔥

We now have both a `vocab.json`, which is a list of the most frequent tokens ranked by frequency, and a `merges.txt` list of merges.

```json
{
	"<s>": 0,
	"<pad>": 1,
	"</s>": 2,
	"<unk>": 3,
	"<mask>": 4,
	"!": 5,
	"\"": 6,
	"#": 7,
	"$": 8,
	"%": 9,
	"&": 10,
	"'": 11,
	"(": 12,
	")": 13,
	# ...
}

# merges.txt
l a
Ġ k
o n
Ġ la
t a
Ġ e
Ġ d
Ġ p
# ...
```

What is great is that our tokenizer is optimized for Esperanto. Compared to a generic tokenizer trained for English, more native words are represented by a single, unsplit token. Diacritics, i.e. accented characters used in Esperanto – `ĉ`, `ĝ`, `ĥ`, `ĵ`, `ŝ`, and `ŭ` – are encoded natively. We also represent sequences in a more efficient manner. Here on this corpus, the average length of encoded sequences is ~30% smaller as when using the pretrained GPT-2 tokenizer.

Here’s  how you can use it in `tokenizers`, including handling the RoBERTa special tokens – of course, you’ll also be able to use it direcly from `transformers`.


In [0]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing


tokenizer = ByteLevelBPETokenizer(
    "{path}/vocab.json".format(path=path),
    "{path}/merges.txt".format(path=path),
)

In [0]:
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

In [0]:
tokenizer.encode("Dette er første testen.")

Encoding(num_tokens=7, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing, original_str, normalized_str])

In [0]:
tokenizer.encode("Dette er første testen.").tokens

['<s>', 'Dette', 'Ġer', 'ĠfÃ¸rste', 'Ġtesten', '.', '</s>']

## 3. Train a language model from scratch

We will now train our language model using the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py) script from `transformers` (newly renamed from `run_lm_finetuning.py` as it now supports training from scratch more seamlessly). Just remember to leave `--model_name_or_path` to `None` to train from scratch vs. from an existing model or checkpoint.

> We’ll train a RoBERTa-like model, which is a BERT-like with a couple of changes (check the [documentation](https://huggingface.co/transformers/model_doc/roberta.html) for more details).

As the model is BERT-like, we’ll train it on a task of *Masked language modeling*, i.e. the predict how to fill arbitrary tokens that we randomly mask in the dataset. This is taken care of by the example script.


In [0]:
# Check that we have a GPU
!nvidia-smi

Fri May  1 20:21:26 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

In [0]:
# Check that PyTorch sees it
import torch
torch.cuda.is_available()

True

Here, as we only have one text file, we don't even need to customize our `LineByLineDataset`. We'll just run the `run_language_modeling.py` script out-of-the-box.

In [0]:
# Get the example scripts.
# This is buggy, use previous commit. https://github.com/huggingface/transformers/issues/3893
!wget -c https://raw.githubusercontent.com/huggingface/transformers/v2.8.0/examples/run_language_modeling.py
#!wget -c https://raw.githubusercontent.com/huggingface/transformers/b1ff0b2ae7d368b7db3a8a8472a29cc195d278d8/examples/run_language_modeling.py

--2020-05-02 21:13:29--  https://raw.githubusercontent.com/huggingface/transformers/v2.8.0/examples/run_language_modeling.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 34328 (34K) [text/plain]
Saving to: ‘run_language_modeling.py’


2020-05-02 21:13:29 (2.65 MB/s) - ‘run_language_modeling.py’ saved [34328/34328]



### We'll define the following config for the model

In [0]:
import json
config = {
	"architectures": [
		"RobertaForMaskedLM"
	],
	"attention_probs_dropout_prob": 0.1,
	"hidden_act": "gelu",
	"hidden_dropout_prob": 0.1,
	"hidden_size": 768,
	"initializer_range": 0.02,
	"intermediate_size": 3072,
	"layer_norm_eps": 1e-05,
	"max_position_embeddings": 514,
	"model_type": "roberta",
	"num_attention_heads": 12,
	"num_hidden_layers": 6,
	"type_vocab_size": 1,
	"vocab_size": 52000
}
with open("{path}/config.json".format(path=path), 'w') as fp:
    json.dump(config, fp)

tokenizer_config = {
	"max_len": 512
}
with open("{path}/tokenizer_config.json".format(path=path), 'w') as fp:
    json.dump(tokenizer_config, fp)

Let's run our script with the following options:

In [0]:
from pathlib import Path
output_dir = "{path}/NorBERTa-small-v2".format(path=path)
Path(output_dir).mkdir(parents=True, exist_ok=True)

In [0]:
!head -n 100000 no.txt >> no_small.txt

In [0]:
#	--should_continue
cmd =	"""
  python run_language_modeling.py
  --train_data_file ./no_small.txt
  --output_dir '{output_dir}'
	--model_type roberta
	--mlm
	--config_name '{path}'
	--tokenizer_name '{path}'
	--do_train
	--line_by_line
	--should_continue
	--learning_rate 1e-4
	--num_train_epochs 5
	--save_total_limit 2
	--save_steps 2000
	--per_gpu_train_batch_size 8
	--seed 42
""".format(path=path, output_dir=output_dir).replace("\n", " ")

In [0]:
print(cmd)

In [0]:
%%time
!{cmd}

/bin/bash: {cmd}: command not found
CPU times: user 42.7 ms, sys: 12.3 ms, total: 54.9 ms
Wall time: 15.1 s
