<a href="https://colab.research.google.com/github/huggingface/blog/blob/notebook_update_may15/notebooks/01_how_to_train.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to train a new language model from scratch using Transformers and Tokenizers

### Notebook edition (link to blogpost [link](https://huggingface.co/blog/how-to-train)). Last update May 15, 2020


Over the past few months, we made several improvements to our [`transformers`](https://github.com/huggingface/transformers) and [`tokenizers`](https://github.com/huggingface/tokenizers) libraries, with the goal of making it easier than ever to **train a new language model from scratch**.

In this post we’ll demo how to train a “small” model (84 M parameters = 6 layers, 768 hidden size, 12 attention heads) – that’s the same number of layers & heads as DistilBERT – on **Esperanto**. We’ll then fine-tune the model on a downstream task of part-of-speech tagging.


## 1. Find a dataset

First, let us find a corpus of text in Esperanto. Here we’ll use the Esperanto portion of the [OSCAR corpus](https://traces1.inria.fr/oscar/) from INRIA.
OSCAR is a huge multilingual corpus obtained by language classification and filtering of [Common Crawl](https://commoncrawl.org/) dumps of the Web.

<img src="https://huggingface.co/blog/assets/01_how-to-train/oscar.png" style="margin: auto; display: block; width: 260px;">

The Esperanto portion of the dataset is only 299M, so we’ll concatenate with the Esperanto sub-corpus of the [Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download), which is comprised of text from diverse sources like news, literature, and wikipedia.

The final training corpus has a size of 3 GB, which is still small – for your model, you will get better results the more data you can get to pretrain on. 



In [1]:
# in this notebook we'll only get one of the files (the Oscar one) for the sake of simplicity and performance
#!wget -c https://cdn-datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt

In [2]:
# import transformers
# transformers.AutoModelForCausalLM

## 2. Train a tokenizer

We choose to train a byte-level Byte-pair encoding tokenizer (the same as GPT-2), with the same special tokens as RoBERTa. Let’s arbitrarily pick its size to be 52,000.

We recommend training a byte-level BPE (rather than let’s say, a WordPiece tokenizer like BERT) because it will start building its vocabulary from an alphabet of single bytes, so all words will be decomposable into tokens (no more `<unk>` tokens!).


In [3]:
# We won't need TensorFlow here
#!pip uninstall -y tensorflow
# Install `transformers` from master
!pip install git+https://github.com/huggingface/transformers
!pip list | grep -E 'transformers|tokenizers'
# transformers version at notebook update --- 2.11.0
# tokenizers version at notebook update --- 0.8.0rc1

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-4sqk6m0t
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-4sqk6m0t
Building wheels for collected packages: transformers
  Building wheel for transformers (setup.py) ... [?25ldone
[?25h  Created wheel for transformers: filename=transformers-3.0.2-py3-none-any.whl size=810726 sha256=91b44485b333ef849f4f35eff34c286b158e543a897bf95df7caf72167a14877
  Stored in directory: /tmp/pip-ephem-wheel-cache-5sgdnm9z/wheels/35/2e/a7/d819e3310040329f0f47e57c9e3e7a7338aa5e74c49acfe522
Successfully built transformers
tokenizers                         0.8.1rc2           
transformers                       3.0.2              


In [4]:
import pandas as pd
from tqdm import tqdm
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package stopwords to /home/taylor/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
stop = stopwords.words('english')
#custom
Remove= ['and ','the ','is ','this ','a ', '\n', '[', ']', "'", "\\n", '"','he']
Remove= Remove + stop

data = pd.read_csv('reddit.csv', error_bad_lines=False);
# We only need the Headlines text column from the data

# data_text['index'] = data_text.index
data_text = []
for key, value in data[['text']].T.to_dict().items():
    value['text'] = value['text'].lower()
    
    for rep in Remove:
        value['text']=value['text'].replace(rep, "")
        
    if len(value['text']) <=512:
        data_text.append(value['text'])

# for element in data_text:
#     for thing in Remove:
#         element = element.replace(thing,"")



# data = data_text..tolist()
# data = data.replace("\n"," ")
# data = data.replace("["," ")
# 
# data = data.replace("]"," ")
# data = data.replace("]"," ")
# data = data.apply(lambda x: [item for item in x if item not in stop])
# data = data.apply(lambda x: [item for item in x if item not in Remove])
# data = data.replace(Remove,"")

# print (data)
print(data_text)





['1', ' ner, hub f cvl cu., (/r/plc/wk/nex#wk__cvl)n generl,  ceu  r., ebe/cu/rgue r  e, n ck peple., perl nul, h  r ccun, he peech, **n** vcng  whng eh/phcl hr, r rule vln cn ul n pernen bn., f  ee cn n vln   rule, plee p ., f    quen grng n ule ng pe  hub, plee clck (hp://www..c/r/plc/wk/whel)  w  el   whel le cr., ****  b, hcn  perf uc., plee cc er  hub(/ge/cpe/?=/r/plc) f   n quen  ccern., *', 'l v  ge  fr.', 'en  el!, ..   funn!', 'elee', 'rup 100% l h   h w blck   ben', 'rcn(hp://hp.rpp.c/hg-pr.3.zw.c/ge/pen-elec-nl-rup-n-kne--n-ger-new-ph-1586966279.jpg?ze=768:*)', 'ee lughng ju lke ee lug wn rup fr nunce', '’ gb?,  wer wn ng blck  f ben?', 'r ler wn   rup  publc un., ju pr  hlbu lee.', 'ler gg  g    w   f h.', 'r  nee!', ' he her.', 'kne   g w ’ ju cl', 'h., gne kne  pen..’ ke h  rup.', 'cue  fuck   hp., l fuck h krhn.', '  ju le 40 er l whe hper   ben.', 'nl egne  rw blck  ng  w  ben  lp rup.', 'n  ju    vng f r-n-cf., kne  gnl   f nl rup n 2020:  kw   vng (hp://www.bunener.c/

In [6]:


# data['text']
# temp = [] 
# for key, value in data.T.to_dict().items():
# #     print(element)
#     if len(value['text']) <= 512:
#         temp.append(value)
    
# temp

with open("RedditText.txt", "w") as output:
    for element in data_text:
        output.write(element + "\n")


In [7]:
%%bash 
# ls
cat RedditText.txt

1
 ner, hub f cvl cu., (/r/plc/wk/nex#wk__cvl)n generl,  ceu  r., ebe/cu/rgue r  e, n ck peple., perl nul, h  r ccun, he peech, **n** vcng  whng eh/phcl hr, r rule vln cn ul n pernen bn., f  ee cn n vln   rule, plee p ., f    quen grng n ule ng pe  hub, plee clck (hp://www..c/r/plc/wk/whel)  w  el   whel le cr., ****  b, hcn  perf uc., plee cc er  hub(/ge/cpe/?=/r/plc) f   n quen  ccern., *
l v  ge  fr.
en  el!, ..   funn!
elee
rup 100% l h   h w blck   ben
rcn(hp://hp.rpp.c/hg-pr.3.zw.c/ge/pen-elec-nl-rup-n-kne--n-ger-new-ph-1586966279.jpg?ze=768:*)
ee lughng ju lke ee lug wn rup fr nunce
’ gb?,  wer wn ng blck  f ben?
r ler wn   rup  publc un., ju pr  hlbu lee.
ler gg  g    w   f h.
r  nee!
 he her.
kne   g w ’ ju cl
h., gne kne  pen..’ ke h  rup.
cue  fuck   hp., l fuck h krhn.
  ju le 40 er l whe hper   ben.
nl egne  rw blck  ng  w  ben  lp rup.
n  ju    vng f r-n-cf., kne  gnl   f nl rup n 2020:  kw   vng (hp://www.bunener.c/kne--gnl---f-nl-rup-n-2020-2020-4)
cncl pl n hpe  pung n

In [8]:
%%time 
from pathlib import Path

from tokenizers import ByteLevelBPETokenizer

paths = [str(x) for x in Path(".").glob("**/*.txt")]

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

CPU times: user 1.74 s, sys: 486 ms, total: 2.22 s
Wall time: 116 ms


Now let's save files to disk

In [9]:
!rm -rf RedditText
!mkdir RedditText
tokenizer.save_model("RedditText")

['RedditText/vocab.json', 'RedditText/merges.txt']

🔥🔥 Wow, that was fast! ⚡️🔥

We now have both a `vocab.json`, which is a list of the most frequent tokens ranked by frequency, and a `merges.txt` list of merges.

Here’s  how you can use it in `tokenizers`, including handling the RoBERTa special tokens – of course, you’ll also be able to use it directly from `transformers`.


In [11]:
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

NameError: name 'BertProcessing' is not defined

In [12]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing


tokenizer = ByteLevelBPETokenizer(
    "./RedditText/vocab.json",
    "./RedditText/merges.txt",
)

## 3. Train a language model from scratch

**Update:** This section follows along the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py) script, using our new [`Trainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) directly. Feel free to pick the approach you like best.

> We’ll train a RoBERTa-like model, which is a BERT-like with a couple of changes (check the [documentation](https://huggingface.co/transformers/model_doc/roberta.html) for more details).

As the model is BERT-like, we’ll train it on a task of *Masked language modeling*, i.e. the predict how to fill arbitrary tokens that we randomly mask in the dataset. This is taken care of by the example script.


In [13]:
# Check that we have a GPU
!nvidia-smi

Sat Aug  1 23:01:14 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K80           Off  | 00000000:06:00.0 Off |                    0 |
| N/A   62C    P0    57W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 00000000:07:00.0 Off |                    0 |
| N/A   45C    P0    73W / 149W |      0MiB / 11441MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
                                                                            

In [14]:
# Check that PyTorch sees it
import torch
torch.cuda.is_available()

True

### We'll define the following config for the model

In [15]:
from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

Now let's re-create our tokenizer in transformers

In [16]:
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("./RedditText", max_len=512)

Finally let's initialize our model.

**Important:**

As we are training from scratch, we only initialize from a config, not from an existing pretrained model or checkpoint.

In [17]:
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config=config)

In [18]:
model.num_parameters()
# => 84 million parameters

84095008

### Now let's build our training Dataset

We'll build our dataset by applying our tokenizer to our text file.

Here, as we only have one text file, we don't even need to customize our `Dataset`. We'll just use the `LineByLineDataset` out-of-the-box.

In [19]:
%%time
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="RedditText.txt",
    block_size=64,
)

CPU times: user 207 ms, sys: 33.6 ms, total: 240 ms
Wall time: 18.8 ms


Like in the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py) script, we need to define a data_collator.

This is just a small helper that will help us batch different samples of the dataset together into an object that PyTorch knows how to perform backprop on.

In [20]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

### Finally, we are all set to initialize our Trainer

In [21]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./EsperBERTo",
    overwrite_output_dir=True,
    num_train_epochs=40,
    per_gpu_train_batch_size=64,
    save_steps=10_000,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
    prediction_loss_only=True,
)

### Start training

In [22]:
%%time
trainer.train()

Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.


HBox(children=(FloatProgress(value=0.0, description='Epoch', max=40.0, style=ProgressStyle(description_width='…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…






HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…



CPU times: user 3min 45s, sys: 2min 4s, total: 5min 49s
Wall time: 3min 10s


TrainOutput(global_step=160, training_loss=6.817177161574364)

#### 🎉 Save final model (+ tokenizer + config) to disk

In [23]:
trainer.save_model("./EsperBERTo")

In [24]:
tokenizer.save_pretrained("./EsperBERTo")

('./EsperBERTo/vocab.json',
 './EsperBERTo/merges.txt',
 './EsperBERTo/special_tokens_map.json',
 './EsperBERTo/added_tokens.json')

## 4. Check that the LM actually trained

Aside from looking at the training and eval losses going down, the easiest way to check whether our language model is learning anything interesting is via the `FillMaskPipeline`.

Pipelines are simple wrappers around tokenizers and models, and the 'fill-mask' one will let you input a sequence containing a masked token (here, `<mask>`) and return a list of the most probable filled sequences, with their probabilities.



In [25]:
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="./EsperBERTo",
    tokenizer="./EsperBERTo"
)



In [26]:
# The sun <mask>.
# =>

fill_mask("I will vote for <mask>.")

[{'sequence': '<s>I will vote for.</s>',
  'score': 0.1515866369009018,
  'token': 225,
  'token_str': 'Ġ'},
 {'sequence': '<s>I will vote for n.</s>',
  'score': 0.03599032387137413,
  'token': 268,
  'token_str': 'Ġn'},
 {'sequence': '<s>I will vote for .</s>',
  'score': 0.03144724294543266,
  'token': 274,
  'token_str': 'ĠĠ'},
 {'sequence': '<s>I will vote for.,.</s>',
  'score': 0.03131469711661339,
  'token': 286,
  'token_str': '.,'},
 {'sequence': '<s>I will vote for f.</s>',
  'score': 0.029994426295161247,
  'token': 275,
  'token_str': 'Ġf'}]

Ok, simple syntax/grammar works. Let’s try a slightly more interesting prompt:



In [27]:
fill_mask("<mask> is the president")

# This is the beginning of a beautiful <mask>.
# =>

[{'sequence': '<s>  is the president</s>',
  'score': 0.12730830907821655,
  'token': 225,
  'token_str': 'Ġ'},
 {'sequence': '<s> n is the president</s>',
  'score': 0.04005299508571625,
  'token': 268,
  'token_str': 'Ġn'},
 {'sequence': '<s>   is the president</s>',
  'score': 0.037378400564193726,
  'token': 274,
  'token_str': 'ĠĠ'},
 {'sequence': '<s>., is the president</s>',
  'score': 0.020558584481477737,
  'token': 286,
  'token_str': '.,'},
 {'sequence': '<s> f is the president</s>',
  'score': 0.019908567890524864,
  'token': 275,
  'token_str': 'Ġf'}]

## 5. Share your model 🎉

Finally, when you have a nice model, please think about sharing it with the community:

- upload your model using the CLI: `transformers-cli upload`
- write a README.md model card and add it to the repository under `model_cards/`. Your model card should ideally include:
    - a model description,
    - training params (dataset, preprocessing, hyperparameters), 
    - evaluation results,
    - intended uses & limitations
    - whatever else is helpful! 🤓

### **TADA!**

➡️ Your model has a page on http://huggingface.co/models and everyone can load it using `AutoModel.from_pretrained("username/model_name")`.

[![tb](https://huggingface.co/blog/assets/01_how-to-train/model_page.png)](https://huggingface.co/julien-c/EsperBERTo-small)


If you want to take a look at models in different languages, check https://huggingface.co/models

[![all models](https://huggingface.co/front/thumbnails/models.png)](https://huggingface.co/models)
