# How to train a new language model from scratch using Transformers and Tokenizers

### Notebook edition (link to blogpost [link](https://huggingface.co/blog/how-to-train)). Last update May 15, 2020


Over the past few months, we made several improvements to our [`transformers`](https://github.com/huggingface/transformers) and [`tokenizers`](https://github.com/huggingface/tokenizers) libraries, with the goal of making it easier than ever to **train a new language model from scratch**.

In this post we’ll demo how to train a “small” model (84 M parameters = 6 layers, 768 hidden size, 12 attention heads) – that’s the same number of layers & heads as DistilBERT – on **Esperanto**. We’ll then fine-tune the model on a downstream task of part-of-speech tagging.


In [None]:
# We won't need TensorFlow here
!pip uninstall -y tensorflow
# Install `transformers` from master
!pip install git+https://github.com/huggingface/transformers
!pip list | grep -E 'transformers|tokenizers'
# transformers version at notebook update --- 2.11.0
# tokenizers version at notebook update --- 0.8.0rc1

In [None]:
from tokenizers.processors import BertProcessing
from tokenizers import BertWordPieceTokenizer

tokenizer = BertWordPieceTokenizer('./vocab.txt',
                                    strip_accents=False,
                                    lowercase=False)


In [None]:
!mkdir gil-tokenizer
tokenizer.save_model("gil-tokenizer")

In [None]:
tokenizer._tokenizer.post_processor = BertProcessing(
    ("[SEP]", tokenizer.token_to_id("[SEP]")),
    ("[CLS]", tokenizer.token_to_id("[CLS]")),
)
tokenizer.enable_truncation(max_length=512)

In [None]:
tokenizer.encode("Olá, como está você.")

In [None]:
tokenizer.encode("Olá, como está você.").tokens

## 3. Train a language model from scratch

**Update:** This section follows along the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/legacy/run_language_modeling.py) script, using our new [`Trainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) directly. Feel free to pick the approach you like best.

> We’ll train a RoBERTa-like model, which is a BERT-like with a couple of changes (check the [documentation](https://huggingface.co/transformers/model_doc/roberta.html) for more details).

As the model is BERT-like, we’ll train it on a task of *Masked language modeling*, i.e. the predict how to fill arbitrary tokens that we randomly mask in the dataset. This is taken care of by the example script.


In [None]:
# Check that we have a GPU
!nvidia-smi

In [None]:
# Check that PyTorch sees it
import torch
torch.cuda.is_available()

### We'll define the following config for the model

In [None]:
from transformers import BertConfig

config = BertConfig().from_json_file("./config.json")
config

Now let's re-create our tokenizer in transformers

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained(
    # "./vocab.txt", 
    "./gil-tokenizer", 
    max_len=512,
    do_lower_case=False,
)

Finally let's initialize our model.

**Important:**

As we are training from scratch, we only initialize from a config, not from an existing pretrained model or checkpoint.

In [None]:
from transformers import BertForMaskedLM

model = BertForMaskedLM(config=config)

In [None]:
model.num_parameters()

### Now let's build our training Dataset

We'll build our dataset by applying our tokenizer to our text file.

Here, as we only have one text file, we don't even need to customize our `Dataset`. We'll just use the `LineByLineDataset` out-of-the-box.

In [None]:
%%time
import shutil
import os

if not os.path.isfile("training_data.txt"):
    folder_path = './Data'
    file_paths = glob.glob(os.path.join(folder_path, "*.txt"))

    with open("training_data.txt", "wb") as outfile:
        for filename in file_paths:
            with open(filename, "rb") as infile:
                shutil.copyfileobj(infile, outfile, length=1024*1024)


In [None]:
from torch.utils.data import IterableDataset

class IterableLineByLineTextDataset(IterableDataset):
    def __init__(self, tokenizer, file_path: str, block_size: int):
        self.tokenizer = tokenizer
        self.file_path = file_path
        self.block_size = block_size
        self.file = open(self.file_path, 'r', encoding='utf-8')

    def __iter__(self):
        for line in self.file:
            # lines = [line for line in self.file.read().splitlines() if (len(line) > 0 and not line.isspace())]
            batch_encoding = self.tokenizer(line, add_special_tokens=True, truncation=True, max_length=self.block_size, 
                                            truncation_strategy='only_first_token', padding=True)
            yield {"input_ids": torch.tensor(batch_encoding["input_ids"], dtype=torch.long)}
    
    def __len__(self):
        if self.file_path == "./Validation.txt":
            return 7851663
        elif self.file_path=="./training_data.txt":
            return 64640252
        
        return None
    
train_data = IterableLineByLineTextDataset(file_path="./training_data.txt", tokenizer=tokenizer, block_size=128)

In [None]:
validation = IterableLineByLineTextDataset(tokenizer=tokenizer, file_path="./Validation.txt", block_size=128)

Like in the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py) script, we need to define a data_collator.

This is just a small helper that will help us batch different samples of the dataset together into an object that PyTorch knows how to perform backprop on.

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

### Finally, we are all set to initialize our Trainer

In [None]:
from transformers import Trainer, TrainingArguments
from torch.optim import AdamW
from transformers import get_linear_schedule_with_warmup


# total_steps = 72_000

training_args = TrainingArguments(
    output_dir="./gilBERTo",
    overwrite_output_dir=True,
    num_train_epochs=1,
    # max_steps=total_steps,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    save_steps=8_000,
    save_total_limit=2,
    prediction_loss_only=True,
    learning_rate=5e-5,
)

optimizer = AdamW(model.parameters(), lr=training_args.learning_rate, eps=1e-8)
total_steps = training_args.num_train_epochs * len(train_data)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_data,
    eval_dataset=validation,
    optimizers=(optimizer, scheduler)
)

### Start training

In [None]:
%%time
import torch

trainer.train()

**Resume training if necessary** 

In [None]:
from transformers import BertForPreTraining, TrainingArguments, Trainer

output_dir = "gilBERTo"

# Load the training arguments from the output_dir
training_args = TrainingArguments.from_json_file(output_dir / "training_args.json")

# Load the model from the output_dir
model = BertForPreTraining.from_pretrained(output_dir)

# Create a new trainer with the same training arguments
trainer = Trainer(model=model, args=training_args, data_collator=data_collator, train_dataset=train_dataset, eval_dataset=eval_dataset)

# Resume training from the checkpoint
trainer.resume_from_checkpoint(output_dir / "best.pth")

# Continue training
trainer.train()

#### 🎉 Save final model (+ tokenizer + config) to disk

In [None]:
trainer.save_model("./gilBERTo-model")

## 4. Check that the LM actually trained

Aside from looking at the training and eval losses going down, the easiest way to check whether our language model is learning anything interesting is via the `FillMaskPipeline`.

Pipelines are simple wrappers around tokenizers and models, and the 'fill-mask' one will let you input a sequence containing a masked token (here, `<mask>`) and return a list of the most probable filled sequences, with their probabilities.



In [None]:
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="./gilBERTo-model",
    tokenizer=tokenizer
)

In [None]:
# The sun <mask>.
# =>

fill_mask("O português é [MASK] idioma.")

Ok, simple syntax/grammar works. Let’s try a slightly more interesting prompt:



In [None]:
fill_mask("O dia está [MASK] lindo.")

# This is the beginning of a beautiful <mask>.
# =>

## 5. Share your model 🎉

Finally, when you have a nice model, please think about sharing it with the community:

- upload your model using the CLI: `transformers-cli upload`
- write a README.md model card and add it to the repository under `model_cards/`. Your model card should ideally include:
    - a model description,
    - training params (dataset, preprocessing, hyperparameters), 
    - evaluation results,
    - intended uses & limitations
    - whatever else is helpful! 🤓

### **TADA!**

➡️ Your model has a page on http://huggingface.co/models and everyone can load it using `AutoModel.from_pretrained("username/model_name")`.

[![tb](https://huggingface.co/blog/assets/01_how-to-train/model_page.png)](https://huggingface.co/julien-c/EsperBERTo-small)


If you want to take a look at models in different languages, check https://huggingface.co/models

[![all models](https://huggingface.co/front/thumbnails/models.png)](https://huggingface.co/models)
