# Train a language model (Masked Language Modelling) from scratch using Huggingface Transformers and a custom tokenizer

### Extracted from the notebook (link to blogpost [link](https://huggingface.co/blog/how-to-train)).


Over the past few months, we made several improvements to our [`transformers`](https://github.com/huggingface/transformers) and [`tokenizers`](https://github.com/huggingface/tokenizers) libraries, with the goal of making it easier than ever to **train a new language model from scratch**.

In this post we’ll demo how to train a “small” RoBERTa model (6 layers, 768 hidden size, 12 attention heads) – that’s the same number of layers & heads as DistilBERT – on a language from a clothing shop. We’ll then fine-tune the model on a downstream task of text generation.


# Loading the libraries

In [None]:
import os
import pandas as pd
import tqdm

# Loading the datasets

In [None]:
from google.colab import drive

drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


Set the variables to the data folders:

In [None]:
#Set the path to the data folder, datafile and output folder and files
root_folder = '/content/drive/My Drive/'
data_folder = os.path.abspath(os.path.join(root_folder, 'datasets/text_gen_product_names'))
model_folder = os.path.abspath(os.path.join(root_folder, 'Projects/text_generation_names/RoBERTaML'))
output_folder = os.path.abspath(os.path.join(root_folder, 'Projects/text_generation_names'))
tokenizer_folder = os.path.abspath(os.path.join(root_folder, 'Projects/text_generation_names/TokRoBERTa'))

test_filename='cl_test_descriptions.csv'
datafile= 'product_names_desc_cl_train.csv'
outputfile = 'submission.csv'

datafile_path = os.path.abspath(os.path.join(data_folder,datafile))
testfile_path = os.path.abspath(os.path.join(data_folder,test_filename))
outputfile_path = os.path.abspath(os.path.join(output_folder,outputfile))

Load the train datafile with the product descriptions and names:

In [None]:
# Load the dataset: sentence in english, sentence in spanish 
train_df=pd.read_csv(datafile_path, header=0, usecols=[0,1])
print('Num Examples: ',len(train_df))
print('Null Values\n', train_df.isna().sum())
train_df.dropna(inplace=True)
print('Num Examples: ',len(train_df))

Num Examples:  31593
Null Values
 name           44
description     1
dtype: int64
Num Examples:  31548


Then, we read the test dataset:

In [None]:
# Load the dataset: sentence in english, sentence in spanish 
test_df=pd.read_csv(testfile_path, header=0)
print('Num Examples: ',len(test_df))
print('Null Values\n', test_df.isna().sum())

Num Examples:  1441
Null Values
 description    0
dtype: int64


To train our Tokenizer we need to save every text example in our dataset to a txt file, including both the train and test dataset:

In [None]:
txt_files_dir = "./text_split"
!rm -rf {txt_files_dir}
!mkdir {txt_files_dir}

In [None]:
# Store values in a dataframe column (Series object) to files, one file per record
def column_to_files(column, prefix, txt_files_dir):
    i=prefix
    #For every value in the df, with just one column
    for row in column.to_list():
      file_name = os.path.join(txt_files_dir, str(i)+'.txt')
      try:
        f = open(file_name, 'wb')
        f.write(row.encode('utf-8'))
        f.close()
      except Exception as e:  #catch exceptions(for eg. empty rows)
        print(row, e) 
      i+=1
    
    return i


In [None]:
data = train_df["description"]
data = data.replace("\n"," ")
prefix=0
prefix = column_to_files(data, prefix, txt_files_dir)
print(prefix)

31548


In [None]:
data = test_df["description"]
data = data.replace("\n"," ")
print(len(data))
prefix = column_to_files(data, prefix, txt_files_dir)
print(prefix)

1441
32989


In [None]:
data = train_df["name"]
data = data.replace("\n"," ")
print(len(data))
prefix = column_to_files(data, prefix, txt_files_dir)
print(prefix)

31548
64537


# 3. Train a tokenizer

We choose to train a byte-level Byte-pair encoding tokenizer (the same as GPT-2), with the same special tokens as RoBERTa. Let’s pick its size to be 8,192 because our specific vocabulary is very limited and simple.

We recommend training a byte-level BPE (rather than let’s say, a WordPiece tokenizer like BERT) because it will start building its vocabulary from an alphabet of single bytes, so all words will be decomposable into tokens (no more `<unk>` tokens!).


In [None]:
# We won't need TensorFlow here
!pip uninstall -y tensorflow
# Install `transformers` from master
!pip install git+https://github.com/huggingface/transformers
!pip list | grep -E 'transformers|tokenizers'
# transformers version at notebook update --- 2.11.0
# tokenizers version at notebook update --- 0.8.0rc1
!pip install datasets==1.0.2

Uninstalling tensorflow-2.4.1:
  Successfully uninstalled tensorflow-2.4.1
Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-sveetl9v
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-sveetl9v
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp37-cp37m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 8.0MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 50

In [None]:
from pathlib import Path

from tokenizers import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing

from datasets import Dataset

Now we can train our tokenizer on the text files containing our vocabulary

In [None]:
%%time 
paths = [str(x) for x in Path(".").glob("text_split/*.txt")]

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer(lowercase=True)

# Customize training
tokenizer.train(files=paths, vocab_size=8192, min_frequency=1,
                show_progress=True,
                special_tokens=[
                                "<s>",
                                "<pad>",
                                "</s>",
                                "<unk>",
                                "<mask>",
])

CPU times: user 3.77 s, sys: 3.32 s, total: 7.1 s
Wall time: 2.19 s


In [None]:
tokenizer

Tokenizer(vocabulary_size=8192, model=ByteLevelBPE, add_prefix_space=False, lowercase=True, dropout=None, unicode_normalizer=None, continuing_subword_prefix=None, end_of_word_suffix=None, trim_offsets=False)

Now let's save the tokenizer to disk, later we will use it to train the language model:

In [None]:
#Save the Tokenizer to disk
tokenizer.save_model(tokenizer_folder)

['/content/drive/My Drive/Projects/SpainAI NLP/TokRoBERTa/vocab.json',
 '/content/drive/My Drive/Projects/SpainAI NLP/TokRoBERTa/merges.txt']

The count of samples is small and the tokenizer trains very fast.

We now have both a `vocab.json`, which is a list of the most frequent tokens ranked by frequency, and a `merges.txt` list of merges.

```json
{
	"<s>": 0,
	"<pad>": 1,
	"</s>": 2,
	"<unk>": 3,
	"<mask>": 4,
	"!": 5,
	"\"": 6,
	"#": 7,
	"$": 8,
	"%": 9,
	"&": 10,
	"'": 11,
	"(": 12,
	")": 13,
	# ...
}

# merges.txt
l a
Ġ k
o n
Ġ la
t a
Ġ e
Ġ d
Ġ p
# ...
```

What is great is that our tokenizer is optimized for our very specific vocabulary. Compared to a generic tokenizer trained for English, more native words are represented by a single, unsplit token. 

Here’s  how you can use it in `tokenizers`, including handling the RoBERTa special tokens – of course, you’ll also be able to use it directly from `transformers`.


In [None]:
tokenizer = ByteLevelBPETokenizer(
    os.path.abspath(os.path.join(tokenizer_folder,'vocab.json')),
    os.path.abspath(os.path.join(tokenizer_folder,'merges.txt'))
)

In [None]:
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

Let's show some examples:

In [None]:
tokenizer.encode("knit midi dress with vneckline straps.")

Encoding(num_tokens=9, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [None]:
tokenizer.encode("knit midi dress with vneckline straps.").tokens

['<s>',
 'knit',
 'Ġmidi',
 'Ġdress',
 'Ġwith',
 'Ġvneckline',
 'Ġstraps',
 '.',
 '</s>']

# 4. Train a language model from scratch

**Update:** This section follows along the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py) script, using our new [`Trainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) directly. Feel free to pick the approach you like best.

> We’ll train a RoBERTa-like model, which is a BERT-like with a couple of changes (check the [documentation](https://huggingface.co/transformers/model_doc/roberta.html) for more details).

As the model is BERT-like, we’ll train it on a task of *Masked language modeling*, i.e. the predict how to fill arbitrary tokens that we randomly mask in the dataset. This is taken care of by the example script.


In [None]:
TRAIN_BATCH_SIZE = 16    # input batch size for training (default: 64)
VALID_BATCH_SIZE = 8    # input batch size for testing (default: 1000)
TRAIN_EPOCHS = 25        # number of epochs to train (default: 10)
LEARNING_RATE = 1e-4    # learning rate (default: 0.01)
SEED = 42               # random seed (default: 42)
MAX_LEN = 100
SUMMARY_LEN = 7

In [None]:
# Check that we have a GPU
!nvidia-smi

Sat Feb 27 09:00:07 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# Check that PyTorch sees it
import torch
torch.cuda.is_available()

True

### We'll define the following config for the model

In [None]:
from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=8192,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

Now let's re-create our tokenizer in transformers

In [None]:
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained(tokenizer_folder, max_len=128)

In [None]:
tokenizer

PreTrainedTokenizerFast(name_or_path='/content/drive/My Drive/Projects/SpainAI NLP/TokRoBERTa', vocab_size=8192, model_max_len=128, is_fast=True, padding_side='right', special_tokens={'bos_token': AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'sep_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'pad_token': AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'cls_token': AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=True)})

Finally let's initialize our model.

**Important:**

As we are training from scratch, we only initialize from a config, not from an existing pretrained model or checkpoint.

In [None]:
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config=config)

In [None]:
model.num_parameters()
# => 49 million parameters

49816064

### Now let's build our training Dataset

We'll build our dataset by applying our tokenizer to our text file.

Here, as we only have one text file, we don't even need to customize our `Dataset`. We'll just use the `LineByLineDataset` out-of-the-box.

In [None]:
train_files_dir = "./train_files"

Recreate the folder to contain the text files:

In [None]:
!rm -rf {train_files_dir}
!mkdir {train_files_dir}

In [None]:
train_split = 1
train_data_size = int(len(train_df)*train_split)
test_data_size = int(len(test_df)*train_split)
print('Len Train data: ', str(train_data_size),' Len Test data: ', str(test_data_size))

with open(os.path.join(train_files_dir,'train.txt') , 'w') as f:
    for item in train_df[:train_data_size].name.to_list():
        f.write("%s\n" % item)
    # We can evaluate to use the test file to train our language model
    #for item in test_df[:test_data_size].description.to_list():
    #    f.write("%s\n" % item)

Len Train data:  31548  Len Test data:  1441


### Create our dataset from the text file `train.txt`

In [None]:
%%time
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path=os.path.join(train_files_dir,'train.txt'),
    block_size=32,
)



CPU times: user 1.03 s, sys: 30.7 ms, total: 1.06 s
Wall time: 618 ms


## TO DO: **CREATE A DATASET TO TRAIN INSTEAD OF FROM FILES**

https://ryanong.co.uk/2020/06/11/day-163-how-to-build-a-language-model-from-scratch-implementation/

In [None]:
class EsperantoDataset(Dataset):
    def __init__(self, evaluate: bool = False):
        tokenizer = ByteLevelBPETokenizer(
            "./esperberto/vocab.json",
            "./esperberto/merges.txt",
        )
        tokenizer._tokenizer.post_processor = BertProcessing(
            ("</s>", tokenizer.token_to_id("</s>")),
            ("<s>", tokenizer.token_to_id("<s>")),
        )
        tokenizer.enable_truncation(max_length=512)
        # or use the RobertaTokenizer from `transformers` directly.

        self.examples = []

        src_files = Path("./data/").glob("*-eval.txt") if evaluate else Path("./data/").glob("final_data.txt")
        for src_file in src_files:
            print("🔥", src_file)
            lines = src_file.read_text(encoding="utf-8").splitlines()
            self.examples += [x.ids for x in tokenizer.encode_batch(lines)]

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, i):
        # We’ll pad at the batch level.
        return torch.tensor(self.examples[i])

In [None]:
# Concatenate the train dataset and the test dataset for language modelling
df=pd.concat([train_df['description'], test_df['description']], axis=0)
print(len(df), len(train_df), len(test_df))
print(df.head(5))

32989 31548 1441
0       towel with border with lines metallic thread .
1    printed bermuda shorts made technical fabric ....
2    bodysuit with shapewear effect . this bodysuit...
3    vneck with thin adjustable straps.height model...
4    puritan collar dress featuring long sleeves wi...
Name: description, dtype: object


In [None]:
df[:100]

0        towel with border with lines metallic thread .
1     printed bermuda shorts made technical fabric ....
2     bodysuit with shapewear effect . this bodysuit...
3     vneck with thin adjustable straps.height model...
4     puritan collar dress featuring long sleeves wi...
                            ...                        
95                            bracelet with appliqués .
96    teardropshaped dangle earrings with rhinestone...
97    round neck tshirt with short sleeves . contras...
98                 pack pairs plain socks.one size only
99    flat mules caramel . tortoiseshelleffect vinyl...
Name: description, Length: 100, dtype: object

In [None]:
dataset=Dataset.from_pandas(df.to_frame())
dataset.remove_columns_(['__index_level_0__'])
print(dataset)

Dataset(features: {'description': Value(dtype='string', id=None)}, num_rows: 32989)


In [None]:
dataset.set_format(type='torch', columns=['description'])
dataset.remove_columns_(['__index_level_0__'])
print(dataset)

Dataset(features: {'description': Value(dtype='string', id=None), '__index_level_0__': Value(dtype='int64', id=None)}, num_rows: 32989)


In [None]:
dataset

<transformers.data.datasets.language_modeling.LineByLineTextDataset at 0x7fe8d19de4d0>

## Define the Data Collactor for masking our language

Like in the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py) script, we need to define a data_collator.

This is just a small helper that will help us batch different samples of the dataset together into an object that PyTorch knows how to perform backprop on.

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

### Finally, we are all set to initialize our Trainer

In [None]:
model_folder = os.path.abspath(os.path.join(root_folder, 'Projects/SpainAI NLP/DecRoBERTaML'))
print(model_folder)

/content/drive/My Drive/Projects/SpainAI NLP/DecRoBERTaML


Set the training arguments for our model:

In [None]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir=model_folder,
    overwrite_output_dir=True,
    num_train_epochs=15,
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    save_steps=8192,
    eval_steps=4096,
    save_total_limit=1,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
    #prediction_loss_only=True,
)

### Start training

In [None]:
trainer.train()

Step,Training Loss
500,7.2501
1000,6.3738
1500,6.0786
2000,5.8728
2500,5.6212
3000,5.494
3500,5.3428
4000,5.2581
4500,5.031
5000,4.9632


TrainOutput(global_step=29580, training_loss=4.028654971067933, metrics={'train_runtime': 1177.1714, 'train_samples_per_second': 25.128, 'total_flos': 1187514138046464.0, 'epoch': 15.0, 'init_mem_cpu_alloc_delta': 47230, 'init_mem_gpu_alloc_delta': 0, 'init_mem_cpu_peaked_delta': 18306, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 326984, 'train_mem_gpu_alloc_delta': 599612928, 'train_mem_cpu_peaked_delta': 52578005, 'train_mem_gpu_peaked_delta': 200973312})

#### 🎉 Save final model (+ tokenizer + config) to disk

In [None]:
trainer.save_model(model_folder)

## 4. Check that the LM actually trained

Aside from looking at the training and eval losses going down, the easiest way to check whether our language model is learning anything interesting is via the `FillMaskPipeline`.

Pipelines are simple wrappers around tokenizers and models, and the 'fill-mask' one will let you input a sequence containing a masked token (here, `<mask>`) and return a list of the most probable filled sequences, with their probabilities.



In [None]:
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model=os.path.abspath(os.path.join(output_folder,'TokRoBERTa')),
    tokenizer=os.path.abspath(os.path.join(output_folder,'TokRoBERTa'))
)

Some weights of RobertaModel were not initialized from the model checkpoint at /content/drive/My Drive/Projects/SpainAI NLP/TokRoBERTa and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# knit midi dress with vneckline
# =>
fill_mask("midi <mask> with vneckline.")

[{'score': 0.37433964014053345,
  'sequence': 'midi dress with vneckline.',
  'token': 482,
  'token_str': ' dress'},
 {'score': 0.33222395181655884,
  'sequence': 'midi skirt with vneckline.',
  'token': 769,
  'token_str': ' skirt'},
 {'score': 0.035536717623472214,
  'sequence': 'midi crop with vneckline.',
  'token': 1693,
  'token_str': ' crop'},
 {'score': 0.023702150210738182,
  'sequence': 'midi sleeve with vneckline.',
  'token': 469,
  'token_str': ' sleeve'},
 {'score': 0.0199593435972929,
  'sequence': 'midi vest with vneckline.',
  'token': 2315,
  'token_str': ' vest'}]

Ok, simple syntax/grammar works. Let’s try a slightly more interesting prompt:



In [None]:
fill_mask("Jen la komenco de bela <mask>.")

# This is the beginning of a beautiful <mask>.
# =>

[{'score': 0.01814725436270237,
  'sequence': '<s> Jen la komenco de bela urbo.</s>',
  'token': 871},
 {'score': 0.015888698399066925,
  'sequence': '<s> Jen la komenco de bela vivo.</s>',
  'token': 1160},
 {'score': 0.015662025660276413,
  'sequence': '<s> Jen la komenco de bela tempo.</s>',
  'token': 1021},
 {'score': 0.015555007383227348,
  'sequence': '<s> Jen la komenco de bela mondo.</s>',
  'token': 945},
 {'score': 0.01412549614906311,
  'sequence': '<s> Jen la komenco de bela tago.</s>',
  'token': 1633}]

## 5. Share your model 🎉

Finally, when you have a nice model, please think about sharing it with the community:

- upload your model using the CLI: `transformers-cli upload`
- write a README.md model card and add it to the repository under `model_cards/`. Your model card should ideally include:
    - a model description,
    - training params (dataset, preprocessing, hyperparameters), 
    - evaluation results,
    - intended uses & limitations
    - whatever else is helpful! 🤓