### Code adapted from huggingface.co

Check https://huggingface.co/models

[![all models](https://huggingface.co/front/thumbnails/models.png)](https://huggingface.co/models)


## Train a tokenizer

In [2]:
!pip install git+https://github.com/huggingface/transformers
!pip list | grep -E 'transformers|tokenizers'

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to c:\users\david\appdata\local\temp\pip-req-build-d37_a15u
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
    Preparing wheel metadata: started
    Preparing wheel metadata: finished with status 'done'
Collecting sacremoses
  Downloading sacremoses-0.0.43.tar.gz (883 kB)
Collecting tokenizers==0.9.4
  Downloading tokenizers-0.9.4-cp38-cp38-win_amd64.whl (1.9 MB)
Building wheels for collected packages: transformers, sacremoses
  Building wheel for transformers (PEP 517): started
  Building wheel for transformers (PEP 517): finished with status 'done'
  Created wheel for transformers: filename=transformers-4.2.0.dev0-py3-none-any.whl size=1531023 sha256=19ee441ec121ce07bedd718f5fd58b3dd7c9d29dbb89dc1ed280072ab

'grep' is not recognized as an internal or external command,
operable program or batch file.


In [1]:
import tokenizers
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer

In [2]:
from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=5000,
    hidden_size=256,
    num_hidden_layers=6,
    num_attention_heads=8,
    intermediate_size=1024,
    max_position_embeddings=128,
    type_vocab_size=1,
    hidden_dropout_prob=0.3,
    attention_probs_dropout_prob=0.3
)

Now let's re-create our tokenizer in transformers

In [3]:
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("tokenizer", max_len=64)

Finally let's initialize our model.

**Important:**

As we are training from scratch, we only initialize from a config, not from an existing pretrained model or checkpoint.

In [4]:
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config=config)

In [5]:
model.num_parameters()

6123400

### Now let's build our training Dataset

In [6]:
%%time
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="datasets/all_seqs_text.txt",
    block_size=128,
)



Wall time: 1min 7s


Like in the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py) script, we need to define a data_collator.

This is just a small helper that will help us batch different samples of the dataset together into an object that PyTorch knows how to perform backprop on.

In [9]:
import transformers

In [10]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

### Finally, we are all set to initialize our Trainer

In [16]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    seed=0,
    output_dir='roberta',
    overwrite_output_dir=True,
    num_train_epochs=6,
    per_device_train_batch_size=256,
    save_steps=50,
    logging_steps=50,
    save_total_limit=5,
    tpu_num_cores=8,
    prediction_loss_only=True,
    warmup_steps=500,
    model_parallel=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

In [14]:
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

### Start training

In [None]:
trainer.train()

#### 🎉 Save final model (+ tokenizer + config) to disk

In [None]:
trainer.save_model('existing_roberta')