<a href="https://colab.research.google.com/github/arnaudmkonan/Transformers-text-classification/blob/master/notebooks/01_how_to_train_from_scratch_LM_ft_NER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#@title
%%html
<div style="background-color: pink;">
  Notebook written in collaboration with <a href="https://github.com/aditya-malte">Aditya Malte</a>.
  <br>
  The Notebook is on GitHub, so contributions are more than welcome.
</div>
<br>
<div style="background-color: yellow;">
  Aditya wrote another notebook with a slightly different use case and methodology, please check it out.
  <br>
  <a target="_blank" href="https://gist.github.com/aditya-malte/2d4f896f471be9c38eb4d723a710768b">
    https://gist.github.com/aditya-malte/2d4f896f471be9c38eb4d723a710768b
  </a>
</div>


# How to train a new language model from scratch using Transformers and Tokenizers

### Notebook edition (link to blogpost [link](https://huggingface.co/blog/how-to-train)). Last update May 15, 2020


Over the past few months, we made several improvements to our [`transformers`](https://github.com/huggingface/transformers) and [`tokenizers`](https://github.com/huggingface/tokenizers) libraries, with the goal of making it easier than ever to **train a new language model from scratch**.

In this post we’ll demo how to train a “small” model (84 M parameters = 6 layers, 768 hidden size, 12 attention heads) – that’s the same number of layers & heads as DistilBERT – on **Esperanto**. We’ll then fine-tune the model on a downstream task of part-of-speech tagging.


## 1. Find a dataset

First, let us find a corpus of text in Esperanto. Here we’ll use the Esperanto portion of the [OSCAR corpus](https://traces1.inria.fr/oscar/) from INRIA.
OSCAR is a huge multilingual corpus obtained by language classification and filtering of [Common Crawl](https://commoncrawl.org/) dumps of the Web.

<img src="https://huggingface.co/blog/assets/01_how-to-train/oscar.png" style="margin: auto; display: block; width: 260px;">

The Esperanto portion of the dataset is only 299M, so we’ll concatenate with the Esperanto sub-corpus of the [Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download), which is comprised of text from diverse sources like news, literature, and wikipedia.

The final training corpus has a size of 3 GB, which is still small – for your model, you will get better results the more data you can get to pretrain on. 



In [7]:
# in this notebook we'll only get one of the files (the Oscar one) for the sake of simplicity and performance
!wget -c https://cdn-datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt

--2022-10-12 05:33:23--  https://cdn-datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt
Resolving cdn-datasets.huggingface.co (cdn-datasets.huggingface.co)... 18.64.174.50, 18.64.174.35, 18.64.174.56, ...
Connecting to cdn-datasets.huggingface.co (cdn-datasets.huggingface.co)|18.64.174.50|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 312733741 (298M) [text/plain]
Saving to: ‘oscar.eo.txt’


2022-10-12 05:33:30 (43.2 MB/s) - ‘oscar.eo.txt’ saved [312733741/312733741]



## 2. Train a tokenizer

We choose to train a byte-level Byte-pair encoding tokenizer (the same as GPT-2), with the same special tokens as RoBERTa. Let’s arbitrarily pick its size to be 52,000.

We recommend training a byte-level BPE (rather than let’s say, a WordPiece tokenizer like BERT) because it will start building its vocabulary from an alphabet of single bytes, so all words will be decomposable into tokens (no more `<unk>` tokens!).


In [2]:
 !pip install deepparse compress-pickle~=2.0.1 datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting deepparse
  Downloading deepparse-0.9.2-py3-none-any.whl (204 kB)
[K     |████████████████████████████████| 204 kB 5.0 MB/s 
[?25hCollecting compress-pickle~=2.0.1
  Downloading compress_pickle-2.0.1-py3-none-any.whl (24 kB)
Collecting pymagnitude-light
  Downloading pymagnitude_light-0.1.147-py3-none-any.whl (35 kB)
Collecting fasttext
  Downloading fasttext-0.9.2.tar.gz (68 kB)
[K     |████████████████████████████████| 68 kB 8.7 MB/s 
[?25hCollecting bpemb
  Downloading bpemb-0.3.4-py3-none-any.whl (19 kB)
Collecting poutyne
  Downloading Poutyne-1.12.1-py3-none-any.whl (211 kB)
[K     |████████████████████████████████| 211 kB 66.4 MB/s 
Collecting gensim>=4.0.0
  Downloading gensim-4.2.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |████████████████████████████████| 24.1 MB 1.2 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1

In [5]:
# We won't need TensorFlow here
!pip uninstall -y tensorflow
# Install `transformers` from master
!pip install git+https://github.com/huggingface/transformers
!pip list | grep -E 'transformers|tokenizers'
# transformers version at notebook update --- 2.11.0
# tokenizers version at notebook update --- 0.8.0rc1

Found existing installation: tensorflow 2.8.2+zzzcolab20220929150707
Uninstalling tensorflow-2.8.2+zzzcolab20220929150707:
  Successfully uninstalled tensorflow-2.8.2+zzzcolab20220929150707
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-4b29cuw9
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-4b29cuw9
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 5.0 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163

In [21]:
import os
import compress_pickle
import pickle
from deepparse import download_from_public_repository
from deepparse.dataset_container import PickleDatasetContainer
from deepparse.parser import AddressParser
import shutil
from poutyne import set_seeds
import poutyne
import timeit


# ------
import pandas as pd
from tqdm import tqdm_notebook
import glob
from pathlib import Path

seed = 42
set_seeds(seed)

In [4]:
os.makedirs("dataset")
download_from_public_repository("dataset/data", "", file_extension="zip")

In [6]:
# First, let's decompress the archive
archive_root_path = os.path.join("dataset")
archive_path = os.path.join(archive_root_path, "data.zip")

# Unzip the archive
shutil.unpack_archive(archive_path, archive_root_path)

# Delete the archive
os.remove(archive_path)

In [9]:
# The script functions with minor modification to handle argument
# instead or CLI parsed argument

# Function to handle the files paths
def absolute_file_paths(directory):
    """
    Function to get all the absolute paths of files into a directory.
    """
    for dir_path, _, filenames in os.walk(directory):
        for f in filenames:
            if f.endswith(".lzma"):
                yield os.path.abspath(os.path.join(dir_path, f))


# Function to LZMA decompress the files_directory into the path_to_save directory
def lzma_decompress(files_directory, root_path_to_save) -> None:
    """
    Script to decompress the dataset from LZMA compress files into pickled one.
    """
    paths = absolute_file_paths(files_directory)

    for path in paths:
        pickled_data = compress_pickle.load(path, compression="lzma")
        filename = path.split(os.path.sep)[-1].replace(".lzma", ".p")
        file_path = os.path.join(*path.split(os.path.sep)[-4:-1])
        path_to_save = os.path.join(root_path_to_save, file_path)
        os.makedirs(path_to_save, exist_ok=True)
        with open(os.path.join(path_to_save, filename), "wb") as file:
            pickle.dump(pickled_data, file)
        os.remove(path)

In [10]:
# Let's decompress the dataset. It takes several minutes to decompress.

root_dir = os.path.join("dataset", "data")
clean_root_dir = os.path.join(root_dir, "clean_data")
clean_train_directory = os.path.join(clean_root_dir, "train")
clean_test_directory = os.path.join(clean_root_dir, "test")

In [11]:
# We decompress all the dataset
lzma_decompress(root_dir, "dataset")

In [13]:
clean_root_dir = os.path.join(root_dir, "clean_data")
clean_train_directory = os.path.join(clean_root_dir, "train")
clean_test_directory = os.path.join(clean_root_dir, "test")

In [17]:
uk_training_data_path = os.path.join(clean_train_directory, "gb.p")
uk_test_data_path = os.path.join(clean_test_directory, "gb.p")

uk_training_container = PickleDatasetContainer(uk_training_data_path)
uk_test_container = PickleDatasetContainer(uk_test_data_path)

# USA 

us_training_data_path = os.path.join(clean_train_directory, "us.p")
us_test_data_path = os.path.join(clean_test_directory, "us.p")

us_training_container = PickleDatasetContainer(us_training_data_path)
us_test_container = PickleDatasetContainer(us_test_data_path)

In [14]:
Labels = ['Province', 
          'Municipality', 
          'PostalCode', 
          'StreetNumber', 
          'StreetName',  
          'Orientation', 
          'Unit', 
          'BuildingNumber',
          'Floor',
          'Room',
#           'GeneralDelivery',
          'AddressOrther']

In [15]:
label_mapper = {v:k for k, v in zip(*[range(1, len(Labels)+1),Labels])}
label_mapper

{'Province': 1,
 'Municipality': 2,
 'PostalCode': 3,
 'StreetNumber': 4,
 'StreetName': 5,
 'Orientation': 6,
 'Unit': 7,
 'BuildingNumber': 8,
 'Floor': 9,
 'Room': 10,
 'AddressOrther': 11}

In [16]:
from typing import Dict, Optional, List, Mapping

def iob_mapper(address: str, labels: List, maps: Optional[Dict]=None, id: Optional[str]=None) -> Mapping:
    """
    - ``"StreetNumber"``: for the street number,
    - ``"StreetName"``: for the name of the street,
    - ``"Unit"``: for the unit (such as an apartment),
    - ``"Municipality"``: for the municipality or city,
    - ``"Province"``: for the province or local region or state,
    - ``"PostalCode"``: for the postal code,
    - ``"Orientation"``: for the street orientation (e.g. west, east),
    - ``"GeneralDelivery"``: for other delivery information,
    - ``"EOS"``: (End Of Sequence) since we use an EOS during training, sometimes the models return an EOS tag.
    """
    
    assert len(address.split()) == len(labels), "Some labels are missing, please check that\
    you have as many labels as address compoments"
    
    if id:
        rec_id = id
    else:
        rec_id = 0
    #if no labels mapper is provided assume full label mappers
    if maps:
        mapper = maps
    else:
        mapper = label_mapper
        
    dct = {"id": rec_id,
           "ner_tag": [mapper.get(key, 0) for key in labels],
           "tokens": address.split() 
          }
    return dct

In [19]:
iob_mapper(us_training_container.data[0][0],us_training_container.data[0][1])

{'id': 0,
 'ner_tag': [4, 5, 5, 2, 2, 1, 3],
 'tokens': ['3572',
  'silverwood',
  'roa',
  'w',
  'sacramento',
  'california',
  '95691']}

In [22]:
def read_pickle_data(iso2:str) -> Mapping:
    """
    TODO:
    """
    root_dir = os.path.join("dataset", "data")
    clean_root_dir = os.path.join(root_dir, "clean_data")
    
    clean_train_directory = os.path.join(clean_root_dir, "train")
    clean_test_directory = os.path.join(clean_root_dir, "test")

    training_data_path = os.path.join(clean_train_directory, f"{iso2}.p")
    test_data_path = os.path.join(clean_test_directory, f"{iso2}.p")

    training_container = PickleDatasetContainer(training_data_path)
    test_container = PickleDatasetContainer(test_data_path)
    return training_container, test_container

def extrac_text_address(iso2:str) -> List:
    list_of_addresses = pd.read_pickle(f'dataset/data/clean_data/train/{iso2}.p')
    return list_of_addresses

def collect_to_txt(path):
    save_path = "training_text/"+str(path).replace("/","-").replace('.p','.txt')
    _pkl = pd.read_pickle(path)
    pd.DataFrame([_pkl[i][0] for i in range(len(_pkl))]).to_csv(save_path,header=None, index=None)

def get_all_text_data(path:str) -> pd.DataFrame:
    paths = [str(x) for x in Path(".").glob("**/*.p")]
    return 

In [23]:
!mkdir training_text

mkdir: cannot create directory ‘training_text’: File exists


In [27]:

list(map(collect_to_txt, list(Path(".").glob("**/*.p"))))

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

In [28]:
%%time 
from pathlib import Path

from tokenizers import ByteLevelBPETokenizer

paths = [str(x) for x in Path(".").glob("**/*.txt")]

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

CPU times: user 18min 6s, sys: 5.17 s, total: 18min 12s
Wall time: 4min 56s


Now let's save files to disk

In [29]:
!mkdir AddressBERTo
tokenizer.save_model("AddressBERTo")

['AddressBERTo/vocab.json', 'AddressBERTo/merges.txt']

In [31]:
!head -n 10 AddressBERTo/merges.txt

#version: 0.2 - Trained by `huggingface/tokenizers`
Ġ c
Ġ s
0 0
a n
e n
e r
Ġ m
Ġ d
Ġ 1


In [30]:
!head -n 10 AddressBERTo/vocab.json

{"<s>":0,"<pad>":1,"</s>":2,"<unk>":3,"<mask>":4,"!":5,"\"":6,"#":7,"$":8,"%":9,"&":10,"'":11,"(":12,")":13,"*":14,"+":15,",":16,"-":17,".":18,"/":19,"0":20,"1":21,"2":22,"3":23,"4":24,"5":25,"6":26,"7":27,"8":28,"9":29,":":30,";":31,"<":32,"=":33,">":34,"?":35,"@":36,"A":37,"B":38,"C":39,"D":40,"E":41,"F":42,"G":43,"H":44,"I":45,"J":46,"K":47,"L":48,"M":49,"N":50,"O":51,"P":52,"Q":53,"R":54,"S":55,"T":56,"U":57,"V":58,"W":59,"X":60,"Y":61,"Z":62,"[":63,"\\":64,"]":65,"^":66,"_":67,"`":68,"a":69,"b":70,"c":71,"d":72,"e":73,"f":74,"g":75,"h":76,"i":77,"j":78,"k":79,"l":80,"m":81,"n":82,"o":83,"p":84,"q":85,"r":86,"s":87,"t":88,"u":89,"v":90,"w":91,"x":92,"y":93,"z":94,"{":95,"|":96,"}":97,"~":98,"¡":99,"¢":100,"£":101,"¤":102,"¥":103,"¦":104,"§":105,"¨":106,"©":107,"ª":108,"«":109,"¬":110,"®":111,"¯":112,"°":113,"±":114,"²":115,"³":116,"´":117,"µ":118,"¶":119,"·":120,"¸":121,"¹":122,"º":123,"»":124,"¼":125,"½":126,"¾":127,"¿":128,"À":129,"Á":130,"Â":131,"Ã":132,"Ä":133,"Å":134,"Æ":135,"

🔥🔥 Wow, that was fast! ⚡️🔥

We now have both a `vocab.json`, which is a list of the most frequent tokens ranked by frequency, and a `merges.txt` list of merges.

```json
{
	"<s>": 0,
	"<pad>": 1,
	"</s>": 2,
	"<unk>": 3,
	"<mask>": 4,
	"!": 5,
	"\"": 6,
	"#": 7,
	"$": 8,
	"%": 9,
	"&": 10,
	"'": 11,
	"(": 12,
	")": 13,
	# ...
}

# merges.txt
l a
Ġ k
o n
Ġ la
t a
Ġ e
Ġ d
Ġ p
# ...
```

What is great is that our tokenizer is optimized for Esperanto. Compared to a generic tokenizer trained for English, more native words are represented by a single, unsplit token. Diacritics, i.e. accented characters used in Esperanto – `ĉ`, `ĝ`, `ĥ`, `ĵ`, `ŝ`, and `ŭ` – are encoded natively. We also represent sequences in a more efficient manner. Here on this corpus, the average length of encoded sequences is ~30% smaller as when using the pretrained GPT-2 tokenizer.

Here’s  how you can use it in `tokenizers`, including handling the RoBERTa special tokens – of course, you’ll also be able to use it directly from `transformers`.


In [32]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing


tokenizer = ByteLevelBPETokenizer(
    "./AddressBERTo/vocab.json",
    "./AddressBERTo/merges.txt",
)

In [33]:
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

In [34]:
tokenizer.encode("центральная ул дом 1 пустынка татарстан 423538.")

Encoding(num_tokens=17, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [35]:
tokenizer.encode("центральная ул дом 1 пустынка татарстан 423538.").tokens

['<s>',
 'ÑĨÐµÐ½ÑĤÑĢÐ°Ð»ÑĮÐ½Ð°Ñı',
 'ĠÑĥÐ»',
 'ĠÐ´Ð¾Ð¼',
 'Ġ1',
 'ĠÐ¿',
 'Ñĥ',
 'ÑģÑĤ',
 'Ñĭ',
 'Ð½',
 'ÐºÐ°',
 'ĠÑĤÐ°ÑĤÐ°ÑĢÑģÑĤÐ°Ð½',
 'Ġ42',
 '35',
 '38',
 '.',
 '</s>']

In [36]:
tokenizer.encode("1060 gilbert stree southeast apt 302 atlanta georgia 30316").tokens

['<s>',
 '1060',
 'Ġgilbert',
 'Ġstree',
 'Ġsoutheast',
 'Ġapt',
 'Ġ302',
 'Ġatlanta',
 'Ġgeorgia',
 'Ġ30316',
 '</s>']

## 3. Train a language model from scratch

**Update:** This section follows along the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/legacy/run_language_modeling.py) script, using our new [`Trainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) directly. Feel free to pick the approach you like best.

> We’ll train a RoBERTa-like model, which is a BERT-like with a couple of changes (check the [documentation](https://huggingface.co/transformers/model_doc/roberta.html) for more details).

As the model is BERT-like, we’ll train it on a task of *Masked language modeling*, i.e. the predict how to fill arbitrary tokens that we randomly mask in the dataset. This is taken care of by the example script.


In [37]:
# Check that we have a GPU
!nvidia-smi

Wed Oct 12 06:05:37 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [38]:
# Check that PyTorch sees it
import torch
torch.cuda.is_available()

True

### We'll define the following config for the model

In [39]:
from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

Now let's re-create our tokenizer in transformers

In [41]:
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("./AddressBERTo", max_len=512)

Finally let's initialize our model.

**Important:**

As we are training from scratch, we only initialize from a config, not from an existing pretrained model or checkpoint.

In [42]:
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config=config)

In [43]:
model.num_parameters()
# => 84 million parameters

83504416

### Now let's build our training Dataset

We'll build our dataset by applying our tokenizer to our text file.

Here, as we only have one text file, we don't even need to customize our `Dataset`. We'll just use the `LineByLineDataset` out-of-the-box.

In [52]:
# !pip install datasets
from datasets import load_dataset

In [54]:
from transformers import TextDataset

In [66]:
tokenizer.from_pretrained

<bound method PreTrainedTokenizerBase.from_pretrained of <class 'transformers.models.roberta.tokenization_roberta_fast.RobertaTokenizerFast'>>

In [48]:
%%time
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="**/*.txt",
    block_size=128,
)



ValueError: ignored

In [77]:
# load_dataset('txt', path='training_text')
dataset = load_dataset("text", data_files={"train": paths})
dataset

Resolving data files:   0%|          | 0/121 [00:00<?, ?it/s]



  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 43834509
    })
})

In [None]:
# https://huggingface.co/docs/datasets/nlp_process
dataset = dataset.map(lambda examples: tokenizer(examples["text"], return_tensors="np"), batched=True, )
dataset['train'][0]

  0%|          | 0/43835 [00:00<?, ?ba/s]

Like in the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py) script, we need to define a data_collator.

This is just a small helper that will help us batch different samples of the dataset together into an object that PyTorch knows how to perform backprop on.

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

### Finally, we are all set to initialize our Trainer

In [None]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./AddressBERTo",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_gpu_train_batch_size=64,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

### Start training

In [None]:
%%time
trainer.train()

#### 🎉 Save final model (+ tokenizer + config) to disk

In [None]:
trainer.save_model("./AddressBERTo")

## 4. Check that the LM actually trained

Aside from looking at the training and eval losses going down, the easiest way to check whether our language model is learning anything interesting is via the `FillMaskPipeline`.

Pipelines are simple wrappers around tokenizers and models, and the 'fill-mask' one will let you input a sequence containing a masked token (here, `<mask>`) and return a list of the most probable filled sequences, with their probabilities.



In [None]:
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="./AddressBERTo",
    tokenizer="./AddressBERTo"
)

In [None]:
# The sun <mask>.
# =>

fill_mask("4595 n picadilly ct <mask>.")

Ok, simple syntax/grammar works. Let’s try a slightly more interesting prompt:



In [None]:
fill_mask("1234 main st <mask>.")

# This is the beginning of a beautiful <mask>.
# =>

## 5. Share your model 🎉

Finally, when you have a nice model, please think about sharing it with the community:

- upload your model using the CLI: `transformers-cli upload`
- write a README.md model card and add it to the repository under `model_cards/`. Your model card should ideally include:
    - a model description,
    - training params (dataset, preprocessing, hyperparameters), 
    - evaluation results,
    - intended uses & limitations
    - whatever else is helpful! 🤓

### **TADA!**

➡️ Your model has a page on http://huggingface.co/models and everyone can load it using `AutoModel.from_pretrained("username/model_name")`.

[![tb](https://huggingface.co/blog/assets/01_how-to-train/model_page.png)](https://huggingface.co/julien-c/EsperBERTo-small)


If you want to take a look at models in different languages, check https://huggingface.co/models

[![all models](https://huggingface.co/front/thumbnails/models.png)](https://huggingface.co/models)


# 6. Fine-tune your LM on a downstream task

In [None]:
from transformers import TokenClassificationPipeline, pipeline


MODEL_PATH = "./models/EsperBERTo-small-pos/"

nlp = pipeline(
    "ner",
    model=MODEL_PATH,
    tokenizer=MODEL_PATH,
)