To build an entire NLP input pipeline with HugginFace Transformers' models you must:
* select and use the appropriate ***tokenizer*** for your model;
* generate or pre-process ***training*** and ***test*** sets;
* ***fine-tune*** your model on custom datasets;
* prepare your fine-tuned model for ***inference***.

In [2]:
!pip3 install transformers torch

Collecting transformers
  Downloading transformers-4.12.5-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 8.3 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 52.0 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 32.6 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 61.9 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.1.2-py3-none-any.whl (59 kB)
[K     |████████████████████████████████| 59 kB 6.5 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
 

## Data processing (i.e. tokenization)

*Tokenization* is the task of processing input text, either of the training, test or validation set, so that it can be fed to the neural model. It splits the input sentence into different words or group of words (*tokens*), it inserts masks or identify punctuation etc...

It will then convert the splitted text into numbers according to a specific, uniquely defined, *vocab*; those numbers are fed to the input layer of the neural net. It immediately follows that to each NLP model there is an associated one and only one pre-trained *tokenizer* with a unique *tokens-to-index* conversion rule (i.e. the *vocab*).

To do inference (use) a pre-trained model one must use its associated *tokenizer*.

In [3]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("neuraly/bert-base-italian-cased-sentiment")

Downloading:   0%|          | 0.00/40.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/230k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

If we process an input sentence through a pre-trained *tokenizer* we will get a ```input_ids``` array containing (integer) indeces associated to each *tokens* of our input sequence.

In [7]:
batch = ["Non credo affatto che mi andrà male", "Che bella giornata per non passare Analisi 1"]

idx=tokenizer(batch)
print(idx)

{'input_ids': [[102, 313, 2079, 7320, 158, 318, 10968, 1690, 103], [102, 666, 2708, 3695, 156, 212, 3624, 24162, 202, 103]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}


To actually see the *tokens* we can use the ```decode``` method.

In [11]:
for id in idx["input_ids"]:
  print(tokenizer.decode(id))

[CLS] Non credo affatto che mi andrà male [SEP]
[CLS] Che bella giornata per non passare Analisi 1 [SEP]


The output will show us the final product of the *tokenization* process, that is once the special *tokens* are inserted; if we instead us the ```tokenize()``` method we'd retrieve

In [13]:
tks = tokenizer.tokenize(batch)
print(tks)

['Non', 'credo', 'affatto', 'che', 'mi', 'andrà', 'male', 'Che', 'bella', 'giornata', 'per', 'non', 'passare', 'Analisi', '1']


the proceed with the ```convert_tokens_to_ids()``` method to obtain the array of indeces explained above

In [14]:
ids = tokenizer.convert_tokens_to_ids(tks)
print(ids)

[313, 2079, 7320, 158, 318, 10968, 1690, 666, 2708, 3695, 156, 212, 3624, 24162, 202]


**Please note:** as opposed to the ```idx``` array obtained above, this time we seem lacking the first ```102``` and last ```103``` entries corresponding to the special *tokens* that were added, ```[CLS]``` and ```[SEP]``` respectively.

## Preprocess custom datasets

In [this tutorial](https://huggingface.co/transformers/custom_datasets.html) they explain how to work with your own data.



Finally we convert the raw custom dataset containing all the input texts examples into edible data for our pre-trained model, i.e. we have to pass it through our ```tokenizer```. This can be done in one shot using the ```map()``` method

In [16]:
!pip3 install datasets
from datasets import load_dataset

Collecting datasets
  Downloading datasets-1.16.1-py3-none-any.whl (298 kB)
[K     |████████████████████████████████| 298 kB 10.3 MB/s 
Collecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 61.5 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2021.11.1-py3-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 61.5 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 54.6 MB/s 
Collecting frozenlist>=1.1.1
  Downloading frozenlist-1.2.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (192 kB)
[K     |████████████████████████████████| 192 kB 39.0 MB/s 
[?25hCollecting asynctest==0.13.0
  Downloading asynctest-0.13.0-py3-none-any.whl (26 kB)
Collecting yarl<2.0,>=1.0
 

Downloading:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text (download: 80.23 MiB, generated: 127.02 MiB, post-processed: Unknown size, total: 207.25 MiB) to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...


Downloading:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [17]:
dataset = load_dataset("imdb")

def tok_txt(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tok_dataset = dataset.map(tok_txt, batched=True)

Reusing dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)


  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/25 [00:00<?, ?ba/s]

  0%|          | 0/25 [00:00<?, ?ba/s]

  0%|          | 0/50 [00:00<?, ?ba/s]

In [18]:
small_train_dataset = tok_dataset["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tok_dataset["test"].shuffle(seed=42).select(range(1000))
full_train_dataset = tok_dataset["train"]
full_eval_dataset = tok_dataset["test"]

## Fine-tuning pre-trained models
In TF the fine-tune training can be done directly via Keras's ```fit``` method. In PT there's no such generic training loop and it must be built from scratch. For this the HF team put together an ad-hoc ```Trainer``` API which can be used for that exact purpose.

[This guide](https://huggingface.co/transformers/training.html) will show you how to use it.

## Real-time text inference