<a href="https://colab.research.google.com/github/benoitmialet/Statistical-and-data-analysis-using-R-/blob/main/blank_versions/lab_nlp_01_understanding_encoder_models_blank_version.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LAB NLP 01 Understanding encoder models
# -Version with blanks-

## Intro

In this lab session we are going to use our first transformer model on a locally!

Understanding model structure and functioning is required before trying to use it for inference or training. This is the objective of this lab. In the next one, we will work with some Natural Language Processing tasks, where transformer models have shown outstading results since several years. In these Labs, we will first focus on Encoder-only models. Then, we will tackle Encoder-Decoder models and Decoder-only models.

All along the Labs, we will massively use Hugging Face Hub and Hugging Face libraries to try out different transformer models.
* **Hugging Face Hub** is a huge storage for Open Source models: https://huggingface.co/models. Anyone can upload a model, for public or private usage. Filters can be activated to expore models: tasks, languages, licenses, etc. Task page provide a nice overview: https://huggingface.co/tasks
* **Hugging Face librairies** offer a universal implementation that allow to use any model from the Hub, providing your hardware  can handle it: https://huggingface.co/docs. By using a quite simple syntax, you will be able to perform any NLP task (or Speech to text, computer vision, etc.). We will essentially use `Transformers` and `Datasets` libraries in this purpose.

Note: at any moment, if you face an "out of memory" error, just reset the notebook kernel. It is also advised to reset the notebook kernel each time you load a new model, to avoid memory crash.

## Discovering BERT

To start with NLP encoder models, BERT is the most known example (Google, 2018). As a reminder, **BERT encoder has been trained for masked language modeling (MLM)** task in English language, which made him good at feature extraction, that's to say good at extracting the semantic meaning of a text and the information it contains (in English). It has also been trained for Next Sentence Prediction (SNP), but studies has shown that it was not as good as MLM.


First thing to do before trying to use a transformer model is to understand how it works. Go to the BERT model page then read the text : https://huggingface.co/google-bert/bert-base-uncased/. See how this one is well documented! This is not the case for all models.


Next, you have to check the main model files on the "Files and versions" tab and understand their purpose. **To use NLP models, you will basically need these files:**

* **config.json** shows model parameters configuration.

* **tokenizer_config.json** indicates the max length of inputs you can feed the model with. It's one of the most important information. So, keep in mind "**512** tokens".

* **tokenizer.json** is the most important file. It contains all the vocabulary and the mapping between token and their ids. It also contains all the special tokens. You had to read them carefully

* **vocab.txt** references all the tokens of the vocabulary.

* At least 1 model file. Most of the time, several model file types are available on the model repository. Each of them contains all the weights of the model. you don't need to download all of them. You have to pick at least one of them. Most common types are:
  * **pytorch_model.bin** is the standard binary format (pickle). All repositories have this one.
  * **model.safetensors** is a cool format created by Hugging Face. It is faster to load, especially on cpu. Another cool feature displays all the model layers just by clicking on the little icon on the hub, next to the file.



## Loading the model on device

All methods that are used in the lab are in the next cell. If you lose one, reload the cell.

In [None]:
from transformers import (
    AutoConfig, BertConfig,
    AutoTokenizer,
    AutoModel, AutoModelForMaskedLM, BertForMaskedLM,
    pipeline
)
import torch
import os
import pandas as pd


# this line of code will be useful in any of your projects, to check GPU availability
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

`AutoModel` class can load any model if a model name or path is provided. It automatically identifies model type and loads it, thanks to `from_pretrained` method. This method has plenty of optional parameters than can be found on the documentation, or with `help(AutoModel.from_pretrained)`.

In [None]:
# help(AutoModel.from_pretrained)

Most of the time, this high-level class works perfectly.

* `use_safetensors` allows to load model with safetensor files. You can also load tensorflow version, etc.
* the `to()` method moves the data in a specific device (cpu, gpu). Be careful: **model and input data MUST be on the same device**. tokenizer stays on cpu

In [3]:
model_id = "google-bert/bert-base-uncased"

In [None]:
# Load model with the `from_pretrained` method




WARNING: During the model loading, if a warning message states that some layer were randomly initialized, unfortunately the model is not correctly automatically recognized. You will need to use a specific import class, unless you do it on purpose, for instance to initialize a new layer before training it.

Looking at BERT architecture, we observe:
* The **embedding** part, that outputs token embedding.
* The **"body"**, that update token embeddings (feature extraction), with 12 blocks ("Layers"), each containing several attention heads
* The **"head"**, That is trained for a specific task. Here, it is trained to pool all the token embeddings of the input sequence into 1 single vector, with only 1 pooler layer.

Note: model body can be seen as a drill, ad the head as a drill bit. For a same body, an infinite variety of heads can be trained. **Body does the feature extraction and head does a specific task**. Task can be pooling, pooling + classification, or anything else.

In [None]:
bert_model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

More details can be found, like the number of attention heads per decoder block, by loading model config file. They are different ways to do it:
* by checking `config.json` file.
* using `model.config` attribute
* by instanciating `AutoConfig` object. `AutoConfig` is a class that contains `config.json` information. It can automatically identify the model type just like `AutoModel` and `AutoTokenizer`. However, `BertConfig` will work only for BERT. Both will give the same result.

In [1]:
# Load Bert Autoconfig object.

# bert_config =


## Using a Tokenizer

Tokenization is the first step of processing input data (text), before converting it into a numeric format.

Each model family (e.g. BERT) has its own tokenizer. You will have to identify which tokenizer or how tokenizer works for each model.

3 main tokenization techniques exist:
* Character tokenization: splitting text into characters
* Word tokenization: splitting text into words
* **Subword tokenization**: splitting text into word or smaller entities. It combines best aspects of word and character tokenization and is now broadly adopted.

N.B.: Word and Subword tokenizers themselves are trained along the model training process, with the same text corpus.
N.B.: Today, Subword tokenization is the main technique used in most NLP models.

--

Tokenizer is an object itself and must be imported apart from the model, to prepare model input data. There are two options to load files from the Hub:
1. Download locally the model directory using `git clone`, `wget`... and provide the directory path to the import method.
2. Provide the `model_id` to the same import method. `model_id` is simply the model name displayed on the hub ("google-bert/bert-base-uncased"). The model will be downloaded in a cache then loaded. That's what we are going to do now.

`AutoTokenizer` automatically detects the type of tokenizer a model use, just by provinding it's path or model_id.
if we us `print()` on the tokenizer object, we will access to all the important details. Let's have a look:


### WordPiece Tokenizer

WordPiece Tokenizer is generally used by models focusing on 1 language, mostly English, but it can be another european language too. Examples: BERT, DistilBERT

In [10]:
# load bert tokenizer

# bert_tokenizer =

Some characteristics appear, like the vocabulary size. But plenty of information can be found using `help()`. That's where the WordPiece type of tokenizer appears

In [None]:
# help(bert_tokenizer)

#### Special tokens

Special tokens are particular tokens, not always related to words. They have to be known to understand tokenization. Special tokens differs from a model to another. Here we take a WordPiece tokenizer as example. Here are the special tokens:

* `[CLS]` is the Classification token. It is put at the beginning of input sequence, and will be used if the model task is classification (the rest of the sequence is discarded). It integrates the global context of the sequence
* `[SEP]` is the separator token. It delimitates where sentences end up in a sequence. It also marks the end of the input sequence.
* `[UKN]` is the unknown token. It is used to handle words or subwords that are not in the model's vocabulary
* `[PAD]` is the padding token. When providing a batch of multiple samples as input in the model, all the tensors must have the same size. A max_size is set (by default or manually), and all sequences shortest than this will be completed by this token.
* `[MASK]` is the mask token. It is randomly put in a sequence if the model is being trained for Masked Language Modeling.


Let's tokenize a sentence:

In [26]:
# tokenize the following sentence:
text = "tokenizing text is a core task in NLP."

# tokens =


We can see `[CLS]`, `[SEP]` and some tokens beginning by `##`, which indicates they are subword tokens, which are linked to the previous token.

We can also observe that each token has its own mapped id in the tokenizer vocabulary:

In [27]:
# display token_id for each encoded token, using a list comprehension, or a pandas DataFrame





### SentencePiece tokenizer (XLM-RoBERTa)

SentencePiece Tokenizer is generally (not exclusively) used by models that can handle multiple languages, including non european languages. Examples: mBERT, XLM-RoBERTa

In [None]:
model_id = "FacebookAI/xlm-roberta-base"
xlmroberta_tokenizer = AutoTokenizer.from_pretrained(model_id)
xlmroberta_tokenizer

XLMRobertaTokenizerFast(name_or_path='FacebookAI/xlm-roberta-base', vocab_size=250002, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	250001: AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False, special=True),
}

Here, we see different special tokens, but thanks to the `special_tokens` dictionnary, we can understand that classification. For instance, BOS and SEP tokens are now `<s>` and `</s>`

In [28]:
# tokenize then print tokens for the following sentence:

text = "tokenizing text is a core task in NLP."

# tokens =




Now it's different from WordPiece. Tokens corresponding to the beginning of a word starts with `_`, and those who are linked to the previous one doesn't. The SentencePiece tokenizer is agnostic to accents, ponctuation, and is more adappted to languages without whitespaces, like Japanese.

## Understanding BERT model input and output

Let's use the famous BERT model and try to understand the output data.

### Tokenization (input)

Tokenized data can be returned in different formats. We will use pytorch one.

Don't forget to move tokenized data, which will be model input data, on the device

In [38]:
text1 = "I learn NLP."
text2 = "It's a pain everyday. It is so hard!"
tokens = bert_tokenizer(text1, text2, return_tensors='pt').to("cpu")
tokens

{'input_ids': tensor([[  101,  1045,  4553, 17953,  2361,  1012,   102,  2009,  1005,  1055,
          1037,  3255, 10126,  1012,  2009,  2003,  2061,  2524,   999,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

`token_type_ids` `attention_mask` and are used for training. We will not use them in the lab. For your information:
* `token_type_ids` identifies sequences if several has been tokenized at once.
* `attention_mask` will indicate which token are masked (for MLM training)

We see some special tokens (`101`, `102`) among them. Let's decode them:

In [41]:
# decode `101` and `102` special tokens without explicitely use "101" and "102"




### Model output

Before using a model for inference, puting code in a `torch.no_grad()` context will avoid computing gradients for nothing. However, it is not a big deal here.

In [None]:
%%timeit -n 20 -r 100
output = bert_model(tokens.input_ids)

12.7 ms ± 3.34 ms per loop (mean ± std. dev. of 100 runs, 20 loops each)


In [None]:
%%timeit -n 20 -r 100
with torch.no_grad():
    output = bert_model(tokens.input_ids)

11.7 ms ± 1.93 ms per loop (mean ± std. dev. of 100 runs, 20 loops each)


Reminder: encoder model are build in two parts:
* A **"body"** for feature extraction, with several blocks, each containing several attention heads
* A **"head"** for classification (or anything else), pooling

So, model output provides body and head outputs.
* body output is given by `last_hidden_state` attribute
* head output is given by `pooler_output`

In [None]:
output = bert_model(tokens.input_ids)
output.keys()

odict_keys(['last_hidden_state', 'pooler_output'])

In [42]:
# display `last_hidden_state` dimension lengths and explain them




`last_hidden_state` corresponds to the extracted text context.

Dimensions are:
* ?
* ?
* ?

In [43]:
# display `pooler_output` dimension lengths and explain them



`pooler_output` shape is fixed. This layer is trained to "pool" hidden state matrix into one single vector. This vector carries the text context and can be used for other purposes such as sentence similarity or text classification.

Dimensions are:
* ?
* ?

Historically, classification token (`[CLS]` here) embedding vector was  the one to use for classification. `pooler_output` is a linear than Tahn transformation of this `[CLS]`vector, supposed to be less "noisy", and is preferred for classification.

In [44]:
# Ty to extract only the first embedding vector of each sequence of the batch, using tensor slicing []
# print it's shape




Hugging Face Transformers can also provide all decoder blocks outputs and their attention matrices:

In [None]:
output = bert_model(
    tokens.input_ids,
    output_hidden_states=True,
    output_attentions=True,
)

In [None]:
print(len(output.attentions), len(output.hidden_states))

12 13
