# Using Transformers

## Behind the pipeline 

1. preprocessing
2. going through the model
3. postprocessing

### Preprocessing with a tokenizer

- Transforming text into tokens (numbers)
- Tokenization:
  - Split the text in tokens (e.g. words, symbols)
  - Each token has a corresponding id
  - Add masks to indicate which tokens are missing
- The same tokenizer have to be used in pre-training and inference
- In this code `tokenizer = AutoTokenizer.from_pretrained(checkpoint)`
  - We have `input_ids`: sequence of token's ID
  - And `attention_mask`: mask who indicates which tokens are missing

In [1]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [2]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


### Going through the model

<figure>
  <img src="../images/transformer_and_head-dark.svg" alt="Fluxo do pipeline" width="500"/>
  <figcaption>Dispon√≠vel em: https://huggingface.co/learn/llm-course/chapter2/2</figcaption>
</figure>

- Tokens to internal representation
- High dimensional vector: hidden states
  - After pass the tensor to model, we get the hidden states
  - With dimensions:
    - Batch size
    - Sequence length
    - Hidden size
  - For examploe: `torch.Size([2, 16, 768])`
- Model heads: Transform hidden states to util things 
  - The principal model produces a high dimensional vector
  - The head of the model get this vector and transform it in a specifical thing for the task (classification, generation, etc)
  - Example of heads useds in Transformers Models
    - ForCausalLM: predict the next token
    - ForMaskedLM: predict the masked token
    - ForSequenceClassification: predict the label
    - ForTokenClassification: predict the label for each token
    - ForQuestionAnswering: predict the answer
  - In Sentiment Analysis we used AutoModelForSequenceClassification

# Model heads: Postprocessing the output

- The output of the model is a logits, which are not a probability
- We need to apply a softmax to get the probabilities

```python
predictions = softmax(output.logits, dim = -1)
```

In [5]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs)

SequenceClassifierOutput(loss=None, logits=tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


In [9]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.0195e-02, 9.5981e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


In [None]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

# Creating a Transformer

- Instantiate a transfomer with AutoModel
  - bert-base-cased: architecture and pretrained weights
  - "Auto": corrct architecture based on the checkpoint name

In [1]:
from transformers import AutoModel

model = AutoModel.from_pretrained('bert-base-cased')

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

# Models

## Encoding Text

- Tokenization
  - input_ids: sequence of token's ID
  - token_type_ids: sequence of token's type
  - attention_mask: mask who indicates which tokens are missing
- Added special tokens:
  - CLS: Classification token
  - SEP: Separator token

In [6]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

encoded_input = tokenizer("Hello, I'm a single sentence!")
encoded_input

{'input_ids': [101, 8667, 117, 146, 112, 182, 170, 1423, 5650, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [7]:
tokenizer.decode(encoded_input["input_ids"])

"[CLS] Hello, I ' m a single sentence! [SEP]"

In [8]:
# Encoder a batch of strings/sentences

encoded_input = tokenizer("How are you?", "I'm fine, thank you!")
print(encoded_input)

{'input_ids': [101, 1731, 1132, 1128, 136, 102, 146, 112, 182, 2503, 117, 6243, 1128, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


## Padding inputs

- Padronize the inputs with the same length

In [10]:
encoded_input = tokenizer(
    ["How are you?", "I'm fine, thank you!"], padding=True, return_tensors="pt"
)
print(encoded_input)

{'input_ids': tensor([[ 101, 1731, 1132, 1128,  136,  102,    0,    0,    0,    0],
        [ 101,  146,  112,  182, 2503,  117, 6243, 1128,  106,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


## Truncate inputs

- Tensors are too big 

In [11]:
encoded_input = tokenizer(
    "This is a very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very long sentence.",
    truncation=True,
)
print(encoded_input["input_ids"])

[101, 1188, 1110, 170, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1263, 5650, 119, 102]


In [12]:
encoded_input = tokenizer(
    ["How are you?", "I'm fine, thank you!"],
    padding=True,
    truncation=True,
    max_length=5,
    return_tensors="pt",
)
print(encoded_input)

{'input_ids': tensor([[ 101, 1731, 1132, 1128,  102],
        [ 101,  146,  112,  182,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1]])}


## Adding special tokens

- Important to BERT
- Add to better represent the sentence boundaries
- 