In [None]:
!python --version

Python 3.10.12


In [None]:
# NLP Purpose
!pip install "transformers[sentencepiece]"



In [None]:
import transformers
from transformers import pipeline

import warnings
warnings.filterwarnings('ignore')

## **Transformers**
The **main features**:
- **Ease of use**: Downloading, loading, and using a state-of-the-art NLP model for inference can be done in just two lines of code.
- **Flexibility**: At their core, all models are simple PyTorch `nn.Module` or TensorFlow `tf.keras.Model` classes and can be handled like any other models in their respective machine learning (ML) frameworks.
- **Simplicity**: Hardly any abstractions are made across the library. The **“All in one file”** is a core concept: a model’s forward pass is entirely defined in a single file, so that the code itself is understandable and hackable.

Discussing:
- Model API ➡ model and configuration model classes `pipeline()`
- Tokenizer API ➡ take care the first and last processing steps, handling the conversion from text to numerical, and the conversion back to text.
- Handle multiple sentences
- Look at the high-level `tokenizer()`

## **Behind the Pipeline**


In [None]:
classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9598048329353333},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

This pipeline groups together three steps: **preprocessing**, **passing the inputs through the model**, and **postprocessing**:=

**TOKENIZER ➡ MODEL ➡ POST PROCESSING**

[Raw Text ➡ input IDs ➡ Logits ➡ Prediction]

### **Preprocessing with a Tokenizer**
To convert the text inputs into numbers that the model can make sense of. To do this we use a *tokenizer*, which will be responsible for:

- Splitting the input into words, subwords, or symbols (like punctuation) that are called tokens
- Mapping each token to an integer
- Adding additional inputs that may be useful to the model

All this preprocessing **needs to be done in exactly the same way as when the model was pretrained**.

Using the ***checkpoint*** model and then preprocess with `AutoTokenizer` and it `from_pretained()`

In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Once we have the tokenizer, we can directly pass our sentences to it and we’ll get back a dictionary that’s ready to feed to our model!

The only thing left to do is to convert the list of input IDs to tensors.

**Transformer models only accept `tensors` as input.**

To specify the type of tensors we want to get back, we use `return_tensors`

In [None]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]

# without return_tensors
inputs = tokenizer(raw_inputs, padding=True, truncation=True)
print(inputs)

{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], [101, 1045, 5223, 2023, 2061, 2172, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}


In [None]:
# with return tensors
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


The output itself is a dictionary containing two keys, `input_ids` and `attention_mask`.

`input_ids` contains two rows of integers (one for each sentence) that are the unique identifiers of the tokens in each sentence

### **Going Through the Model**
We can download our pretrained model the same way we did with our tokenizer. 🤗 Transformers provides an `AutoModel` class which also has a `from_pretrained()` method:

In [None]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

➡ downloaded the same **checkpoint** we used in our **pipeline** before (it should actually have been cached already) and **instantiated a model with it**.

This architecture contains only the base Transformer module: given some inputs, it outputs what we’ll call *hidden states*, also known as *features*. For each model input, we’ll retrieve a high-dimensional vector representing the contextual understanding of that input by the Transformer model.

#### **High-Dimensionality Vector**
The **vector output by the Transformer module** is usually large. It generally has three dimensions:

- **Batch size**: The number of sequences processed at a time (2 in our example).
- **Sequence length**: The length of the numerical representation of the sequence (16 in our example).
- **Hidden size**: The vector dimension of each model input.

It is said to be “high dimensional” because of the last value. The hidden size can be very large (768 is common for smaller models, and in larger models this can reach 3072 or more).

In [None]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 16, 768])


The outputs of Transformers model behave like `namedtuple` or *dictionaries* (Key and Value)

In [None]:
# output['last_hidden_state']

#### **Model Heads: Making sense out of numbers** (*architecture*)
The model heads take the high-dimensional vector of hidden states as input and project them onto a different dimension.

The **output of the Transformer model** is **sent directly to the model head to be processed**.

The model represented by its embeddings layer and the subsequent layers. The embeddings layer converts each **input ID** in the tokenized input **into a vector that represents the associated token**. The subsequent layers manipulate those vectors using the attention mechanism to produce the final representation of the sentences.

There are *many different architectures* available in 🤗 Transformers, with each one designed around tackling a specific task.

- `*Model` (retrieve the hidden states)
- `*ForCausalLM`
- `*ForMultipleChoice`
- `*ForQuestionAnswering`
- `*ForSequenceClassification`
- `*ForTokenClassification`
- and others 🤗

For our example, we will **need a model with a sequence classification head** (*to be able to classify the sentences as positive or negative*). So, we won’t actually use the `AutoModel` class, but `AutoModelForSequenceClassification`

In [None]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

In [None]:
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [None]:
print(outputs.logits.shape)

torch.Size([2, 2])


Since we have just two sentences and two labels, the result we get from our model is of shape 2 x 2.

### **Post-processing the output**


The values we get as output from our model don’t necessarily make sense by themselves.

In [None]:
print(outputs.logits)

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


Our model predicted `[-1.5607, 1.6123]` for the first sentence and `[ 4.1692, -3.3464]` for the second one. Those are not probabilities but *logits*, the raw, unnormalized scores outputted by the last layer of the model.

To be converted to probabilities, they need to go through a **SoftMax** layer (*all Transformers models output the logits || loss-function for training SoftMax -- w/ actual loss-function Cross Entropy*)



In [None]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


Now we can see that the model predicted `[0.0402, 0.9598]` for the first sentence and `[0.9995, 0.0005]` for the second one. These are recognizable probability scores.

To get the labels corresponding to each position, we can inspect the `id2label` attribute of the model config.

In [None]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

Now we can conclude that the model predicted the following:

- First sentence: NEGATIVE: 0.0402, POSITIVE: 0.9598
- Second sentence: NEGATIVE: 0.9995, POSITIVE: 0.0005

## **Models**
Closer look at creating and using a model.

The **`AutoModel`** class and all of its relatives are actually simple wrappers over the wide variety of models available in the library. It’s a clever wrapper as it **can automatically guess the appropriate model architecture for your checkpoint**, and then **instantiates a model with this architecture**.

Let’s take a look at how this works with a **BERT** model.

### **Creating a Transformer**

The first thing we’ll need to do to initialize a BERT model is **load a configuration object**:

In [None]:
from transformers import BertConfig, BertModel

# Building the config
config = BertConfig()

# Building the model from the config
model = BertModel(config) ## -- Model is randomly initialized

The **configuration** *contains* **many attributes** that are used to build the model:

In [None]:
print(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.42.4",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



#### **Different Loading Methods**
Creating a model from the default configuration initializes it with random values:

In [None]:
from transformers import BertConfig, BertModel

config = BertConfig()
model = BertModel(config)

## -- Model is randomly initialized

In [None]:
model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

The model can be used in this state, but it will output gibberish; it needs to be trained first. But this would require a long time and a lot of data, and it would have a non-negligible environmental impact.

To avoid unnecessary and duplicated effort, it’s imperative to be able to **share and reuse models that have already been trained**.

In [None]:
from transformers import BertModel

# Load Transformer model that is already trained
model = BertModel.from_pretrained("bert-base-cased")

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

In [None]:
model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(28996, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

In the code sample above we didn’t use `BertConfig`, and instead loaded a pretrained model via the `bert-base-cased` identifier. This is a model checkpoint that was trained by the authors of **BERT** themselves.

This **`model`** is now *initialized with all the weights of the checkpoint*. It can be used directly for inference on the tasks it was trained on, and it can also be fine-tuned on a new task. **By training with pretrained weights rather than from scratch, we can quickly achieve good results**.

The identifier used to load the model can be the identifier of any model on the Model Hub, as long as it is compatible with the BERT architecture.

#### **Saving Methods**
Saving a model is as easy as loading one — we use the `save_pretrained()` method, which is analogous to the `from_pretrained()` method

In [None]:
# model.save_pretrained("directory_on_my_computer")
model.save_pretrained("config")

### **Transformer model for Inference**
Transformer models **can only process numbers — numbers that the tokenizer generates**.

But before we discuss tokenizers, let’s explore **what inputs the model accepts**.

Tokenizers can take care of casting the inputs to the appropriate framework’s tensors.

Simple example of what tokenizers and tensors do:

In [None]:
sequences = ["Hello!", "Cool.", "Nice!"]

The tokenizer converts these to vocabulary indices which are typically called input IDs. Each sequence is now a list of numbers! The resulting output is:

In [None]:
encoded_sequences = [
    [101, 7592, 999, 102],
    [101, 4658, 1012, 102],
    [101, 3835, 999, 102],
]

**This is a list of encoded sequences: a list of lists.**

**Tensors** only accept *rectangular shapes* (think matrices). This “array” is already of rectangular shape, so converting it to a tensor is easy:

In [None]:
import torch

model_inputs = torch.tensor(encoded_sequences)

In [None]:
model_inputs

tensor([[ 101, 7592,  999,  102],
        [ 101, 4658, 1012,  102],
        [ 101, 3835,  999,  102]])

**Using the tensors as inputs to the model**

In [None]:
output = model(model_inputs)

In [None]:
output

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 4.4496e-01,  4.8276e-01,  2.7797e-01,  ..., -5.4032e-02,
           3.9393e-01, -9.4770e-02],
         [ 2.4943e-01, -4.4093e-01,  8.1772e-01,  ..., -3.1917e-01,
           2.2992e-01, -4.1172e-02],
         [ 1.3668e-01,  2.2518e-01,  1.4502e-01,  ..., -4.6915e-02,
           2.8224e-01,  7.5566e-02],
         [ 1.1789e+00,  1.6738e-01, -1.8187e-01,  ...,  2.4671e-01,
           1.0441e+00, -6.1972e-03]],

        [[ 3.6436e-01,  3.2464e-02,  2.0258e-01,  ...,  6.0110e-02,
           3.2451e-01, -2.0996e-02],
         [ 7.1866e-01, -4.8725e-01,  5.1740e-01,  ..., -4.4012e-01,
           1.4553e-01, -3.7545e-02],
         [ 3.3223e-01, -2.3271e-01,  9.4876e-02,  ..., -2.5268e-01,
           3.2172e-01,  8.1085e-04],
         [ 1.2523e+00,  3.5754e-01, -5.1320e-02,  ..., -3.7840e-01,
           1.0526e+00, -5.6255e-01]],

        [[ 2.4042e-01,  1.4718e-01,  1.2110e-01,  ...,  7.6062e-02,
           3.3564e-01,  2

While the model accepts a lot of different arguments, only the input IDs are necessary.

## **Tokenizers**
**Tokenizers** are one of the core components of the NLP pipeline. They serve one purpose: **to translate text into data that can be processed by the model**. Models can only process numbers, so tokenizers need to **convert our text inputs to numerical data**.

🥅 The goal is to find **the most meaningful representation** — that is, **the one that makes the most sense to the model** — and, if possible, **the smallest representation**.

### **Word-based**
The goal is to split the raw text into words and find a numerical representation for each of them

- `Split()` the sentences

In [None]:
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


. With this kind of tokenizer, we can end up with some pretty large *“vocabularies*,” where a **vocabulary** is **defined by the total number of independent tokens that we have in our corpus.**

Each word gets assigned an **ID**, *starting from 0 and going up to the size of the vocabulary*. **The model uses these IDs to identify each word**.

Finally, we need a **custom token** to represent words that are not in our vocabulary. This is known as the *“unknown”* token, often represented as ”`[UNK]`” or ”`<unk>`”. It wasn’t able to retrieve a sensible representation of a word and you’re losing information along the way.

### **Charracter-based**
**Reduce the amount of unknown tokens**.

➡ **Character-based** tokenizers **split the text into characters, rather than words**.

This approach isn’t perfect either. Since the representation is now based on characters rather than words, one could argue that, intuitively, it’s less meaningful: **each character doesn’t mean a lot on its own, whereas that is the case with words**.

Another thing to consider is that we’ll end up with a very large amount of tokens to be processed by our model: whereas a word would only be a single token with a word-based tokenizer, it can easily turn into 10 or more tokens when converted into characters.

### **Subword tokenization**
**Get the best of both worlds** (word - character).

➡ Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords.

For instance, “annoyingly” might be considered a rare word and could be decomposed into “annoying” and “ly”. These are both likely to appear more frequently as standalone subwords, while at the same time the meaning of “annoyingly” is kept by the composite meaning of “annoying” and “ly”.

### **Loading and Saving**
Loading and saving tokenizers is as simple as it is with models. Actually, it’s based on the same two methods: `from_pretrained()` and `save_pretrained()`. These methods will load or save the algorithm used by the tokenizer (a bit like the *architecture* of the model) as well as its *vocabulary* (a bit like the weights of the model).

Loading the **BERT** tokenizer trained with the same checkpoint as BERT is done the same way as loading the model, except we use the BertTokenizer class

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Similar to `AutoModel`, the `AutoTokenizer` class will grab the proper **tokenizer** class in the library based on the checkpoint name, and can be used directly with any checkpoint:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [None]:
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

saving a tokenizer

In [None]:
# tokenizer.save_pretrained("directory_on_my_computer")
tokenizer.save_pretrained("config")

('config/tokenizer_config.json',
 'config/special_tokens_map.json',
 'config/vocab.txt',
 'config/added_tokens.json',
 'config/tokenizer.json')

First, let’s see how the `input_ids` are generated. To do this, we’ll need to look at the intermediate methods of the tokenizer.

### **Encoding**
Translating text to numbers is known as *encoding*. Encoding is done in a two-step process: the **tokenization**, followed by the **conversion to input IDs**.

**Steps**:
1. **Split text into words** (*tokens*) ➡ [*which is why we need to **instantiate the tokenizer using the name of the model**, to make sure we **use the same rules** that were used when the model was **pretrained**.*]
2. Convert **tokens into numbers** ➡ [*.**build a `tensor` out of them and feed them to the model**. To do this, the tokenizer has a **vocabulary**, which is the part we download when we instantiate it with the `from_pretrained()` method.*]

#### **Tokenization**
The tokenization process is done by the `tokenize()` method of the tokenizer

In [None]:
from transformers import AutoTokenizer

# tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = [
            "I’ve been waiting for a HuggingFace course my whole life.",
            "I hate this so much!",
        ]
tokens = tokenizer.tokenize(sequence)

print(tokens)

['i', '’', 've', 'been', 'waiting', 'for', 'a', 'hugging', '##face', 'course', 'my', 'whole', 'life', '.', 'i', 'hate', 'this', 'so', 'much', '!']


This tokenizer is a subword tokenizer: it splits the words until it obtains tokens that can be represented by its vocabulary

#### **Tokens to Input IDs**
The conversion to input IDs is handled by the `convert_tokens_to_ids()` tokenizer method


In [None]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[1045, 1521, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 1045, 5223, 2023, 2061, 2172, 999]


### **Decoding**
**From vocabulary indices**, we want **to get a string**. This can be done with the `decode()` method

In [None]:
decoded_string = tokenizer.decode([1045, 5223, 2023, 2061, 2172, 999])

print(decoded_string)

i hate this so much!


Note that the decode method not only converts the indices back to tokens, but also groups together the tokens that were part of the same words to produce a readable sentence

## **Handling Multiple Sequences**
Some questions emerge:
- How do we **handle multiple sequences**?
- How do we handle multiple sequences of **different lengths**?
- Are **vocabulary indices** the only inputs that allow a model to work well?
- Is there such a thing as **too long a sequence**?

### **Models expect a batch on inputs**
Let’s convert this list of numbers to a tensor and send it to the models

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

## Same checkpoint for Tokenizer and Model
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor(ids) # the problem is here

# This line will fail
model(input_ids)

IndexError: too many indices for tensor of dimension 1

**FAIL**.

The problem is that we sent a single sequence to the model, whereas 🤗 **Transformers models expect multiple sentences by default**.

Here we tried to do everything the tokenizer did behind the scenes when we applied it to a `sequence`. But if you look closely, you’ll see that the tokenizer didn’t just convert the list of input IDs into a tensor, it added a *dimension* on top of it

In [None]:
tokenized_inputs = tokenizer(sequence, return_tensors="pt")
print(tokenized_inputs["input_ids"])

tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102]])


Try again and add a new dimension:

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = torch.tensor([ids]) # <- here's the different
print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", output.logits)

Input IDs: tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])
Logits: tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)


**`Batching`** is the act of **sending multiple sentences through the model, all at once**.

If you only have **one sentence**, you can just build a batch with a single sequence (***two identical batches***):

In [None]:
batched_ids = [ids, ids]

In [None]:
# Batches into tensor
batch_input_ids = torch.tensor(batched_ids) # <-- here's the different
print("Input IDs:", batch_input_ids)

batch_output = model(batch_input_ids)
print("Logits:", batch_output.logits)

Input IDs: tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012],
        [ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])
Logits: tensor([[-2.7276,  2.8789],
        [-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)


### **Padding the Inputs**
Solve the different lengths of Batch or sentences.

🥅 **Having the same length by using *padding token*.**

The padding token ID can be found in `tokenizer.pad_token_id`

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
tensor([[ 1.5694, -1.3895],
        [ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>)


There’s something wrong with the logits in our batched predictions: *the second row should be the same as the logits for the second sentence*, but we’ve got completely different values!

This is because the key feature of Transformer models is attention layers that *contextualize* each token.

✅ To get the same result when passing individual sentences of different lengths through the model or when passing a batch with the same sentences and padding applied, we need to **tell those attention layers to ignore the padding tokens**. This is done by using an `attention mask`.

### **Attention Masks**
*Attention masks* are tensors with the **exact same shape as the input IDs tensor**, filled with 0s and 1s: ***1s indicate the corresponding tokens should be attended to***, and ***0s indicate the corresponding tokens should not be attended to*** (i.e., they should be ignored by the attention layers of the model).

In [None]:
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)


Now we get the same logits for the second sentence in the batch.

In [None]:
from transformers import AutoTokenizer, AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

padd = tokenizer.pad_token_id

batched_sequence = [
    [  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172, 2607,  2026,  2878,  2166,  1012,   102],
    [  101,  1045,  5223,  2023,  2061,  2172,   999,   102, padd, padd, padd, padd, padd, padd, padd, padd],
]

attention_mask_sequence = [
    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
    [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
]

output_sequence = model(torch.tensor(batched_sequence), attention_mask=torch.tensor(attention_mask_sequence))
print(output_sequence.logits)

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


### **Longer Sequences**
With Transformer models, there is a **limit to the lengths of the sequences we can pass the models**. Most models handle sequences of up to `512 or 1024 tokens`, and **will crash when asked to process longer sequences**.

**Two Solutions:**
- Use a model with a longer supported sequence length.
- Truncate your sequences.

Otherwise, we recommend you **truncate** your sequences by specifying the `max_sequence_length` parameter:

In [None]:
# Get from the maximum length inside list of text/documents
max_sequence_length = 0

sequence = sequence[:max_sequence_length]

## **Putting it all Together**
By Transformers API, when you call your **`tokenizer`** directly on the sentence, you get back inputs that are ready to pass through your model:

In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

In [None]:
model_inputs

{'input_ids': [101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Here, the `model_inputs` variable contains everything that’s necessary for a model to operate well.

For **DistilBERT**, that includes the input IDs as well as the attention mask. Other models that accept additional inputs will also have those output by the `tokenizer` object.

**The powerful of `tokenizer`:**

First, it can `tokenize` single sequence

In [None]:
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

In [None]:
model_inputs

{'input_ids': [101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Second, it can `tokenize` multiple sequence at a time

In [None]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

model_inputs = tokenizer(sequences)

In [None]:
model_inputs

{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], [101, 2061, 2031, 1045, 999, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

Third, it can `pad` according to several objectives

In [None]:
# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

In [None]:
model_inputs

{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], [101, 2061, 2031, 1045, 999, 102, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 0, 0]]}

Fourth, it can also `truncate` sequences

In [None]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

In [None]:
model_inputs

{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 102], [101, 2061, 2031, 1045, 999, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

The `tokenizer` object can handle the conversion to specific framework tensors, which can then be directly sent to the model.
- PyTorch ➡ `"pt"`
- TensorFlow ➡ `"tf"`
- NumPy arrays ➡ `"np"`

In [None]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# Returns TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")

# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

In [None]:
model_inputs

{'input_ids': array([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662,
        12172,  2607,  2026,  2878,  2166,  1012,   102],
       [  101,  2061,  2031,  1045,   999,   102,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0]]), 'attention_mask': array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}

### **Special Tokens**


In [None]:
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]


One token ID was added at the beginning, and one at the end

In [None]:
print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))

[CLS] i've been waiting for a huggingface course my whole life. [SEP]
i've been waiting for a huggingface course my whole life.


The `tokenizer` added the special word `[CLS]` at the beginning and the special word `[SEP]` at the end.

This is because the model was pretrained with those, so to get the same results for inference we need to add them as well.

Note that some models don’t add special words, or add different ones; models may also add these special words only at the beginning, or only at the end.

🎯 **In any case, the `tokenizer` knows which ones are expected and will deal with this for you.**

### **Wrapping Up: Tokenizer to Model**
let’s see one final time how it can handle multiple sequences (padding!), very long sequences (truncation!), and multiple types of tensors with its main API

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)

In [None]:
output.logits

tensor([[-1.5607,  1.6123],
        [-3.6183,  3.9137]], grad_fn=<AddmmBackward0>)