# Must read article 

1. https://huggingface.co/docs/transformers/en/philosophy
1. https://huggingface.co/blog/transformers-design-philosophy

# **Introduction**

As you was in `chapter_1`, Tranformer models are usually very large. With millions to tens of *billions* of parameters, training and deploying these models is a complicated undertaking. Furthermore, with new models being released on a near-daily basis and each having its own implementation, trying them all out is no easy task. 

The ü§ó Transformers library was created to solve this problem. Its goal is to provide a single API through which any Transformer model can be loaded, trainied, and saved. The library's main features are: 

* **Ease of use**: Downloading, loadingm and using a state-of-the-art NLP model for inference can be done in just two lines of code. 
* **Flexibility**: At their core, all models are simple PyTorch `nn.Module` or TrnsorFlow `tf.keras.Model` classes and can be handled like any other models in their respective machine learning (ML) framworks. 
* **Simplicity**: Hardly any abstractions are made across the library. The "All in one file" is a core concept: a model's forward pass is entirely defined in a single file, so that the code itself is unserstandable and hackable. 

This last feature makes ü§ó Transformers quite different from other ML libraries. The models are not built on modules that are shared across files; instead, each model has its own layers. In addition to making the models more approachable and understandable, this allows you to easily experiment on one model without affecting others.

This chapter will begin with an end-to-end example where we use a model and a tokenizer together to replicate the `pipeline()` function introduced in `Chapter 1`. Next, we‚Äôll discuss the model API: we‚Äôll dive into the model and configuration classes, and show you how to load a model and how it processes numerical inputs to output predictions.

Then we‚Äôll look at the tokenizer API, which is the other main component of the `pipeline()` function. Tokenizers take care of the first and last processing steps, handling the conversion from text to numerical inputs for the neural network, and the conversion back to text when it is needed. Finally, we‚Äôll show you how to handle sending multiple sentences through a model in a prepared batch, then wrap it all up with a closer look at the high-level `tokenizer()` function.

# **Behind the pipeline**

Let's start with a complete, taking a look at what happened behind the scenes when we executed the following code in *Chapter 1*:

3 Stage Pipeline

0. Dataset
1. language tokenizer -> word to it's unique number mapping
2. pretrained_model -> word id it's meaning vector/embedding vector /hidden layer size
3. model output

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for you.",
        "I hate this so much!"
    ]
)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


[{'label': 'POSITIVE', 'score': 0.9972347617149353},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

As we saw in *chapter 1*,  this pipeline groups together three steps: preprocessing, pasing the inputs through the model, and postprecessing:

![](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/full_nlp_pipeline.svg)

Let's quickly go over each of these.

## **Preprocessing with a tokenizer**

Like other neural networks, transformer models can't process raw text directly, so the first step of our pieline is to convert the text inputs into numbers that the model can make sense of. To do this we use a *tokenizer*, which will be responsible for:

* Splitting the input words, subwords, or symbols (like punctuation) that are called *tokens*. 
* Mapping each token to an integer.
* Adding additional inputs that may be useful to the model.

All this preprocessing needs to be done in exactly the same way as when the model was pretrained, so we first need to download that information from the [Model Hub](https://huggingface.co/models). To do this, we use the *`AutoTokenizer`* class and its *`from_pretrained()`* method. Using the checkpoint name of out model, it will automatically fetch the data associated with the model's tokenizer and cache it(so it's only downloaded the first time you run the code below).

Since the default checkpoint of the *`sentiment-analysis`* pipeline is *`distilbert-base-uncased-finetuned-sst-2-english`* (you can see its models card [here](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)), we run the following:

In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"  ## GPT or BERT.  2 foundational approaches in llm
tokenizer = AutoTokenizer.from_pretrained(checkpoint)  

Once we have the tokenizer, we can directly pass our sentences to it and we'll get back a dictionary that's ready to feed to our model! The only things left to do is to convert the list of input IDs to tensors. 

You can use ü§ó Transformers without having to worry about which ML framework is used as a backbend; it might be PyTorch or TensorFlow, or Flax for some models. However, Transformer models only accept *tensor* as input. If this is your first time hearing about tensors, you can think of them as NumPy arrays insted. A NumPy array van be (OD), a vector (1D), a matrix(2D), or have more dimensions. It's effectively a tensor; other ML frameworks' tensors behave similarly, and are usually as simple to instantiate as NumPy arrays. 

To specify the type of tensors we want to get back (PyTorch, TensorFlow, or plain NumPy), we use the `return_tensors` argument:

``` python
raw_input = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]

input = tokenizer(raw_input, padding=True, truncation=True, return_tensors='pt')
print(input)
```
Don‚Äôt worry about padding and truncation just yet; we‚Äôll explain those later. The main things to remember here are that you can pass one sentence or a list of sentences, as well as specifying the type of tensors you want to get back (if no type is passed, you will get a list of lists as a result).

``` python
output:> {'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172, 2607,  2026,  2878,  2166,  1012,   102],
                               [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,    0,     0,     0,     0,     0,     0]]), 
        'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                                  [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}
```

The output itself is a dictionary containing two keys, `input_ids` and `attention_mask`. `input_ids` contains two rows of integers (one for each sentence) that are the unique identifiers of the tokens in each sentence.

In [15]:
raw_input = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]

inputs = tokenizer(raw_input, padding=True, truncation=True, return_tensors='pt') # Numeric ids => as pytorh Tensors
print(input)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


In [19]:
print(f'tokenizer return multiple things things => {inputs.keys()}')

index = 0
while index < len(inputs['input_ids']):
    print(f'Sentence Number => {index+1}, Sentence = {raw_input[index].split()}')
    print(f'\t input_ids \t => {inputs['input_ids'][index]}, \n\t attention_mask => {input['attention_mask'][index]}')
    index = index + 1

tokenizer return multiple things things => dict_keys(['input_ids', 'attention_mask'])
Sentence Number => 1, Sentence = ["I've", 'been', 'waiting', 'for', 'a', 'HuggingFace', 'course', 'my', 'whole', 'life.']
	 input_ids 	 => tensor([  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
         2607,  2026,  2878,  2166,  1012,   102]), 
	 attention_mask => tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
Sentence Number => 2, Sentence = ['I', 'hate', 'this', 'so', 'much!']
	 input_ids 	 => tensor([ 101, 1045, 5223, 2023, 2061, 2172,  999,  102,    0,    0,    0,    0,
           0,    0,    0,    0]), 
	 attention_mask => tensor([1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0])


### **Going through the model**

We can download our pretrained model the same way we did with our tokenizer. ü§ó Transformers provides an `AutoModel` class which also has a `from_pretrained()` method:

```python
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)
```

In this code snippet, we have downloaded the same checkpoint we used in our pipeline before (it should actually have been cached already) and instantiated a model with it. 

This architecture contains only the base Transformer module; given some inputs, it outputs what we'll call *hidden states*, also known as *feature*. for each model input, we'll retrieve a high-dimensional vector representing the **contextual understanding of that input by the transformer model**.

If this doesn't make sense, don't worry about it.

While thses hidden states can be useful on their own, they're usually inputs to another part of the model, known as the *head*. The different taks could tasks could have been performed with the same architecture, but each of these tasks will have a different head associated with it.

In [10]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

### **A high-dimensinal vector?**

The vector output by the Transformer module is usually large. It generally has three dimensions:

* **Batch size**: The number of sequences processed at a time(2 in our example).
* **Sequence length**: The length of the numerical representation of the sequence (16 in our example).
* **Hidden size**: The voctor dimension of each model input. 

It is said to be "high dimentional" because of the last value. The hidden size can be very large (768 is coomen for smaller models, and in larger models this can reach 3072). 

We can see this if we feed the inputs we preprocessed to our model: 
```python 
outputs = model(**inputs)
print(outputs.last_hidden_state.shape) 
```

Note that the output of ü§ó Transformers models behave like `namedtuple`s or dictionaries. You can access the elements by attributes (like we did) or by key (`outputs["last_hidden_state"]`), or even by index if you know exactly where the things you are looking for is (`outputs[0]`).

In [11]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 16, 768])


### **Model heads: Making sense out of numbers**

The model heads take the high-dimensional vector of hidden states as input and project them onto a different dimension. They are usually composed of one or a few linear layers:

![](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/transformer_and_head.svg)

The output of the Transformer model is sent directly to the model head to be processed. 

In this diagram, the model is represented by its embeddings layer and the subsequent layers. The embeddings layer converts each input ID in the tokenized input into a vector that representation of the sentences.

There are many different architectures available in ü§ó Transformers, with each one designed around tacking a specific task. 
Here is a non- exhaustive list:

* Model (retrieve the hidden states)
* ForCausalLM
* ForMaskedLM
* ForMultipleChoice
* ForQuestionAnswering
* ForSequenceClassification
* ForTokenClassification
* and others ü§ó

For our example, we will need a model with a sequence classification head (to be able to classify the sentences as positive or negative). So, we won‚Äôt actually use the `AutoModel` class, but `AutoModelForSequenceClassification`:

In [14]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.logits.shape)

torch.Size([2, 2])


### **Postprocessing the output**

The values we get as output from our model don‚Äôt necessarily make sense by themselves. Let‚Äôs take a look:

In [20]:
print(outputs.logits)

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


Our model predicted [`-1.5607`, `1.6123`] for the first sentence and [`4.1692`,`-3.3464`] for the second one. Those are not probabilities but logits, the raw, unnormalized scores outputted by the last layer of the model. To be converted to probabilities, they need to go through a `SoftMax` layer (all ü§ó Transformers models output the logits, as the loss function for training will generally fuse the last activation function, such as SoftMax, with the actual loss function, such as cross entropy):


In [22]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim = -1)
print(predictions)

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


Now we can see that the model predicted [`0.0402`, `0.9598`] for the first sentence and [`0.9995`, `0.0005`] for the second one. These are recognizable probability scores.

To get the labels corresponding to each position, we can inspect the `id2label` attribute of the model config (more on this in the next section):

In [23]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

Now we can conclude that the model predicted the following:

* First sentence: NEGATIVE: 0.0402, POSITIVE: 0.9598
* Second sentence: NEGATIVE: 0.9995, POSITIVE: 0.0005 

We have successfully reproduced the three steps of the pipeline: preprocessing with tokenizers, passing the inputs through the model, and postprocessing!

## **Complete Pipeline**

In [29]:
import torch
import transformers

checkpoint = 'distilbert/distilbert-base-uncased-finetuned-sst-2-english'
tokenizer  = transformers.  AutoTokenizer.                        from_pretrained(checkpoint)
model      = transformers.  AutoModelForSequenceClassification.   from_pretrained(checkpoint)

In [33]:
raw_inputs = [
    "We are very happy to see you",  # 7 words
    "I am happy",   # 5 Words
    "screw you",            # 2 Words
]
numeric_ids = tokenizer(raw_inputs , padding=True , return_tensors="pt")    # Numeric ids => as PYTORCH TENSORS

outputs = model(**numeric_ids)

print(outputs.logits)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

print(model.config.id2label)

tensor([[-4.2508,  4.6017],
        [-4.3335,  4.6962],
        [ 3.7575, -3.0165]], grad_fn=<AddmmBackward0>)
tensor([[1.4301e-04, 9.9986e-01],
        [1.1978e-04, 9.9988e-01],
        [9.9886e-01, 1.1419e-03]], grad_fn=<SoftmaxBackward0>)
{0: 'NEGATIVE', 1: 'POSITIVE'}


# **Tokenizer**

Tokenizers are one of the core components of the NLP pipeline. They server one purpose: to translate text into data that c an be processed by the model. Models can only process numbers, so tokenizers need to convert our text inputs to numerical data. In this section, we'll explore exactly what happens in the tokenization pipeline. 

In NLP tasks, the data that is generally processed is raw text. Here's an example of such text: 

```python
Jim Henson was a puppeteer
```

However, models can only process numbers, so we need to find a way to convert the raw text to numbers. That's what the tokenizers do, and there are a lot of aways to go about this. The goal is to find the most meaningful representation - that is, the one that makes the most sense to the model - and, if possible, the smallest representation. 

Let‚Äôs take a look at some examples of tokenization algorithms, and try to answer some of the questions you may have about tokenization.

### **Word-based**

The first type of tokenizer that comes to mind is word-based. It‚Äôs generally very easy to set up and use with only a few rules, and it often yields decent results. For example, in the image below, the goal is to split the raw text into words and find a numerical representation for each of them:

![](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/word_based_tokenization-dark.svg)

There are different ways to split the text. For example, we could use whitespace to tokenize the text into words by applying Python‚Äôs split() function:



In [1]:
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


There are also variations of word tokenizers that have extra rules for punctuation. With this kind of tokenizer, we can end up with some pretty large ‚Äúvocabularies,‚Äù where a vocabulary is defined by the total number of independent tokens that we have in our corpus.

Each word gets assigned an ID, starting from 0 and going up to the size of the vocabulary. The model uses these IDs to identify each word.

If we want to completely cover a language with a word-based tokenizer, we‚Äôll need to have an identifier for each word in the language, which will generate a huge amount of tokens. For example, there are over 500,000 words in the English language, so to build a map from each word to an input ID we‚Äôd need to keep track of that many IDs. Furthermore, words like ‚Äúdog‚Äù are represented differently from words like ‚Äúdogs‚Äù, and the model will initially have no way of knowing that ‚Äúdog‚Äù and ‚Äúdogs‚Äù are similar: it will identify the two words as unrelated. The same applies to other similar words, like ‚Äúrun‚Äù and ‚Äúrunning‚Äù, which the model will not see as being similar initially.

Finally, we need a custom token to represent words that are not in our vocabulary. This is known as the ‚Äúunknown‚Äù token, often represented as `‚Äù[UNK]‚Äù` or `‚Äù<unk>‚Äù`. It‚Äôs generally a bad sign *`if you see that the tokenizer is producing a lot of these tokens, as it wasn‚Äôt able to retrieve a sensible representation of a word and you‚Äôre losing information along the way`*. The goal when crafting the vocabulary is to do it in such a way that the tokenizer tokenizes as few words as possible into the unknown token.

One way to reduce the amount of unknown tokens is to go one level deeper, using a ***character-based*** tokenizer.

### **Character-based**

Character-based tokenizers split the text into characters, rather than words. This has two primary benefits:

* The vocabulary is much smaller.
* There are much fewer out-of-vocabulary (unknown) tokens, since every word can be built from characters.

But here too some questions arise concerning spaces and punctuation:
![](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/character_based_tokenization-dark.svg)

This approach isn‚Äôt perfect either. Since the representation is now based on characters rather than words, one could argue that, intuitively, it‚Äôs less meaningful: each character doesn‚Äôt mean a lot on its own, whereas that is the case with words. However, this again differs according to the language; in Chinese, for example, each character carries more information than a character in a Latin language.

Another thing to consider is that we‚Äôll end up with a very large amount of tokens to be processed by our model: whereas a word would only be a single token with a word-based tokenizer, it can easily turn into 10 or more tokens when converted into characters.

To get the best of both worlds, we can use a third technique that combines the two approaches: `subword tokenization`.

### **Subword tokenization**

Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords.

For instance, ‚Äúannoyingly‚Äù might be considered a rare word and could be decomposed into ‚Äúannoying‚Äù and ‚Äúly‚Äù. These are both likely to appear more frequently as standalone subwords, while at the same time the meaning of ‚Äúannoyingly‚Äù is kept by the composite meaning of ‚Äúannoying‚Äù and ‚Äúly‚Äù.

Here is an example showing how a subword tokenization algorithm would tokenize the sequence ‚ÄúLet‚Äôs do tokenization!‚Äú:

![](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/bpe_subword-dark.svg)

These subwords end up providing a lot of semantic meaning: for instance, in the example above ‚Äútokenization‚Äù was split into ‚Äútoken‚Äù and ‚Äúization‚Äù, two tokens that have a semantic meaning while being space-efficient (only two tokens are needed to represent a long word). This allows us to have relatively good coverage with small vocabularies, and close to no unknown tokens.

This approach is especially useful in agglutinative languages such as Turkish, where you can form (almost) arbitrarily long complex words by stringing together subwords.

**And more!**

Unsurprisingly, there are many more techniques out there. To name a few:

* Byte-level BPE, as used in GPT-2
* WordPiece, as used in BERT
* SentencePiece or Unigram, as used in several multilingual models

You should now have sufficient knowledge of how tokenizers work to get started with the API.

**Loading and saving**
Loading and saving tokenizers is as simple as it is with models. Actually, it‚Äôs based on the same two methods: **from_pretrained()** and **save_pretrained()**. These methods will load or save the algorithm used by the tokenizer (a bit like the `architecture` of the model) as well as its vocabulary (a bit like the `weights` of the model).

Loading the BERT tokenizer trained with the same checkpoint as BERT is done the same way as loading the model, except we use the ***BertTokenizer*** class:



In [2]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

Similar to **AutoModel**, the **AutoTokenizer** class will grab the proper tokenizer class in the library based on the checkpoint name, and can be used directly with any checkpoint:

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [4]:
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [6]:
# tokenizer.save_pretrained("directory_on_my_computer")

### **Encoding**

Translating text to numbers is known as *encoding*. Encoding is done in a two-step process: the tokenization, followed by the conversion to input IDs.

As we‚Äôve seen, the first step is to split the text into words (or parts of words, punctuation symbols, etc.), usually called *tokens*. There are multiple rules that can govern that process, which is why we need to instantiate the tokenizer using the name of the model, to make sure we use the same rules that were used when the model was pretrained.

The second step is to convert those tokens into numbers, so we can build a tensor out of them and feed them to the model. To do this, the tokenizer has a *vocabulary*, which is the part we download when we instantiate it with the from_pretrained() method. Again, we need to use the same vocabulary used when the model was pretrained.

To get a better understanding of the two steps, we‚Äôll explore them separately. Note that we will use some methods that perform parts of the tokenization pipeline separately to show you the intermediate results of those steps, but in practice, you should call the tokenizer directly on your inputs (as shown in the section 2).

***Tokenization***

The tokenization process is done by the *tokenize()* method of the tokenizer:

In [7]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']


This tokenizer is a subword tokenizer: it splits the words until it obtains tokens that can be represented by its vocabulary. That‚Äôs the case here with transformer, which is split into two tokens: `transform` and `##er`.

### From tokens to input IDs

The conversion to input IDs is handled by the *`convert_tokens_to_ids()`* tokenizer method:

In [8]:
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[7993, 170, 13809, 23763, 2443, 1110, 3014]


### **Decoding**

***Decoding*** is going the other way around: from vocabulary indices, we want to get a string. This can be done with the *`decode()`* method as follows:

In [None]:
decoded_string = tokenizer.decode([7993, 170, 13809, 23763, 2443, 1110, 3014])
print(decoded_string)

Using a Transformer network is simple highest


Note that the decode method not only converts the indices back to tokens, but also groups together the tokens that were part of the same words to produce a readable sentence. This behavior will be extremely useful when we use models that predict new text (either text generated from a prompt, or for sequence-to-sequence problems like translation or summarization).

By now you should understand the atomic operations a tokenizer can handle: tokenization, conversion to IDs, and converting IDs back to a string. However, we‚Äôve just scraped the tip of the iceberg.

# **Handling multiple sequences**

* How do we handle multiple sequences?
* How do we handle multiple sequences *of different* lengths?
* Are vocabulary indices the only inouts that allow a model to work well?
* Is there such a thing as too long a sequence? 

Let's see what kinds of problems these question pose, and how we can solve them using the ü§ó Transformers API. 


### **Models expect a batch of inpits**

In the previous exercise you saw how sequences get translated into of numbers. Lets convert this list of numbers to a tensor and send it to the model:

In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)


sequence = "I've been waiting fro a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor(ids)

model(input_ids)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


IndexError: too many indices for tensor of dimension 1

Oh no! Why did this fail? We followed the steps from the pipeline.

The problem is that we sent a single sequence to the model, whereas ü§ó Trnasformers models expect multiple sentences by default. Here we tried to do everything the tokenizer did behind the scenes when we applied it to a **sequence**. But if you look closely, you'll see that the tokenizer didn't just convert the list of input IDs into a tensor, it added a dimension on top of it:

In [4]:
tokenized_inputs = tokenizer(sequence, return_tensors="pt")
tokenized_inputs["input_ids"]

tensor([[  101,  1045,  1005,  2310,  2042,  3403, 10424,  2080,  1037, 17662,
         12172,  2607,  2026,  2878,  2166,  1012,   102]])

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

# Correct tokenization approach
inputs = tokenizer(sequence, return_tensors="pt", padding=True, truncation=True)

# Move model and data to MPS (for M1 Mac support)
device = torch.device("mps") if torch.backends.mps.is_available() else "cpu"
model.to(device)
inputs = {k: v.to(device) for k, v in inputs.items()}

print("Input IDs:", inputs['input_ids'])

# Forward pass
with torch.no_grad():
    output = model(**inputs)

print("Logits:", output.logits)


Input IDs: tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102]], device='mps:0')
Logits: tensor([[-1.5607,  1.6123]], device='mps:0')


`Batching` is the act of sending multiple sentences through the model, all at once. If you only have one sentence, you can just build a batch with a single sequence: 
```python
batched_ids = [ids, ids]
```

This is a batch of two identical sequences!

Batching allows the model to work when you feed it multiple sentences. Using sequences is just as simple as building a batch with a single sequence. There's a second issue, though. When you're trying to batch together two(or more) sentences, they might be of different lengths. If you've ever worked with tensors before, you known that they to be of rectangulare shape, so you won't be able to convert the list of input IDs into a tensor directly. To work around this problem, we usually `pad` the inputs.

### **Padding the inputs**

The following list of lists cannot be converted to a tensor:

```python
batched_ids = [
    [200, 200, 200],
    [200, 200]
]
```

In order to work around this, we'll use *`padding`* to make our tensors have a rectangular shape. Padding makes sure all our sentences have the same length by adding a special word called the *`padding token`* to the sentences with fewer values. for example, if you have 10 sentences with 1 sentence with 20 words, padding will ensure all the sentences have 20 words. In our example, the resulting looks like this:

```python
padding_id = 100

batched_ids = [
    [200, 200, 200],
    [200, 200, padding_id],
]
```



In [2]:
# Initialize tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

# Sequences in raw text form
sequences = ["I love HuggingFace!", "Transformers are amazing."]

# Tokenize the sequences with padding and attention mask
inputs = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")

# Run inference
with torch.no_grad():
    output = model(**inputs)

# Print logits for all sequences
print(output.logits)

tensor([[-4.1373,  4.4470],
        [-4.3598,  4.6863]])


In [3]:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

checkpoint = "bert-base-uncased"  # Example checkpoint
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = torch.tensor([[200, 200, 200]], dtype=torch.long)
sequence2_ids = torch.tensor([[200, 200, tokenizer.pad_token_id]], dtype=torch.long)

# Padding to ensure uniform shape
batched_ids = torch.tensor([
    [200, 200, 200], 
    [200, 200, tokenizer.pad_token_id]  # Ensuring same shape
], dtype=torch.long)

# Forward pass
print(model(sequence1_ids).logits)
print(model(sequence2_ids).logits)
print(model(batched_ids).logits)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


tensor([[ 0.1238, -0.0997]], grad_fn=<AddmmBackward0>)
tensor([[ 0.1131, -0.0852]], grad_fn=<AddmmBackward0>)
tensor([[ 0.1238, -0.0997],
        [ 0.1131, -0.0852]], grad_fn=<AddmmBackward0>)


Actual Output ->
```python
tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)
tensor([[ 1.5694, -1.3895],
        [ 1.3373, -1.2163]], grad_fn=<AddmmBackward>)
```
There's something wrong with the logits in our batched predictions: the second row should be the same as the logits for the second sentence, but we've got completely different values!

This is because the key feature of Transformer models is attention layers that *`contextualize`* each token. These will take into account the padding tokens since they to all of the tokens of sequence. To get the same result when passing individual sentences of different lengths through the model or when passing a batch with the same sentences and padding applied, we need to tell those attention layers to ignore the padding tokens. This is done by using an attention mask. 

### **Attention masks**

*`Attention masks`* are tensor with exact same shape as the input IDs tensor, filled with 0s and 1s: as indicate the corresponding tokens should be attended to, and 0s indicate the corresponding tokens should not be attended to (i.e., they should be ignored by the attention layers of the model). 

Let's complete the previous example with an attention mask:

In [9]:
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 0.1238, -0.0997],
        [ 0.2877, -0.0793]], grad_fn=<AddmmBackward0>)


Now we get the same logits for the second sentence in the batch.

Notice how the last value of the second sequence is a padding ID, which is a 0 value in the attention mask.

In [16]:
# Initialize tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

# Sequences in raw text form
sequences = ["I have been waiting for a HuggingFace course my whole life.", "I hate this so mush!"]

# Tokenize the sequences with padding and attention mask
inputs = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
print(inputs)

# Run inference
with torch.no_grad():
    output = model(**inputs)

# Print logits for all sequences
print(output.logits)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'input_ids': tensor([[  101,  1045,  2031,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061, 14163,  4095,   999,   102,     0,
             0,     0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]])}
tensor([[ 0.1148, -0.0106],
        [ 0.1838, -0.0007]])


### **Longer sequences**

With Transformer models, there is a limit to the lengths of the sequences we can pass the models. Most models handle sequences of up to 512 or 1024 tokens, and will crash when asked to processs longer sequences. There are two solutions to this problem:
* Use a model with a longer supported sequence length.
* Truncate your sequences. 

Models have different supported sequence lengths, and some specialize in handling very long sequences. Longformer is one example, and another is LED, If you're working on a task that requires very long sequences, we recommend you take look at those models. 

Otherwise, we recommend you truncate your sequences by specifying the `max_sequence_length` parameter:

```python
sequence = sequence[:max_sequence_length]
```

### **Putting it all together**

We have trying our best to do most of the work by hand. We're explored how tokenizers work and lokked at tokenization, conversion to input IDs, padding truncation, and attentions masks. 

However, as we saw in section 2, the ü§ó transformers API can handle all of this for us with high-level function that we'll dive into here. When you call your `tokenizer` directly on the sentence, you get back inputs that are ready to pass through your model:

In [20]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

In [28]:
model_inputs['input_ids'], model_inputs['attention_mask']

([101,
  1045,
  1005,
  2310,
  2042,
  3403,
  2005,
  1037,
  17662,
  12172,
  2607,
  2026,
  2878,
  2166,
  1012,
  102],
 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

Here, the `model_inputs` variable contains everything tha's necessary for a model to operate well. For *`DistilBERT`*, that includes the input IDs as well as the attention mask. Other models that accept additional inputs will also have those output by the *`tokenizer`* object. 

As we'll see in some examples belo, this method is very powerfull. First, it can tokenize single sequence:

```python
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
model_inputs
Output -> {'input_ids': [101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
```

It also handles multiple sequences at a time, with no change in the API:

In [53]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

model_inputs = tokenizer(sequences)
print(model_inputs)
print(len(model_inputs['input_ids'][1]))

{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], [101, 2061, 2031, 1045, 999, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}
6


In [58]:
# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")
print(model_inputs)
print(len(model_inputs['input_ids'][1]))

{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], [101, 2061, 2031, 1045, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}
16


In [55]:
# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

len(model_inputs['input_ids'][1])

512

In [57]:
# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)
print(model_inputs) 
len(model_inputs['input_ids'][1])

{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], [101, 2061, 2031, 1045, 999, 102, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 0, 0]]}


8

The above outputs, `'16, 512, 8,'` are based on our requirements, and we obtained them accordingly. The actual size of the sentence *`'So have I!'`* is **6**. It has additional values like [101] and [102].


It can also truncate sequences:

In [59]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)
print(model_inputs)
print("//////")

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)
print(model_inputs)

{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], [101, 2061, 2031, 1045, 999, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}
//////
{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 102], [101, 2061, 2031, 1045, 999, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}


The tokenizer object can handle the conversion to specific framework tensors, which can then be directly sent to the model. For example, in the following code sample we are prompting the tokenizer to return tensors from the different frameworks ‚Äî "`pt`" returns PyTorch tensors, "`tf`" returns TensorFlow tensors, and "`np`" returns NumPy arrays:

In [66]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")
print(model_inputs)

# Returns TensorFlow tensors    
# model_inputs = tokenizer(sequences, padding=True, return_tensors="tf") ## Without install, it's not working
# print(model_inputs)

# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")
print(model_inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  2061,  2031,  1045,   999,   102,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}
{'input_ids': array([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662,
        12172,  2607,  2026,  2878,  2166,  1012,   102],
       [  101,  2061,  2031,  1045,   999,   102,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0]]), 'attention_mask': array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}


### **Special Tokens**

If we take a look at the input IDs returned by the tokenizer, we will see they are a tiny bit different from what we had earlier:

In [67]:
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]


One token ID was added at the beginning, and one at the end. Let‚Äôs decode the two sequences of IDs above to see what this is about:

In [68]:
print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))

[CLS] i've been waiting for a huggingface course my whole life. [SEP]
i've been waiting for a huggingface course my whole life.


The tokenizer added the special word `[CLS]` at the beginning and the special word `[SEP]` at the end. This is because the model was pretrained with those, so to get the same results for inference we need to add them as well. Note that some models don't add special words, or add different ones' models may also add those special words only at the beginning, or only at the end. In any case, the tokenizer knows which ones are expected and will deal with this fro you. 

### **Wrapping up: From tokenizer to model** 

Now that we've seen all the individual steps the *`tokenizer`* object uses when applied on texts, let's see one final time how it can handle multiple sequences (padding!), very long sequences (truncation!), and multiple types types of tensors with its main API:

In [76]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequence, padding=True, truncation=True, return_tensors='pt')
output= model(**tokens)

In [77]:
output

SequenceClassifierOutput(loss=None, logits=tensor([[-1.5607,  1.6123],
        [-3.6183,  3.9137]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

Great job following the course up to here! To recap, in this chapter you:

* Learned the basic building blocks of a Transformer model.
* Learned what makes up a tokenization pipeline.
* Saw how to use a Transformer model in practice.
* Learned how to leverage a tokenizer to convert text to tensors that are understandable by the model.
* Set up a tokenizer and a model together to get from text to predictions.
* Learned the limitations of input IDs, and learned about attention masks.
* Played around with versatile and configurable tokenizer methods.

From now on, you should be able to freely navigate the ü§ó Transformers docs: the vocabulary will sound familiar, and you‚Äôve already seen the methods that you‚Äôll use the majority of the time.


In [78]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
result = tokenizer.tokenize("Hello!")

In [80]:
result

['Hello', '!']