# Introduction To HuggingFace Transformers 

In [1]:
# Built-in library
import re
import json
from typing import Any, Dict, List, Optional, Union
import logging
import warnings

# Standard imports
import numpy as np
from pprint import pprint
import pandas as pd

# Visualization
import matplotlib.pyplot as plt


# Pandas settings
pd.options.display.max_rows = 1_000
pd.options.display.max_columns = 1_000
pd.options.display.max_colwidth = 600

warnings.filterwarnings("ignore")

# Black code formatter (Optional)
%load_ext lab_black
# auto reload imports
%load_ext autoreload
%autoreload 2

In [2]:
from transformers import pipeline


classifier = pipeline(task="sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

2023-09-05 17:21:52.840615: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9598051905632019},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

### Preprocessing With a Tokenizer

```text
- Transformer models can’t process raw text directly, so the first step of our pipeline is to convert the text inputs into numbers that the model can make sense of. 
- To do this we use a tokenizer, which will be responsible for:
  - Splitting the input into words, subwords, or symbols (like punctuation) that are called tokens
  - Mapping each token to an integer
  - Adding additional inputs that may be useful to the model

- All this preprocessing needs to be done in exactly the same way as when the model was pretrained.
- The `AutoTokenizer` class and its `from_pretrained()` method are used to download and cache the data associated with the model's tokenizer. 
- This is done automatically using the checkpoint name of the model. 
- The data is only downloaded the first time the code is run.
```

In [3]:
from transformers import AutoTokenizer


# The default checkpoint of the sentiment-analysis pipeline is:
# distilbert-base-uncased-finetuned-sst-2-english
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [4]:
# Transformer models only accept tensors as input.
# To specify the type of tensors we want to get back (PyTorch, TensorFlow, or plain NumPy), use the return_tensors argument:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
pprint(inputs)

{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]),
 'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]])}


### Going Through The Model

```text
- Download the pretrained model the same way just like the tokenizer. 
- 🤗 Transformers provides an AutoModel class which also has a from_pretrained() method:
```

In [5]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

### High-dimensional Vector

```text
- The vector output by the Transformer module is usually large. 
- It generally has three dimensions:
  - Batch size: The number of sequences processed at a time (2 in our example).
  - Sequence length: The length of the numerical representation of the sequence (16 in our example).
  - Hidden size: The vector dimension of each model input.
  
- It's said to be “high dimensional” because of the last value. 
- The hidden size can be very large (768 is common for smaller models, and in larger models this can reach 3072 or more).

```

In [6]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 16, 768])


### [Model Heads](https://huggingface.co/learn/nlp-course/chapter2/2?fw=pt)

```text

- The model heads take the high-dimensional vector of hidden states as input and project them onto a different dimension. 
- They are usually composed of one or a few linear layers.
- The output of the Transformer model is sent directly to the model head to be processed.
```

In [7]:
from transformers import AutoModelForSequenceClassification


# In this example, we'll need a model with a sequence classification head (to be able to classify the sentences as positive or negative).
# So, we won’t actually use the AutoModel class, but AutoModelForSequenceClassification:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

In [8]:
# Since we have just two sentences and two labels, the result we get from our model is of shape 2 x 2.
print(outputs.logits.shape)

torch.Size([2, 2])


### Postprocessing the output

```text
- The model provided logits [-1.5607, 1.6123] for the first sentence and [4.1692, -3.3464] for the second sentence. 
- Logits are raw scores that require conversion to probabilities using a SoftMax layer.
```

In [9]:
# The values we get as output from our model don’t necessarily make sense by themselves. Let’s take a look:
print(outputs.logits)

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


In [10]:
import torch


# The model predicted [-1.5607, 1.6123] for the first sentence and [4.1692, -3.3464] for the second one.
# These are logits, not probabilities. To convert them to probabilities, they need to go through a SoftMax layer.
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.0195e-02, 9.5981e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


In [11]:
# The model predicted [0.0402, 0.9598] for the first sentence and [0.9995, 0.0005] for the second one. These are probability scores.
# The labels corresponding to each position can be found by inspecting the id2label attribute of the model config.
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

<br><hr>

### Ex 1:

```text
✏️ Choose two (or more) texts of your own and run them through the sentiment-analysis pipeline. 
- Replicate the steps you saw here yourself and check that you obtain the same results!
```

In [12]:
# Method 1: Using the pipeline
raw_inputs = [
    "Yesterday's football match was not the greatest.",
    "I'm looking forward to starting my consultancy firm.",
    "One of my favourite quoutes is 'All you have is all you need!'",
    "Another favourite of mine is 'If you have never failed at anything, you have never tried anything new!'",
    "We've been making very poor choices as a nation for a long time.",
]
task = "sentiment-analysis"
clf = pipeline(task=task)
clf(raw_inputs)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'NEGATIVE', 'score': 0.9985768795013428},
 {'label': 'POSITIVE', 'score': 0.9977651834487915},
 {'label': 'POSITIVE', 'score': 0.9936624765396118},
 {'label': 'POSITIVE', 'score': 0.9965687990188599},
 {'label': 'NEGATIVE', 'score': 0.9996651411056519}]

In [13]:
tokenizer??

[0;31mSignature:[0m     
[0mtokenizer[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mtext[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mList[0m[0;34m[[0m[0mstr[0m[0;34m][0m[0;34m,[0m [0mList[0m[0;34m[[0m[0mList[0m[0;34m[[0m[0mstr[0m[0;34m][0m[0;34m][0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtext_pair[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mList[0m[0;34m[[0m[0mstr[0m[0;34m][0m[0;34m,[0m [0mList[0m[0;34m[[0m[0mList[0m[0;34m[[0m[0mstr[0m[0;34m][0m[0;34m][0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtext_target[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mList[0m[0;34m[[0m[0mstr[0m[0;34m][0m[0;34m,[0m [0mList[0m[0;34m[[0m[0mList[0m[0;34m[[0m[0mstr[0m[0;34m][0m[0;34m][0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m


In [14]:
from rich import print


# Method 2: Manual Approach
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
input = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(input)

In [15]:
clf_model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
output = clf_model(**input)
print(output)

In [16]:
import torch.nn.functional as F

logits_ = output.logits
prob = F.softmax(logits_, dim=1)
print(prob)

In [17]:
# Label names
clf_model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

## Models

```text

- The `AutoModel` class is a simple wrapper over the wide variety of models available in the library. 
- It can automatically guess the appropriate model architecture for your checkpoint, and then instantiates a model with this architecture.
- If you know the type of model you want to use, you can use the class that defines its architecture directly. 
- For example, to use a BERT model, you can use the BertModel class.
```

<br>

### Creating a Transformer

```text

The first thing we’ll need to do to initialize a BERT model is load a configuration object:
```

In [18]:
from transformers import BertConfig, BertModel


# Building the config
config = BertConfig()

# Building the model from the config
# Weights are initialized randomly
model = BertModel(config)

In [19]:
# The configuration contains many attributes that are used to build the model:
print(config)

### Different loading methods
```text

Creating a model from the default configuration initializes it with random values:
```

In [20]:
from transformers import BertConfig, BertModel
from transformers import BertModel


# Load the pretrained model weights
model = BertModel.from_pretrained("bert-base-cased")

- The AutoModel class is a checkpoint-agnostic wrapper over the wide variety of models available in the library. This means that if your code works for one checkpoint, it should work seamlessly with another, even if the architecture is different.


- In the code sample above we didn’t use BertConfig, and instead loaded a pretrained model via the bert-base-cased identifier. This is a model checkpoint that was trained by the authors of BERT themselves; you can find more details about it in its [model card](https://huggingface.co/bert-base-cased).

- This model is now initialized with all the weights of the checkpoint. It can be used directly for inference on the tasks it was trained on, and it can also be fine-tuned on a new task. By training with pretrained weights rather than from scratch, we can quickly achieve good results.

<br>

### Saving methods

```text
-  `save_pretrained()` method is used for saving the model.
```
<br>

```python
model.save_pretrained("directory_on_my_computer")
```

<br>

### Using a Transformer Model For Inference

In [21]:
sequences = ["Hello!", "Cool.", "Nice!"]

encoded_sequences = tokenizer(sequences).get("input_ids")
print(encoded_sequences)

#### Using the tensors as inputs to the model

```text

- Making use of the tensors with the model is extremely simple.
— We just call the model with the inputs:
```

In [22]:
input = torch.tensor(encoded_sequences)
output = model(input)

print(output)

## Tokenizers

```text
- Tokenizers are one of the core components of the NLP pipeline. 
- They're used to translate text into data that can be processed by the model. 
- Since models can only process numbers, tokenizers are required to convert the text inputs to numerical data. 

- In NLP tasks, the data that is generally processed is raw text. Here’s an example of such text:
```

<br>

#### Word-based Tokenizers

[![image.png](https://i.postimg.cc/hvHrwMbk/image.png)](https://postimg.cc/YLzYL6PR)

In [23]:
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

```text

- Word tokenizers can have additional rules for handling punctuation, resulting in larger vocabularies. 
- A vocabulary refers to the total number of distinct tokens in a corpus.

- Word-based tokenization assigns a unique ID to each word, starting from 0 and going up to the size of the vocabulary. 
- However, covering an entire language with word-based tokenization requires a massive number of tokens. 
- e.g., the English language alone has over 500,000 words, necessitating the tracking of that many IDs. 
- Moreover, variations of words, such as "dog" and "dogs" or "run" and "running," are considered unrelated by the model initially, lacking the understanding of their similarity.

- To handle words not in the vocabulary, a custom token called the "unknown" token is used, typically denoted as "[UNK]" or "". 
- If the tokenizer generates many unknown tokens, it indicates a loss of information and a suboptimal representation of words.
- The aim when creating the vocabulary is to minimize the number of words tokenized into the unknown token. 
- One approach to achieve this is by using a character-based tokenizer, which delves deeper into the structure of words.
```

#### Character-based

```text
- Character-based tokenizers split the text into characters, rather than words. This has two primary benefits:
  - The vocabulary is much smaller.
  - There are much fewer out-of-vocabulary (unknown) tokens, since every word can be built from characters.

Limitations
-----------
- Using a character-based tokenizer has its limitations. 
- While it captures more information in languages like Chinese, where characters carry meaning, it may be less meaningful in languages using Latin characters. 
- Additionally, this approach results in a larger number of tokens for the model to process. 
- A single word token in a word-based tokenizer can transform into 10 or more tokens in a character-based tokenizer.
```

<br>

[![image.png](https://i.postimg.cc/HLd9MrLC/image.png)](https://postimg.cc/68bZJ5RH)

<br>

#### Subword tokenization

```text
- Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords.
- For instance, “annoyingly” might be considered a rare word and could be decomposed into “annoying” and “ly”. 
- These are both likely to appear more frequently as standalone subwords, while at the same time the meaning of “annoyingly” is kept by the composite meaning of “annoying” and “ly”.
```
<br>

> Here is an example showing how a subword tokenization algorithm would tokenize the sequence `Let’s do tokenization!`:


```text
Let’s</w> | do | token | ization |  !</w>
```

<br>

```text
- Subword tokenization provides semantic meaning by splitting words into smaller units, enabling efficient representation and good coverage with minimal unknown tokens.
- E.g. in the example above “tokenization” was split into “token” and “ization”, two tokens that have a semantic meaning while being space-efficient (only two tokens are needed to represent a long word).
- This approach is especially useful in agglutinative languages such as Turkish, where you can form (almost) arbitrarily long complex words by stringing together subwords.
```


<br>

#### Other Types of Subword Tokenizers

```text

- There are many more techniques out there. To name a few:
  - Byte-level BPE, as used in GPT-2
  - WordPiece, as used in BERT
  - SentencePiece or Unigram, as used in several multilingual models
  ```

### Loading and Saving

```text
Loading and saving tokenizers is based on the same two methods: `from_pretrained()` and s`ave_pretrained()`. 
- These methods will load or save the algorithm used by the tokenizer (a bit like the architecture of the model) as well as its vocabulary (a bit like the weights of the model).
- Loading the BERT tokenizer trained with the same checkpoint as BERT is done the same way as loading the model, except we use the BertTokenizer class:
```

In [24]:
from transformers import BertTokenizer


tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

In [25]:
from transformers import AutoTokenizer


# Similar to AutoModel, the AutoTokenizer class will grab the proper tokenizer class in the library based
# on the checkpoint name, and can be used directly with any checkpoint:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

- Saving a tokenizer is identical to saving a model:
  
```python
tokenizer.save_pretrained("directory_on_my_computer")
```

<br><hr>

### Encoding

```text
- Translating text to numbers is known as encoding. 
- Encoding is done in a two-step process: the tokenization, followed by the conversion to input IDs.

- As we’ve seen, the first step is to split the text into words (or parts of words, punctuation symbols, etc.), usually called tokens. 
- There are multiple rules that can govern that process, which is why we need to instantiate the tokenizer using the name of the model, to make sure we use the same rules that were used when the model was pretrained.

- The second step is to convert those tokens into numbers, so we can build a tensor out of them and feed them to the model. To do this, the tokenizer has a vocabulary, which is the part we download when we instantiate it with the from_pretrained() method. 
- Again, we need to use the same vocabulary used when the model was pretrained.
```

<br>

[![image.png](https://i.postimg.cc/t456dZBQ/image.png)](https://postimg.cc/k22Dq4rf)

<br>

### Tokenization

```text
- The tokenization process is done by the tokenize() method of the tokenizer:
```

In [26]:
from transformers import AutoTokenizer


tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)

In [27]:
# This tokenizer is a subword tokenizer: it splits the words until it obtains tokens that
# can be represented by its vocabulary. That’s the case here with transformer, which
# is split into two tokens: transform and ##er.

["Using", "a", "transform", "##er", "network", "is", "simple"]

['Using', 'a', 'transform', '##er', 'network', 'is', 'simple']

#### From tokens to input IDs

```text
The conversion to input IDs is handled by the convert_tokens_to_ids() tokenizer method:
```



In [28]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

## Ex 2:

```text 
- Replicate the two last steps (tokenization and conversion to input IDs) on the input sentences we used in section 2. 
- i.e. (“I’ve been waiting for a HuggingFace course my whole life.” and “I hate this so much!”). 
- Check that you get the same input IDs we got earlier!
```

In [29]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
# checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
tokens = tokenizer.tokenize(raw_inputs)

print(tokens)

In [30]:
ids = tokenizer.convert_tokens_to_ids(tokens)

# The only difference between this and the prev. result is the lack of start and end tokens.
print(ids)

Prev Result

```python

{
  'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]),
  'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]])
}
```

<hr><br>

### Decoding

```text
- Decoding is going the other way around: from vocabulary indices, we want to get a string. 
- This can be done with the decode() method as follows:
```

In [31]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

#### Note 

```text
- Note that the decode method not only converts the indices back to tokens, but also groups together the tokens that were part of the same words to produce a readable sentence. 
- This behavior will be extremely useful when we use models that predict new text (either text generated from a prompt, or for sequence-to-sequence problems like translation or summarization).

```

## [Handling Multiple Sequences](https://huggingface.co/learn/nlp-course/chapter2/5?fw=pt)

```text
- In the previous section, we explored the simplest of use cases: doing inference on a single sequence of a small length. 
- However, some questions emerge already:

  - How do we handle multiple sequences?
  - How do we handle multiple sequences of different lengths?
  - Are vocabulary indices the only inputs that allow a model to work well?
  - Is there such a thing as too long a sequence?
```

### Models expect a batch of inputs


```text
- In the previous exercise you saw how sequences get translated into lists of numbers. 
- Let’s convert this list of numbers to a tensor and send it to the model:
```



In [32]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification


checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
# Convert to a 2D Tensor
input_ids = torch.tensor(ids).view(1, -1)

output = model(input_ids)

print("Logits:", output.logits)

In [33]:
# Batching is the act of sending multiple sentences through the model, all at once.
# If you only have one sentence, you can just build a batch with a single sequence:
batched_ids = [ids, ids]
input = torch.tensor(batched_ids)
print(input.shape)

model(input)

SequenceClassifierOutput(loss=None, logits=tensor([[-2.7276,  2.8789],
        [-2.7276,  2.8789]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

### Longer sequences

```text
- With Transformer models, there is a limit to the lengths of the sequences we can pass the models. 
- Most models handle sequences of up to 512 or 1024 tokens, and will crash when asked to process longer sequences. 
- There are two solutions to this problem:
  - Use a model with a longer supported sequence length.
  - Truncate your sequences.

- Models have different supported sequence lengths, and some specialize in handling very long sequences. 
- Longformer is one example, and another is LED. 
- If you’re working on a task that requires very long sequences, it's recommended that you take a look at those models.
- Otherwise, truncate your sequences by specifying the max_sequence_length parameter:
```


```python
sequence = sequence[:max_sequence_length]
```
<br>

- [Longformer](https://huggingface.co/transformers/model_doc/longformer.html)
- [LED](https://huggingface.co/transformers/model_doc/led.html)

## Putting It All Together

In [35]:
from transformers import AutoTokenizer


checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

In [36]:
# For a single sequence
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

In [37]:
# Multiple sequencea
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

model_inputs = tokenizer(sequences)

In [38]:
# Add padding
# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

In [39]:
# Truncate
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

In [40]:
# Return Tensors for different frameworks
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# Returns TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")

# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

### Special tokens
```
If we take a look at the input IDs returned by the tokenizer, we will see they are a tiny bit different from what we had earlier:
```


In [41]:
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

In [42]:
# One token ID was added at the beginning, and one at the end.
# Let’s decode the two sequences of IDs above to see what this is about:
print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))

#### Note

```text
- The tokenizer adds [CLS] at the start and [SEP] at the end to match the model's pretraining. 
- Different models may have different special words, but the tokenizer handles this automatically.

```

<br><hr>

### Wrapping Up: From Tokenizer To Model


In [44]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification


checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)
print(output)