# Putting it all together
We’ve explored how tokenizers work and looked at tokenization, conversion to input IDs, padding, truncation, and attention masks.

In [2]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

Here, the model_inputs variable contains everything that’s necessary for a model to operate well. For DistilBERT, that includes the input IDs as well as the attention mask. Other models that accept additional inputs will also have those output by the tokenizer object.

In [6]:
# First, it can tokenize a single sequence:
sequence = "I've been waiting for a HuggingFace course my whole life."
model_inputs = tokenizer(sequence)

In [7]:
# It also handles multiple sequences at a time, with no change in the API:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]
model_inputs = tokenizer(sequences)

In [None]:
# It can pad according to several objectives

# Pad to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

# Pad to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# Pad to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

In [None]:
# It can also truncate

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Truncate the sequences longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# Truncate the sequences longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

The tokenizer object can handle the conversion to specific framework tensors, which can then be directly sent to the model. For example, in the following code sample we are prompting the tokenizer to return tensors from the different frameworks — "pt" returns PyTorch tensors, "tf" returns TensorFlow tensors, and "np" returns NumPy arrays:

In [8]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# Returns TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")

# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")




### Special tokens
If we take a look at the input IDs returned by the tokenizer, we will see they are a tiny bit different from what we had earlier:

In [9]:
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]


One token ID was added at the beginning, and one at the end. Let’s decode the two sequences of IDs above to see what this is about:

In [10]:
print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))

[CLS] i've been waiting for a huggingface course my whole life. [SEP]
i've been waiting for a huggingface course my whole life.


Special word [CLS] at the beginning and the special word [SEP] at the end. This is because the model was pretrained with those

### Wrapping up: From tokenizer to model

In [12]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "So have I!"
    ]

tokens = tokenizer(sequences, padding=True, truncation = True, return_tensors="pt")
output = model(**tokens)

## All Learned so Far

Learned the basic building blocks of a Transformer model.
#
Learned what makes up a tokenization pipeline.
#
Saw how to use a Transformer model in practice.
#
Learned how to leverage a tokenizer to convert text to tensors that are 
understandable by the model.
#
Set up a tokenizer and a model together to get from text to predictions.
#
Learned the limitations of input IDs, and learned about attention masks.
#
Played around with versatile and configurable tokenizer methods.