# Putting it all together (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [35]:
%%capture
!pip install datasets evaluate transformers[sentencepiece]

Also, log into Hugging face

In [22]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In the last few sections, we've been trying our best to do <font color='blue'>most</font> of the <font color='blue'>work by hand</font>. We've explored how tokenizers work and looked at tokenization, conversion to input IDs, padding, truncation, and attention masks.

However, as we saw in section 2, the 🤗 <font color='blue'>Transformers API</font> can <font color='blue'>handle all of this</font> for us with a <font color='blue'>high-level function</font> that we'll dive into here. When you <font color='blue'>call</font> your <font color='blue'>tokenizer directly</font> on the <font color='blue'>sentence</font>, you get back <font color='blue'>inputs</font> that are <font color='blue'>ready</font> to pass through your <font color='blue'>model</font>:

In [36]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
for input in model_inputs:
  print(f"{input}: {model_inputs[input]}")

input_ids: [101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
attention_mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


Here, the <font color='blue'>model_inputs</font> variable contains <font color='blue'>everything</font> that's necessary for a model to operate well. For <font color='blue'>DistilBERT</font>, that includes the <font color='blue'>input IDs</font> as well as the <font color='blue'>attention mask</font>. <font color='blue'>Other models</font> that accept additional inputs will also have those <font color='blue'>output</font> by the <font color='blue'>tokenizer object</font>.

As we'll see in some examples below, this method is very powerful. First, it can <font color='blue'>tokenize</font> a <font color='blue'>single sequence</font>:

In [24]:
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
for input in model_inputs:
  print(f"{input}: {model_inputs[input]}")

input_ids: [101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
attention_mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


It also handles <font color='blue'>multiple sequences</font> at a time, with <font color='blue'>no change</font> in the API:

In [25]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

model_inputs = tokenizer(sequences)
for input in model_inputs:
    print(f"{input}:")
    for idx, entry in enumerate(model_inputs[input]):
        print(f"  {input} for sentence {idx+1}: {entry}")

input_ids:
  input_ids for sentence 1: [101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
  input_ids for sentence 2: [101, 2061, 2031, 1045, 999, 102]
attention_mask:
  attention_mask for sentence 1: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
  attention_mask for sentence 2: [1, 1, 1, 1, 1, 1]


It can <font color='blue'>pad</font> according to <font color='blue'>several objectives</font>:

In [26]:
# Pads the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")
for input in model_inputs:
    print(f"{input}:")
    for idx, entry in enumerate(model_inputs[input]):
        print(f"  {input} for sentence {idx+1}: {entry}")

input_ids:
  input_ids for sentence 1: [101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
  input_ids for sentence 2: [101, 2061, 2031, 1045, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
attention_mask:
  attention_mask for sentence 1: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
  attention_mask for sentence 2: [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [27]:
# Pads sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")
for input in model_inputs:
    print(f"{input}:")
    for idx, entry in enumerate(model_inputs[input]):
        print(f"  {input} for sentence {idx+1}: {entry}")

input_ids:
  input_ids for sentence 1: [101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [28]:
# Pads sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)
for input in model_inputs:
    print(f"{input}:")
    for idx, entry in enumerate(model_inputs[input]):
        print(f"  {input} for sentence {idx+1}: {entry}")

input_ids:
  input_ids for sentence 1: [101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
  input_ids for sentence 2: [101, 2061, 2031, 1045, 999, 102, 0, 0]
attention_mask:
  attention_mask for sentence 1: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
  attention_mask for sentence 2: [1, 1, 1, 1, 1, 1, 0, 0]


In [29]:
# Note: In the above example you need to pass truncation=True for sequences longer than the max_length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8,truncation=True)
for input in model_inputs:
    print(f"{input}:")
    for idx, entry in enumerate(model_inputs[input]):
        print(f"  {input} for sentence {idx+1}: {entry}")

input_ids:
  input_ids for sentence 1: [101, 1045, 1005, 2310, 2042, 3403, 2005, 102]
  input_ids for sentence 2: [101, 2061, 2031, 1045, 999, 102, 0, 0]
attention_mask:
  attention_mask for sentence 1: [1, 1, 1, 1, 1, 1, 1, 1]
  attention_mask for sentence 2: [1, 1, 1, 1, 1, 1, 0, 0]


It can also <font color='blue'>truncate sequences</font>:

In [30]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Truncates sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)
for input in model_inputs:
    print(f"{input}:")
    for idx, entry in enumerate(model_inputs[input]):
        print(f"  {input} for sentence {idx+1}: {entry}")

input_ids:
  input_ids for sentence 1: [101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
  input_ids for sentence 2: [101, 2061, 2031, 1045, 999, 102]
attention_mask:
  attention_mask for sentence 1: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
  attention_mask for sentence 2: [1, 1, 1, 1, 1, 1]


In [31]:
# Truncates sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)
for input in model_inputs:
    print(f"{input}:")
    for idx, entry in enumerate(model_inputs[input]):
        print(f"  {input} for sentence {idx+1}: {entry}")

input_ids:
  input_ids for sentence 1: [101, 1045, 1005, 2310, 2042, 3403, 2005, 102]
  input_ids for sentence 2: [101, 2061, 2031, 1045, 999, 102]
attention_mask:
  attention_mask for sentence 1: [1, 1, 1, 1, 1, 1, 1, 1]
  attention_mask for sentence 2: [1, 1, 1, 1, 1, 1]


The <font color='blue'>tokenizer object</font> can handle the <font color='blue'>conversion</font> to <font color='blue'>specific framework tensors</font>, which can then be directly <font color='blue'>sent</font> to the <font color='blue'>model</font>. For example, in the following code sample we are prompting the tokenizer to return tensors from the different frameworks — <font color='blue'>pt</font> returns <font color='blue'>PyTorch</font> tensors, <font color='blue'>tf</font> returns <font color='blue'>TensorFlow</font> tensors, and <font color='blue'>np</font> returns NumPy arrays:

In [32]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Return PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")
type(model_inputs['input_ids'])

torch.Tensor

In [33]:
# Return TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")
type(model_inputs['input_ids'])

tensorflow.python.framework.ops.EagerTensor

In [16]:
# Return NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")
type(model_inputs['input_ids'])

numpy.ndarray

### Special tokens

If we take a look at the <font color='blue'>input IDs</font> returned by the <font color='blue'>tokenizer</font>, we will see they are a <font color='blue'>tiny bit different</font> from what we had earlier:

In [19]:
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
print('Previous implementation: ', model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print('Current implementation: ', ids)

Previous implementation:  [101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
Current implementation:  [1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]


One token ID was <font color='blue'>added</font> at the <font color='blue'>beginning</font>, and <font color='blue'>one</font> at the <font color='blue'>end</font>. Let's decode the two sequences of IDs above to see what this is about:

In [20]:
print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))

[CLS] i've been waiting for a huggingface course my whole life. [SEP]
i've been waiting for a huggingface course my whole life.


The <font color='blue'>tokenizer</font> added the special word <font color='blue'>[CLS]</font> at the <font color='blue'>beginning</font> and the special word <font color='blue'>[SEP]</font> at the <font color='blue'>end</font>. This is because the <font color='blue'>model</font> was <font color='blue'>pretrained with those</font>, so to get the same results for inference we need to add them as well. Note that <font color='blue'>some models don't add special words</font>, or add different ones; models may also add these special words only at the beginning, or only at the end. In any case, the tokenizer knows which ones are expected and will deal with this for you.

### Wrapping up: From tokenizer to model

 Now that we've seen all the individual steps the `tokenizer` object uses when applied on texts, let's see one final time how it can handle

 - multiple sequences (<font color='blue'>padding</font>)
 - very long sequences (<font color='blue'>truncation</font>)
 - multiple types of tensors


 with its main API:

In [42]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)
print(output)

SequenceClassifierOutput(loss=None, logits=tensor([[-1.5607,  1.6123],
        [-3.6183,  3.9137]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


In [44]:
logits = output.logits
print(logits)

tensor([[-1.5607,  1.6123],
        [-3.6183,  3.9137]], grad_fn=<AddmmBackward0>)


In [45]:
import numpy as np

# Convert logits to probabilities
probabilities = torch.nn.functional.softmax(logits, dim=-1)

# Display output probabilities as decimals (instead of the default of scientific notation)
np.set_printoptions(suppress=True, precision=8, floatmode='fixed')

# Print logits and corresponding probabilities
for i, sequence in enumerate(sequences):
    print(f"Sequence: {sequence}")
    print(f"Logits: {logits[i].detach().numpy()}")
    print(f"Probabilities: {probabilities[i].detach().numpy()}")
    print(f"Predicted Class: {torch.argmax(logits[i]).item()}\n")

Sequence: I've been waiting for a HuggingFace course my whole life.
Logits: [-1.56069887  1.61228395]
Probabilities: [0.04019519 0.95980483]
Predicted Class: 1

Sequence: So have I!
Logits: [-3.61831784  3.91374946]
Probabilities: [0.00053534 0.99946469]
Predicted Class: 1

