# Handling multiple sequences (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
%%capture
!pip install datasets evaluate transformers[sentencepiece]

Also, log into Hugging face

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In the previous section, we explored the simplest of use cases: doing <font color='blue'>inference</font> on a <font color='blue'>single sequence</font> of a <font color='blue'>small length</font>. However, some questions emerge already:

- How do we handle <font color='blue'>multiple sequences</font>?
- How do we handle multiple sequences of <font color='blue'>different lengths</font>?
- Are <font color='blue'>vocabulary indices</font> the <font color='blue'>only inputs</font> that allow a model to <font color='blue'>work well</font>?
- Is there such a thing as <font color='blue'>too long</font> a <font color='blue'>sequence</font>?

Let's see what kinds of problems these questions pose, and how we can solve them using the 🤗 Transformers API.

### Models expect a batch of inputs

In the previous exercise you saw how <font color='blue'>sequences</font> get translated into <font color='blue'>lists of numbers</font>. Let's <font color='blue'>convert</font> this list of numbers to a <font color='blue'>tensor</font> and send it to the model:

In [47]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor(ids)                                                   # This is 1D tensor of shape [14] because we have 14 tokens
# This line will fail.
model(input_ids)

IndexError: too many indices for tensor of dimension 1

Oh no! Why did this fail? We followed the steps from the pipeline in section 2!

The problem is that we <font color='blue'>sent</font> a <font color='blue'>single sequence</font> to the model, whereas 🤗 Transformers models <font color='blue'>expect multiple sentences</font> by <font color='blue'>default</font>. Here we tried to do everything the tokenizer did behind the scenes when we applied it to a sequence. But if you look closely, you'll see that the tokenizer didn't just convert the list of input IDs into a tensor, it <font color='blue'>added a dimension</font> on top of it:

In [45]:
tokenized_inputs = tokenizer(sequence, return_tensors="pt")
print(tokenized_inputs["input_ids"])

tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102]])


Examining what we ran before, we see that we passed a 1D tensor into the model.

In [46]:
# The length of the list passed to the model
len(input_ids)

14

Let's fix this by <font color='blue'>adding a new dimension</font>:

In [48]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = torch.tensor([ids])                                                 # This a 2D tensor of shape [1, 14], where 1 is the batch size (containing a single sequence) and 14 is the sequence length.
print('Shape of input ids: ', input_ids.shape)
output = model(input_ids)

Shape of input ids:  torch.Size([1, 14])


We print the input IDs as well as the resulting logits — here's the output:

In [49]:
print("Input IDs:", input_ids)
print("Logits:", output.logits)

Input IDs: tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])
Logits: tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)


<font color='blue'>Batching</font> is the act of<font color='blue'>sending multiple sentences</font> through the <font color='blue'>model</font>, all at once. If you only have one sentence, you can just build a batch with a single sequence:

In [8]:
batched_ids = [ids, ids]

This is a batch of two identical sequences!

**Try it out!** Convert this `batched_ids` list into a tensor and pass it through your model. Check that you obtain the same logits as before (but twice)!

In [9]:
# Exercise
import numpy as np

input_ids = torch.tensor(np.array(batched_ids))
output = model(input_ids)
print("Input IDs:", input_ids)
print("Logits:", output.logits)

Input IDs: tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012],
        [ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])
Logits: tensor([[-2.7276,  2.8789],
        [-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)


<font color='blue'>Batching</font> allows the model to work when you <font color='blue'>feed it multiple sentences</font>. Using multiple sequences is just as simple as building a batch with a single sequence. There's a second issue, though. When you're trying to batch together <font color='blue'>two (or more) sentences</font>, they might be of <font color='blue'>different lengths</font>. If you've ever worked with tensors before, you know that they need to be of <font color='blue'>rectangular shape</font>, so you won't be able to convert the list of input IDs into a tensor directly. To work around this problem, we usually <font color='blue'>pad</font> the inputs.

### Padding the inputs

The following list of lists <font color='blue'>cannot</font> be converted to a tensor:

In [10]:
batched_ids = [
    [200, 200, 200],
    [200, 200]
]

In order to work around this, we'll use <font color='blue'>padding</font> to make our tensors have a <font color='blue'>rectangular shape</font>. Padding makes sure all our sentences have the <font color='blue'>same length</font> by adding a <font color='blue'>special word</font> called the <font color='blue'>padding token</font> to the <font color='blue'>sentences</font> with <font color='blue'>fewer values</font>. For example, if you have 10 sentences with 10 words and 1 sentence with 20 words, padding will ensure all the sentences have 20 words. In our example, the resulting tensor looks like this:

In [11]:
padding_id = 100

batched_ids = [
    [200, 200, 200],
    [200, 200, padding_id],
]

The <font color='blue'>padding token ID</font> can be found in <font color='blue'>tokenizer.pad_token_id</font>. Let's use it and send our two sentences through the model individually and batched together:

In [12]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
tensor([[ 1.5694, -1.3895],
        [ 1.3373, -1.2163]], grad_fn=<AddmmBackward0>)


There's <font color='blue'>something wrong</font> with the <font color='blue'>logits in our batched predictions</font>: the <font color='blue'>second row</font> should be the <font color='blue'>same</font> as the <font color='blue'>logits</font> for the second sentence, but we've got completely different values!

This is because the <font color='blue'>key feature</font> of <font color='blue'>Transformer models</font> is <font color='blue'>attention</font> layers that <font color='blue'>contextualize</font> each <font color='blue'>token</font>. These will <font color='blue'>take into account</font> the <font color='blue'>padding tokens</font> since they attend to all of the tokens of a sequence. To get the same result when passing individual sentences of different lengths through the model or when passing a batch with the same sentences and padding applied, we need to <font color='blue'>tell</font> those <font color='blue'>attention layers</font> to <font color='blue'>ignore</font> the <font color='blue'>padding tokens</font>. This is done by using an <font color='blue'>attention mask</font>.

### Attention masks

*Attention masks* are tensors with the <font color='blue'>exact same shape</font> as the <font color='blue'>input IDs tensor</font>, filled with 0s and 1s: <font color='blue'>1s</font> indicate the corresponding tokens <font color='blue'>should</font> be attended to, and 0s indicate the corresponding tokens <font color='blue'>should not</font> be attended to (i.e., they <font color='blue'>should be ignored</font> by the attention layers of the model).

Let’s complete the previous example with an attention mask:

In [13]:
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)


Now we get the same logits for the second sentence in the batch. Notice how the <font color='blue'>last value</font> of the <font color='blue'>second sequence</font> is a <font color='blue'>padding ID</font>, which is a 0 value in the attention mask.

**Try it out!** Apply the <font color='blue'>tokenization manually</font> on the <font color='blue'>two sentences</font> used in <font color='blue'>section 2</font> (“I've been waiting for a HuggingFace course my whole life.” and “I hate this so much!”). Pass them through the model and <font color='blue'>check</font> that you get the <font color='blue'>same logits</font> as in <font color='blue'>section 2</font>. Now <font color='blue'>batch them together</font> using the <font color='blue'>padding token</font>, then create the proper <font color='blue'>attention mask</font>. Check that you obtain the same results when going through the model!

Here is the approach taken in Section 2:

In [50]:
# Section 2 Code
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sentences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!"
]

# Tokenize and convert to IDs
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Extract the token IDs and attention masks
input_ids = inputs["input_ids"]
attention_mask = inputs["attention_mask"]

# Input IDs for each sentence
print(input_ids[0])
print(input_ids[1])

# Padding in the attention masks
print(attention_mask[0])
print(attention_mask[1])

# Pass the tokenized inputs through the model
outputs = model(**inputs)

# Print the logits for each sentence
print(outputs.logits)

tensor([  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
         2607,  2026,  2878,  2166,  1012,   102])
tensor([ 101, 1045, 5223, 2023, 2061, 2172,  999,  102,    0,    0,    0,    0,
           0,    0,    0,    0])
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
tensor([1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0])
tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


and the approach described in this section:

In [53]:
# Exercise Code
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
tokenizer.pad_token_id
sentences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!"
]
#tokens = [tokenizer.tokenize(sentence) for sentence in sentences]
tokens = [tokenizer.tokenize(sentence, add_special_tokens=True) for sentence in sentences]
ids = [tokenizer.convert_tokens_to_ids(token) for token in tokens]

# New logic
sequence1_ids = [ids[0]]
sequence2_ids = [ids[1]]
len_sent1 = len(ids[0])
len_sent2 = len(ids[1])
max_sequence_length = max(len_sent1, len_sent2)
pad_length = abs(len_sent2 - len_sent1)
batched_ids = [ids[0], ids[1]+[tokenizer.pad_token_id]*pad_length]
attention_mask = [
  [1]*len_sent1,
  [1]*len_sent2+[0]*pad_length
]
print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
#print(outputs)

tensor([[-1.5607,  1.6123]], grad_fn=<AddmmBackward0>)
tensor([[ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


### Longer sequences

With Transformer models, there is a <font color='blue'>limit to the lengths</font> of the <font color='blue'>sequences</font> we can pass the models. Most models handle sequences of up to 512 or 1024 tokens, and will crash when asked to process longer sequences. There are two solutions to this problem:

- Use a model with a <font color='blue'>longer supported</font> sequence length.
- <font color='blue'>Truncate</font> your sequences.

Models have different supported sequence lengths, and some specialize in handling very long sequences. [Longformer](https://huggingface.co/docs/transformers/model_doc/longformer) is one example, and another is [LED](https://huggingface.co/docs/transformers/model_doc/led). If you're working on a task that requires very long sequences, we recommend you take a look at those models.

In [16]:
sequence = sequence[:max_sequence_length]