<a href="https://colab.research.google.com/github/abolfazlaghdaee/AI_Project/blob/main/multiple_sequences.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In the previous section, we explored the simplest of use cases: doing inference on a single sequence of a small length. However, some questions emerge already:

- How do we handle multiple sequences?
- How do we handle multiple sequences of different lengths?
- Are vocabulary indices the only inputs that allow a model to work well?
- Is there such a thing as too long a sequence?


Let’s see what kinds of problems these questions pose, and how we can solve them using the 🤗 Transformers API.

In [1]:
!pip install transformers



**Models expect a batch of inputs**

In [8]:
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."


tokens = tokenizer.tokenize(sequence)

ids = tokenizer.convert_tokens_to_ids(tokens)


input_ids = tf.constant(ids)

output = model(input_ids)

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


In [10]:
output.logits

<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[-2.7276208,  2.8789372]], dtype=float32)>

Oh no! Why did this fail? We followed the steps from the pipeline in section 2.

The problem is that we sent a single sequence to the model, whereas 🤗 Transformers models expect multiple sentences by default. Here we tried to do everything the tokenizer did behind the scenes when we applied it to a sequence. But if you look closely, you’ll see that the tokenizer didn’t just convert the list of input IDs into a tensor, it added a dimension on top of it:

In [7]:
tokenized_inputs = tokenizer(sequence, return_tensors="tf")

print(tokenized_inputs["input_ids"])

tf.Tensor(
[[  101  1045  1005  2310  2042  3403  2005  1037 17662 12172  2607  2026
   2878  2166  1012   102]], shape=(1, 16), dtype=int32)


**Padding the inputs**

In order to work around this, we’ll use `padding` to make our tensors have a rectangular shape. Padding makes sure all our sentences have the same length by adding a special word called the padding token to the sentences with fewer values. For example, if you have 10 sentences with 10 words and 1 sentence with 20 words, padding will ensure all the sentences have 20 words. In our example, the resulting tensor looks like this:

In [11]:
padding_id = 100

batched_ids = [
    [200, 200, 200],
    [200, 200, padding_id],
]

In [12]:
model  = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)


sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]

batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(tf.constant(sequence1_ids)).logits)
print(model(tf.constant(sequence2_ids)).logits)
print(model(tf.constant(batched_ids)).logits)

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


tf.Tensor([[ 1.5693678 -1.3894583]], shape=(1, 2), dtype=float32)
tf.Tensor([[ 0.58030033 -0.41252464]], shape=(1, 2), dtype=float32)
tf.Tensor(
[[ 1.569368  -1.3894584]
 [ 1.3373476 -1.2163184]], shape=(2, 2), dtype=float32)


**Attention masks**

Attention masks are tensors with the exact same shape as the input IDs tensor, filled with 0s and 1s: 1s indicate the corresponding tokens should be attended to, and 0s indicate the corresponding tokens should not be attended to (i.e., they should be ignored by the attention layers of the model).

Let’s complete the previous example with an attention mask:

In [13]:
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(tf.constant(batched_ids), attention_mask = tf.constant(attention_mask))


outputs.logits

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[ 1.569368  , -1.3894584 ],
       [ 0.58030206, -0.41252613]], dtype=float32)>

✏️ Try it out! Apply the tokenization manually on the two sentences used in section 2 (“I’ve been waiting for a HuggingFace course my whole life.” and “I hate this so much!”). Pass them through the model and check that you get the same logits as in section 2. Now batch them together using the padding token, then create the proper attention mask. Check that you obtain the same results when going through the model!

In [21]:
sequence1 = "I’ve been waiting for a HuggingFace course my whole life."
sequence2 = "I hate this so much!"


tokens1 = tokenizer.tokenize(sequence1)
tokens2 = tokenizer.tokenize(sequence2)

ids1 = tokenizer.convert_tokens_to_ids(tokens1)
ids2 = tokenizer.convert_tokens_to_ids(tokens2)


input_ids1 = tf.constant(ids1)
input_ids2 = tf.constant(ids2)



outputs1 = model(input_ids1)
outputs2 = model(input_ids2)

print(outputs1.logits)
outputs2.logits


tf.Tensor([[-2.571974   2.6852386]], shape=(1, 2), dtype=float32)


<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[ 3.1930926, -2.6685236]], dtype=float32)>

In [26]:
max_length = max(len(ids1), len(ids2))


ids_1_padded = ids1 + [tokenizer.pad_token_id] * (max_length - len(ids1))
ids_2_padded = ids2 + [tokenizer.pad_token_id] * (max_length - len(ids2))

# Create attention masks (1 for real tokens, 0 for padding tokens)
attention_mask_1 = [1] * len(ids1) + [0] * (max_length - len(ids1))
attention_mask_2 = [1] * len(ids2) + [0] * (max_length - len(ids2))


batched_input_ids = tf.constant([ids_1_padded, ids_2_padded])
batched_attention_mask = tf.constant([attention_mask_1, attention_mask_2])

batched_output = model(batched_input_ids, attention_mask=batched_attention_mask).logits
batched_output


<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[-2.5719736,  2.6852374],
       [ 3.1930914, -2.668523 ]], dtype=float32)>

**Longer sequences**

With Transformer models, there is a limit to the lengths of the sequences we can pass the models. Most models handle sequences of up to 512 or 1024 tokens, and will crash when asked to process longer sequences. There are two solutions to this problem:

- Use a model with a longer supported sequence length.
- Truncate your sequences.


Models have different supported sequence lengths, and some specialize in handling very long sequences. Longformer is one example, and another is LED. If you’re working on a task that requires very long sequences, we recommend you take a look at those models.

Otherwise, we recommend you truncate your sequences by specifying the max_sequence_length parameter:


In [None]:
sequence = sequence[:max_sequence_length]