# Handling multiple sequences (TensorFlow)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
!pip install datasets evaluate transformers[sentencepiece]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.5.2-py3-none-any.whl (432 kB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 432 kB 22.1 MB/s 
[?25hCollecting evaluate
  Downloading evaluate-0.2.2-py3-none-any.whl (69 kB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 69 kB 3.1 MB/s 
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.22.2-py3-none-any.whl (4.9 MB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4.9 MB 28.9 MB/s 
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.10.0-py3-none-any.whl (163 kB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚

In [2]:
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = tf.constant(ids)
# This line will fail.
model(input_ids)

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFDistilBertForSequenceClassification.

All the layers of TFDistilBertForSequenceClassification were initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[-2.7276225,  2.8789392]], dtype=float32)>, hidden_states=None, attentions=None)

In [3]:
tokenized_inputs = tokenizer(sequence, return_tensors="tf")
print(tokenized_inputs["input_ids"])

tf.Tensor(
[[  101  1045  1005  2310  2042  3403  2005  1037 17662 12172  2607  2026
   2878  2166  1012   102]], shape=(1, 16), dtype=int32)


In [4]:
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = tf.constant([ids])
print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", output.logits)

Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_39']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Input IDs: tf.Tensor(
[[ 1045  1005  2310  2042  3403  2005  1037 17662 12172  2607  2026  2878
   2166  1012]], shape=(1, 14), dtype=int32)
Logits: tf.Tensor([[-2.7276225  2.8789392]], shape=(1, 2), dtype=float32)


In [5]:
batched_ids = [ids, ids]
print(batched_ids)

[[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012], [1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]]


‚úèÔ∏è Try it out! Convert this batched_ids list into a tensor and pass it through your model. Check that you obtain the same logits as before (but twice)!

In [29]:
batched_input_ids = tf.constant(batched_ids)
print("Batched input IDs:", batched_input_ids)

Batched input IDs: tf.Tensor(
[[200 200 200]
 [200 200   0]], shape=(2, 3), dtype=int32)


In [30]:
output = model(batched_input_ids)
print("Batched logits:", output.logits)

Batched logits: tf.Tensor(
[[ 1.569367  -1.3894578]
 [ 1.3373486 -1.2163193]], shape=(2, 2), dtype=float32)


#### üéì Check that you obtain the same logits as before (but twice): Yep!

# Padding the inputs

In [31]:
batched_ids = [
    [200, 200, 200],
    [200, 200]
]

In [32]:
padding_id = 100

batched_ids = [
    [200, 200, 200],
    [200, 200, padding_id],
]

In [33]:
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(tf.constant(sequence1_ids)).logits)
print(model(tf.constant(sequence2_ids)).logits)
print(model(tf.constant(batched_ids)).logits)

Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_99']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tf.Tensor([[ 1.5693678 -1.3894578]], shape=(1, 2), dtype=float32)
tf.Tensor([[ 0.58030325 -0.41252738]], shape=(1, 2), dtype=float32)
tf.Tensor(
[[ 1.569367  -1.3894578]
 [ 1.3373486 -1.2163193]], shape=(2, 2), dtype=float32)


In [34]:
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(tf.constant(batched_ids), attention_mask=tf.constant(attention_mask))
print(outputs.logits)

tf.Tensor(
[[ 1.569367   -1.3894578 ]
 [ 0.58029795 -0.4125215 ]], shape=(2, 2), dtype=float32)


‚úèÔ∏è Try it out! Apply the tokenization manually on the two sentences used in section 2 (‚ÄúI‚Äôve been waiting for a HuggingFace course my whole life.‚Äù and ‚ÄúI hate this so much!‚Äù). Pass them through the model and check that you get the same logits as in section 2. Now batch them together using the padding token, then create the proper attention mask. Check that you obtain the same results when going through the model!

## Apply tokenization separately

In [35]:
# Re-run the model

import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)

Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_119']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### List of strings scenario

In [43]:
# Codes copied from section 2
list_of_str_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]

list_of_str_tokens = tokenizer(list_of_str_inputs)

print(list_of_str_tokens)


{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], [101, 1045, 5223, 2023, 2061, 2172, 999, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}


#### Note input_ids include 101 and 102 tokens

#### Separate sequences scenario

---





In [54]:
sequence1 = "I've been waiting for a HuggingFace course my whole life."
sequence2 = "I hate this so much."

# Apply tokenizer to separately
# Tokenizing strings
tokens1 = tokenizer.tokenize(sequence1)
tokens2 = tokenizer.tokenize(sequence2)

print(tokens1, "\n", tokens2)

['i', "'", 've', 'been', 'waiting', 'for', 'a', 'hugging', '##face', 'course', 'my', 'whole', 'life', '.'] 
 ['i', 'hate', 'this', 'so', 'much', '.']


In [55]:
# List of list of str

ids1 = tokenizer.convert_tokens_to_ids(tokens1)
ids2 = tokenizer.convert_tokens_to_ids(tokens2)

print(ids1, "\n", ids2)

input_ids1 = tf.constant(ids1)
input_ids2 = tf.constant(ids2)

print("Input IDs 1:", input_ids1)
print("Input IDs 2:", input_ids2)

[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012] 
 [1045, 5223, 2023, 2061, 2172, 1012]
Input IDs 1: tf.Tensor(
[ 1045  1005  2310  2042  3403  2005  1037 17662 12172  2607  2026  2878
  2166  1012], shape=(14,), dtype=int32)
Input IDs 2: tf.Tensor([1045 5223 2023 2061 2172 1012], shape=(6,), dtype=int32)


#### Note that the tokens don't include 101 or 102

### Separate list of sequence scenario

In [63]:
# Apply tokenizer separately, but tokenizing list of sequence
tokens1_list_of_seq = tokenizer([sequence1])
tokens2_list_of_seq = tokenizer([sequence2])

print([sequence1], "\n", [sequence2])

print(tokens1_list_of_seq, "\n", tokens2_list_of_seq)

["I've been waiting for a HuggingFace course my whole life."] 
 ['I hate this so much.']
{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]} 
 {'input_ids': [[101, 1045, 5223, 2023, 2061, 2172, 1012, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1]]}


#### Note: If I put sequence1 and sequence2 inside square brackets and turn them in to list of strings, I can preserve the 101 and 102 tokens

# Logits comparision

In [77]:
# Logits for list of strings scenario
print("input ids:", list_of_str_tokens['input_ids'], "\n")
print("Attention mask", list_of_str_tokens['attention_mask'], "\n")

print(model(tf.constant(list_of_str_tokens['input_ids'][0])).logits)
print(model(tf.constant(list_of_str_tokens['input_ids'][1])).logits)

input ids: [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], [101, 1045, 5223, 2023, 2061, 2172, 999, 102]] 

Attention mask [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]] 

tf.Tensor([[-1.5606974  1.612282 ]], shape=(1, 2), dtype=float32)
tf.Tensor([[ 4.1692314 -3.3464477]], shape=(1, 2), dtype=float32)


In [79]:
# Logits for separate sequence scenario
print(input_ids1, '\n')
print(input_ids2, '\n')

print(model(input_ids1).logits)
print(model(input_ids2).logits)

tf.Tensor(
[ 1045  1005  2310  2042  3403  2005  1037 17662 12172  2607  2026  2878
  2166  1012], shape=(14,), dtype=int32) 

tf.Tensor([1045 5223 2023 2061 2172 1012], shape=(6,), dtype=int32) 

tf.Tensor([[-2.7276225  2.8789392]], shape=(1, 2), dtype=float32)
tf.Tensor([[ 3.1248865 -2.6449811]], shape=(1, 2), dtype=float32)


In [82]:
# Logits for separate list of sequence scenario
print(tokens1_list_of_seq['input_ids'], "\n Attention mask:", tokens1_list_of_seq['attention_mask'] )
print(tokens2_list_of_seq['input_ids'], "\n Attention mask:", tokens2_list_of_seq['attention_mask'] )

print(model(tf.constant(tokens1_list_of_seq['input_ids'][0])).logits)
print(model(tf.constant(tokens2_list_of_seq['input_ids'][0])).logits)


[[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]] 
 Attention mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
[[101, 1045, 5223, 2023, 2061, 2172, 1012, 102]] 
 Attention mask: [[1, 1, 1, 1, 1, 1, 1, 1]]
tf.Tensor([[-1.5606974  1.612282 ]], shape=(1, 2), dtype=float32)
tf.Tensor([[ 4.2321453 -3.4100935]], shape=(1, 2), dtype=float32)


In [64]:
print(model(input_ids1).logits)
print(model(input_ids2).logits)


tf.Tensor([[-2.7276225  2.8789392]], shape=(1, 2), dtype=float32)
tf.Tensor([[ 3.1248865 -2.6449811]], shape=(1, 2), dtype=float32)


### Both are not quite right. Need to add padding to the shorter sentence üßê

In [86]:
# Pad for list of str scenario
list_of_str_tokens['input_ids'][1]= list_of_str_tokens['input_ids'][1]+[0]*(len(list_of_str_tokens['input_ids'][0])-len(list_of_str_tokens['input_ids'][1]))

print(list_of_str_tokens['input_ids'])

[[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], [101, 1045, 5223, 2023, 2061, 2172, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0]]


# Now pad the shorter attention mask

In [90]:
print(list_of_str_tokens['attention_mask'][0])
print(list_of_str_tokens['attention_mask'][1])

# Attention mask also needs padding


[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1]


In [92]:
# Pad the shorter attention mask
list_of_str_tokens['attention_mask'][1]= list_of_str_tokens['attention_mask'][1]+[0]*(len(list_of_str_tokens['attention_mask'][0])-len(list_of_str_tokens['attention_mask'][1]))

print(list_of_str_tokens['attention_mask'][1])

[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]


In [97]:
# Recheck
print(tf.constant(list_of_str_tokens['input_ids'][0]),"\n")
print(tf.constant(list_of_str_tokens['attention_mask'][0]),"\n")

# Logits for the first sentence
print(model(tf.constant(list_of_str_tokens['input_ids'][0]), attention_mask = tf.constant(list_of_str_tokens['attention_mask'][0])).logits)

# Unchanged

tf.Tensor(
[  101  1045  1005  2310  2042  3403  2005  1037 17662 12172  2607  2026
  2878  2166  1012   102], shape=(16,), dtype=int32) 

tf.Tensor([1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1], shape=(16,), dtype=int32) 

tf.Tensor([[-1.5606974  1.612282 ]], shape=(1, 2), dtype=float32)


In [98]:
# Logit for the second sentence
print(tf.constant(list_of_str_tokens['input_ids'][1]),"\n")
print(tf.constant(list_of_str_tokens['attention_mask'][1]),"\n")

print(model(tf.constant(list_of_str_tokens['input_ids'][1]), attention_mask = tf.constant(list_of_str_tokens['attention_mask'][1])).logits)


tf.Tensor(
[ 101 1045 5223 2023 2061 2172  999  102    0    0    0    0    0    0
    0    0], shape=(16,), dtype=int32) 

tf.Tensor([1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0], shape=(16,), dtype=int32) 

tf.Tensor([[ 4.1692314 -3.3464475]], shape=(1, 2), dtype=float32)


## üéì Pass them through the model and check that you get the same logits as in section 2: Yes, these are the same logits from session 2. That's because I included the 101 and 102 tokens from the model 

# Batch them

In [102]:
# Batched sep seq
batched_sep_seq = [ids1, ids2]
print(batched_sep_seq)

[[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012], [1045, 5223, 2023, 2061, 2172, 1012]]


In [105]:
#This would return an error
#batched_sep_seq_ids = tf.constant(batched_sep_seq)
#print(batched_sep_seq_ids)

ValueError: ignored

### Need to pad the data

In [115]:
ids2 = ids2 + [tokenizer.pad_token_id]*(len(ids1)-len(ids2))

In [116]:
ids2

[1045, 5223, 2023, 2061, 2172, 1012, 0, 0, 0, 0, 0, 0, 0, 0]

In [117]:
batched_sep_seq = [ids1, ids2]
print(batched_sep_seq)
batched_sep_seq_ids = tf.constant(batched_sep_seq)
print(batched_sep_seq_ids)

[[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012], [1045, 5223, 2023, 2061, 2172, 1012, 0, 0, 0, 0, 0, 0, 0, 0]]
tf.Tensor(
[[ 1045  1005  2310  2042  3403  2005  1037 17662 12172  2607  2026  2878
   2166  1012]
 [ 1045  5223  2023  2061  2172  1012     0     0     0     0     0     0
      0     0]], shape=(2, 14), dtype=int32)


In [118]:
print(model(batched_sep_seq_ids).logits)

tf.Tensor(
[[-2.7276185  2.878935 ]
 [ 1.7251045 -1.5324826]], shape=(2, 2), dtype=float32)


#### Because I'm only batching the "pure" sequences that don't include tokens 101 and 102, the results don't match. Ok to move to next section