<a href="https://colab.research.google.com/github/addamit/LMExperiments/blob/main/batch_prompting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

In [12]:

tokenizer = AutoTokenizer.from_pretrained("gpt2")


In [14]:
tokenizer.pad_token  = tokenizer.eos_token

# why we explicitly added the pad token ?
"""
Below line is added before calling tokenizer to encode. It sets the padding token of the tokenizer to the end-of-sequence token (eos_token). This is a common practice when a tokenizer doesn't have a dedicated padding token. By doing this, the tokenizer will use the eos_token
"""

"\nBelow line is added before calling tokenizer to encode. It sets the padding token of the tokenizer to the end-of-sequence token (eos_token). This is a common practice when a tokenizer doesn't have a dedicated padding token. By doing this, the tokenizer will use the eos_token\n"

In [16]:
# tokenize a batch of inputs
prompts = [
    "Morning coffee is",
    "Short walks and runs are good"
]
inputs = tokenizer(prompts, return_tensors="pt", padding=True)
print(inputs)

{'input_ids': tensor([[42997,  6891,   318, 50256, 50256, 50256],
        [16438, 11114,   290,  4539,   389,   922]]), 'attention_mask': tensor([[1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1]])}


In [19]:

inputs['input_ids'].shape, inputs['attention_mask'].shape

(torch.Size([2, 6]), torch.Size([2, 6]))

In [21]:

inputs['input_ids'], inputs['attention_mask']

(tensor([[42997,  6891,   318, 50256, 50256, 50256],
         [16438, 11114,   290,  4539,   389,   922]]),
 tensor([[1, 1, 1, 0, 0, 0],
         [1, 1, 1, 1, 1, 1]]))

In [22]:

model = AutoModelForCausalLM.from_pretrained("gpt2")

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [25]:
model.eval()
outputs = model.generate(input_ids = inputs['input_ids'],
               attention_mask = inputs['attention_mask'],
               max_new_tokens = 10)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


In [27]:

outputs.shape

torch.Size([2, 16])

In [32]:
text1 = tokenizer.decode(outputs[0], skip_special_tokens=True)
text2 = tokenizer.decode(outputs[1], skip_special_tokens=True)
print(f"Sentence 1: {text1}")
print(f"Sentence 2: {text2}")

Sentence 1: Morning coffee is

a good way to get your coffee

Sentence 2: Short walks and runs are good for a lot of people, but they're not


In [38]:
tokenizer.batch_decode(outputs, skip_special_tokens=False)

['Morning coffee is<|endoftext|><|endoftext|><|endoftext|>\n\na good way to get your coffee\n',
 "Short walks and runs are good for a lot of people, but they're not"]

In [39]:

tokenizer.batch_decode(outputs, skip_special_tokens=True)

['Morning coffee is\n\na good way to get your coffee\n',
 "Short walks and runs are good for a lot of people, but they're not"]

In [69]:
past_key_values = None
max_new_tokens = 10
batch_size = inputs['input_ids'].shape[0]

generated_sequences = inputs['input_ids'].clone()

for step in range(max_new_tokens):
  if past_key_values is None:
    outputs = model(input_ids = inputs['input_ids'],
               attention_mask = inputs['attention_mask'],
                    use_cache=True)

    logits = outputs.logits
    past_key_values = outputs.past_key_values
    next_tokens = torch.argmax(logits[:, -1, :], dim=-1, keepdim=True)
    print(f"Shape Starting next tokens {next_tokens.shape}")

    current_attention_mask = inputs['attention_mask']

  else:
    current_attention_mask = torch.cat([current_attention_mask,
                                        torch.ones(batch_size, 1, dtype=current_attention_mask.dtype)
                                       ],
                                       dim=-1)
    outputs = model(input_ids = next_tokens,
               attention_mask = current_attention_mask,
               past_key_values = past_key_values,
                    use_cache=True)

    logits = outputs.logits
    past_key_values = outputs.past_key_values
    next_tokens = torch.argmax(logits[:, -1, :], dim=-1, keepdim=True)

    # lets find out if there is any next token in the batch that is eos token
    if torch.any(next_tokens.squeeze(-1) == tokenizer.eos_token_id):
      print("Reached end of sentence")
      break

  generated_sequences = torch.cat([generated_sequences, next_tokens], dim=-1)
  print(tokenizer.batch_decode(generated_sequences, skip_special_tokens=True))




Shape Starting next tokens torch.Size([2, 1])
['Morning coffee isThe', 'Short walks and runs are good for']
['Morning coffee isThe best', 'Short walks and runs are good for a']
['Morning coffee isThe best way', 'Short walks and runs are good for a lot']
['Morning coffee isThe best way to', 'Short walks and runs are good for a lot of']
['Morning coffee isThe best way to get', 'Short walks and runs are good for a lot of people']
['Morning coffee isThe best way to get your', 'Short walks and runs are good for a lot of people,']
['Morning coffee isThe best way to get your coffee', 'Short walks and runs are good for a lot of people, but']
['Morning coffee isThe best way to get your coffee.', 'Short walks and runs are good for a lot of people, but they']
['Morning coffee isThe best way to get your coffee. It', "Short walks and runs are good for a lot of people, but they're"]
["Morning coffee isThe best way to get your coffee. It's", "Short walks and runs are good for a lot of people, but the

```
Initial attention mask:

For causal LMs, typically has shape [batch_size, sequence_length]
Contains 1s for actual tokens and 0s for padding tokens


Extending attention masks:

For each new generated token, extending the attention mask by adding a column of 1s
This allows the new token to attend to all previous non-padding tokens


```

In [54]:
print(inputs['input_ids'].shape)
print(inputs['attention_mask'].shape)
print(outputs.keys())
print(outputs['logits'].shape)
print(len(outputs['past_key_values']))


torch.Size([2, 6])
torch.Size([2, 6])
odict_keys(['logits', 'past_key_values'])
torch.Size([2, 6, 50257])
12


In [57]:
next_token_id = torch.argmax(outputs['logits'][:, -1, :], dim=-1, keepdim=True)
print(next_token_id.shape)
tokenizer.batch_decode(next_token_id, skip_special_tokens=False)

torch.Size([2, 1])


['The', ' for']

In [73]:

prompt = "Morning coffee is"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
input_ids.shape


torch.Size([1, 3])

In [79]:
custom_attention_mask = torch.tensor([[1, 0, 0]], dtype=int)

outputs = model.generate(input_ids = input_ids, attention_mask = custom_attention_mask, max_new_tokens=10)

tokenizer.decode(outputs[0], skip_special_token=True)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'Morning coffee is the next day, I was in the middle of'

In [81]:
custom_attention_mask = torch.tensor([[1, 1, 1]], dtype=int)

outputs = model.generate(input_ids = input_ids, attention_mask = custom_attention_mask, max_new_tokens=10)

tokenizer.decode(outputs[0], skip_special_token=True)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


"Morning coffee is a great way to get a good night's sleep"