# Padding with Generative Models: What to Remember

When working with generative language models in HuggingFace Transformers, padding is crucial for batching and consistent input lengths. However, improper handling can lead to unexpected outputs or degraded performance.


## TLDR
* set the padding side to left
* set the correct padding token
* set the attention mask to tell the model which tokens are padding and should be ignored

In [1]:
import os

# TODO adjust to your GPU
os.environ["CUDA_VISIBLE_DEVICES"] = "5"

We set padding_side to "left".

Explanation: Causal models generate tokens autoregressively from left to right. Left padding ensures that the prompt is always at the rightmost position, so the model can attend to it properly.

We set the pad token = eos token.

Explanation: For Llama and similar models, the EOS token is often used as the pad token. For other models, check the tokenizer config or documentation.

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch


# We will use the same tokenizer, model, and prompt across all conditions.
TOKENIZER = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B")
MODEL = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-8B")
MODEL.to("cuda")
PROMPT = "Hello, how are you?"
MAX_NEW_TOKENS = 100

TOKENIZER.pad_token = TOKENIZER.eos_token
TOKENIZER.padding_side = "left"

MODEL.config.pad_token_id = TOKENIZER.eos_token_id


def generate_text(input_ids, attention_mask):
    output = MODEL.generate(input_ids=input_ids, attention_mask=attention_mask, max_new_tokens=MAX_NEW_TOKENS, pad_token_id=TOKENIZER.eos_token_id, do_sample=False, temperature=None, top_k=None, top_p=None)
    return TOKENIZER.decode(output[0][len(input_ids[0]):], skip_special_tokens=False)



Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

First, we generate text without padding.
When we pad the input, we want to get the exact same output from the model as without padding.

In [3]:
# NO PADDING
input = TOKENIZER(PROMPT, return_tensors="pt").to("cuda")

print(f"Decoded input: {TOKENIZER.decode(input['input_ids'][0])}")

output_no_padding = generate_text(input_ids=input["input_ids"], attention_mask=None)

print(f"Output without padding: {output_no_padding}")

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Decoded input: <|begin_of_text|>Hello, how are you?
Output without padding:  I hope you are well. I am writing to you because I am looking for a job. I am a hard worker and I am very responsible. I am a good person and I am very friendly. I am looking for a job because I need to support my family. I am a good cook and I am very good at cleaning. I am also very good at taking care of children. I am looking for a job that will allow me to use my skills and help me to support my family


Next, we apply padding to max_length. We set the attention mask to ignore these padding tokens.

In [4]:
# PADDING AND ATTENTION MASK
input = TOKENIZER(PROMPT, return_tensors="pt", padding="max_length", max_length=100).to("cuda")

print(f"Decoded input: {TOKENIZER.decode(input['input_ids'][0])}")

# this is what the attention mask should look like (all 1s except for the padding tokens, which are 0s)
test_attention_mask = torch.where(input['input_ids'] == TOKENIZER.pad_token_id, torch.zeros_like(input['input_ids']), torch.ones_like(input['input_ids']))
assert torch.all(input["attention_mask"] == test_attention_mask)

output_padding = generate_text(input_ids=input["input_ids"], attention_mask=input["attention_mask"])
print(f"Output with padding: {output_padding}")

Decoded input: <|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_t

As you can see, this is generating the same output as without padding (as it should be).

In [5]:
assert output_no_padding == output_padding

## What not to do

1. **Not setting the pad token:** If you don’t set these, you may get warnings or errors, and the model may treat padding tokens as valid input.
2. **Not passing attention mask:** If you omit the attention_mask, the model may attend to padding tokens, leading to poor or unpredictable generations (see 1st example below).
3. **Using right padding with causal models:** Causal models expect left padding. Right padding can cause the model to ignore the prompt or generate poor outputs (see 2nd example below).




In [6]:
# BAD: Padding, but no attention mask
input = TOKENIZER(PROMPT, return_tensors="pt", padding="max_length", max_length=100).to("cuda")

print(f"Decoded input: {TOKENIZER.decode(input['input_ids'][0])}")

output = generate_text(input_ids=input["input_ids"], attention_mask=None)

print(f"Output: {output}")

Decoded input: <|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_t

In [7]:
# BAD: Set padding side to "right", yields a different output
TOKENIZER.padding_side = "right"

input = TOKENIZER(PROMPT, return_tensors="pt", padding="max_length", max_length=100).to("cuda")

print(f"Decoded input: {TOKENIZER.decode(input['input_ids'][0])}")

output = generate_text(input_ids=input["input_ids"], attention_mask=input["attention_mask"])

print(f"Output: {output}")

Decoded input: <|begin_of_text|>Hello, how are you?<|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|en

Summary

| Scenario                | Output Quality | Notes                        |
|-------------------------|---------------|------------------------------|
| No padding              | Good          | Baseline                     |
| Left padding + mask     | Good          | Matches baseline             |
| Left padding, no mask   | Bad           | Model attends to padding     |
| Right padding + mask    | Bad           | Does not match      
         |

## vLLM and Padding
Lastly, let's look at how vLLM handles padding automatically and efficiently. Unlike the manual approach we saw earlier, vLLM abstracts away the complexity of padding management, making it easy to work with batched inference while maintaining performance.

In [8]:
# clear the GPU cache
import torch
import gc

del MODEL
gc.collect()
torch.cuda.empty_cache()

In [9]:

# Example: Using vLLM with proper padding
from vllm import LLM, SamplingParams

# Initialize the model
llm = LLM(model="meta-llama/Meta-Llama-3.1-8B", device="cuda")

# Define sampling parameters
sampling_params = SamplingParams(
    temperature=0,
    max_tokens=100,
)


INFO 07-03 07:20:19 [__init__.py:244] Automatically detected platform cuda.
INFO 07-03 07:20:28 [config.py:823] This model supports multiple tasks: {'reward', 'generate', 'score', 'embed', 'classify'}. Defaulting to 'generate'.
INFO 07-03 07:20:29 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=16384.




INFO 07-03 07:20:34 [__init__.py:244] Automatically detected platform cuda.
INFO 07-03 07:20:35 [core.py:455] Waiting for init message from front-end.
INFO 07-03 07:20:35 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model='meta-llama/Meta-Llama-3.1-8B', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3.1-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_d

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.22it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.18it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.69it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.47it/s]



INFO 07-03 07:20:40 [default_loader.py:272] Loading weights took 2.73 seconds
INFO 07-03 07:20:41 [gpu_model_runner.py:1624] Model loading took 14.9889 GiB and 3.948407 seconds
INFO 07-03 07:20:46 [backends.py:462] Using cache directory: /home/elisabeth/.cache/vllm/torch_compile_cache/260813549e/rank_0_0 for vLLM's torch.compile
INFO 07-03 07:20:46 [backends.py:472] Dynamo bytecode transform time: 4.88 s
INFO 07-03 07:20:50 [backends.py:135] Directly load the compiled graph(s) for shape None from the cache, took 3.660 s
INFO 07-03 07:20:51 [monitor.py:34] torch.compile takes 4.88 s in total
INFO 07-03 07:20:52 [gpu_worker.py:227] Available KV cache memory: 105.91 GiB
INFO 07-03 07:20:52 [kv_cache_utils.py:715] GPU KV cache size: 867,600 tokens
INFO 07-03 07:20:52 [kv_cache_utils.py:719] Maximum concurrency for 131,072 tokens per request: 6.62x
INFO 07-03 07:21:10 [gpu_model_runner.py:2048] Graph capturing finished in 18 secs, took 0.66 GiB
INFO 07-03 07:21:10 [core.py:171] init engine 

In [10]:

# Prepare prompts
prompts = [
    PROMPT,
    "A longer prompt that will require padding"
]

# vLLM handles padding automatically and efficiently
# No need to manually set pad tokens or attention masks
outputs = llm.generate(prompts, sampling_params)
output_vllm = outputs[0].outputs[0].text

# Print results
print(f"Output: {output_vllm}")


# vLLM will automatically handle padding and attention masks
# during batch processing for optimal performance

Adding requests:   0%|          | 0/2 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/2 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Output:  I hope you are well. I am writing to you because I am looking for a job. I am a hard worker and I am very responsible. I am a good person and I am very friendly. I am looking for a job because I need to support my family. I am a good cook and I am very good at cleaning. I am also very good at taking care of children. I am looking for a job that will allow me to use my skills and help me to support my family


Again, the output from vllm is the same as without padding and as with the manual approach.

In [11]:
assert output_no_padding == output_vllm == output_padding

## Further reading
* [HuggingFace Docs: Padding & Attention Mask](https://huggingface.co/docs/transformers/pad_truncation)
* [HuggingFace Forums: Padding for CausalLM](https://discuss.huggingface.co/t/the-effect-of-padding-side/67188)