# Efficient inference with Tranformers

In this exercise, we will be looking more closely into our LLM. We will explore strategies for reducing the size of our models and its impact on the quality of model outputs.

In [None]:
! pip install transformers torch tqdm accelerate --upgrade --quiet

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import matplotlib
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"


# Selecting the font size here will affect all the figures in this notebook
# Alternatively, you can set the font size for axis labels of each figure separately
font = {'size': 16}
matplotlib.rc('font', **font)

# Exercise 1: Estimating the size of our model [30 mins]

Recall your a HuggingFace `model` is a simple PyTorch NN. Inspect the model by doing the following:

1. Print the total number of layers.
2. Names of all the parameter tensors inside each layer, along with their data type.
3. The output dimensionality at each layer along with it data type.
4. The data type of each tensor and the estimated storage requirements in MBs.
5. The total size of the model in MBs.

**Hint:** You can use `print(model)` to see all the layers and sublayers inside the model.

**Hint:** Recall that all the layers and the model itself are simply `torch.nn` modules. Which means you can call the functions `model.named_parameters()` or `layer.named_parameters()` to go over the parameters or the model or the layer. See the [documentation of `named_parameters`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.named_parameters).

In [None]:
model_name = "Qwen/Qwen2.5-0.5B" # Very small model that we used in the last class. Can be used without a GPU.
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
# Your code here
print(model) # Printing the model will give you an overview of its layers

In [None]:
import itertools
from collections import Counter

parameter_names = []
parameter_sizes = []
num_elements = []
element_sizes = []

# The lm_head is not counted as a named parameter of the model. Not sure why.
# So we iterate over both the named_parameters of model and model.lm_head.
for pname, pval in itertools.chain(model.named_parameters(), model.lm_head.named_parameters()):
    num_params = pval.numel()  # number of elements in this parameter tensor
    param_size = pval.element_size()  # Size of a single element in bytes
    parameter_names.append(pname)
    parameter_sizes.append(num_params * param_size)  # Size of this tensor
    num_elements.append(num_params)
    element_sizes.append(param_size)

print(f"Total number of parameters: {sum(num_elements)}")
print(f"Total model size in Bytes: {sum(parameter_sizes)}")
print(f"Size in bytes of parameters tensors with counts: {Counter(element_sizes)} ")

element_count_sorted = sorted(zip(parameter_names, num_elements), key=lambda x: x[1], reverse=True)
print(f"Tensor with largest num_params: {element_count_sorted[:5]}")

# Exercise 2: Converting the model to bfloat [10 mins]

You might have noticed that the `AutoModelForCausalLM.from_pretrained` loaded the model in `torch.float32` data type. Load the model again, this time by passing the augument `torch_dtype=torch.bfloat16` to the `from_pretrained` method.

Compute the size of individual layers and the total model size again.

**Solution:** Just pass the `torch_dtype=bfloat16` to the `from_pretrained` method and run the previous code cell again.



# Exercise 3: Capturing different aspects of model outputs [20 mins]

We would like to compare the difference in the parameters and outputs of the models when we quantize them.

Write a function that given a prompt, returns the following stats for the original model that is loaded in `torch.float32` precision.

1. First N output tokens generated in a greedy manner.
2. Softmax probabilities at each generation position. In other words, compute these probabilities not just for output tokens but _also for input tokens_.
3. Weights at the Kth transformer layer where K is a parameter to the function.

Make sure your function always returns these all tensors in `float32` precision.

**Hint:** You can use the generate function that you write in one of the previous lectures.

In [None]:
def generate(
        model: torch.nn.Module,
        prompt: str,
        gen_len: int,
        layer_number: int,
        pname: str = "mlp.up_proj.weight",
    ) -> [str, torch.FloatTensor, torch.FloatTensor]:
    # Your code here
    tokenized_input = tokenizer(prompt, return_tensors="pt")
    input_ids = tokenized_input["input_ids"]
    for _ in range(gen_len):
        with torch.no_grad():
           output = model(input_ids).logits  # We should pass the attention mask but we can ignore it for causal LLMs when we have just a single input
        output = output.squeeze(dim=0)
        next_token_id = torch.argmax(output[-1], dim=-1)  # Most likely token at the final input token
        input_ids = torch.cat((input_ids, torch.LongTensor([next_token_id]).reshape(1,-1)), dim=-1)
    gen = tokenizer.decode(input_ids.numpy().flatten()) # Includes the input too
    softmax_probs = torch.softmax(output, dim=-1).detach().to(torch.float32)
    w = model.get_parameter(f"model.layers.{layer_number}.{pname}")
    return gen, softmax_probs, w.to(torch.float32)

prompt = "Einstein was born in"
gen, softmax_probs, w = generate(model, prompt, 10, 10, "self_attn.q_proj.weight")


# Exercise 4: Comparing differences in model outputs [20 mins]

Use three different versions of the model:
1. The originally loaded model in `torch.float32` format.
2. A model that you manually cast to `torch.float16` using [`model.half()`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.half) function.
3. A model that you load in `torch.bfloat16` by passing the `torch.dtype=torch.bfloat16` parameter to the pretrained model.

Compare the following properties of these three models:
1. Difference in generated tokens.
2. 10 most likely tokens at each generation position.
3. L2 norm of difference in weights.

In [None]:
# Your code here
prompt = "Einstein was born in"
gen_len = 20
layer_num = 20
#pname = "self_attn.q_proj.weight"
pname = "mlp.up_proj.weight"

model_fp16 = model.half()
model_bf16 = model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)

params = (prompt, gen_len, layer_num)
k_params = {"pname": pname}
g_base, s_base, w_base = generate(model, *params, **k_params)
g_fp16, s_fp16, w_fp16 = generate(model_fp16, *params, **k_params)
g_bf16, s_bf16, w_bf16 = generate(model_bf16, *params, **k_params)

In [None]:
# FP 32 vs. FP16
labels = ["FP32", "FP16", "BF16"]
gens = [g_base, g_fp16, g_bf16]
softmaxes = [s_base, s_fp16, s_bf16]
weights = [w_base, w_fp16, w_bf16]

print("= Generations =")
for mname, g in zip(labels, gens):
    print(f"{mname} -- {g}")
print("\n\n")
print("= Softmax =")
for mname, s in zip(labels, softmaxes):
    print(f"- {mname} -")
    print(s.max(-1).values.to(torch.float32).numpy())
print("\n\n")
print("= Weight diff from base =")
for mname, w in zip(labels, weights):
    print(f"- {mname} -")
    print(f"{((w - w_base).abs()).sum().detach().numpy()}")

