# Efficient inference with Tranformers

In this exercise, we will be looking more closely into our LLM. We will explore strategies for reducing the size of our models and its impact on the quality of model outputs.

In [None]:
! pip install transformers torch tqdm accelerate --upgrade --quiet

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import matplotlib
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"


# Selecting the font size here will affect all the figures in this notebook
# Alternatively, you can set the font size for axis labels of each figure separately
font = {'size': 16}
matplotlib.rc('font', **font)

# Exercise 1: Estimating the size of our model [30 mins]

Recall your a HuggingFace `model` is a simple PyTorch NN. Inspect the model by doing the following:

1. Print the total number of layers.
2. Names of all the parameter tensors inside each layer, along with their data type.
3. The output dimensionality at each layer along with it data type.
4. The data type of each tensor and the estimated storage requirements in MBs.
5. The total size of the model in MBs.

**Hint:** You can use `print(model)` to see all the layers and sublayers inside the model.

**Hint:** Recall that all the layers and the model itself are simply `torch.nn` modules. Which means you can call the functions `model.named_parameters()` or `layer.named_parameters()` to go over the parameters or the model or the layer. See the [documentation of `named_parameters`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.named_parameters).

In [None]:
model_name = "Qwen/Qwen2.5-0.5B" # Very small model that we used in the last class. Can be used without a GPU.
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
# Your code here

# Exercise 2: Converting the model to bfloat [10 mins]

You might have noticed that the `AutoModelForCausalLM.from_pretrained` loaded the model in `torch.float32` data type. Load the model again, this time by passing the augument `torch_dtype=torch.bfloat16` to the `from_pretrained` method.

Compute the size of individual layers and the total model size again.

# Exercise 3: Capturing different aspects of model outputs [20 mins]

We would like to compare the difference in the parameters and outputs of the models when we quantize them.

Write a function that given a prompt, returns the following stats for the original model that is loaded in `torch.float32` precision.

1. First N output tokens generated in a greedy manner.
2. Softmax probabilities at each generation position. In other words, compute these probabilities not just for output tokens but _also for input tokens_.
3. Weights at the Kth transformer layer where K is a parameter to the function.

Make sure your function always returns these all tensors in `float32` precision.

**Hint:** You can use the generate function that you write in one of the previous lectures.

In [None]:
# Your code here

# Exercise 4: Comparing differences in model outputs [20 mins]

Use three different versions of the model:
1. The originally loaded model in `torch.float32` format.
2. A model that you manually cast to `torch.float16` using [`model.half()`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.half) function.
3. A model that you load in `torch.bfloat16` by passing the `torch.dtype=torch.bfloat16` parameter to the pretrained model.

Compare the following properties of these three models:
1. Difference in generated tokens.
2. 10 most likely tokens at each generation position.
3. L2 norm of difference in weights.

In [None]:
# Your code here