# Cache sparsity

In [None]:
! pip install transformers torch tqdm accelerate bitsandbytes --upgrade --quiet

In [1]:
# Only execute this if you are using the machine at jupyterhub.uni-muenster.de
# These machine only have 10GB disk storage which gets filled quite quickly.

# ! rm -r ~/.cache/pip
# ! rm -r ~/.cache/huggingface/hub/

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import BitsAndBytesConfig
import torch
import matplotlib
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"


# Selecting the font size here will affect all the figures in this notebook
# Alternatively, you can set the font size for axis labels of each figure separately
font = {'size': 16}
matplotlib.rc('font', **font)

# Exercise 1: Comparing the Dynamic and Sliding Window Cache [30 mins]

In this exercise, we will load a 8 bit quantized model using `bitsandbytes`. Make sure you have access to a GPU since `bitsandbytes` does not yet support CPUs.

First load, the model in 8 bits using the code below. Next, generate some long outputs using prompts like `Write a really long story`. Sample `10` sequences with a temperature of `0.5`. How long does the generation take? Are the generations coherent?

Recall that HuggingFace uses the dynamic cache by default where memory is allocated for cache on the fly. Now we will simulate a scenario where we are memory constrained. Use a [`SlidingWindowCache`](https://huggingface.co/docs/transformers/v4.51.3/en/internal/generation_utils#transformers.SlidingWindowCache). What kind of differences do you notice in runtime and coherence? Does changing the parameters help?

Tip: Feel free to use datasets from HuggingFace like [discrim-eval](https://huggingface.co/datasets/Anthropic/discrim-eval) that consist of longish prompts. Add instructions like "think step by step" to encourage the model to generate long outputs.

In [None]:
model_name = "microsoft/Phi-3-mini-4k-instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    device_map="auto", 
    quantization_config=BitsAndBytesConfig(load_in_8bit=True),
    )
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
# Your code here

# Exercise 2: Using the Sink Cache [30 mins]

Now generate using the [`SinkCache`](https://huggingface.co/docs/transformers/v4.51.3/en/internal/generation_utils#transformers.SinkCache). Compare the runtime and generation quality with the dynamic and sliding window caches.

In [None]:
# Your code here