# Project 3, Lightweight LLMs – Task 4
  
**Supervisor:** Sayedpedram Haeri Boroujeni  
**Course:** CPSC 4420 - Artificial Intelligence  
**Assignment:** Task 4  
**Deadline:** Friday, November 14, 2025  

## Contributors
- **Samuel Jordan**
- **Gabriel Hillesheim**  
- **Patrick Woods**
  
---

## Table of Contents
1. [Data Collection for Controller Training](#part-1-data-collection-for-controller-training)  
2. [Train a Lightweight Controller](#part-2-train-a-lightweight-controller)  
3. [Integrate the Controller into Generation](#part-3-integrate-the-controller-into-generation)  
4. [Compare vs Baselines](#part-4-compare-vs-baselines)  

---
## Part 1: Data Collection for Controller Training  
Goal: To gather training data linking saliency features to optimal bit-width choice.

The final section of last week's notebook (implementing a Dynamic KV bit-width policy) now runs the model on text from HellaSwag and collects features needed to train a controller that will predict KV-cache bit-widths. For each token, it logs:
* entropy (model uncertainty)  
* token rarity (how frequently the token appears)  
* attention variance  
* the KV bit-width chosen by our rule-based policy  
* latency per token  
* whether the model predicted the next token correctly  

It gathers about 1000 tokens’ worth of data and saves everything to 'data/controller_training.csv', which can be seen in part below:

In [1]:
import pandas as pd

df = pd.read_csv("data/controller_training.csv")
print(df.head(10))

    entropy    rarity  attn_var  kv_bits    latency  accuracy
0  7.871669 -0.000000  0.048757       16  13.700008       0.0
1  5.129652  1.098612  0.048757        8  13.700008       0.0
2  5.238664  1.609438  0.048757        8  13.700008       0.0
3  2.877511  1.945910  0.048757        4  13.700008       0.0
4  1.377800  2.197225  0.048757        2  13.700008       1.0
5  4.657082  2.397895  0.048757        8  13.700008       0.0
6  3.797291  2.564949  0.048757        4  13.700008       0.0
7  4.184223  2.708050  0.048757        8  13.700008       0.0
8  7.871669  2.833213  0.020498       16  15.594651       0.0
9  5.196707  2.944439  0.020498        8  15.594651       0.0


In [2]:
# Basic summary
print("=== Overall Summary (mean) ===")
print(df.mean())

print("\n=== Distribution Stats (std, min, max) ===")
print(df.describe().loc[["std", "min", "max"]])

# correlation between features
print("\n=== Feature Correlations ===")
print(df.corr())

=== Overall Summary (mean) ===
entropy     3.877600
rarity      4.942119
attn_var    0.014616
kv_bits     6.426000
latency     9.755550
accuracy    0.271000
dtype: float64

=== Distribution Stats (std, min, max) ===
      entropy    rarity  attn_var    kv_bits    latency  accuracy
std  1.745641  1.318897  0.007587   4.010812   3.700416  0.444699
min  0.005374 -0.000000  0.004821   2.000000   3.887909  0.000000
max  8.417944  7.170888  0.048757  16.000000  25.573283  1.000000

=== Feature Correlations ===
           entropy    rarity  attn_var   kv_bits   latency  accuracy
entropy   1.000000  0.174027  0.113601  0.893349  0.113772 -0.434348
rarity    0.174027  1.000000 -0.096314  0.188033  0.044707 -0.265595
attn_var  0.113601 -0.096314  1.000000  0.099542  0.720528 -0.056500
kv_bits   0.893349  0.188033  0.099542  1.000000  0.090438 -0.343158
latency   0.113772  0.044707  0.720528  0.090438  1.000000 -0.045596
accuracy -0.434348 -0.265595 -0.056500 -0.343158 -0.045596  1.000000


---
## Part 2: Train a Lightweight Controller
Goal: Configure a small MLP (≈ 2 hidden layers, 64–128 neurons) predicting KV bit (2, 4, 8, 16).

---
## Part 3: Integrate the Controller into Generation
Goal: Measure latency and accuracy after laoding the controller during inference.

---
## Part 4: Compare vs Baselines
Goal: Compare the new Learned Controller's average accuracy, latency, and peak VRAM against baseline 4-bit KV and our prior Rule-based Dynamic KV.