# KVShuttle: FP16 Generation Quality on GPU

This notebook runs the end-to-end generation quality experiment using
FP16 PyTorch models on a CUDA GPU (T4/A100).

**What it does:**
1. Installs KVShuttle and dependencies
2. Runs FP16 model inference with 7 compressors on 3 models
3. Measures attention error, perplexity delta, and token agreement
4. Saves results JSON for local figure generation

**Runtime:** ~2-3 hours on T4 (3 models x 50 prompts x 7 compressors)

In [None]:
# Check GPU availability
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
else:
    raise RuntimeError("No GPU detected! Go to Runtime > Change runtime type > GPU")

In [None]:
# Install dependencies
!pip install -q transformers accelerate datasets pyyaml tqdm

# Clone and install KVShuttle
!git clone https://github.com/your-repo/KVShuttle.git 2>/dev/null || echo "Already cloned"
%cd KVShuttle
!pip install -q -e .

In [None]:
# Verify KVShuttle installation and torch backend
from kvshuttle.models.loader_torch import TORCH_MODEL_REGISTRY, load_model_torch
from kvshuttle.models.kv_extractor_torch import extract_kv_cache_torch
from kvshuttle.models.kv_injector_torch import forward_continuation_with_kv_cache_torch
from kvshuttle.compression.registry import list_compressors

print(f"Available models: {list(TORCH_MODEL_REGISTRY.keys())}")
print(f"Available compressors: {list_compressors()}")

## Run the experiment

Uses the `generation_quality_torch.yaml` config with `backend: torch`.
For a quick test, reduce `prompts.count` in the config.

In [None]:
# Run the full experiment
!python -m experiments.scripts.run_experiment experiments/configs/generation_quality_torch.yaml

In [None]:
# Inspect results
import json
from pathlib import Path

results_path = Path("experiments/results/generation_quality_fp16/results.json")
if results_path.exists():
    with open(results_path) as f:
        data = json.load(f)
    print(f"Metadata: {json.dumps(data['metadata'], indent=2)}")
    print(f"\nTotal results: {len(data['results'])}")
    
    # Quick summary
    import pandas as pd
    df = pd.DataFrame(data['results'])
    summary = df.groupby(['model', 'compressor']).agg({
        'mean_key_cosine_sim': 'mean',
        'perplexity_delta': 'mean',
        'token_agreement': 'mean',
    }).round(4)
    display(summary)
else:
    print("Results not found. Check experiment output above for errors.")

In [None]:
# Download results for local figure generation
if results_path.exists():
    from google.colab import files
    files.download(str(results_path))
    print("Downloaded results.json")