# 🧠 any4 Quantization Tutorial

This tutorial demonstrates:
- Running inference on a Hugging Face model
- Applying `any4` quantization from Meta
- Benchmarking speed and memory
- Evaluating performance with `lm-eval` and BigCode Eval

## 📦 1. Load Model and Tokenizer

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-3.2-1B"
device = "cuda"

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model.eval()

# Avoid HF warnings when pad token is missing
model.generation_config.pad_token_id = model.generation_config.eos_token_id

## 🔶 2. Inference with BF16 Model

In [None]:
prompt = "Once upon a time"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(text)

## 📊 3. Benchmarking (BF16)

In [None]:
from utils import get_model_size

model_size = get_model_size(model)
print(f"Model Size: {model_size / 2**30:.2f} GB")

In [None]:
from utils import benchmark_cuda_only_in_ms

model_cuda_time = benchmark_cuda_only_in_ms(model, warmup=0, iters=1, **inputs, do_sample=True, max_new_tokens=256)
print(f"GPU Time: {model_cuda_time:.2f} ms")

In [None]:
from utils import benchmark_in_ms

model_total_time = benchmark_in_ms(model, warmup=0, iters=1, **inputs, do_sample=True, max_new_tokens=256)
print(f"Total Time: {model_total_time:.2f} ms")

## 📈 4. Evaluation (BF16)

In [None]:
# Evaluate on LM Harness: PIQA and ARC-Easy
import json
from lm_eval import simple_evaluate

results = simple_evaluate(
    model="hf",
    model_args={
        "pretrained": model,
        "tokenizer": tokenizer,
        "batch_size": 8
    },
    tasks=["piqa", "arc_easy"],
)
print(json.dumps(results["results"], indent=2))

In [None]:
# Evaluate on BigCode Humaneval
import argparse
from datetime import timedelta
from accelerate import Accelerator, InitProcessGroupKwargs

import bigcode_eval
import bigcode_eval.evaluator
from bigcode_eval.arguments import EvalArguments
from eval import bigcode_default_args

accelerator = Accelerator(InitProcessGroupKwargs(timeout=timedelta(weeks=52)))
bigcode_evaluator = bigcode_eval.evaluator.Evaluator(
    accelerator=accelerator,
    model=model,
    tokenizer=tokenizer,
    args=argparse.Namespace(**bigcode_default_args),
)

result = bigcode_evaluator.evaluate("humaneval")
print(result)

In [None]:
# Evaluate on Open Pile (Perplexity)
from data import eval_perplexity, task_dataset_configs

result = eval_perplexity(
    model=model,
    tokenizer=tokenizer,
    batch_size=1,
    max_seq_len=2048,
    num_batches=10,
    **task_dataset_configs["pile-clean"]
)
print(result)

## 🧮 5. Apply any4 Quantization

In [None]:
from quantize import any4

# Apply any4 quantization to the model
model = any4(model)

Now, `Linear` layers inside the model are replaced with `Any4Linear`.

In [None]:
print(model)

## 🔷 6. Inference with Quantized Model

In [None]:
outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(text)

## 📊 7. Benchmarking (Quantized)

In [None]:
model_size = get_model_size(model)
print(f"Model Size: {model_size / 2**30:.2f} GB")

In [None]:
model_cuda_time = benchmark_cuda_only_in_ms(model, warmup=0, iters=1, **inputs, do_sample=True, max_new_tokens=256)
print(f"GPU Time: {model_cuda_time:.2f} ms")

In [None]:
model_total_time = benchmark_in_ms(model, warmup=0, iters=1, **inputs, do_sample=True, max_new_tokens=256)
print(f"Total Time: {model_total_time:.2f} ms")

> ✅ **Model size reduced** from ~2.79 GB → ~1.47 GB  
> ✅ **GPU time reduced** from ~20.52 ms → ~18.02 ms  
> ✅ **Total latency reduced** from ~56.94 ms → ~37.05 ms  

*Note: The embedding and LM head are not quantized, which limits size reduction on small models like Llama 3.2B.*

## 📈 8. Evaluation (Quantized)

In [None]:
# Evaluate on LM Harness: PIQA and ARC-Easy
results = simple_evaluate(
    model="hf",
    model_args={
        "pretrained": model,
        "tokenizer": tokenizer,
        "batch_size": 8
    },
    tasks=["piqa", "arc_easy"],
)
print(json.dumps(results["results"], indent=2))

In [None]:
# Evaluate on BigCode Humaneval
bigcode_evaluator = bigcode_eval.evaluator.Evaluator(
    accelerator=accelerator,
    model=model,
    tokenizer=tokenizer,
    args=argparse.Namespace(**bigcode_default_args),
)
result = bigcode_evaluator.evaluate("humaneval")
print(result)

In [None]:
# Evaluate on Open Pile (Perplexity)
result = eval_perplexity(
    model=model,
    tokenizer=tokenizer,
    batch_size=1,
    max_seq_len=2048,
    num_batches=10,
    **task_dataset_configs["pile-clean"]
)
print(result)

## ✅ 9. Conclusion

`any4` delivers:
- **Model size reduction**
- **Faster inference**
- **Minimal accuracy loss**

This makes it a practical choice for deploying LLMs efficiently on GPU.