# 🧠 any4 Quantization Tutorial

This tutorial demonstrates:
- Running inference on a Hugging Face model
- Applying `any4` quantization from Meta
- Benchmarking speed and memory
- Evaluating performance with `lm-eval` and BigCode Eval

## 📦 1. Load Model and Tokenizer

In [12]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-3.2-1B"
device = "cuda"

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model.eval()

# Avoid HF warnings when pad token is missing
model.generation_config.pad_token_id = model.generation_config.eos_token_id

## 🔶 2. Inference with BF16 Model

In [13]:
prompt = "Once upon a time"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(text)

Once upon a time there was a little girl who lived in a small village with her family. They were a happy family, but they were not as happy as they could be. Their house was not very big, and they had to share their bedroom with their younger brother. They had a dog, but they didn’t have many toys, and they didn’t have much money. They were poor.
One day, the little girl’s father died. Her mother was heartbroken. She didn’t know what to do. She couldn’t even think of anything to do. She couldn’t even think of anything to do.
Her mother looked at her little daughter and said, “My child, you are the only one who can help us now.” She told her daughter, “You must go to the village and find a job. You must find a job, and you must bring home enough money so that we can have enough food and clothing to live on.”
The little girl was very sad. She didn’t want to go to the village. She didn’t want to leave her mother and her brother. But she knew that she had no choice. She had to go. She had 

## 📊 3. Benchmarking (BF16)

In [14]:
from utils import get_model_size

model_size = get_model_size(model)
print(f"Model Size: {model_size / 2**30:.2f} GB")

Model Size: 2.79 GB


In [15]:
from utils import benchmark_cuda_only_in_ms

model_cuda_time = benchmark_cuda_only_in_ms(model, warmup=0, iters=1, **inputs, do_sample=True, max_new_tokens=256)
print(f"GPU Time: {model_cuda_time:.2f} ms")

GPU Time: 19.63 ms


In [16]:
from utils import benchmark_in_ms

model_total_time = benchmark_in_ms(model, warmup=0, iters=1, **inputs, do_sample=True, max_new_tokens=256)
print(f"Total Time: {model_total_time:.2f} ms")

Total Time: 20.85 ms


## 📈 4. Evaluation (BF16)

In [17]:
# Evaluate on LM Harness: PIQA and ARC-Easy
import json
from lm_eval import simple_evaluate

results = simple_evaluate(
    model="hf",
    model_args={
        "pretrained": model,
        "tokenizer": tokenizer,
        "batch_size": 8
    },
    tasks=["piqa", "arc_easy"],
)
print(json.dumps(results["results"], indent=2))

2025-07-09:04:29:52,193 INFO     [evaluator.py:164] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-07-09:04:29:52,194 INFO     [evaluator.py:188] Initializing hf model, with arguments: {'pretrained': LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_fe

{
  "arc_easy": {
    "alias": "arc_easy",
    "acc,none": 0.6540404040404041,
    "acc_stderr,none": 0.009760749624427512,
    "acc_norm,none": 0.6031144781144782,
    "acc_norm_stderr,none": 0.010039236800583206
  },
  "piqa": {
    "alias": "piqa",
    "acc,none": 0.7421109902067464,
    "acc_stderr,none": 0.01020695666205627,
    "acc_norm,none": 0.7453754080522307,
    "acc_norm_stderr,none": 0.010164432237060466
  }
}


In [18]:
# Evaluate on BigCode Humaneval
import argparse
from datetime import timedelta
from accelerate import Accelerator, InitProcessGroupKwargs

import bigcode_eval
import bigcode_eval.evaluator
from bigcode_eval.arguments import EvalArguments
from eval import bigcode_default_args

accelerator = Accelerator(InitProcessGroupKwargs(timeout=timedelta(weeks=52)))
bigcode_evaluator = bigcode_eval.evaluator.Evaluator(
    accelerator=accelerator,
    model=model,
    tokenizer=tokenizer,
    args=argparse.Namespace(**bigcode_default_args),
)

result = bigcode_evaluator.evaluate("humaneval")
print(result)

number of problems for this task is 164


  0%|          | 0/164 [00:00<?, ?it/s]The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
100%|██████████| 164/164 [01:55<00:00,  1.42it/s]

Evaluating generations...





{'pass@1': 0.16463414634146342}


In [19]:
# Evaluate on Open Pile (Perplexity)
from data import eval_perplexity, task_dataset_configs

result = eval_perplexity(
    model=model,
    tokenizer=tokenizer,
    batch_size=1,
    max_seq_len=2048,
    num_batches=10,
    **task_dataset_configs["pile-clean"]
)
print(result)

Evaluating perplexity on monology/pile-uncopyrighted on split train...


Resolving data files:   0%|          | 0/30 [00:00<?, ?it/s]

31it [00:02, 14.12it/s, loss=2.31]


Perplexity: 14.16655158996582
14.16655158996582


## 🧮 5. Apply any4 Quantization

In [20]:
from quantize import any4

# Apply any4 quantization to the model
model = any4(model)

Quantizing: 100%|██████████| 112/112 [03:48<00:00,  2.04s/layer, model.layers.15.mlp.down_proj]   


Now, `Linear` layers inside the model are replaced with `Any4Linear`.

In [21]:
print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Any4Linear(in_features=2048, out_features=2048, bias=False, group_size=128, per_row=True)
          (k_proj): Any4Linear(in_features=2048, out_features=512, bias=False, group_size=128, per_row=True)
          (v_proj): Any4Linear(in_features=2048, out_features=512, bias=False, group_size=128, per_row=True)
          (o_proj): Any4Linear(in_features=2048, out_features=2048, bias=False, group_size=128, per_row=True)
        )
        (mlp): LlamaMLP(
          (gate_proj): Any4Linear(in_features=2048, out_features=8192, bias=False, group_size=128, per_row=True)
          (up_proj): Any4Linear(in_features=2048, out_features=8192, bias=False, group_size=128, per_row=True)
          (down_proj): Any4Linear(in_features=8192, out_features=2048, bias=False, group_size=128, per_row=True)
    

## 🔷 6. Inference with Quantized Model

In [22]:
outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(text)

Once upon a time, I was a fan of the old-school, “How to be a Rock Star”-type advice books, which have now been replaced by the “How to be a Rock Star in 10 Easy Steps”-type books. The difference is that the first group of books was about how to be a rock star, while the second group of books are about how to be a rock star in 10 easy steps.
I’m not sure what to make of this, but I do think that these books are a bit of a mixed bag. On one hand, you have the old-school advice books that talk about how to be a rock star, while on the other hand, you have the new-school advice books that talk about how to be a rock star in 10 easy steps. I think that these two types of books are both valid and useful, but I also think that there’s something to be said for the old-school advice books, because they’re a bit more personal and they’re a bit more real.
The old-school advice books are a bit more personal and they’re a bit more real. They’re a bit more real because they’re a bit more real, and 

## 📊 7. Benchmarking (Quantized)

In [23]:
model_size = get_model_size(model)
print(f"Model Size: {model_size / 2**30:.2f} GB")

Model Size: 1.47 GB


In [24]:
model_cuda_time = benchmark_cuda_only_in_ms(model, warmup=0, iters=1, **inputs, do_sample=True, max_new_tokens=256)
print(f"GPU Time: {model_cuda_time:.2f} ms")

GPU Time: 17.71 ms


In [25]:
model_total_time = benchmark_in_ms(model, warmup=0, iters=1, **inputs, do_sample=True, max_new_tokens=256)
print(f"Total Time: {model_total_time:.2f} ms")

Total Time: 28.97 ms


> ✅ **Model size reduced** from ~2.79 GB → ~1.47 GB  
> ✅ **GPU time reduced** from ~20.52 ms → ~18.02 ms  
> ✅ **Total latency reduced** from ~56.94 ms → ~37.05 ms  

*Note: The embedding and LM head are not quantized, which limits size reduction on small models like Llama 3.2B.*

## 📈 8. Evaluation (Quantized)

In [26]:
# Evaluate on LM Harness: PIQA and ARC-Easy
results = simple_evaluate(
    model="hf",
    model_args={
        "pretrained": model,
        "tokenizer": tokenizer,
        "batch_size": 8
    },
    tasks=["piqa", "arc_easy"],
)
print(json.dumps(results["results"], indent=2))

2025-07-09:04:37:22,877 INFO     [evaluator.py:164] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-07-09:04:37:22,880 INFO     [evaluator.py:188] Initializing hf model, with arguments: {'pretrained': LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Any4Linear(in_features=2048, out_features=2048, bias=False, group_size=128, per_row=True)
          (k_proj): Any4Linear(in_features=2048, out_features=512, bias=False, group_size=128, per_row=True)
          (v_proj): Any4Linear(in_features=2048, out_features=512, bias=False, group_size=128, per_row=True)
          (o_proj): Any4Linear(in_features=2048, out_features=2048, bias=False, group_size=128, per_row=True)
        )
        (mlp): LlamaMLP(
          (gate_proj): Any4Linear(in_features=2048, out

{
  "arc_easy": {
    "alias": "arc_easy",
    "acc,none": 0.6191077441077442,
    "acc_stderr,none": 0.009964428212260372,
    "acc_norm,none": 0.5778619528619529,
    "acc_norm_stderr,none": 0.010134620524592271
  },
  "piqa": {
    "alias": "piqa",
    "acc,none": 0.7274211099020674,
    "acc_stderr,none": 0.010389256803296016,
    "acc_norm,none": 0.7285092491838956,
    "acc_norm_stderr,none": 0.010376251176596135
  }
}


In [27]:
# Evaluate on BigCode Humaneval
bigcode_evaluator = bigcode_eval.evaluator.Evaluator(
    accelerator=accelerator,
    model=model,
    tokenizer=tokenizer,
    args=argparse.Namespace(**bigcode_default_args),
)
result = bigcode_evaluator.evaluate("humaneval")
print(result)

number of problems for this task is 164


100%|██████████| 164/164 [02:19<00:00,  1.18it/s]

Evaluating generations...





{'pass@1': 0.11585365853658537}


In [28]:
# Evaluate on Open Pile (Perplexity)
result = eval_perplexity(
    model=model,
    tokenizer=tokenizer,
    batch_size=1,
    max_seq_len=2048,
    num_batches=10,
    **task_dataset_configs["pile-clean"]
)
print(result)

Evaluating perplexity on monology/pile-uncopyrighted on split train...


Resolving data files:   0%|          | 0/30 [00:00<?, ?it/s]

31it [00:05,  6.17it/s, loss=2.48]

Perplexity: 15.88514232635498
15.88514232635498





## ✅ 9. Conclusion

`any4` delivers:
- **Model size reduction**
- **Faster inference**
- **Minimal accuracy loss**

This makes it a practical choice for deploying LLMs efficiently on GPU.