# Model Quantization with LLM Compressor

This notebook demonstrates how to quantize a language model using LLM Compressor, a library for model compression and optimization. We'll quantize the TinyLlama-1.1B-Chat model from full precision to INT8.

## Overview

The quantization process involves:
1. Installing the required library
2. Setting up model names
3. Configuring the quantization recipe
4. Running the quantization process



In [None]:
!pip install llmcompressor 

In [None]:
full_precision_model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
quantized_model_name = "TinyLlama-1.1B-Chat-v1.0-INT8"

In [None]:
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor import oneshot

# Select quantization algorithm. In this case, we:
#   * apply SmoothQuant to make the activations easier to quantize
#   * quantize the weights to int8 with GPTQ (static per channel)
#   * quantize the activations to int8 (dynamic per token)
recipe = [
    SmoothQuantModifier(smoothing_strength=0.8),
    GPTQModifier(scheme="W8A8", targets="Linear", ignore=["lm_head"]),
]

# Apply quantization using the built in open_platypus dataset.
#   * See examples for demos showing how to pass a custom calibration set
oneshot(
    model=full_precision_model_name,
    dataset="open_platypus",
    recipe=recipe,
    output_dir=quantized_model_name,
    max_seq_length=2048,
    num_calibration_samples=512,
)