# FP8 Post-Training Quantization (PTQ) with NVIDIA Model Optimizer

This notebook demonstrates how to perform FP8 Post-Training Quantization using NVIDIA's ModelOpt library with Min-Max calibration.

## Configure CUDA Environment

Set up environment variables for CUDA toolkit paths, including the CUDA architecture version, PTX assembler path, and library paths required for GPU operations.

In [1]:
import os
os.environ["TORCH_CUDA_ARCH_LIST"] = "12.1"
os.environ["TRITON_PTXAS_PATH"] = "/usr/local/cuda/bin/ptxas"
os.environ["PATH"] = "/usr/local/cuda/bin:" + os.environ.get("PATH", "")
os.environ["LD_LIBRARY_PATH"] = "/usr/local/cuda/lib64:" + os.environ.get("LD_LIBRARY_PATH", "")

## Install Dependencies

Install PyTorch 2.9.0 with CUDA 13.0 support, along with bitsandbytes and NVIDIA Model Optimizer. **Note:** Restart the kernel after running this cell before proceeding.

In [2]:
# Run this cell first, then restart the kernel before running the next cell
%pip uninstall torch -y
%pip install torch==2.9.0 torchvision==0.24.0 torchaudio==2.9.0 --index-url https://download.pytorch.org/whl/cu130

%pip install bitsandbytes>=0.43.2
%pip install nvidia-modelopt
%pip install huggingface_hub
%pip install transformers
%pip install python-dotenv
%pip install accelerate
%pip install datasets

Found existing installation: torch 2.9.0+cu130
Uninstalling torch-2.9.0+cu130:
  Successfully uninstalled torch-2.9.0+cu130
Note: you may need to restart the kernel to use updated packages.
Looking in indexes: https://download.pytorch.org/whl/cu130
Collecting torch==2.9.0
  Using cached https://download.pytorch.org/whl/cu130/torch-2.9.0%2Bcu130-cp312-cp312-manylinux_2_28_aarch64.whl.metadata (30 kB)
Using cached https://download.pytorch.org/whl/cu130/torch-2.9.0%2Bcu130-cp312-cp312-manylinux_2_28_aarch64.whl (512.4 MB)
Installing collected packages: torch
Successfully installed torch-2.9.0+cu130
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
N

## Import Libraries, Verify Setup, and Authenticate

Import required libraries including PyTorch, transformers, and bitsandbytes. Verify that PyTorch and CUDA are properly installed. Load environment variables from `.env` file and authenticate with HuggingFace Hub to access gated models.

In [3]:
# Run this cell after restarting the kernel
import os
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

from dotenv import load_dotenv
from huggingface_hub import login
from transformers import AutoModelForCausalLM, AutoTokenizer

import bitsandbytes as bnb

# Load environment variables from .env file and login to Hugging Face
load_dotenv()
login(token=os.getenv("HF_TOKEN"))

PyTorch version: 2.9.0+cu130
CUDA available: True


    Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
    Minimum and Maximum cuda capability supported by this version of PyTorch is
    (8.0) - (12.0)
    
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


## Configure Model and Dataset Parameters

Define the model name (Llama-3.1-8B-Instruct), calibration dataset (CNN DailyMail), batch size, and number of calibration samples for quantization.

In [4]:
model_name = "meta-llama/Llama-3.1-8B-Instruct"
dataset_name = "cnn_dailymail"
batch_size = 8
calib_samples = 128

## Load the Pre-trained Model and Compute Original Size

Load the Llama-3.1-8B-Instruct model in bfloat16 precision using automatic device mapping. Load the tokenizer and set the pad token. Import and use the helper function to compute and display the original model size in GB.

In [5]:
# Load model - use device_map="auto" to handle device placement automatically
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    dtype=torch.bfloat16,  # Must be torch_dtype, not dtype
    device_map="auto",           # Don't use .cuda(), use this instead
    low_cpu_mem_usage=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model

import sys
sys.path.append("..")  # Add parent directory to path

# Force reload the module to pick up changes
import importlib
import quantization_theory_helper
importlib.reload(quantization_theory_helper)

from quantization_theory_helper import compute_module_sizes
module_size = compute_module_sizes(model)
print(f"The model size is {module_size[''] * 1e-9} GB")

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

The model size is 16.060522752 GB


## Import ModelOpt and Prepare Calibration Dataset

Import NVIDIA ModelOpt quantization tools and dataset utilities. Create a DataLoader from the CNN DailyMail dataset for calibration, which will be used to collect activation statistics during the quantization process.

In [6]:
import modelopt.torch.quantization as mtq
from modelopt.torch.utils.dataset_utils import create_forward_loop, get_dataset_dataloader

dataloader = get_dataset_dataloader(
    dataset_name=dataset_name,
    tokenizer=tokenizer,
    batch_size=batch_size,
    num_samples=calib_samples,
    device="cuda",
)

  warn(


## Create Forward Loop for Calibration

Create a forward loop function that will run the calibration data through the model to collect activation statistics for quantization.

In [7]:
forward_loop = create_forward_loop(dataloader=dataloader)

## Apply FP8 Quantization

Apply FP8 Post-Training Quantization to the model using ModelOpt with Min-Max calibration. This converts the model weights to 8-bit floating point format.

In [8]:
# Try FP8 instead of FP4 - may have better Blackwell support
# If FP8 also fails, Triton doesn't support Blackwell GPUs yet for ModelOpt
quant_config = mtq.FP8_DEFAULT_CFG  # Changed from NVFP4_DEFAULT_CFG
model = mtq.quantize(model, quant_config, forward_loop=forward_loop)

Registered <class 'transformers.models.llama.modeling_llama.LlamaAttention'> to _QuantAttention for KV Cache quantization
Inserted 771 quantizers


100%|██████████| 16/16 [00:34<00:00,  2.18s/it]


## Test the Quantized Model

Generate text using the quantized model to verify it works correctly. Using eager mode (no torch.compile) for GPU compatibility.

In [9]:
# Using eager mode (no torch.compile) for Blackwell GPU compatibility
inputs = tokenizer("Hello, my name is", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=10)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Loading extension modelopt_cuda_ext_fp8...


[31mFAILED: [code=1] [0mtensor_quant_gpu_fp8.cuda.o 
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output tensor_quant_gpu_fp8.cuda.o.d -DTORCH_EXTENSION_NAME=modelopt_cuda_ext_fp8 -DTORCH_API_INCLUDE_EXTENSION_H -isystem /root/src/github.com/elizabetht/language-modeling-from-scratch/.venv/lib/python3.12/site-packages/torch/include -isystem /root/src/github.com/elizabetht/language-modeling-from-scratch/.venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /usr/include/python3.12 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_121,code=sm_121 --compiler-options '-fPIC' -std=c++17 -c /root/src/github.com/elizabetht/language-modeling-from-scratch/.venv/lib/python3.12/site-packages/modelopt/torch/quantization/src/tensor_quant_gpu_fp8.cu -o tensor_quant_gpu_fp8.cuda.o 
In file in

Hello, my name is Amanda and I am a 33 year old mom


## Export Quantized Model

Export the quantized model in HuggingFace checkpoint format and save the tokenizer to the export directory.

In [10]:
from modelopt.torch.export import export_hf_checkpoint

export_path = "./quantized_model/fp8/"
export_hf_checkpoint(model, export_dir=export_path)
tokenizer.save_pretrained(export_path)

`torch_dtype` is deprecated! Use `dtype` instead!
  weight_scaling_factor = torch.tensor(weight_quantizer.amax / weight_quantizer.maxbound)


('./quantized_model/fp8/tokenizer_config.json',
 './quantized_model/fp8/special_tokens_map.json',
 './quantized_model/fp8/chat_template.jinja',
 './quantized_model/fp8/tokenizer.json')

## Compute Quantized Model Size

Reload the helper module and calculate the size of the quantized model to compare with the original model size and verify the memory savings from quantization.

In [11]:
import sys
sys.path.append("..")  # Add parent directory to path

# Force reload the module to pick up changes
import importlib
import quantization_theory_helper
importlib.reload(quantization_theory_helper)

from quantization_theory_helper import compute_module_sizes
module_size = compute_module_sizes(model)
print(f"The model size is {module_size[''] * 1e-9} GB")

The model size is 9.08120448 GB


## Save Model Config and Create Model Card

Save the model configuration to the export directory. Create a comprehensive README.md model card with metadata tags, model details, usage instructions, and licensing information for the HuggingFace model repository.

In [12]:


# Save model config
model.config.save_pretrained(export_path)
print("✓ Saved config")

# Create a README model card
model_card = """---
license: llama3.1
base_model: meta-llama/Llama-3.1-8B-Instruct
tags:
  - llama
  - quantized
  - nvidia-modeloptimizer
  - fp8
library_name: nvidia-modeloptimizer
---

# Llama-3.1-8B-Instruct Quantized (ModelOpt FP8)

This is a quantized version of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) using [modelopt](https://github.com/NVIDIA/Model-Optimizer) with FP8 weight quantization.

## Model Details

- **Base Model:** meta-llama/Llama-3.1-8B-Instruct
- **Quantization Method:** modelopt FP8 Post-Training Quantization (PTQ)    
- **Weight Precision:** FP8
- **Original Size:** ~16 GB (bfloat16)
- **Quantized Size:** ~9 GB

## Usage

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model structure
model = AutoModelForCausalLM.from_pretrained(
    "tokenlabsdotrun/Llama-3.1-8B-ModelOpt-FP8",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True
)

# Load tokenizer and generate
tokenizer = AutoTokenizer.from_pretrained("tokenlabsdotrun/Llama-3.1-8B-ModelOpt-FP8")

inputs = tokenizer("Hello, my name is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=10)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## License

This model inherits the [Llama 3.1 Community License](https://llama.meta.com/llama3_1/license/).
"""

with open(f"{export_path}/README.md", "w") as f:
    f.write(model_card)
print("✓ Created model card")

print(f"\nModel saved to {export_path}/")
print("Contents:", os.listdir(export_path))

✓ Saved config
✓ Created model card

Model saved to ./quantized_model/fp8//
Contents: ['special_tokens_map.json', 'README.md', 'model-00002-of-00002.safetensors', 'tokenizer_config.json', 'chat_template.jinja', 'hf_quant_config.json', 'generation_config.json', 'config.json', 'model-00001-of-00002.safetensors', 'model.safetensors.index.json', 'tokenizer.json']


## Upload to HuggingFace Hub

Create a HuggingFace repository and upload the quantized model, tokenizer, and model card to make it publicly available.

In [13]:
from huggingface_hub import create_repo, upload_folder

# Load write token from .env file
hf_write_token = os.getenv("HF_WRITE_TOKEN")

# Upload to HuggingFace Hub
repo_name = "tokenlabsdotrun/Llama-3.1-8B-ModelOpt-FP8"  # Change to your username/repo

try:
    # Create the repo (set private=True if you want it private)
    create_repo(repo_name, exist_ok=True, private=False, token=hf_write_token)
    print(f"✓ Repository created: {repo_name}")
    
    # Upload all files
    upload_folder(
        folder_path=export_path,
        repo_id=repo_name,
        repo_type="model",
        commit_message="Upload Llama-3.1-8B quantized with ModelOpt FP8",
        token=hf_write_token
    )
    print(f"✓ Uploaded to https://huggingface.co/{repo_name}")
    
except Exception as e:
    print(f"❌ Error: {e}")

✓ Repository created: tokenlabsdotrun/Llama-3.1-8B-ModelOpt-FP8


Processing Files (0 / 0): |          |  0.00B /  0.00B            

New Data Upload: |          |  0.00B /  0.00B            

✓ Uploaded to https://huggingface.co/tokenlabsdotrun/Llama-3.1-8B-ModelOpt-FP8
