Model Quantizer

Flox environment for quantizing HuggingFace models for offline inference. Four quantization backends cover the full precision-compression spectrum: AWQ 4-bit, FP8 via torchao, LLM Compressor (FP8, GPTQ, W8A8, NVFP4), and GGUF for the llama.cpp ecosystem (ollama, LM Studio, koboldcpp). HuggingFace-format outputs are written in hub cache layout so vLLM discovers quantized checkpoints with HF_HUB_OFFLINE=1; GGUF outputs are single-file models ready for llama.cpp-based tools.

Python 3.13 | PyTorch 2.9.1 (CUDA) | x86_64-linux, aarch64-linux

How It Works

This repository is a Flox environment. Flox is a package manager built on Nix that defines your entire development toolchain — system packages, Python runtime, CUDA libraries — in a single declarative manifest (.flox/env/manifest.toml). The .flox/ directory travels with the repo, so anyone who clones it gets an identical environment without installing anything manually beyond Flox itself.

Running flox activate does the following:

Provides Python 3.13 and PyTorch 2.9.1 with CUDA support from the Flox catalog (no pip/conda)
Creates a Python venv (first run only) and installs PyPI packages: torchao, transformers, accelerate, safetensors, huggingface-hub, autoawq, llmcompressor, datasets, gguf, sentencepiece
Removes the PyPI torch so Python falls through to the Flox-provided CUDA-enabled build via --system-site-packages
Applies compatibility patches for AutoAWQ and FP8/torchao (see AutoAWQ Compatibility Patches and FP8 Compatibility Patches)
Provides quantize-awq, quantize-fp8-local, quantize-fp8-production, quantize-llmc-local, quantize-llmc-production, quantize-gguf-local, quantize-gguf-production, and list-models commands (from the model-quantizer package)

No Docker, no conda, no manual virtualenv management. Clone the repo, install Flox (<70MB), activate, quantize.

Setup

Install Flox (one-time): follow the instructions at flox.dev/docs for your platform (apt, rpm, nix, or Docker).
Clone and activate:

git clone <this-repo>
cd model-quantizer
flox activate

The first activation provisions the Python venv and installs PyPI packages. Subsequent activations are instant.

Quick Start

flox activate

# AWQ 4-bit (best compression, ~3.5x smaller)
quantize-awq Qwen/Qwen3-8B

# FP8 via torchao (native Hopper/Blackwell, ~2x smaller)
quantize-fp8-local Qwen/Qwen3-8B

# LLM Compressor -- FP8 dynamic (data-free, compressed-tensors for vLLM)
quantize-llmc-local Qwen/Qwen3-8B

# LLM Compressor -- W4A16 GPTQ (calibration-based)
quantize-llmc-local Qwen/Qwen3-8B gptq --online

# GGUF for llama.cpp ecosystem (ollama, LM Studio, koboldcpp)
quantize-gguf-local Qwen/Qwen3-8B Q4_K_M       # fast, dev/iteration
quantize-gguf-production Qwen/Qwen3-8B Q4_K_M   # validated, CI/pipeline

# List cached source and quantized models
list-models

All scripts default to offline mode, so models must already be in the cache directory. Pass --online (FP8, LLMC, GGUF) or set HF_OFFLINE=0 (AWQ) to download models on the fly. AWQ uses environment variables rather than CLI flags for most configuration — see AWQ Environment Variables. If HF_TOKEN is set in your shell environment, the HuggingFace libraries will use it automatically for gated model access.

The cache directory defaults to $FLOX_ENV_PROJECT (models live under $FLOX_ENV_PROJECT/hub/models--*/) but can be pointed anywhere — either by overriding MODEL_CACHE_DIR at activation time or by editing the default in .flox/env/manifest.toml:

# Override at activation time
MODEL_CACHE_DIR=/data/models flox activate

# Or change the default permanently in the manifest [vars] section:
#   MODEL_CACHE_DIR = "/data/models"

Features

Four quantization backends: AWQ 4-bit INT (AutoAWQ is unmaintained; compatibility patches are applied automatically — see AutoAWQ Compatibility Patches); FP8 torchao (E4M3); LLM Compressor (FP8, GPTQ, W8A8, NVFP4); GGUF (llama.cpp ecosystem)
Offline-first: all scripts default to cache-only model loading
HF hub cache output layout: quantized models appear as siblings of source models, ready for vLLM
Content-addressed output: each run produces a deterministic snapshot ID derived from a full configuration fingerprint; identical parameters always map to the same output path
Idempotent: re-running with the same parameters skips quantization if output already exists
Concurrent-safe: file-level locking prevents races when multiple quantization jobs target the same output directory
Smoke tests: optional forward pass and token generation on quantized output to verify correctness before publishing
JSON output mode: machine-readable status for pipeline integration (--json on all quantization scripts)
Checksum manifests (AWQ): opt-in SHA-256 checksums for all output files
CI-ready: same Flox environment in local dev and CI, with GitHub Actions support (see CI / Pipeline Usage)
Auto-provisioned Python venv with PyPI packages on first activation

Quantization Methods

AWQ 4-bit (`quantize-awq`)

Uses AutoAWQ for activation-aware 4-bit integer weight quantization. Produces the best compression ratio (~3.5x) with good quality retention. Works on all CUDA GPUs.

# Default: 4-bit, group_size=128
quantize-awq Qwen/Qwen3-8B

# Custom bit-width and group size
quantize-awq Qwen/Qwen3-8B 4 64

# Force re-quantize an existing output
FORCE_REQUANTIZE=1 quantize-awq Qwen/Qwen3-8B

# JSON output for scripting
quantize-awq --json Qwen/Qwen3-8B

# Skip smoke test for faster turnaround
quantize-awq --smoke-test off Qwen/Qwen3-8B

Output model is saved as <model-id>-AWQ in the cache directory.

FP8 via torchao (`quantize-fp8-local` / `quantize-fp8-production`)

Uses torchao to convert BF16 weights to FP8 E4M3 (weight-only, data-free). Approximately 2x compression vs BF16. Native hardware acceleration on Hopper SM90 (H100, H200) and Blackwell SM120 (RTX 5090, B200). Note: L40S is Ada Lovelace SM89, not Hopper — it does not have native FP8 compute but can still load FP8 checkpoints via dequantization.

Two variants are provided:

quantize-fp8-local — fast, lightweight. Basic output validation. Best for local development and quick iteration.
quantize-fp8-production — adds --json-strict for structured JSON on both success and error, --reload-validate-device for reload validation, --lock-mode/--lock-timeout for stronger locking, and stage-based error tracking. Best for CI, serving pipelines, and long-term artifact storage.

Both share the same CLI interface and output layout. The production variant adds extra options (see FP8 Options).

When to choose: Use quantize-fp8-local when iterating on FP8 quantization or testing new models. Use quantize-fp8-production when the output will be served in production, stored as a build artifact, or generated in CI where you need structured error reporting and artifact integrity guarantees.

# Default: offline, auto device selection
quantize-fp8-local Qwen/Qwen3-8B

# Force rebuild, run smoke test after
quantize-fp8-local --force --smoke-test Qwen/Qwen3-8B

# Safetensors output (experimental)
quantize-fp8-local --allow-safetensors Qwen/Qwen3-8B

# Custom shard size for large models
quantize-fp8-local --max-shard-size 4GB Qwen/Qwen3-8B

# JSON output for scripting
quantize-fp8-local --json Qwen/Qwen3-8B

# Production: validated output with strict JSON errors
quantize-fp8-production --json Qwen/Qwen3-8B

# Production: reload validation on CPU
quantize-fp8-production --json-strict --reload-validate-device cpu Qwen/Qwen3-8B

Output model is saved as <model-id>-FP8-TORCHAO in the cache directory.

LLM Compressor (`quantize-llmc-local` / `quantize-llmc-production`)

Uses llm-compressor (vLLM project) for unified quantization. Produces compressed-tensors format loaded natively by vLLM without format conversion.

Scheme	Command	Calibration	Output Suffix
FP8 dynamic	`quantize-llmc-local <model>`	No (data-free)	`-FP8-DYNAMIC`
FP8 block	`quantize-llmc-local <model> fp8 --fp8-scheme block`	No (data-free)	`-FP8-BLOCK`
W4A16 GPTQ	`quantize-llmc-local <model> gptq`	Yes	`-W4A16-GPTQ`
W8A8 SmoothQuant	`quantize-llmc-local <model> w8a8`	Yes	`-W8A8-SQ-GPTQ`
NVFP4	`quantize-llmc-local <model> nvfp4`	Yes	`-NVFP4`

Two variants are provided:

quantize-llmc-local — fast, lightweight. Best for local development and quick iteration.
quantize-llmc-production — adds strict shell mode with ERR traps, stage-based error tracking via --json-strict, artifact manifests with SHA-256 checksums, atomic publish from temp dir, and extended output validation. Best for CI, serving pipelines, and long-term artifact storage.

Both share the same CLI interface and output layout. The production variant adds --json-strict.

When to choose: Use quantize-llmc-local when iterating on quantization schemes or testing new models. Use quantize-llmc-production when the output will be served in production, stored as a build artifact, or generated in CI where you need structured error reporting and artifact integrity guarantees.

# FP8 dynamic (default, data-free, no network needed)
quantize-llmc-local Qwen/Qwen3-8B

# W4A16 GPTQ with custom calibration parameters
quantize-llmc-local Qwen/Qwen3-8B gptq --online --num-samples 1024 --seq-length 4096

# W8A8 SmoothQuant
quantize-llmc-local Qwen/Qwen3-8B w8a8 --online

# NVFP4 (Blackwell-native)
quantize-llmc-local Qwen/Qwen3-8B nvfp4 --online

# Validate output by loading in vLLM and running generation
quantize-llmc-local Qwen/Qwen3-8B --validate --validate-prompt "Explain gravity."

# Use a local calibration dataset
quantize-llmc-local Qwen/Qwen3-8B gptq --dataset-path ./calibration.jsonl --text-column content

# Sequential pipeline for large models that exceed GPU memory
quantize-llmc-local Qwen/Qwen3-32B gptq --online --pipeline sequential

# JSON output for scripting
quantize-llmc-local --json Qwen/Qwen3-8B

# Production: validated output with artifact manifest
quantize-llmc-production --json Qwen/Qwen3-8B

# Production: strict JSON errors
quantize-llmc-production --json-strict Qwen/Qwen3-8B gptq --online

Schemes that require calibration (gptq, w8a8, nvfp4) need a dataset. The default is open_platypus. Pass --online if the dataset is not already cached.

GGUF (`quantize-gguf-local` / `quantize-gguf-production`)

Uses llama.cpp to convert HuggingFace models to GGUF format for the llama.cpp inference ecosystem (ollama, LM Studio, koboldcpp). Two-phase pipeline: convert HF safetensors to F16 GGUF, then quantize to the target type.

Two variants are provided:

quantize-gguf-local — fast, lightweight. Performs basic file-existence checks on output. Best for local development and quick iteration.
quantize-gguf-production — adds GGUF structural validation (magic, version, tensor count, alignment, data region bounds), artifact SHA-256 checksums, stage-based error tracking via --json-strict (each error includes the pipeline stage that failed), pre-publish integrity checks (fingerprint round-trip, metadata cross-validation, artifact hash verification), and extended fingerprints that include converter SHA-256, llama-quantize SHA-256, and gguf Python module version. Best for CI, serving pipelines, and long-term artifact storage.

Both share the same CLI interface and output layout. The production variant adds extra options (--json-strict, --smoke-ngl, --require-smoke-pass).

When to choose: Use quantize-gguf-local when iterating on quant types or testing new models — it's faster and has fewer dependencies. Use quantize-gguf-production when the output will be served in production, stored as a build artifact, or generated in CI where you need structured error reporting and artifact integrity guarantees.

# Default: Q4_K_M (good balance of size and quality)
quantize-gguf-local Qwen/Qwen3-8B

# Higher quality, larger file
quantize-gguf-local Qwen/Qwen3-8B Q5_K_M

# Maximum compression
quantize-gguf-local Qwen/Qwen3-8B Q2_K

# With importance matrix for better quality at low bit-widths
quantize-gguf-local Qwen/Qwen3-8B IQ4_XS --imatrix imatrix.dat

# Smoke test to verify output loads correctly
quantize-gguf-local --smoke-test Qwen/Qwen3-8B Q4_K_M

# JSON output for scripting
quantize-gguf-local --json Qwen/Qwen3-8B Q4_K_M

# Force rebuild
quantize-gguf-local --force Qwen/Qwen3-8B Q4_K_M

# Production: validated output with artifact SHA-256
quantize-gguf-production --json Qwen/Qwen3-8B Q4_K_M

# Production: strict JSON errors + required smoke test
quantize-gguf-production --json-strict --require-smoke-pass --smoke-ngl 99 Qwen/Qwen3-8B Q4_K_M

The F16 intermediate GGUF is cached by default at $FLOX_ENV_CACHE/gguf-staging/, so quantizing the same model to multiple types (Q4_K_M, Q5_K_S, Q8_0) only runs the conversion once. Pass --no-cache-f16 to disable caching.

Importance matrices (--imatrix): An importance matrix records per-weight activation statistics from a calibration corpus. Providing one via --imatrix imatrix.dat improves quality for aggressive quant types — particularly the IQ-series (IQ2_XXS through IQ4_NL) and low-K types (Q2_K, Q3_K_*). Generate one with llama-imatrix from llama.cpp against a representative text sample. Standard K-quants (Q4_K_M and above) see little benefit.

Intermediate precision (--convert-type): The default f16 is appropriate for most models. Use bf16 if the source model was trained in BF16 and you want to preserve the original precision through the intermediate GGUF stage; this avoids the minor rounding introduced by the F16 conversion.

Output model is saved as <model-id>-GGUF-<TYPE> in the cache directory.

When to Use What

Method	Compression	Quality	GPU Support	Best For
AWQ 4-bit	~3.5x	Good	All CUDA	Fitting larger models in limited VRAM
FP8 torchao	~2x	Excellent	SM90+	Quick FP8 checkpoint, data-free
FP8 dynamic/block (llmc)	~2x	Excellent	SM90+	compressed-tensors for vLLM
W4A16 GPTQ (llmc)	~3.5x	Good	All CUDA	4-bit with vLLM-native format
W8A8 SmoothQuant (llmc)	~2x	Excellent	All CUDA	Production throughput, 8-bit
NVFP4 (llmc)	~4x	Good	SM120	Native Blackwell 4-bit float
GGUF Q4_K_M	~3.5-4x	Good	CPU/Any	llama.cpp, ollama, LM Studio
GGUF Q5_K_M	~3x	Very good	CPU/Any	Higher quality GGUF
GGUF Q8_0	~2x	Excellent	CPU/Any	Highest quality GGUF

Model Sizing Reference (32 GB VRAM)

Model	BF16	AWQ/GPTQ 4-bit	FP8	NVFP4	GGUF Q4_K_M
7-8B	16 GB	4.5 GB	8 GB	~4 GB	~5 GB
14B	28 GB	8 GB	14 GB	~7 GB	~9 GB
32B	64 GB	18 GB	32 GB	~16 GB	~20 GB
70B	140 GB	40 GB	70 GB	~35 GB	~43 GB

GGUF rows apply equally to both quantize-gguf-local and quantize-gguf-production — the quantization output is identical; only the validation level differs. See GGUF for details.

Data-Free vs Calibration-Based Quantization

Method	Calibration Required	Notes
FP8 torchao	No (data-free)	Static weight cast to E4M3
FP8 dynamic/block (LLMC)	No (data-free)	Dynamic per-tensor or per-block scaling
GGUF (all types)	No (data-free)	Optional `--imatrix` improves low-bit quality
AWQ 4-bit	Yes (built-in default)	Uses calibration corpus for activation-aware quantization
W4A16 GPTQ (LLMC)	Yes	Requires `--online` or local dataset
W8A8 SmoothQuant (LLMC)	Yes	Requires `--online` or local dataset
NVFP4 (LLMC)	Yes	Requires `--online` or local dataset

Data-free methods are faster, work fully offline, and produce deterministic output. Calibration-based methods adapt quantization parameters to the model's weight distribution, generally achieving better quality at the same bit-width — but require a representative dataset and network access (unless using a local dataset via --dataset-path).

Output Layout

When WRITE_LOCAL_REPO_LAYOUT=1 (the default), quantized models are written in HuggingFace hub cache structure:

$QUANTIZED_OUTPUT_DIR/
  hub/
    models--Qwen--Qwen3-8B/              # source model (read-only)
      refs/main
      snapshots/<commit>/
        config.json, model.safetensors, ...
    models--Qwen--Qwen3-8B-AWQ/          # quantized output (AWQ example)
      refs/main
      snapshots/<snapshot-id>/
        config.json
        model.safetensors
        tokenizer.json
        tokenizer_config.json
        FINGERPRINT.json                  # AWQ, FP8, and GGUF
        awq_quantize_meta.json            # AWQ only
    models--Qwen--Qwen3-8B-GGUF-Q4_K_M/  # quantized output (GGUF example)
      refs/main
      snapshots/<snapshot-id>/
        Qwen3-8B-Q4_K_M.gguf             # single GGUF file
        FINGERPRINT.json
        gguf_quantize_info.json

The <snapshot-id> is a hash of the full quantization fingerprint (model ID, source commit, quant parameters, seed, etc.) — SHA-256 for AWQ and GGUF, SHA-1 for FP8 and LLMC. This makes each output content-addressed: changing any parameter produces a new snapshot directory, while identical parameters reuse the existing one.

When WRITE_LOCAL_REPO_LAYOUT=0, output is written to a flat directory under $QUANTIZED_OUTPUT_DIR/<model-id-suffix>/<snapshot-id>/.

Configuration

Global Environment Variables

These are set in the [vars] section of .flox/env/manifest.toml and apply to all scripts. Change the defaults directly in the manifest, or override per-session at activation time (e.g., MODEL_CACHE_DIR=/data/models flox activate):

Variable	Default	Description
`MODEL_CACHE_DIR`	`$FLOX_ENV_PROJECT`	HuggingFace cache root (scripts append `hub/models--*/`)
`QUANTIZED_OUTPUT_DIR`	`$FLOX_ENV_PROJECT`	Output root for quantized models
`WRITE_LOCAL_REPO_LAYOUT`	`1`	Write HF hub cache layout for vLLM auto-discovery

AWQ Environment Variables

Variable	Default	Description
`MODEL_REVISION`	`main`	HF revision (branch, tag, or 40-char commit hash)
`HF_OFFLINE`	`1`	Cache-only model loading (no network)
`TRUST_REMOTE_CODE`	`0`	Allow execution of model-repo custom code
`FORCE_REQUANTIZE`	`0`	Remove and rebuild existing complete output
`SHOW_SIZES`	`0`	Print source/output size comparison after quantization
`REQUIRE_CUDA`	`1`	Fail if CUDA is unavailable
`DEVICE_MAP`	(dynamic)	`auto`, `cuda0`, or `cpu`. Default: `cuda0` if CUDA available, else `cpu`
`QUANT_SEED`	`1337`	RNG seed for reproducibility
`DETERMINISTIC`	`0`	Request deterministic CUDA algorithms
`QUANT_SHARD_SIZE`	`5GB`	Weight shard size
`QUANT_SAFETENSORS`	`1`	Save in safetensors format
`SMOKE_TEST_MODE`	`full`	`full` (reload + generate), `fast` (in-memory), or `off`
`WRITE_CHECKSUMS`	`0`	Write `FILES_SHA256.json` manifest
`LOCK_TIMEOUT_SECONDS`	`0`	Seconds to wait for output lock (0 = fail immediately)
`LOCK_METHOD`	`auto`	`auto`, `flock`, or `mkdir`
`LOCK_STALE_SECONDS`	`0`	Remove stale locks older than this (0 = disabled)
`AWQ_CALIB_DATASET`	(unset)	Override calibration dataset
`AWQ_MAX_CALIB_SAMPLES`	(unset)	Max calibration samples
`AWQ_MAX_CALIB_SEQ_LEN`	(unset)	Max calibration sequence length
`AWQ_N_PARALLEL_CALIB_SAMPLES`	(unset)	Parallel calibration samples
`DETERMINISTIC_FAIL_CLOSED`	`1`	Make determinism setup errors fatal (0 = warn and continue)
`WRITE_WEIGHT_CHECKSUMS`	`0`	Include weight shards in `FILES_SHA256.json`
`FINGERPRINT_INCLUDE_VERS`	`1`	Include torch/transformers/awq versions in fingerprint
`FINGERPRINT_INCLUDE_SYS`	`0`	Include host/system info in fingerprint
`ALLOW_REMOTE_STALE`	`0`	Allow stale-lock cleanup across hosts on shared storage
`PYTHON`	`python3`	Python executable

FP8 (torchao) Options

The FP8 script uses CLI flags for most configuration. It also respects MODEL_CACHE_DIR and QUANTIZED_OUTPUT_DIR (see Global Environment Variables above) and three version-gate env vars: MIN_TORCH_VERSION (default: 2.1.0), MIN_TRANSFORMERS_VERSION (default: 4.40.0), MIN_TORCHAO_VERSION (default: 0.10.0).

Both quantize-fp8-local and quantize-fp8-production share this core interface:

quantize-fp8-local <model-id> [options]
quantize-fp8-production <model-id> [options]

  -c, --cache-dir DIR           HF cache root
  -o, --output-dir DIR          Output root
  -r, --revision REV            HF revision (default: main)
      --device MODE             auto|cpu|cuda (default: auto)
      --online                  Allow network access
      --trust-remote-code       Allow model repo custom code
      --force                   Rebuild even if output exists
      --suffix STR              Output suffix (default: -FP8-TORCHAO)
      --format FMT              torch|safetensors (default: torch)
      --allow-safetensors       Attempt safetensors format
      --max-shard-size STR      Weight shard size (e.g. 4GB)
      --offline-pick-latest     Pick newest cached snapshot when refs are missing
      --lock-ttl-seconds N      Lock TTL for cross-host stale handling (default: 21600)
      --smoke-test              Run generation on output
      --smoke-prompt STR        Prompt for smoke test (default: "Hello")
      --smoke-max-new-tokens N  Tokens to generate (default: 1)
      --smoke-temperature F     Temperature (default: 0.0)
      --no-validate             Skip structural validation
      --no-validate-quant       Skip quantization coverage check
      --validate-zip-crc        Run zip CRC checks on .bin shards (slow)
      --quant-min-ratio FLOAT   Min fraction of quantized layers (default: 0.80)
      --json                    JSON output on stdout (logs to stderr)

quantize-fp8-production adds:

      --json-strict                Like --json, and all failures also emit JSON to stdout
      --reload-validate-device DEV cpu|skip (default: cpu)
      --lock-mode MODE             auto|fd|mkdir (default: auto)
      --lock-timeout N             Lock wait seconds (0=fail-fast, -1=unlimited, default: 0)

GGUF Options

Both quantize-gguf-local and quantize-gguf-production share this core interface:

quantize-gguf-local <model-id> [quant-type] [options]
quantize-gguf-production <model-id> [quant-type] [options]

Quant types: Q2_K, Q3_K_S, Q3_K_M, Q3_K_L, Q4_0, Q4_1, Q4_K_S, Q4_K_M,
             Q5_0, Q5_1, Q5_K_S, Q5_K_M, Q6_K, Q8_0, F16, F32,
             IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_XS, IQ3_S, IQ4_XS, IQ4_NL

  -c, --cache-dir DIR         HF cache root
  -o, --output-dir DIR        Output root
  -r, --revision REV          HF revision (default: main)
      --suffix STR            Override output suffix (default: -GGUF-<TYPE>)
      --online                Allow network access
      --trust-remote-code     Allow model repo custom code
      --force                 Rebuild even if output exists
      --imatrix FILE          Importance matrix for quantization
      --convert-type TYPE     Intermediate precision: f16, bf16 (default: f16)
      --no-cache-f16          Do not cache intermediate F16 GGUF
      --threads N             Threads for llama-quantize (default: nproc)
      --smoke-test            Load output and generate tokens via llama-completion
      --smoke-prompt STR      Smoke test prompt (default: "Hello")
      --smoke-tokens N        Smoke test token count (default: 8)
      --lock-timeout N        Lock wait seconds (0=fail-fast, -1=unlimited, default: 0)
      --json                  JSON output on stdout (logs to stderr)

quantize-gguf-production adds:

      --json-strict           Emit structured JSON on errors (not just success)
      --smoke-ngl N           GPU layers for smoke test (default: 0 = CPU)
      --require-smoke-pass    Make smoke test failure fatal

LLM Compressor Options

Both quantize-llmc-local and quantize-llmc-production share this core interface:

quantize-llmc-local <model-id> [scheme] [options]
quantize-llmc-production <model-id> [scheme] [options]

Schemes: fp8 (default), gptq, w8a8, nvfp4

Quantization:
  --ignore LIST            Extra layer ignore patterns (comma-separated; repeatable)

FP8-specific:
  --fp8-scheme NAME        dynamic|block (default: dynamic)
  --fp8-pathway NAME       oneshot|model_free (default: oneshot)
  --model-free-device STR  Device for model_free_ptq (default: cuda:0)
  --model-free-workers N   Worker count for model_free_ptq (default: 8)

Calibration:
  --num-samples N          Calibration samples (default: 512)
  --seq-length N           Max sequence length (default: 2048)
  --batch-size N           Batch size (default: 1)
  --dataset NAME           HF dataset ID (default: open_platypus)
  --dataset-config NAME    HF dataset config name (optional)
  --dataset-path PATH      Local dataset (.json, .jsonl, .csv, .parquet, or directory)
  --text-column KEY        Text column name (default: text)
  --no-shuffle             Do not shuffle calibration samples
  --seed N                 RNG seed (default: 1234)
  --streaming              Stream dataset (hub ID or dvc:// path; online only)

Pipeline:
  --pipeline NAME          basic|datafree|sequential|independent
  --sequential-targets L   Decoder layer class names (comma-separated)
  --sequential-offload D   Offload device between layers (default: cpu)
  --no-qac                 Disable quantization-aware calibration
  --splits SPEC            Split percentages spec passed to oneshot
  --preprocessing-workers N  Dataset preprocessing workers
  --dataloader-workers N     DataLoader workers (default: 0)

Validation:
  --validate               Load output in vLLM and run checks
  --validate-prompt TEXT   Smoke test prompt (default: "Hello!")
  --validate-suite PATH    JSONL or txt regression suite
  --validate-seed N        Seed for vLLM sampling (default: 1)
  --validate-max-tokens N  Max tokens per prompt (default: 64)
  --validate-min-chars N   Min chars in output to count as pass (default: 1)

General:
  --model-revision REV     HF revision (default: main)
  --online                 Allow network access
  --trust-remote-code      Allow model repo custom code
  --use-auth-token         Use HuggingFace auth for private repos
  --force                  Overwrite existing output
  --suffix STR             Override output suffix
  --lock-timeout SECONDS   Lock wait time (0=fail, -1=unlimited)
  --log-dir PATH           llmcompressor log directory
  --json                   JSON output on stdout (logs to stderr)

quantize-llmc-production adds:

      --json-strict           Emit structured JSON on errors (not just success)

list-models Options

list-models [cache-dir]

Lists all HuggingFace models in the cache directory, showing model ID, total size, snapshot count, and detected quantization type ([AWQ], [FP8-TORCHAO], [compressed-tensors], [GGUF], [quantized]). The optional positional cache-dir argument overrides $MODEL_CACHE_DIR. Pass -h for usage help. list-models does not support --json.

Advanced Features

Locking

All six quantization scripts implement file-level locking on the output directory to prevent concurrent quantization jobs from corrupting each other. The AWQ and FP8 scripts support both flock (preferred) and mkdir-based locks with configurable timeout and stale lock detection. The LLMC and GGUF scripts use flock with --lock-timeout. Lock timeout behavior: 0 (default) fails immediately if the lock is held, N > 0 waits up to N seconds, -1 waits indefinitely.

The production GGUF script also holds a separate flock on the F16 intermediate cache directory, allowing safe concurrent quantization of the same model to multiple types (e.g., Q4_K_M and Q5_K_S in parallel) without duplicate F16 conversions.

Fingerprinting

All six quantization scripts compute content-addressed snapshot IDs by hashing a fingerprint of all configuration inputs. For AWQ this includes: model ID, source commit, quantization parameters (bits, group size, zero point), calibration settings, device map, dtype policy, seed, determinism flag, save format, shard size, and optionally library versions and system info. FP8 includes a similar set plus script version and output format. LLMC hashes the model ID, resolved commit, revision, scheme, FP8 options, calibration parameters, and pipeline options. GGUF local hashes the model ID, source commit, quant type, convert type, imatrix SHA-256, script SHA, and llama.cpp version (7 fields). GGUF production extends this with converter SHA-256 (convert_hf_to_gguf.py), llama-quantize SHA-256, gguf Python module version, and trust_remote_code (11 fields). Identical configurations always produce the same output path.

Determinism

Set DETERMINISTIC=1 (AWQ) or --seed N (LLMC) to improve reproducibility. The AWQ script sets Python, NumPy, and PyTorch seeds, and optionally enables torch.use_deterministic_algorithms(True). Full bitwise reproducibility across runs is not guaranteed due to CUDA non-determinism, but output quality is consistent.

Smoke Tests

AWQ: --smoke-test full (default) reloads the saved checkpoint and runs a forward pass plus short generation. fast tests the in-memory model immediately after quantization. off skips generation but still validates output structure.
FP8: --smoke-test loads the output checkpoint and generates one token.
LLMC: --validate spawns a separate vLLM process to load the checkpoint and run generation, verifying the output is a valid vLLM-loadable model.
GGUF (both variants): --smoke-test runs llama-completion to load the output GGUF and generate a short sequence, verifying the file is valid and loadable. The production variant adds --smoke-ngl to control GPU offloading and --require-smoke-pass to make failure fatal.

Note: The smoke test flag is named differently across scripts for historical reasons: --smoke-test full|fast|off for AWQ, --smoke-test (boolean) for FP8 and GGUF, and --validate for LLMC. All serve the same purpose — verifying the quantized output loads and generates tokens correctly.

JSON Mode

All six quantization scripts support --json for machine-readable output on stdout (logs go to stderr); list-models does not support --json. Each emits a JSON object with status (ok or exists), source model, and output path. AWQ and FP8 also include source commit, snapshot ID, revision, and smoke test results; FP8 adds device mode, format, and validation coverage. LLMC local includes scheme, validation status, output size, and timestamp. LLMC production adds --json-strict for structured JSON on both success and error, with stage-based error tracking. GGUF includes quant type, GGUF filename, file size, and smoke test results; the GGUF production variant also includes artifact_sha256. Both production variants (quantize-llmc-production and quantize-gguf-production) support --json-strict which emits structured JSON on both success and error. Error objects include a stage field indicating where the failure occurred (one of: startup, preflight, locking, source-resolution, f16-conversion, quantization, artifact-validation, smoke-test, publish, complete). Successful runs (ok) include timing.

CI / Pipeline Usage

The scripts are designed for unattended operation. With --json, all human-readable logs go to stderr and a single JSON object is written to stdout on completion, making it straightforward to parse results in a pipeline.

Running in CI

Flox provides GitHub Actions for CI integration. The environment travels with the repo, so CI gets the same toolchain as local development.

The commands (quantize-fp8-local, quantize-fp8-production, quantize-llmc-local, quantize-llmc-production, quantize-gguf-local, quantize-gguf-production, etc.) are binaries from the model-quantizer package and work in all contexts — interactive sessions and CI alike:

# .github/workflows/quantize.yml
jobs:
  quantize:
    runs-on: [self-hosted, gpu]  # needs NVIDIA GPU
    steps:
      - uses: actions/checkout@v4
      - uses: flox/install-flox-action@v2
      - uses: flox/activate-action@v1
        with:
          command: |
            quantize-fp8-production --online --json Qwen/Qwen3-8B > result.json
            cat result.json

For non-GitHub CI (GitLab, CircleCI, Jenkins), install Flox on the runner and use flox activate --:

flox activate -- quantize-fp8-production --online --json Qwen/Qwen3-8B > result.json

Key behaviors for automation

Exit codes: 0 on success or if output already exists, 1 on error
Idempotent: re-running with the same parameters detects existing output and exits immediately (JSON reports "status": "exists")
--force: bypasses the exists check and re-quantizes from scratch
Locking: concurrent jobs targeting the same output serialize automatically via file locks. Lock timeout is configurable: --lock-timeout 0 (default) fails immediately if the lock is held, --lock-timeout N waits up to N seconds, --lock-timeout -1 waits indefinitely
--json: structured output for parsing; logs on stderr won't pollute stdout
HF_TOKEN: set in CI secrets for gated model access — the HuggingFace libraries pick it up automatically. For local development, export it in your shell profile or run huggingface-cli login

Batch quantization example

#!/usr/bin/env bash
# Quantize a list of models, collect results
# Run inside flox activate, or use: flox activate -- bash batch-quantize.sh
models=(Qwen/Qwen3-8B meta-llama/Llama-3.1-8B-Instruct google/gemma-3-4b-it)

for model in "${models[@]}"; do
  echo "--- $model ---" >&2
  quantize-llmc-local --online --json "$model" >> results.jsonl
done

Each line in results.jsonl is a self-contained JSON object with status, output path, timing, and validation.

AutoAWQ Compatibility Patches

AutoAWQ 0.2.9 is the last release and is no longer maintained. The environment applies three patches on first activation to maintain compatibility with transformers 4.52+:

GELUTanh rename: PytorchGELUTanh was renamed to GELUTanh in transformers. Patched in awq/quantize/scale.py.
Catcher attribute proxy: The Catcher wrapper class in awq/quantize/quantizer.py does not proxy attribute access to the wrapped decoder layer. Transformers 4.57+ accesses attention_type on Qwen2/Qwen3 layers, crashing calibration. Patched by adding __getattr__ fallback.
Deprecation noise: AWQ's __init__.py overrides Python's warning filters and emits a deprecation notice on every import. Patched to remove the simplefilter override and warnings.warn call.

These patches are applied automatically during venv provisioning and do not require manual intervention.

FP8 Compatibility Patches

TorchAO's FP8 weight-only quantization (Float8WeightOnlyConfig) produces Float8Tensor subclasses that store quantized data in .qdata (float8_e4m3fn) and per-row scales in .scale (float32). These tensor subclasses do not support low-level operations (storage(), data_ptr(), view(-1)) that safetensors and transformers rely on during save_pretrained(). The environment applies two categories of fixes:

Venv patches (applied by `setup-venv`)

Five patches are applied to safetensors and transformers during venv provisioning:

storage_ptr() fallback (safetensors torch.py): The original code returns 0 when storage().data_ptr() raises NotImplementedError, causing all FP8 tensors to appear as shared storage. Patched to catch RuntimeError as well and return id(tensor) — a unique identifier per tensor — so safetensors correctly detects disjoint tensors.
_end_ptr() wrapping (safetensors torch.py): view(-1) and data_ptr() fail on Float8Tensor. Wrapped in try/except to fall back to id(tensor).
_end_ptr() wrapping (transformers modeling_utils.py): Same issue as above but in transformers' copy of the function. Uses tensor.element_size() instead of _SIZE[tensor.dtype].
_find_disjoint() wrapping (transformers modeling_utils.py): tensor.data_ptr() call wrapped in try/except to treat FP8 tensors as unique (non-shared).
_find_identical() wrapping (transformers modeling_utils.py): Same pattern — prevents false positive shared-tensor detection for FP8 tensors.

All patches are idempotent (guarded by grep -q for a unique marker string) and version-sensitive: if safetensors or transformers update to natively support FP8 tensor subclasses, the patches print a warning and skip. Targets: safetensors 0.7.0, transformers 5.3.0.

Script-level fixes

The FP8 quantization scripts (quantize-fp8-local.sh, quantize-fp8-production.sh) apply two additional fixes:

Float8Tensor unwrapping: Before calling save_pretrained(), the script extracts .qdata and .scale from each Float8Tensor into plain tensors (with _scale suffix for scales), then passes the unwrapped state dict via state_dict=unwrapped. This bypasses all remaining serialization issues.
Validation safetensors fallback (local script only): Transformers 5.x always saves as safetensors regardless of the safe_serialization flag. The validation step now checks for .safetensors files as a fallback when out_format="torch" and no .bin files are found. The production script's validate_saved_artifacts() already checks both formats.

Packages

From Flox (declarative, pinned):

Package	Version	Description
`flox-cuda/python3Packages.torch`	2.9.1	PyTorch with CUDA support
`uv`	latest	Python package installer
`flox-cuda/llama-cpp`	latest	llama.cpp tools (convert-hf-to-gguf, llama-quantize, llama-completion)
`gcc-unwrapped`	latest	C++ standard library (libstdc++) for PyPI native extensions
`zlib`	latest	Compression library required by numpy

From PyPI via uv (auto-provisioned on first flox activate):

Package	Description
`torchao`	PyTorch native quantization (FP8 E4M3)
`transformers`	HuggingFace Transformers
`accelerate`	Model parallelism and device mapping
`safetensors`	Safe tensor serialization
`huggingface-hub`	HuggingFace Hub client
`autoawq`	AWQ 4-bit quantization
`llmcompressor`	vLLM's unified quantization library
`datasets`	HuggingFace Datasets (calibration data for GPTQ/W8A8/NVFP4)
`gguf`	GGUF format support (required by convert-hf-to-gguf)
`sentencepiece`	Tokenizer library (required by some model conversions)

PyPI torch is automatically removed after installation so Python falls through to the Flox-provided CUDA-enabled torch via --system-site-packages.

System Requirements

OS: Linux (x86_64 or aarch64)
GPU: NVIDIA GPU with CUDA support. FP8 methods require SM90+ (Hopper: H100, H200) or SM120+ (Blackwell: RTX 5090, B200) for native hardware acceleration; L40S is Ada Lovelace (SM89) and does not have native FP8 compute. AWQ and GPTQ work on all CUDA GPUs. GGUF quantization and inference are CPU-only and do not require a GPU.
Driver: NVIDIA driver compatible with CUDA 12.x (driver 525+)
VRAM: Depends on model size. 7-8B models need ~16 GB for loading + quantization workspace. AWQ and GPTQ 4-bit outputs fit larger models in less VRAM at inference time.
Disk: Source model + quantized output. HuggingFace-format methods (AWQ, FP8, LLMC) need ~2x the source model size. GGUF needs ~2.5x when caching the F16 intermediate (shared across quant types), or ~2x with --no-cache-f16.
Flox: must be installed (see Setup)

Troubleshooting

"Source model not found in cache"

The model has not been downloaded. Pass --online to allow the script to fetch it, or download the model separately before running in offline mode.

"Lock busy"

A previous run was interrupted and left a stale lock. Remove it manually:

# AWQ locks (inside the output model directory)
rm -rf $QUANTIZED_OUTPUT_DIR/hub/models--<org>--<model>-<suffix>/.quantize.lockdir
rm -f  $QUANTIZED_OUTPUT_DIR/hub/models--<org>--<model>-<suffix>/.quantize.lock

# FP8 locks (inside the output model directory)
rm -rf $QUANTIZED_OUTPUT_DIR/hub/models--<org>--<model>-<suffix>/.quantize.lock

# LLMC / GGUF locks (in the output root)
rm -f  $QUANTIZED_OUTPUT_DIR/.quantize-<model-slug>.lock

For AWQ, set LOCK_STALE_SECONDS=300 to automatically clean stale locks older than 5 minutes.

CUDA out of memory

The model is too large for available VRAM. Options:

Close other GPU-using processes (nvidia-smi to check)
Use DEVICE_MAP=auto or --device auto for automatic multi-GPU or CPU offloading
For LLMC, use --pipeline sequential to quantize one decoder layer at a time
Quantize a smaller model or use a machine with more VRAM

"CUDA is not available"

PyTorch cannot see the GPU. Check that nvidia-smi works and that the Flox-provided torch has CUDA support:

flox activate
python3 -c "import torch; print(torch.cuda.is_available(), torch.version.cuda)"

Venv issues after environment changes

If the Python venv gets into a bad state (version mismatches, broken patches), delete it and re-activate:

rm -rf .flox/cache/venv
flox activate  # recreates venv from scratch

Resources

AutoAWQ -- AWQ quantization library
torchao -- PyTorch native quantization
llm-compressor -- vLLM's unified quantization
llama.cpp -- GGUF quantization and inference
vLLM quantization docs -- Loading quantized models in vLLM
Flox -- Reproducible development environments

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.flox		.flox
scripts		scripts
.gitignore		.gitignore
FLOX.md		FLOX.md
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Model Quantizer

How It Works

Setup

Quick Start

Features

Quantization Methods

AWQ 4-bit (quantize-awq)

FP8 via torchao (quantize-fp8-local / quantize-fp8-production)

LLM Compressor (quantize-llmc-local / quantize-llmc-production)

GGUF (quantize-gguf-local / quantize-gguf-production)

When to Use What

Model Sizing Reference (32 GB VRAM)

Data-Free vs Calibration-Based Quantization

Output Layout

Configuration

Global Environment Variables

AWQ Environment Variables

FP8 (torchao) Options

GGUF Options

LLM Compressor Options

list-models Options

Advanced Features

Locking

Fingerprinting

Determinism

Smoke Tests

JSON Mode

CI / Pipeline Usage

Running in CI

Key behaviors for automation

Batch quantization example

AutoAWQ Compatibility Patches

FP8 Compatibility Patches

Venv patches (applied by setup-venv)

Script-level fixes

Packages

System Requirements

Troubleshooting

"Source model not found in cache"

"Lock busy"

CUDA out of memory

"CUDA is not available"

Venv issues after environment changes

Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

AWQ 4-bit (`quantize-awq`)

FP8 via torchao (`quantize-fp8-local` / `quantize-fp8-production`)

LLM Compressor (`quantize-llmc-local` / `quantize-llmc-production`)

GGUF (`quantize-gguf-local` / `quantize-gguf-production`)

Venv patches (applied by `setup-venv`)

Packages