Flox environment for quantizing HuggingFace models for offline inference. Four quantization backends cover the full precision-compression spectrum: AWQ 4-bit, FP8 via torchao, LLM Compressor (FP8, GPTQ, W8A8, NVFP4), and GGUF for the llama.cpp ecosystem (ollama, LM Studio, koboldcpp). HuggingFace-format outputs are written in hub cache layout so vLLM discovers quantized checkpoints with HF_HUB_OFFLINE=1; GGUF outputs are single-file models ready for llama.cpp-based tools.
Python 3.13 | PyTorch 2.9.1 (CUDA) | x86_64-linux, aarch64-linux
This repository is a Flox environment. Flox is a package manager built on Nix that defines your entire development toolchain — system packages, Python runtime, CUDA libraries — in a single declarative manifest (.flox/env/manifest.toml). The .flox/ directory travels with the repo, so anyone who clones it gets an identical environment without installing anything manually beyond Flox itself.
Running flox activate does the following:
- Provides Python 3.13 and PyTorch 2.9.1 with CUDA support from the Flox catalog (no pip/conda)
- Creates a Python venv (first run only) and installs PyPI packages: torchao, transformers, accelerate, safetensors, huggingface-hub, autoawq, llmcompressor, datasets, gguf, sentencepiece
- Removes the PyPI torch so Python falls through to the Flox-provided CUDA-enabled build via
--system-site-packages - Applies compatibility patches for AutoAWQ and FP8/torchao (see AutoAWQ Compatibility Patches and FP8 Compatibility Patches)
- Provides
quantize-awq,quantize-fp8-local,quantize-fp8-production,quantize-llmc-local,quantize-llmc-production,quantize-gguf-local,quantize-gguf-production, andlist-modelscommands (from themodel-quantizerpackage)
No Docker, no conda, no manual virtualenv management. Clone the repo, install Flox (<70MB), activate, quantize.
- Install Flox (one-time): follow the instructions at flox.dev/docs for your platform (apt, rpm, nix, or Docker).
- Clone and activate:
git clone <this-repo>
cd model-quantizer
flox activateThe first activation provisions the Python venv and installs PyPI packages. Subsequent activations are instant.
flox activate
# AWQ 4-bit (best compression, ~3.5x smaller)
quantize-awq Qwen/Qwen3-8B
# FP8 via torchao (native Hopper/Blackwell, ~2x smaller)
quantize-fp8-local Qwen/Qwen3-8B
# LLM Compressor -- FP8 dynamic (data-free, compressed-tensors for vLLM)
quantize-llmc-local Qwen/Qwen3-8B
# LLM Compressor -- W4A16 GPTQ (calibration-based)
quantize-llmc-local Qwen/Qwen3-8B gptq --online
# GGUF for llama.cpp ecosystem (ollama, LM Studio, koboldcpp)
quantize-gguf-local Qwen/Qwen3-8B Q4_K_M # fast, dev/iteration
quantize-gguf-production Qwen/Qwen3-8B Q4_K_M # validated, CI/pipeline
# List cached source and quantized models
list-modelsAll scripts default to offline mode, so models must already be in the cache directory. Pass --online (FP8, LLMC, GGUF) or set HF_OFFLINE=0 (AWQ) to download models on the fly. AWQ uses environment variables rather than CLI flags for most configuration — see AWQ Environment Variables. If HF_TOKEN is set in your shell environment, the HuggingFace libraries will use it automatically for gated model access.
The cache directory defaults to $FLOX_ENV_PROJECT (models live under $FLOX_ENV_PROJECT/hub/models--*/) but can be pointed anywhere — either by overriding MODEL_CACHE_DIR at activation time or by editing the default in .flox/env/manifest.toml:
# Override at activation time
MODEL_CACHE_DIR=/data/models flox activate
# Or change the default permanently in the manifest [vars] section:
# MODEL_CACHE_DIR = "/data/models"- Four quantization backends: AWQ 4-bit INT (AutoAWQ is unmaintained; compatibility patches are applied automatically — see AutoAWQ Compatibility Patches); FP8 torchao (E4M3); LLM Compressor (FP8, GPTQ, W8A8, NVFP4); GGUF (llama.cpp ecosystem)
- Offline-first: all scripts default to cache-only model loading
- HF hub cache output layout: quantized models appear as siblings of source models, ready for vLLM
- Content-addressed output: each run produces a deterministic snapshot ID derived from a full configuration fingerprint; identical parameters always map to the same output path
- Idempotent: re-running with the same parameters skips quantization if output already exists
- Concurrent-safe: file-level locking prevents races when multiple quantization jobs target the same output directory
- Smoke tests: optional forward pass and token generation on quantized output to verify correctness before publishing
- JSON output mode: machine-readable status for pipeline integration (
--jsonon all quantization scripts) - Checksum manifests (AWQ): opt-in SHA-256 checksums for all output files
- CI-ready: same Flox environment in local dev and CI, with GitHub Actions support (see CI / Pipeline Usage)
- Auto-provisioned Python venv with PyPI packages on first activation
Uses AutoAWQ for activation-aware 4-bit integer weight quantization. Produces the best compression ratio (~3.5x) with good quality retention. Works on all CUDA GPUs.
# Default: 4-bit, group_size=128
quantize-awq Qwen/Qwen3-8B
# Custom bit-width and group size
quantize-awq Qwen/Qwen3-8B 4 64
# Force re-quantize an existing output
FORCE_REQUANTIZE=1 quantize-awq Qwen/Qwen3-8B
# JSON output for scripting
quantize-awq --json Qwen/Qwen3-8B
# Skip smoke test for faster turnaround
quantize-awq --smoke-test off Qwen/Qwen3-8BOutput model is saved as <model-id>-AWQ in the cache directory.
Uses torchao to convert BF16 weights to FP8 E4M3 (weight-only, data-free). Approximately 2x compression vs BF16. Native hardware acceleration on Hopper SM90 (H100, H200) and Blackwell SM120 (RTX 5090, B200). Note: L40S is Ada Lovelace SM89, not Hopper — it does not have native FP8 compute but can still load FP8 checkpoints via dequantization.
Two variants are provided:
quantize-fp8-local— fast, lightweight. Basic output validation. Best for local development and quick iteration.quantize-fp8-production— adds--json-strictfor structured JSON on both success and error,--reload-validate-devicefor reload validation,--lock-mode/--lock-timeoutfor stronger locking, and stage-based error tracking. Best for CI, serving pipelines, and long-term artifact storage.
Both share the same CLI interface and output layout. The production variant adds extra options (see FP8 Options).
When to choose: Use quantize-fp8-local when iterating on FP8 quantization or testing new models. Use quantize-fp8-production when the output will be served in production, stored as a build artifact, or generated in CI where you need structured error reporting and artifact integrity guarantees.
# Default: offline, auto device selection
quantize-fp8-local Qwen/Qwen3-8B
# Force rebuild, run smoke test after
quantize-fp8-local --force --smoke-test Qwen/Qwen3-8B
# Safetensors output (experimental)
quantize-fp8-local --allow-safetensors Qwen/Qwen3-8B
# Custom shard size for large models
quantize-fp8-local --max-shard-size 4GB Qwen/Qwen3-8B
# JSON output for scripting
quantize-fp8-local --json Qwen/Qwen3-8B
# Production: validated output with strict JSON errors
quantize-fp8-production --json Qwen/Qwen3-8B
# Production: reload validation on CPU
quantize-fp8-production --json-strict --reload-validate-device cpu Qwen/Qwen3-8BOutput model is saved as <model-id>-FP8-TORCHAO in the cache directory.
Uses llm-compressor (vLLM project) for unified quantization. Produces compressed-tensors format loaded natively by vLLM without format conversion.
| Scheme | Command | Calibration | Output Suffix |
|---|---|---|---|
| FP8 dynamic | quantize-llmc-local <model> |
No (data-free) | -FP8-DYNAMIC |
| FP8 block | quantize-llmc-local <model> fp8 --fp8-scheme block |
No (data-free) | -FP8-BLOCK |
| W4A16 GPTQ | quantize-llmc-local <model> gptq |
Yes | -W4A16-GPTQ |
| W8A8 SmoothQuant | quantize-llmc-local <model> w8a8 |
Yes | -W8A8-SQ-GPTQ |
| NVFP4 | quantize-llmc-local <model> nvfp4 |
Yes | -NVFP4 |
Two variants are provided:
quantize-llmc-local— fast, lightweight. Best for local development and quick iteration.quantize-llmc-production— adds strict shell mode with ERR traps, stage-based error tracking via--json-strict, artifact manifests with SHA-256 checksums, atomic publish from temp dir, and extended output validation. Best for CI, serving pipelines, and long-term artifact storage.
Both share the same CLI interface and output layout. The production variant adds --json-strict.
When to choose: Use quantize-llmc-local when iterating on quantization schemes or testing new
models. Use quantize-llmc-production when the output will be served in production, stored as a
build artifact, or generated in CI where you need structured error reporting and artifact integrity
guarantees.
# FP8 dynamic (default, data-free, no network needed)
quantize-llmc-local Qwen/Qwen3-8B
# W4A16 GPTQ with custom calibration parameters
quantize-llmc-local Qwen/Qwen3-8B gptq --online --num-samples 1024 --seq-length 4096
# W8A8 SmoothQuant
quantize-llmc-local Qwen/Qwen3-8B w8a8 --online
# NVFP4 (Blackwell-native)
quantize-llmc-local Qwen/Qwen3-8B nvfp4 --online
# Validate output by loading in vLLM and running generation
quantize-llmc-local Qwen/Qwen3-8B --validate --validate-prompt "Explain gravity."
# Use a local calibration dataset
quantize-llmc-local Qwen/Qwen3-8B gptq --dataset-path ./calibration.jsonl --text-column content
# Sequential pipeline for large models that exceed GPU memory
quantize-llmc-local Qwen/Qwen3-32B gptq --online --pipeline sequential
# JSON output for scripting
quantize-llmc-local --json Qwen/Qwen3-8B
# Production: validated output with artifact manifest
quantize-llmc-production --json Qwen/Qwen3-8B
# Production: strict JSON errors
quantize-llmc-production --json-strict Qwen/Qwen3-8B gptq --onlineSchemes that require calibration (gptq, w8a8, nvfp4) need a dataset. The default is open_platypus. Pass --online if the dataset is not already cached.
Uses llama.cpp to convert HuggingFace models to GGUF format for the llama.cpp inference ecosystem (ollama, LM Studio, koboldcpp). Two-phase pipeline: convert HF safetensors to F16 GGUF, then quantize to the target type.
Two variants are provided:
quantize-gguf-local— fast, lightweight. Performs basic file-existence checks on output. Best for local development and quick iteration.quantize-gguf-production— adds GGUF structural validation (magic, version, tensor count, alignment, data region bounds), artifact SHA-256 checksums, stage-based error tracking via--json-strict(each error includes the pipeline stage that failed), pre-publish integrity checks (fingerprint round-trip, metadata cross-validation, artifact hash verification), and extended fingerprints that include converter SHA-256, llama-quantize SHA-256, and gguf Python module version. Best for CI, serving pipelines, and long-term artifact storage.
Both share the same CLI interface and output layout. The production variant adds extra options (--json-strict, --smoke-ngl, --require-smoke-pass).
When to choose: Use quantize-gguf-local when iterating on quant types or testing new models — it's faster and has fewer dependencies. Use quantize-gguf-production when the output will be served in production, stored as a build artifact, or generated in CI where you need structured error reporting and artifact integrity guarantees.
# Default: Q4_K_M (good balance of size and quality)
quantize-gguf-local Qwen/Qwen3-8B
# Higher quality, larger file
quantize-gguf-local Qwen/Qwen3-8B Q5_K_M
# Maximum compression
quantize-gguf-local Qwen/Qwen3-8B Q2_K
# With importance matrix for better quality at low bit-widths
quantize-gguf-local Qwen/Qwen3-8B IQ4_XS --imatrix imatrix.dat
# Smoke test to verify output loads correctly
quantize-gguf-local --smoke-test Qwen/Qwen3-8B Q4_K_M
# JSON output for scripting
quantize-gguf-local --json Qwen/Qwen3-8B Q4_K_M
# Force rebuild
quantize-gguf-local --force Qwen/Qwen3-8B Q4_K_M
# Production: validated output with artifact SHA-256
quantize-gguf-production --json Qwen/Qwen3-8B Q4_K_M
# Production: strict JSON errors + required smoke test
quantize-gguf-production --json-strict --require-smoke-pass --smoke-ngl 99 Qwen/Qwen3-8B Q4_K_MThe F16 intermediate GGUF is cached by default at $FLOX_ENV_CACHE/gguf-staging/, so quantizing the same model to multiple types (Q4_K_M, Q5_K_S, Q8_0) only runs the conversion once. Pass --no-cache-f16 to disable caching.
Importance matrices (--imatrix): An importance matrix records per-weight activation statistics from a calibration corpus. Providing one via --imatrix imatrix.dat improves quality for aggressive quant types — particularly the IQ-series (IQ2_XXS through IQ4_NL) and low-K types (Q2_K, Q3_K_*). Generate one with llama-imatrix from llama.cpp against a representative text sample. Standard K-quants (Q4_K_M and above) see little benefit.
Intermediate precision (--convert-type): The default f16 is appropriate for most models. Use bf16 if the source model was trained in BF16 and you want to preserve the original precision through the intermediate GGUF stage; this avoids the minor rounding introduced by the F16 conversion.
Output model is saved as <model-id>-GGUF-<TYPE> in the cache directory.
| Method | Compression | Quality | GPU Support | Best For |
|---|---|---|---|---|
| AWQ 4-bit | ~3.5x | Good | All CUDA | Fitting larger models in limited VRAM |
| FP8 torchao | ~2x | Excellent | SM90+ | Quick FP8 checkpoint, data-free |
| FP8 dynamic/block (llmc) | ~2x | Excellent | SM90+ | compressed-tensors for vLLM |
| W4A16 GPTQ (llmc) | ~3.5x | Good | All CUDA | 4-bit with vLLM-native format |
| W8A8 SmoothQuant (llmc) | ~2x | Excellent | All CUDA | Production throughput, 8-bit |
| NVFP4 (llmc) | ~4x | Good | SM120 | Native Blackwell 4-bit float |
| GGUF Q4_K_M | ~3.5-4x | Good | CPU/Any | llama.cpp, ollama, LM Studio |
| GGUF Q5_K_M | ~3x | Very good | CPU/Any | Higher quality GGUF |
| GGUF Q8_0 | ~2x | Excellent | CPU/Any | Highest quality GGUF |
| Model | BF16 | AWQ/GPTQ 4-bit | FP8 | NVFP4 | GGUF Q4_K_M |
|---|---|---|---|---|---|
| 7-8B | 16 GB | 4.5 GB | 8 GB | ~4 GB | ~5 GB |
| 14B | 28 GB | 8 GB | 14 GB | ~7 GB | ~9 GB |
| 32B | 64 GB | 18 GB | 32 GB | ~16 GB | ~20 GB |
| 70B | 140 GB | 40 GB | 70 GB | ~35 GB | ~43 GB |
GGUF rows apply equally to both quantize-gguf-local and quantize-gguf-production — the quantization output is identical; only the validation level differs. See GGUF for details.
| Method | Calibration Required | Notes |
|---|---|---|
| FP8 torchao | No (data-free) | Static weight cast to E4M3 |
| FP8 dynamic/block (LLMC) | No (data-free) | Dynamic per-tensor or per-block scaling |
| GGUF (all types) | No (data-free) | Optional --imatrix improves low-bit quality |
| AWQ 4-bit | Yes (built-in default) | Uses calibration corpus for activation-aware quantization |
| W4A16 GPTQ (LLMC) | Yes | Requires --online or local dataset |
| W8A8 SmoothQuant (LLMC) | Yes | Requires --online or local dataset |
| NVFP4 (LLMC) | Yes | Requires --online or local dataset |
Data-free methods are faster, work fully offline, and produce deterministic output. Calibration-based methods adapt quantization parameters to the model's weight distribution, generally achieving better quality at the same bit-width — but require a representative dataset and network access (unless using a local dataset via --dataset-path).
When WRITE_LOCAL_REPO_LAYOUT=1 (the default), quantized models are written in HuggingFace hub cache structure:
$QUANTIZED_OUTPUT_DIR/
hub/
models--Qwen--Qwen3-8B/ # source model (read-only)
refs/main
snapshots/<commit>/
config.json, model.safetensors, ...
models--Qwen--Qwen3-8B-AWQ/ # quantized output (AWQ example)
refs/main
snapshots/<snapshot-id>/
config.json
model.safetensors
tokenizer.json
tokenizer_config.json
FINGERPRINT.json # AWQ, FP8, and GGUF
awq_quantize_meta.json # AWQ only
models--Qwen--Qwen3-8B-GGUF-Q4_K_M/ # quantized output (GGUF example)
refs/main
snapshots/<snapshot-id>/
Qwen3-8B-Q4_K_M.gguf # single GGUF file
FINGERPRINT.json
gguf_quantize_info.json
The <snapshot-id> is a hash of the full quantization fingerprint (model ID, source commit, quant parameters, seed, etc.) — SHA-256 for AWQ and GGUF, SHA-1 for FP8 and LLMC. This makes each output content-addressed: changing any parameter produces a new snapshot directory, while identical parameters reuse the existing one.
When WRITE_LOCAL_REPO_LAYOUT=0, output is written to a flat directory under $QUANTIZED_OUTPUT_DIR/<model-id-suffix>/<snapshot-id>/.
These are set in the [vars] section of .flox/env/manifest.toml and apply to all scripts. Change the defaults directly in the manifest, or override per-session at activation time (e.g., MODEL_CACHE_DIR=/data/models flox activate):
| Variable | Default | Description |
|---|---|---|
MODEL_CACHE_DIR |
$FLOX_ENV_PROJECT |
HuggingFace cache root (scripts append hub/models--*/) |
QUANTIZED_OUTPUT_DIR |
$FLOX_ENV_PROJECT |
Output root for quantized models |
WRITE_LOCAL_REPO_LAYOUT |
1 |
Write HF hub cache layout for vLLM auto-discovery |
| Variable | Default | Description |
|---|---|---|
MODEL_REVISION |
main |
HF revision (branch, tag, or 40-char commit hash) |
HF_OFFLINE |
1 |
Cache-only model loading (no network) |
TRUST_REMOTE_CODE |
0 |
Allow execution of model-repo custom code |
FORCE_REQUANTIZE |
0 |
Remove and rebuild existing complete output |
SHOW_SIZES |
0 |
Print source/output size comparison after quantization |
REQUIRE_CUDA |
1 |
Fail if CUDA is unavailable |
DEVICE_MAP |
(dynamic) | auto, cuda0, or cpu. Default: cuda0 if CUDA available, else cpu |
QUANT_SEED |
1337 |
RNG seed for reproducibility |
DETERMINISTIC |
0 |
Request deterministic CUDA algorithms |
QUANT_SHARD_SIZE |
5GB |
Weight shard size |
QUANT_SAFETENSORS |
1 |
Save in safetensors format |
SMOKE_TEST_MODE |
full |
full (reload + generate), fast (in-memory), or off |
WRITE_CHECKSUMS |
0 |
Write FILES_SHA256.json manifest |
LOCK_TIMEOUT_SECONDS |
0 |
Seconds to wait for output lock (0 = fail immediately) |
LOCK_METHOD |
auto |
auto, flock, or mkdir |
LOCK_STALE_SECONDS |
0 |
Remove stale locks older than this (0 = disabled) |
AWQ_CALIB_DATASET |
(unset) | Override calibration dataset |
AWQ_MAX_CALIB_SAMPLES |
(unset) | Max calibration samples |
AWQ_MAX_CALIB_SEQ_LEN |
(unset) | Max calibration sequence length |
AWQ_N_PARALLEL_CALIB_SAMPLES |
(unset) | Parallel calibration samples |
DETERMINISTIC_FAIL_CLOSED |
1 |
Make determinism setup errors fatal (0 = warn and continue) |
WRITE_WEIGHT_CHECKSUMS |
0 |
Include weight shards in FILES_SHA256.json |
FINGERPRINT_INCLUDE_VERS |
1 |
Include torch/transformers/awq versions in fingerprint |
FINGERPRINT_INCLUDE_SYS |
0 |
Include host/system info in fingerprint |
ALLOW_REMOTE_STALE |
0 |
Allow stale-lock cleanup across hosts on shared storage |
PYTHON |
python3 |
Python executable |
The FP8 script uses CLI flags for most configuration. It also respects MODEL_CACHE_DIR and QUANTIZED_OUTPUT_DIR (see Global Environment Variables above) and three version-gate env vars: MIN_TORCH_VERSION (default: 2.1.0), MIN_TRANSFORMERS_VERSION (default: 4.40.0), MIN_TORCHAO_VERSION (default: 0.10.0).
Both quantize-fp8-local and quantize-fp8-production share this core interface:
quantize-fp8-local <model-id> [options]
quantize-fp8-production <model-id> [options]
-c, --cache-dir DIR HF cache root
-o, --output-dir DIR Output root
-r, --revision REV HF revision (default: main)
--device MODE auto|cpu|cuda (default: auto)
--online Allow network access
--trust-remote-code Allow model repo custom code
--force Rebuild even if output exists
--suffix STR Output suffix (default: -FP8-TORCHAO)
--format FMT torch|safetensors (default: torch)
--allow-safetensors Attempt safetensors format
--max-shard-size STR Weight shard size (e.g. 4GB)
--offline-pick-latest Pick newest cached snapshot when refs are missing
--lock-ttl-seconds N Lock TTL for cross-host stale handling (default: 21600)
--smoke-test Run generation on output
--smoke-prompt STR Prompt for smoke test (default: "Hello")
--smoke-max-new-tokens N Tokens to generate (default: 1)
--smoke-temperature F Temperature (default: 0.0)
--no-validate Skip structural validation
--no-validate-quant Skip quantization coverage check
--validate-zip-crc Run zip CRC checks on .bin shards (slow)
--quant-min-ratio FLOAT Min fraction of quantized layers (default: 0.80)
--json JSON output on stdout (logs to stderr)
quantize-fp8-production adds:
--json-strict Like --json, and all failures also emit JSON to stdout
--reload-validate-device DEV cpu|skip (default: cpu)
--lock-mode MODE auto|fd|mkdir (default: auto)
--lock-timeout N Lock wait seconds (0=fail-fast, -1=unlimited, default: 0)
Both quantize-gguf-local and quantize-gguf-production share this core interface:
quantize-gguf-local <model-id> [quant-type] [options]
quantize-gguf-production <model-id> [quant-type] [options]
Quant types: Q2_K, Q3_K_S, Q3_K_M, Q3_K_L, Q4_0, Q4_1, Q4_K_S, Q4_K_M,
Q5_0, Q5_1, Q5_K_S, Q5_K_M, Q6_K, Q8_0, F16, F32,
IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_XS, IQ3_S, IQ4_XS, IQ4_NL
-c, --cache-dir DIR HF cache root
-o, --output-dir DIR Output root
-r, --revision REV HF revision (default: main)
--suffix STR Override output suffix (default: -GGUF-<TYPE>)
--online Allow network access
--trust-remote-code Allow model repo custom code
--force Rebuild even if output exists
--imatrix FILE Importance matrix for quantization
--convert-type TYPE Intermediate precision: f16, bf16 (default: f16)
--no-cache-f16 Do not cache intermediate F16 GGUF
--threads N Threads for llama-quantize (default: nproc)
--smoke-test Load output and generate tokens via llama-completion
--smoke-prompt STR Smoke test prompt (default: "Hello")
--smoke-tokens N Smoke test token count (default: 8)
--lock-timeout N Lock wait seconds (0=fail-fast, -1=unlimited, default: 0)
--json JSON output on stdout (logs to stderr)
quantize-gguf-production adds:
--json-strict Emit structured JSON on errors (not just success)
--smoke-ngl N GPU layers for smoke test (default: 0 = CPU)
--require-smoke-pass Make smoke test failure fatal
Both quantize-llmc-local and quantize-llmc-production share this core interface:
quantize-llmc-local <model-id> [scheme] [options]
quantize-llmc-production <model-id> [scheme] [options]
Schemes: fp8 (default), gptq, w8a8, nvfp4
Quantization:
--ignore LIST Extra layer ignore patterns (comma-separated; repeatable)
FP8-specific:
--fp8-scheme NAME dynamic|block (default: dynamic)
--fp8-pathway NAME oneshot|model_free (default: oneshot)
--model-free-device STR Device for model_free_ptq (default: cuda:0)
--model-free-workers N Worker count for model_free_ptq (default: 8)
Calibration:
--num-samples N Calibration samples (default: 512)
--seq-length N Max sequence length (default: 2048)
--batch-size N Batch size (default: 1)
--dataset NAME HF dataset ID (default: open_platypus)
--dataset-config NAME HF dataset config name (optional)
--dataset-path PATH Local dataset (.json, .jsonl, .csv, .parquet, or directory)
--text-column KEY Text column name (default: text)
--no-shuffle Do not shuffle calibration samples
--seed N RNG seed (default: 1234)
--streaming Stream dataset (hub ID or dvc:// path; online only)
Pipeline:
--pipeline NAME basic|datafree|sequential|independent
--sequential-targets L Decoder layer class names (comma-separated)
--sequential-offload D Offload device between layers (default: cpu)
--no-qac Disable quantization-aware calibration
--splits SPEC Split percentages spec passed to oneshot
--preprocessing-workers N Dataset preprocessing workers
--dataloader-workers N DataLoader workers (default: 0)
Validation:
--validate Load output in vLLM and run checks
--validate-prompt TEXT Smoke test prompt (default: "Hello!")
--validate-suite PATH JSONL or txt regression suite
--validate-seed N Seed for vLLM sampling (default: 1)
--validate-max-tokens N Max tokens per prompt (default: 64)
--validate-min-chars N Min chars in output to count as pass (default: 1)
General:
--model-revision REV HF revision (default: main)
--online Allow network access
--trust-remote-code Allow model repo custom code
--use-auth-token Use HuggingFace auth for private repos
--force Overwrite existing output
--suffix STR Override output suffix
--lock-timeout SECONDS Lock wait time (0=fail, -1=unlimited)
--log-dir PATH llmcompressor log directory
--json JSON output on stdout (logs to stderr)
quantize-llmc-production adds:
--json-strict Emit structured JSON on errors (not just success)
list-models [cache-dir]
Lists all HuggingFace models in the cache directory, showing model ID, total size, snapshot count, and detected quantization type ([AWQ], [FP8-TORCHAO], [compressed-tensors], [GGUF], [quantized]). The optional positional cache-dir argument overrides $MODEL_CACHE_DIR. Pass -h for usage help. list-models does not support --json.
All six quantization scripts implement file-level locking on the output directory to prevent concurrent quantization jobs from corrupting each other. The AWQ and FP8 scripts support both flock (preferred) and mkdir-based locks with configurable timeout and stale lock detection. The LLMC and GGUF scripts use flock with --lock-timeout. Lock timeout behavior: 0 (default) fails immediately if the lock is held, N > 0 waits up to N seconds, -1 waits indefinitely.
The production GGUF script also holds a separate flock on the F16 intermediate cache directory, allowing safe concurrent quantization of the same model to multiple types (e.g., Q4_K_M and Q5_K_S in parallel) without duplicate F16 conversions.
All six quantization scripts compute content-addressed snapshot IDs by hashing a fingerprint of all configuration inputs. For AWQ this includes: model ID, source commit, quantization parameters (bits, group size, zero point), calibration settings, device map, dtype policy, seed, determinism flag, save format, shard size, and optionally library versions and system info. FP8 includes a similar set plus script version and output format. LLMC hashes the model ID, resolved commit, revision, scheme, FP8 options, calibration parameters, and pipeline options. GGUF local hashes the model ID, source commit, quant type, convert type, imatrix SHA-256, script SHA, and llama.cpp version (7 fields). GGUF production extends this with converter SHA-256 (convert_hf_to_gguf.py), llama-quantize SHA-256, gguf Python module version, and trust_remote_code (11 fields). Identical configurations always produce the same output path.
Set DETERMINISTIC=1 (AWQ) or --seed N (LLMC) to improve reproducibility. The AWQ script sets Python, NumPy, and PyTorch seeds, and optionally enables torch.use_deterministic_algorithms(True). Full bitwise reproducibility across runs is not guaranteed due to CUDA non-determinism, but output quality is consistent.
- AWQ:
--smoke-test full(default) reloads the saved checkpoint and runs a forward pass plus short generation.fasttests the in-memory model immediately after quantization.offskips generation but still validates output structure. - FP8:
--smoke-testloads the output checkpoint and generates one token. - LLMC:
--validatespawns a separate vLLM process to load the checkpoint and run generation, verifying the output is a valid vLLM-loadable model. - GGUF (both variants):
--smoke-testrunsllama-completionto load the output GGUF and generate a short sequence, verifying the file is valid and loadable. The production variant adds--smoke-nglto control GPU offloading and--require-smoke-passto make failure fatal.
Note: The smoke test flag is named differently across scripts for historical reasons:
--smoke-test full|fast|offfor AWQ,--smoke-test(boolean) for FP8 and GGUF, and--validatefor LLMC. All serve the same purpose — verifying the quantized output loads and generates tokens correctly.
All six quantization scripts support --json for machine-readable output on stdout (logs go to stderr); list-models does not support --json. Each emits a JSON object with status (ok or exists), source model, and output path. AWQ and FP8 also include source commit, snapshot ID, revision, and smoke test results; FP8 adds device mode, format, and validation coverage. LLMC local includes scheme, validation status, output size, and timestamp. LLMC production adds --json-strict for structured JSON on both success and error, with stage-based error tracking. GGUF includes quant type, GGUF filename, file size, and smoke test results; the GGUF production variant also includes artifact_sha256. Both production variants (quantize-llmc-production and quantize-gguf-production) support --json-strict which emits structured JSON on both success and error. Error objects include a stage field indicating where the failure occurred (one of: startup, preflight, locking, source-resolution, f16-conversion, quantization, artifact-validation, smoke-test, publish, complete). Successful runs (ok) include timing.
The scripts are designed for unattended operation. With --json, all human-readable logs go to stderr and a single JSON object is written to stdout on completion, making it straightforward to parse results in a pipeline.
Flox provides GitHub Actions for CI integration. The environment travels with the repo, so CI gets the same toolchain as local development.
The commands (quantize-fp8-local, quantize-fp8-production, quantize-llmc-local, quantize-llmc-production, quantize-gguf-local, quantize-gguf-production, etc.) are binaries from the model-quantizer package and work in all contexts — interactive sessions and CI alike:
# .github/workflows/quantize.yml
jobs:
quantize:
runs-on: [self-hosted, gpu] # needs NVIDIA GPU
steps:
- uses: actions/checkout@v4
- uses: flox/install-flox-action@v2
- uses: flox/activate-action@v1
with:
command: |
quantize-fp8-production --online --json Qwen/Qwen3-8B > result.json
cat result.jsonFor non-GitHub CI (GitLab, CircleCI, Jenkins), install Flox on the runner and use flox activate --:
flox activate -- quantize-fp8-production --online --json Qwen/Qwen3-8B > result.json- Exit codes: 0 on success or if output already exists, 1 on error
- Idempotent: re-running with the same parameters detects existing output and exits immediately (JSON reports
"status": "exists") --force: bypasses the exists check and re-quantizes from scratch- Locking: concurrent jobs targeting the same output serialize automatically via file locks. Lock timeout is configurable:
--lock-timeout 0(default) fails immediately if the lock is held,--lock-timeout Nwaits up to N seconds,--lock-timeout -1waits indefinitely --json: structured output for parsing; logs on stderr won't pollute stdoutHF_TOKEN: set in CI secrets for gated model access — the HuggingFace libraries pick it up automatically. For local development, export it in your shell profile or runhuggingface-cli login
#!/usr/bin/env bash
# Quantize a list of models, collect results
# Run inside flox activate, or use: flox activate -- bash batch-quantize.sh
models=(Qwen/Qwen3-8B meta-llama/Llama-3.1-8B-Instruct google/gemma-3-4b-it)
for model in "${models[@]}"; do
echo "--- $model ---" >&2
quantize-llmc-local --online --json "$model" >> results.jsonl
doneEach line in results.jsonl is a self-contained JSON object with status, output path, timing, and validation.
AutoAWQ 0.2.9 is the last release and is no longer maintained. The environment applies three patches on first activation to maintain compatibility with transformers 4.52+:
- GELUTanh rename:
PytorchGELUTanhwas renamed toGELUTanhin transformers. Patched inawq/quantize/scale.py. - Catcher attribute proxy: The
Catcherwrapper class inawq/quantize/quantizer.pydoes not proxy attribute access to the wrapped decoder layer. Transformers 4.57+ accessesattention_typeon Qwen2/Qwen3 layers, crashing calibration. Patched by adding__getattr__fallback. - Deprecation noise: AWQ's
__init__.pyoverrides Python's warning filters and emits a deprecation notice on every import. Patched to remove thesimplefilteroverride andwarnings.warncall.
These patches are applied automatically during venv provisioning and do not require manual intervention.
TorchAO's FP8 weight-only quantization (Float8WeightOnlyConfig) produces Float8Tensor subclasses that store quantized data in .qdata (float8_e4m3fn) and per-row scales in .scale (float32). These tensor subclasses do not support low-level operations (storage(), data_ptr(), view(-1)) that safetensors and transformers rely on during save_pretrained(). The environment applies two categories of fixes:
Five patches are applied to safetensors and transformers during venv provisioning:
-
storage_ptr()fallback (safetensorstorch.py): The original code returns0whenstorage().data_ptr()raisesNotImplementedError, causing all FP8 tensors to appear as shared storage. Patched to catchRuntimeErroras well and returnid(tensor)— a unique identifier per tensor — so safetensors correctly detects disjoint tensors. -
_end_ptr()wrapping (safetensorstorch.py):view(-1)anddata_ptr()fail on Float8Tensor. Wrapped in try/except to fall back toid(tensor). -
_end_ptr()wrapping (transformersmodeling_utils.py): Same issue as above but in transformers' copy of the function. Usestensor.element_size()instead of_SIZE[tensor.dtype]. -
_find_disjoint()wrapping (transformersmodeling_utils.py):tensor.data_ptr()call wrapped in try/except to treat FP8 tensors as unique (non-shared). -
_find_identical()wrapping (transformersmodeling_utils.py): Same pattern — prevents false positive shared-tensor detection for FP8 tensors.
All patches are idempotent (guarded by grep -q for a unique marker string) and version-sensitive: if safetensors or transformers update to natively support FP8 tensor subclasses, the patches print a warning and skip. Targets: safetensors 0.7.0, transformers 5.3.0.
The FP8 quantization scripts (quantize-fp8-local.sh, quantize-fp8-production.sh) apply two additional fixes:
-
Float8Tensor unwrapping: Before calling
save_pretrained(), the script extracts.qdataand.scalefrom each Float8Tensor into plain tensors (with_scalesuffix for scales), then passes the unwrapped state dict viastate_dict=unwrapped. This bypasses all remaining serialization issues. -
Validation safetensors fallback (local script only): Transformers 5.x always saves as safetensors regardless of the
safe_serializationflag. The validation step now checks for.safetensorsfiles as a fallback whenout_format="torch"and no.binfiles are found. The production script'svalidate_saved_artifacts()already checks both formats.
From Flox (declarative, pinned):
| Package | Version | Description |
|---|---|---|
flox-cuda/python3Packages.torch |
2.9.1 | PyTorch with CUDA support |
uv |
latest | Python package installer |
flox-cuda/llama-cpp |
latest | llama.cpp tools (convert-hf-to-gguf, llama-quantize, llama-completion) |
gcc-unwrapped |
latest | C++ standard library (libstdc++) for PyPI native extensions |
zlib |
latest | Compression library required by numpy |
From PyPI via uv (auto-provisioned on first flox activate):
| Package | Description |
|---|---|
torchao |
PyTorch native quantization (FP8 E4M3) |
transformers |
HuggingFace Transformers |
accelerate |
Model parallelism and device mapping |
safetensors |
Safe tensor serialization |
huggingface-hub |
HuggingFace Hub client |
autoawq |
AWQ 4-bit quantization |
llmcompressor |
vLLM's unified quantization library |
datasets |
HuggingFace Datasets (calibration data for GPTQ/W8A8/NVFP4) |
gguf |
GGUF format support (required by convert-hf-to-gguf) |
sentencepiece |
Tokenizer library (required by some model conversions) |
PyPI torch is automatically removed after installation so Python falls through to the Flox-provided CUDA-enabled torch via --system-site-packages.
- OS: Linux (x86_64 or aarch64)
- GPU: NVIDIA GPU with CUDA support. FP8 methods require SM90+ (Hopper: H100, H200) or SM120+ (Blackwell: RTX 5090, B200) for native hardware acceleration; L40S is Ada Lovelace (SM89) and does not have native FP8 compute. AWQ and GPTQ work on all CUDA GPUs. GGUF quantization and inference are CPU-only and do not require a GPU.
- Driver: NVIDIA driver compatible with CUDA 12.x (driver 525+)
- VRAM: Depends on model size. 7-8B models need ~16 GB for loading + quantization workspace. AWQ and GPTQ 4-bit outputs fit larger models in less VRAM at inference time.
- Disk: Source model + quantized output. HuggingFace-format methods (AWQ, FP8, LLMC) need ~2x the source model size. GGUF needs ~2.5x when caching the F16 intermediate (shared across quant types), or ~2x with
--no-cache-f16. - Flox: must be installed (see Setup)
The model has not been downloaded. Pass --online to allow the script to fetch it, or download the model separately before running in offline mode.
A previous run was interrupted and left a stale lock. Remove it manually:
# AWQ locks (inside the output model directory)
rm -rf $QUANTIZED_OUTPUT_DIR/hub/models--<org>--<model>-<suffix>/.quantize.lockdir
rm -f $QUANTIZED_OUTPUT_DIR/hub/models--<org>--<model>-<suffix>/.quantize.lock
# FP8 locks (inside the output model directory)
rm -rf $QUANTIZED_OUTPUT_DIR/hub/models--<org>--<model>-<suffix>/.quantize.lock
# LLMC / GGUF locks (in the output root)
rm -f $QUANTIZED_OUTPUT_DIR/.quantize-<model-slug>.lockFor AWQ, set LOCK_STALE_SECONDS=300 to automatically clean stale locks older than 5 minutes.
The model is too large for available VRAM. Options:
- Close other GPU-using processes (
nvidia-smito check) - Use
DEVICE_MAP=autoor--device autofor automatic multi-GPU or CPU offloading - For LLMC, use
--pipeline sequentialto quantize one decoder layer at a time - Quantize a smaller model or use a machine with more VRAM
PyTorch cannot see the GPU. Check that nvidia-smi works and that the Flox-provided torch has CUDA support:
flox activate
python3 -c "import torch; print(torch.cuda.is_available(), torch.version.cuda)"If the Python venv gets into a bad state (version mismatches, broken patches), delete it and re-activate:
rm -rf .flox/cache/venv
flox activate # recreates venv from scratch- AutoAWQ -- AWQ quantization library
- torchao -- PyTorch native quantization
- llm-compressor -- vLLM's unified quantization
- llama.cpp -- GGUF quantization and inference
- vLLM quantization docs -- Loading quantized models in vLLM
- Flox -- Reproducible development environments