You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
[v1.2.0] 2026-06-08
Save/Load Support for JointQ, RTN, and OneBit Quantizers
JointQ: Added get_quant_config(), finalize_quant_config_for_save(), and create_inference_layer() to JointQ class (onecomp/quantizer/jointq/_jointq.py)
Emits quant_method="gptq" to reuse GPTQLinear and vLLM GPTQ plugin (JointQ uses the same scale/zero/assignment structure as GPTQ)
create_inference_layer() converts JointQ's 3D assignment (out_features, num_groups, group_size) to 2D qweight (out_features, in_features) matching GPTQ format, with scale/zero transposition
Handles actorder permutation: restores original column order before passing to GPTQLinear so g_idx is constructed correctly
Symmetric quantization: shifts signed integers [-2^(n-1), 2^(n-1)-1] to unsigned [0, 2^n - 1] for GPTQLinear bit packing
Added bits == 1 warning in validate_params(): GPTQLinear weight packing does not support 1-bit; inference layer must be built with pack_weights=False
Added _build_quantization_bits() static method to emit per-layer quantization_bits metadata for mixed-precision save
RTN: Added get_quant_config(), finalize_quant_config_for_save(), create_inference_layer(), and RTNResult.compute_dequantized_weight() to RTN class (onecomp/quantizer/rtn/_rtn.py)
Emits quant_method="gptq" to reuse GPTQLinear and vLLM GPTQ plugin (RTN uses the same qweight/scales/qzeros tensor format)
compute_dequantized_weight() implements W = (quantized_weight - zero) * scale with per-channel and group-wise paths
create_inference_layer() transposes scale/zero from (out_features, num_groups) to (num_groups, out_features) for GPTQLinear compatibility
Added _build_quantization_bits() static method for per-layer metadata
OneBit: Added get_quant_config(), finalize_quant_config_for_save(), create_inference_layer(), and OnebitResult.compute_dequantized_weight() to Onebit class (onecomp/quantizer/onebit/_onebit.py)
Emits quant_method="onebit" with OneBit-specific parameters (iters, use_importance_scaling, use_balancing, balance_iters, balance_alpha)
create_inference_layer() builds OneBitLinear via OneBitLinear.from_quantization_result()
Added _build_quantization_bits() static method for per-layer metadata
Apple Silicon / macOS support
MPS quantization: GPTQ (and AutoBit with GPTQ-only candidates) on device="mps"; cross-platform empty_cache() via new onecomp/utils/device.py (runner.py, quantizer/gptq/_gptq.py, quantizer/_quantizer.py)
MPS device placement (GPTQ on CPU, QEP correction on MPS): With device="mps", run_gptq moves the Hessian and weights to CPU for the full column-wise GPTQ loop (including inverse-Hessian Cholesky). The main reason is not absent Cholesky kernels on MPS (recent PyTorch supports them); if the GPTQ loop stayed on MPS, maxq.item() inside quantize() would run once per column—each call waits for pending MPS work to finish and read back a single scalar to the host (per-column host sync), not a full matrix copy per column—and that overhead is often several times slower than CPU on Apple Silicon (~4× in internal benchmarks with PyTorch 2.12). When QEP weight correction runs (adjust_weight, typically under qep=True), per-layer work stays on MPS (e.g. weight @ delta_hatX); only the Cholesky solve uses CPU via _safe_cholesky_and_solve (one solve per layer). A full CPU fallback for QEP does not materially improve speed. Calibration forwards may still use MPS. Details: README (macOS / MPS).
MPS inference: load saved quantized models on Mac with QuantizedModelLoader + Transformers generate() (GemLite/vLLM remain Linux + CUDA)
macOS uv sync: added darwin to tool.uv.environments, --extra mps for MPS-enabled PyTorch from PyPI; --extra cpu is Linux-only (pytorch-cpu index); Linux-only markers on CUDA extras (cu118–cu130)
New Feature : Dashboard
Added dashboard/, a browser-based web app for OneCompression on SLURM-managed HPC GPU nodes without Docker: pick a Hugging Face model and quantization settings in the UI, run jobs on the GPU, deploy the quantized checkpoint, and validate inference via chat
Stack: React + Vite frontend (local PC), FastAPI API, Celery worker + user-built Redis, SQLite job DB, per-job output under backend/tmp/quantized/; CUDA quantization via onecomp and chat deploy via a separate vLLM subprocess from the same backend/.venv (onecomp + vllm>=0.21 in pyproject.toml)
Quantization methods exposed in the UI: gptq, autobit, jointq, and auto_run (VRAM-based bitwidth / group size); optional QEP (not with JointQ); fractional bit widths for autobit / auto_run
New Feature: Global PTQ (Post-Training Quantization)
Added GlobalPTQ and GlobalPTQDistributed post-process classes for KL-distillation-based global optimisation of continuous quantization parameters (scales and zeros for GPTQ; scaling factors for DBF)
GlobalPTQ: Single-GPU implementation with cosine-warmup LR scheduling, early stopping, mixed-precision support, and gradient accumulation
GlobalPTQDistributed: Multi-GPU implementation using HuggingFace Trainer + DeepSpeed ZeRO-2, supporting KL divergence and/or NTP loss with automatic best-state rollback
Evaluation:
Added onecomp.eval and the onecomp-eval CLI: one vLLM server, subprocess evaluators, aggregated summary.json / summary.csv
Added .pre-commit-config.yaml with black, isort, and local hooks (no-japanese, copyright-header, no-email-address); install with uv sync --extra dev then pre-commit install (see README)
OneBitLinear Inference Layer Improvements
Added OneBitLinear.from_quantization_result() class method: builds OneBitLinear from OnebitResult (mirrors the pattern used by GPTQLinear and DoubleBinaryLinear) (onecomp/quantizer/onebit/onebit_layer.py)
Added OneBitLinear.from_saved_state() class method: reconstructs OneBitLinear from saved state_dict tensors (a, b, sign_packed, optional bias), using the same cls.__new__ pattern as DoubleBinaryLinear (onecomp/quantizer/onebit/onebit_layer.py)
Removed preunpack parameter from OneBitLinear.__init__() and replace_linear_with_onebit_layer(): sign matrix is now always stored as packed uint8 and unpacked on demand during forward(), matching the DBF inference layer pattern (onecomp/quantizer/onebit/onebit_layer.py)
Normalized buffers to FP16 with detach() in OneBitLinear.__init__() to drop autograd graph
Added _load_from_state_dict() override to clear sign_matrix cache when loading from checkpoint
Extracted _unpack_sign_matrix() helper for sign matrix unpacking logic
Removed unreferenced functions replace_linear_with_onebit_layer() and extract_onebit_weights_for_save() from onebit_layer.py: layer construction is now handled by OneBitLinear.from_quantization_result() / OneBitLinear.from_saved_state(), and save-time weight extraction is covered by the unified create_inference_layer() / state_dict() path (onecomp/quantizer/onebit/onebit_layer.py)
QuantizedModelLoader: OneBit Support
QuantizedModelLoader now supports quant_method="onebit" (onecomp/quantized_model_loader.py)
Added OneBitLinear to import and layer replacement logic
Added OneBitLinear.from_saved_state() call path for creating empty OneBit layers during model loading
Hadamard hook registration now recognizes OneBitLinear as a quantized layer class
BlockWisePTQ / CBQ OneBit Optimizer Compatibility
Updated OneBit block-wise and cross-block quantization (CBQ) optimizers to work with packed-only OneBitLinear (onecomp/post_process/_blockwise/onebit_block_optimizer.py, onecomp/post_process/_blockwise/onebit_cbq_optimizer.py)
Reads current sign matrices from sign_packed via my_unpack() when sign_matrix is not present, while still allowing sign_matrix as a temporary optimization override
Writes sign updates back to sign_packed with my_pack() and clears sign_matrix so packed signs remain the single source of truth after hard evaluation, best-state restore, and final updates
Hoisted my_pack / my_unpack imports in the OneBit CBQ optimizer
Clarified OneBitLinear.sign_matrix as a non-persistent temporary override used by optimization flows such as BlockWisePTQ and CBQ (onecomp/quantizer/onebit/onebit_layer.py)
Bug Fix
Fixed GPTQLinear.from_saved_state(): _weight_is_packed now defaults to False when wbits == 1 (JointQ wbits=1 checkpoints are saved with pack_weights=False because GPTQLinear packing does not support 1-bit) (onecomp/quantizer/gptq/gptq_layer.py)
Fixed redundant symmetric shift in RTN inference layer (onecomp/quantizer/rtn/_rtn.py)
Fixed run_onebit() returning False on NaN/Inf detection; now raises ValueError with proper GPU tensor cleanup to prevent OOM cascading (onecomp/quantizer/onebit/onebit_impl.py)
Removed pre-computed dequantized_weight from run_onebit() return dict and OnebitResult; dequantized weight is now computed on demand via compute_dequantized_weight() (onecomp/quantizer/onebit/onebit_impl.py, onecomp/quantizer/onebit/_onebit.py)
QuantizedModelLoader._cast_fp16_to_target_dtype() now skips OneBitLinear in addition to GPTQLinear and DoubleBinaryLinear, so OneBit's fp16 scaling buffers (a, b, bias) are preserved when loading a OneBit-quantized model that requires bfloat16 (e.g. Gemma 3 / Gemma 4 detected via needs_bfloat16). Without this, the post-load safety-net cast rewrote OneBit's stored fp16 metadata to bfloat16, breaking the dtype contract that OneBitLinear.forward relies on (self.a.to(x.dtype) / self.b.to(x.dtype) casts to the activation dtype at compute time). Updated the function's docstring to list OneBitLinear alongside the other quantized layer types whose fp16 metadata is intentionally retained (onecomp/quantized_model_loader.py).
Tests
Enabled inherited test_forward_error tests for JointQ, OneBit, and RTN (previously skipped with "does not support create_inference_layer") (tests/onecomp/quantizer/jointq/test_jointq.py, tests/onecomp/quantizer/onebit/test_onebit.py, tests/onecomp/quantizer/rtn/test_rtn.py)
Added _forward_error_features class attribute to BaseQuantizeSpec for parameterizing layer size in test_forward_error; JointQ overrides to 32 (requires in_features divisible by pack_factor = 32 // wbits) (tests/onecomp/quantizer/test_module.py)
Changed JointQ test default bits from 1 to 2 to match GPTQLinear packing constraints (tests/onecomp/quantizer/jointq/test_jointq.py)
Updated check_equal_results in RTN and OneBit tests to use compute_dequantized_weight() instead of direct dequantized_weight attribute access
Updated apply_quantized_weights in RTN and OneBit tests to use compute_dequantized_weight() with proper dtype preservation
Tightened GPTQ unit test tolerances in tests/onecomp/quantizer/gptq/test_gptq.py so regressions in dequantized-weight error are detected earlier (error < 0.4, max_error < 1.71; previously 0.6 / 2.5) (tests/onecomp/quantizer/gptq/test_gptq.py)
Fixed tests/onecomp/quantizer/test_module.py to feed y_replaced consistently into q_proj / k_proj / v_proj after quantized weights are applied, aligning the replacement-path forward test with the intended residual update flow
Extracted the duplicated attention+MLP forward loop in test_quantize_error into TestModel.forward() (tests/onecomp/quantizer/test_module.py); both the pre-quantization and post-quantization inference paths now call model(inp) directly, eliminating 34 duplicate lines
Dependencies
Pinned the vllm optional dependency to vllm>=0.10,<0.22 in pyproject.toml (and regenerated uv.lock). vLLM 0.22.0 removed the legacy Exllama GPTQ kernel that OneComp's GPTQ serving uses for low bit-widths (2-/3-bit, and 4-/8-bit models that are asymmetric or use desc_act), so vLLM 0.22 and later are not supported — serving affected models on 0.22+ fails at runtime. Documented in docs/user-guide/vllm-inference.md and docs/getting-started/installation.md.
Documentation
Documented save/load and vLLM compatibility for the newly-supported JointQ, RTN, and OneBit quantizers across the docs:
docs/api/quantizers/base.md: moved JointQ, RTN, and Onebit into the supported rows of the "Quantizer Feature Support" table (get_quant_config / create_inference_layer / Save / Quantized PPL/ACC all Yes), and added a new "Saved quant_method and vLLM compatibility" table mapping each quantizer to its emitted quant_method (gptq / mixed_gptq / dbf / onebit) and serving path
docs/user-guide/basic-usage.md: updated the quantized-model evaluation note and the "Quantizer feature support" table to include JointQ/RTN/OneBit, added a quant_method column, and clarified which saved models are vLLM-servable
docs/user-guide/vllm-inference.md: rewrote the "Supported Quantization Methods" table to distinguish vLLM's built-in GPTQ plugin (used for gptq: GPTQ uniform bits, JointQ, RTN) from the OneComp plugins (mixed_gptq, dbf), added a note that Onebit is not vLLM-servable, and listed the GPTQ/JointQ/AutoBit end-to-end examples; split the gptq row so GPTQ/RTN (wbits in {2, 3, 4, 8}) and JointQ (bits in {2, 3, 4}; bits=1 is OneComp load-only with pack_weights=False) document their distinct supported bit-widths
docs/algorithms/jointq.md: added a "Save and Load" section (emits quant_method="gptq", served by vLLM's built-in GPTQ plugin), a note that JointQ bits is limited to {2, 3, 4} for vLLM (the JointQ core quantizer rejects bits > 4) while bits=1 requires an explicit runner.save_quantized_model(..., pack_weights=False) and is OneComp load-only / not vLLM-servable, and clarified that JointQ does not support QEP (qep=False)
docs/algorithms/rtn.md: added a "Save and Load" section, a note that vLLM serving uses wbits in {2, 3, 4, 8} (RTN itself accepts a wider range, but GPTQ-compatible bit packing and vLLM serving are limited to these), and a warning that rotation-preprocessed RTN models cannot be served with vLLM (no online Hadamard transform), though they remain loadable with load_quantized_model()
docs/getting-started/quickstart.md, docs/index.md, README.md: updated quantized-model evaluation and vLLM integration descriptions to include JointQ/RTN/OneBit and reference the GPTQ built-in plugin path
Examples
Added example/vllm_inference/example_jointq_vllm_inference.py: end-to-end JointQ quantization (4-bit, group_size=128) → save → vLLM offline inference. Mirrors the GPTQ vLLM example, uses qep=False (JointQ does not support QEP), and documents the bits >= 2 requirement for vLLM bit-packing. Registered in the README example table.