Skip to content

v1.1.1

Choose a tag to compare

@FKKimura FKKimura released this 21 May 02:15
· 28 commits to main since this release
da41b49

[v1.1.1] 2026-05-21

New Feature: Quantization progress logging

  • Added QuantizationProgressTracker (onecomp/utils/quantization_progress.py) that emits a single [progress] INFO line per completed step with done/total, percentage, elapsed time, and a linear ETA estimate; supports an optional thread_safe=True mode for multi-GPU quantization
  • Added report_progress: bool = True flag to Runner.__init__ (onecomp/runner.py) and to the underlying entry points run_chunked_quantization (onecomp/runner_methods/chunked_quantization.py), run_multi_gpu_quantization / run_quantization_phase (onecomp/runner_methods/multi_gpu_quantization.py), run_quantize_with_qep (onecomp/qep/_quantize_with_qep.py), and run_quantize_with_qep_arch (onecomp/qep/_quantize_with_qep_arch.py) so long quantization runs (calibration, chunked, multi-GPU, QEP) report progress by default; pass report_progress=False for quiet runs
  • Demoted some INFO-level per-layer / per-chunk logs to DEBUG to avoid duplication with the new [progress] line (still available via logging.basicConfig(level=logging.DEBUG) for deep debugging)

Bug fixes: QEP + JointQ validation

  • Raise a clear error when Runner is configured with qep=True and a quantizer that does not support QEP (currently JointQ). Previously the run failed deep inside quantize_with_qep / adjust_weight with a confusing low-level error. Runner.check() now reports e.g. "Quantizer 'JointQ' (or one of its candidate quantizers) does not support QEP (Quantization Error Propagation). Set qep=False, or use a QEP-compatible quantizer (e.g., GPTQ, DBF, AutoBitQuantizer with QEP-compatible candidates)." Implementation: added flag_qep_supported (default True) on Quantizer, set to False on JointQ, and propagated via AutoBitQuantizer._sync_flags (only True when all candidate quantizers support QEP) (quantizer/_quantizer.py, quantizer/jointq/_jointq.py, quantizer/autobit/_autobit.py, runner.py).

Bug fixes: VLM save / load

  • Runner.save_quantized_model() now copies all auxiliary *.json and *.jinja files (e.g. preprocessor_config.json, processor_config.json, special_tokens_map.json, chat_template.jinja) from the original model directory to the save directory, so the quantized model is fully self-contained for VLM / multimodal inference. Weight tensors (*.safetensors, *.bin, *.pt, *.pth), weight index files, config.json and generation_config.json are skipped, and any file already written by model.save_pretrained / tokenizer.save_pretrained is preserved (runner.py).
  • Source-model directory resolution (incl. huggingface_hub.snapshot_download fallback for Hub IDs) was extracted into a private helper Runner._resolve_source_model_dir() (runner.py).
  • load_quantized_model() now re-establishes the lm_head <-> embed_tokens weight tie for models with tie_word_embeddings=True. load_state_dict(..., assign=True) would otherwise leave lm_head.weight as the freshly initialised tensor (typically float16) while embed_tokens.weight got replaced with the checkpoint tensor (typically bfloat16), causing RuntimeError: expected mat1 and mat2 to have the same dtype at the final lm_head matmul during generation. The re-tie is gated on lm_head still being an nn.Linear so it does not interfere when lm_head itself was quantized (quantized_model_loader.py).
  • load_quantized_model() now reads torch_dtype from config.json when no explicit torch_dtype is passed by the caller, so the empty model is built in the same dtype as the saved checkpoint. Previously it always defaulted to torch.float16, which left non-quantized VLM submodules (e.g. multi_modal_projector in Cohere2Vision) at fp16 whenever load_state_dict(..., assign=True) could not find their key in the state_dict (quantized_model_loader.py).
  • load_quantized_model() now casts any leftover float16 parameters and buffers of non-quantized modules to model.config.torch_dtype after the lm_head re-tie step. Quantized layers (GPTQLinear, DoubleBinaryLinear) and float32 params (e.g. fp32 LayerNorm in mixed-precision models) are deliberately untouched. This generalises the existing lm_head re-tie to any non-quantized module and fixes the dtype mismatch reported in issue 64-3 (RuntimeError: ... c10::Half != c10::BFloat16 on VLM image features) (quantized_model_loader.py).
  • Added regression tests tests/onecomp/runner/test_save_quantized_aux_files.py (auxiliary-file copy whitelist), tests/onecomp/runner/test_load_tied_embeddings.py (tied-embedding dtype round-trip) and tests/onecomp/runner/test_load_excluded_module_dtype.py (non-quantized module dtype handling, including config-based empty-model dtype default, fp16 safety-net cast, fp32 preservation, and quantized-layer skip).
  • Loosened test_save_load_pipeline_tinyllama.py and test_save_load_pipeline_qwen3.py save/load round-trip threshold from absolute 1e-3 to relative 1% of the per-tensor logits magnitude (tests/onecomp/pre_process/test_save_load_pipeline_*.py). The original absolute bound was below fp16's representable precision once accumulated through the 22-28 decoder layers of TinyLlama / Qwen3, causing the gptq + save_dequantized cases to fail on aarch64 + Blackwell (GB200) where cuBLAS picks slightly different reduction kernels than reference x86_64 / Hopper hosts. The save/load equivalence intent is preserved via the relative comparison, which is robust to platform-specific fp16 rounding noise.
  • Set gpu_memory_utilization=0.78 explicitly when constructing LLM(...) in example/vllm_inference/example_autobit_vllm_inference.py and example/vllm_inference/example_gptq_vllm_inference.py. The vLLM default 0.92 cgroup-OOMs on UMA hosts (e.g. DGX Spark / GB200, 121.7 GiB UMA) because vLLM's startup memory check fails: the residual quantizer process leaves only ~106 GiB free, which is below 0.92 * 121.7 = 111.96 GiB. 0.78 matches the value already used in tests/vllm_plugins/gptq/test_mixed_gptq_e2e.py and is documented in the workspace slurm-submit.mdc rule.

Logging / observability tweaks

  • Runner._copy_auxiliary_files() now emits a matter-of-fact INFO-level log when an auxiliary file from the original model directory is not copied because the destination already contains a file of the same name (typically because tokenizer.save_pretrained wrote it just before, or a previous save_quantized_model call did). The new line is symmetrical to the existing Copied %s to save directory entry so the auxiliary-copy step can be audited end-to-end (runner.py).
  • QuantizedModelLoader._cast_fp16_to_target_dtype() now returns the list of fully-qualified parameter / buffer names whose dtype was actually converted instead of a plain count. The post-load INFO log in load_quantized_model() includes those names so it is obvious which non-quantized submodules were normalised by the safety-net cast (e.g. multi_modal_projector.linear_* in Cohere2Vision). Existing tests are updated accordingly and a new test pins the buffer-name reporting (quantized_model_loader.py, tests/onecomp/runner/test_load_excluded_module_dtype.py, tests/onecomp/runner/test_save_quantized_aux_files.py).
  • QuantizedModelLoader.load_quantized_model() now detects tie_word_embeddings=True even when the flag is nested in a sub-config (e.g. model.config.text_config.tie_word_embeddings in Llama 3.2-Vision and other torchtune-derived VLMs) by walking one level of sub-configs. Previously the flag was only read from the top-level model.config, so VLMs that placed it in text_config skipped the post-load re-tie; with HF deduplicating lm_head.weight for tied checkpoints, that left lm_head.weight at the empty-model random initial values rather than re-pointing to embed_tokens.weight (quantized_model_loader.py).

Tests

  • Added regression tests for the save/load fixes above: tests/onecomp/runner/test_save_quantized_aux_files.py (auxiliary-file copy whitelist), tests/onecomp/runner/test_load_tied_embeddings.py (tied-embedding dtype round-trip), and tests/onecomp/runner/test_load_excluded_module_dtype.py (non-quantized module dtype handling, including config-based empty-model dtype default, fp16 safety-net cast, fp32 preservation, and quantized-layer skip).
  • Added tests/onecomp/test_runner_check.py for the new qep=True validation path: JointQ + qep=True raises a clear ValueError, while JointQ + qep=False and GPTQ + qep=True both pass Runner.check().
  • Added tests/onecomp/runner/test_load_tied_embeddings.py::test_should_retie_word_embeddings_* unit tests covering top-level, nested-text-config, all-False and unrelated-sub-attribute shapes.

New Contributors