Skip to content

Production-hardening: fix v2.1.0rc1 quantization-state and architecture-fallback bugs#28

Merged
codewithdark-git merged 2 commits intomainfrom
devin/1777241535-quantllm-production-hardening
Apr 27, 2026
Merged

Production-hardening: fix v2.1.0rc1 quantization-state and architecture-fallback bugs#28
codewithdark-git merged 2 commits intomainfrom
devin/1777241535-quantllm-production-hardening

Conversation

@devin-ai-integration
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot commented Apr 27, 2026

Summary

Production-hardening pass on top of v2.1.0rc1 (PR #27). Fixes the
"latest HuggingFace model is properly quantized but not loaded" symptom
reported on the most recent PR, plus the related architecture-fallback
regressions, and adds the missing CI / packaging / docs scaffolding so
those bugs can't silently come back.

Correctness fixes

# Bug Fix
1 from_config_only=True returned a random-weights model but left _is_quantized=True, so callers thought the model was both quantized and loaded. _is_quantized is now a derived property reading model.config.quantization_config and BitsAndBytes layer types at call time. from_config_only=True correctly reports False and warns.
2 A missing bitsandbytes install silently fell through to full precision but kept _is_quantized=True. The warning now describes the install command and the property reports False.
3 Pre-quantized HF repos (GPTQ / AWQ / etc.) were not recognised when the user passed quantize=False. _has_runtime_quantization() looks at quantization_config.quant_method and the actual layer types.
4 DEFAULT_ARCHITECTURE_FALLBACKS was dead code — only consulted when the HF config returned an empty model_type, which never happens in practice. resolve_model_type now consults the table directly and recognises version-suffix patterns: qwen3qwen2, llama4llama, phi4phi3, gemma3gemma2, etc.
5 register_architecture("newmodel", base_model_type="llama", model_class=NewCls) stored the class under "newmodel" but looked it up under "llama", so the fallback path silently ignored it. Lookup tries the original config.model_type first, then falls back to the resolved base family.
6 Duplicated if is_bnb and is_8bit ... block in the existing-quant detection branch. Removed.

Robustness for new architectures and consumer hardware

  • Expanded DEFAULT_ARCHITECTURE_FALLBACKS to 39 entries covering Llama 2/3/4, Mistral / Mixtral, Qwen 2 / 2-MoE / 3, Phi / Phi-3 / Phi-4, Gemma / Gemma 2 / Gemma 3, Falcon, Cohere / Command-R, DeepSeek (V2/V3), OLMo / OLMo 2, SmolLM / SmolLM 2 / SmolLM 3, Yi, StarCoder / StarCoder 2, InternLM / InternLM 2, Baichuan, ChatGLM, StableLM.
  • Pre-quantized repo detection. Names matching *-bnb-4bit, *-bnb-8bit, *-AWQ, *-GPTQ, *-INT4, *-FP8, *-EETQ, *-HQQ, *-AQLM log a friendly hint that the embedded quantization_config will be honoured.
  • GGUF-only repo hint. *-gguf / .gguf repo names now point users at from_gguf instead of silently failing.
  • TurboModel.report() returns a structured snapshot (model_id, params_billion, requested_bits, effective_loading_bits, is_quantized, quant_method, device, dtype, finetuned, lora_applied) so users on any hardware can verify what actually got loaded.
  • TurboModel.is_quantized public property as the canonical answer rather than an instance flag that drifts.

Production hygiene

  • .github/workflows/ci.yml — runs ruff + pytest on Python 3.10/3.11/3.12 and validates the build with python -m build + twine check. PR Add architecture registration + fallback loading path for newly released HF model types #27 was merged with no real CI; this fixes that.
  • pyproject.toml — PEP 517/518 build metadata, conservative ruff lint profile (only blocker-class rules so we don't impose a giant reformat diff), pytest defaults.
  • .pre-commit-config.yaml — local enforcement (whitespace, EOF fixer, large-file guard, ruff with autofix).
  • CHANGELOG.md — full record of every change above.

Tests

All 25 tests pass locally (12 existing + 13 new):

  • tests/test_quantization_state.pyis_quantized derivation, from_config_only honesty, report() schema, override-setter contract.
  • tests/test_resolve_model_type.py — fallback-table consultation, family-suffix matching, registry-class lookup ergonomics, override precedence.

Docs

  • docs/guide/loading-models.md updated to advertise the now-automatic fallbacks, pre-quantized repo detection, and report() API.
  • docs/guide/consumer-hardware.md (new) — per-tier guidance for CPU-only, Apple Silicon, ≤ 8 GB / 12–24 GB / multi-GPU, plus a "when QuantLLM cannot quantize" troubleshooting table.

Review & Testing Checklist for Human

  • Sanity-check the is_quantized property semantics. _is_quantized used to be a plain attribute; it's now backed by a property + setter that writes _is_quantized_override. Any third-party code that pickled a TurboModel from v2.1.0rc1 will need a re-instantiation, but in-process behaviour is unchanged for the existing call sites.
  • Run a real end-to-end load on at least one currently-failing model (e.g. a Qwen3 or Phi-4 checkpoint) on a real GPU — the unit tests stub out transformers entirely. The fallback table changes are believed correct from the existing PR Add architecture registration + fallback loading path for newly released HF model types #27 logic, but the only CI we have is unit-level.
  • Verify the new tests/test_quantization_state.py::test_runtime_quantization_property_reads_model_config matches your real GPTQ / AWQ workflow — it uses a mock quant_method='gptq'. If your team has a known-good GPTQ repo, run turbo("that-repo") and confirm model.report()['quant_method'] matches.
  • Confirm the DEFAULT_ARCHITECTURE_FALLBACKS mappings are sane for your supported models. I made conservative choices (qwen3→qwen2, phi4→phi3, gemma3→gemma2, deepseek_v3→llama) but architectures can drift in ways that break weight-loading even for similar-looking families.
  • CI workflow: the new .github/workflows/ci.yml will run on this PR — confirm the lint job, all three test matrices, and the build job pass before merging. CI was effectively absent from the repo before, so this is the first run.

Notes

  • The PR is intentionally scoped to one coherent context: making turbo() correctly load and quantize any HuggingFace model on any consumer hardware. Followups that I'd file separately:
    • Tag v2.1.0rc1 (the version is in setup.py but no git tag / GitHub release exists).
    • Decide whether to flip trust_remote_code to False by default to match HF's own default.
    • Split quantllm/core/turbo_model.py (2400+ lines) into loading.py, export/, finetune.py, registry.py for reviewability.
  • Studied Unsloth and PowerInfer for behavioural cues but deliberately did not pull in their kernels — the bug surface I'm fixing is in QuantLLM's loader/state-tracking, not its kernels. Two ideas that did make it in: Unsloth-style pre-quantized repo detection (*-bnb-4bit etc.) and PowerInfer-style "honest hybrid behaviour" via the new report() API for users on mixed CPU/GPU setups.
  • Linked review report attached in the original session — the bug numbering above maps directly to §2 / §3 of that report.

Link to Devin session: https://app.devin.ai/sessions/5f080418f8004a8cba7358b93abdabcf
Requested by: @codewithdark-git


Open in Devin Review

…re-fallback bugs

Reflects the issues identified in the codebase review on top of PR #27.

Correctness fixes
-----------------
* TurboModel._is_quantized is now a property derived from the loaded
  model's config.quantization_config and BitsAndBytes layer types,
  with an opt-in override slot used by from_gguf. This fixes:
  - from_config_only=True returning a random-weights model that was
    misreported as quantized;
  - missing bitsandbytes installs falling through silently while the
    flag stayed True;
  - pre-quantized HF repos (GPTQ/AWQ/etc.) not being recognized when
    the user passed quantize=False.
* resolve_model_type now consults DEFAULT_ARCHITECTURE_FALLBACKS for
  unknown HF model_types and recognizes version-suffix patterns
  (qwen3 -> qwen2, llama4 -> llama, phi4 -> phi3, gemma3 -> gemma2,
  ...). The old logic only consulted the table when the config's
  model_type was empty, which never happens in practice.
* register_architecture(model_class=...) is now discoverable under
  the original architecture name as well as the resolved base family,
  matching the documented API.
* Removed an accidentally duplicated 'if is_bnb and is_8bit ...'
  block in the existing-quant detection branch.

Robustness for new architectures and consumer hardware
-----------------------------------------------------
* Greatly expanded DEFAULT_ARCHITECTURE_FALLBACKS (Llama 2/3/4, Qwen
  2/2-MoE/3, Phi/3/4, Gemma/2/3, DeepSeek V2/V3, Cohere/Command-R,
  OLMo/2, SmolLM/2/3, Yi, StarCoder/2, InternLM/2, Baichuan, ChatGLM,
  StableLM, Falcon).
* Pre-quantized HF repo names (Unsloth-style *-bnb-4bit, *-AWQ,
  *-GPTQ, *-INT4, *-FP8, etc.) are detected and surfaced as a hint;
  the embedded quantization_config is honoured.
* GGUF-only repo names trigger a friendly hint pointing at from_gguf.
* New TurboModel.report() returns a structured snapshot of the actual
  loaded model state (quant_method, device, dtype, params_billion).
* TurboModel.is_quantized public property is the canonical answer
  rather than an instance flag that could drift.

Production hygiene
------------------
* New .github/workflows/ci.yml runs ruff + pytest on Python 3.10/3.11
  /3.12 and validates the build with python -m build / twine check.
* New pyproject.toml provides PEP 517/518 build metadata plus a
  conservative ruff lint profile (only blocker-class rules) and
  pytest defaults.
* New .pre-commit-config.yaml for local pre-commit enforcement.
* New CHANGELOG.md documenting every change.

Tests
-----
* tests/test_quantization_state.py covers the from_config_only and
  is_quantized property fixes, the report() schema, and the override
  setter.
* tests/test_resolve_model_type.py covers the fallback-table
  consultation, family-suffix matching, and registry-class lookup
  ergonomics.

Docs
----
* docs/guide/loading-models.md updated to reflect the now-automatic
  fallbacks, the pre-quantized repo detection, and report().
* docs/guide/consumer-hardware.md added with per-tier guidance for
  CPU-only, Apple Silicon, 4-8 GB / 12-24 GB / multi-GPU.
@devin-ai-integration
Copy link
Copy Markdown
Author

devin-ai-integration Bot commented Apr 27, 2026

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

The matrix tests were failing with ModuleNotFoundError because we only
installed runtime deps but never installed the quantllm package
itself. Use `pip install --no-deps -e .` so the package is importable
without re-resolving the heavy (GPU-only) dependency set.
Copy link
Copy Markdown
Author

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

@codewithdark-git codewithdark-git merged commit dd07b0d into main Apr 27, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant