Production-hardening: fix v2.1.0rc1 quantization-state and architecture-fallback bugs#28
Merged
codewithdark-git merged 2 commits intomainfrom Apr 27, 2026
Conversation
…re-fallback bugs Reflects the issues identified in the codebase review on top of PR #27. Correctness fixes ----------------- * TurboModel._is_quantized is now a property derived from the loaded model's config.quantization_config and BitsAndBytes layer types, with an opt-in override slot used by from_gguf. This fixes: - from_config_only=True returning a random-weights model that was misreported as quantized; - missing bitsandbytes installs falling through silently while the flag stayed True; - pre-quantized HF repos (GPTQ/AWQ/etc.) not being recognized when the user passed quantize=False. * resolve_model_type now consults DEFAULT_ARCHITECTURE_FALLBACKS for unknown HF model_types and recognizes version-suffix patterns (qwen3 -> qwen2, llama4 -> llama, phi4 -> phi3, gemma3 -> gemma2, ...). The old logic only consulted the table when the config's model_type was empty, which never happens in practice. * register_architecture(model_class=...) is now discoverable under the original architecture name as well as the resolved base family, matching the documented API. * Removed an accidentally duplicated 'if is_bnb and is_8bit ...' block in the existing-quant detection branch. Robustness for new architectures and consumer hardware ----------------------------------------------------- * Greatly expanded DEFAULT_ARCHITECTURE_FALLBACKS (Llama 2/3/4, Qwen 2/2-MoE/3, Phi/3/4, Gemma/2/3, DeepSeek V2/V3, Cohere/Command-R, OLMo/2, SmolLM/2/3, Yi, StarCoder/2, InternLM/2, Baichuan, ChatGLM, StableLM, Falcon). * Pre-quantized HF repo names (Unsloth-style *-bnb-4bit, *-AWQ, *-GPTQ, *-INT4, *-FP8, etc.) are detected and surfaced as a hint; the embedded quantization_config is honoured. * GGUF-only repo names trigger a friendly hint pointing at from_gguf. * New TurboModel.report() returns a structured snapshot of the actual loaded model state (quant_method, device, dtype, params_billion). * TurboModel.is_quantized public property is the canonical answer rather than an instance flag that could drift. Production hygiene ------------------ * New .github/workflows/ci.yml runs ruff + pytest on Python 3.10/3.11 /3.12 and validates the build with python -m build / twine check. * New pyproject.toml provides PEP 517/518 build metadata plus a conservative ruff lint profile (only blocker-class rules) and pytest defaults. * New .pre-commit-config.yaml for local pre-commit enforcement. * New CHANGELOG.md documenting every change. Tests ----- * tests/test_quantization_state.py covers the from_config_only and is_quantized property fixes, the report() schema, and the override setter. * tests/test_resolve_model_type.py covers the fallback-table consultation, family-suffix matching, and registry-class lookup ergonomics. Docs ---- * docs/guide/loading-models.md updated to reflect the now-automatic fallbacks, the pre-quantized repo detection, and report(). * docs/guide/consumer-hardware.md added with per-tier guidance for CPU-only, Apple Silicon, 4-8 GB / 12-24 GB / multi-GPU.
Author
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
The matrix tests were failing with ModuleNotFoundError because we only installed runtime deps but never installed the quantllm package itself. Use `pip install --no-deps -e .` so the package is importable without re-resolving the heavy (GPU-only) dependency set.
codewithdark-git
approved these changes
Apr 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Production-hardening pass on top of
v2.1.0rc1(PR #27). Fixes the"latest HuggingFace model is properly quantized but not loaded" symptom
reported on the most recent PR, plus the related architecture-fallback
regressions, and adds the missing CI / packaging / docs scaffolding so
those bugs can't silently come back.
Correctness fixes
from_config_only=Truereturned a random-weights model but left_is_quantized=True, so callers thought the model was both quantized and loaded._is_quantizedis now a derived property readingmodel.config.quantization_configand BitsAndBytes layer types at call time.from_config_only=Truecorrectly reportsFalseand warns.bitsandbytesinstall silently fell through to full precision but kept_is_quantized=True.False.quantize=False._has_runtime_quantization()looks atquantization_config.quant_methodand the actual layer types.DEFAULT_ARCHITECTURE_FALLBACKSwas dead code — only consulted when the HF config returned an emptymodel_type, which never happens in practice.resolve_model_typenow consults the table directly and recognises version-suffix patterns:qwen3→qwen2,llama4→llama,phi4→phi3,gemma3→gemma2, etc.register_architecture("newmodel", base_model_type="llama", model_class=NewCls)stored the class under"newmodel"but looked it up under"llama", so the fallback path silently ignored it.config.model_typefirst, then falls back to the resolved base family.if is_bnb and is_8bit ...block in the existing-quant detection branch.Robustness for new architectures and consumer hardware
DEFAULT_ARCHITECTURE_FALLBACKSto 39 entries covering Llama 2/3/4, Mistral / Mixtral, Qwen 2 / 2-MoE / 3, Phi / Phi-3 / Phi-4, Gemma / Gemma 2 / Gemma 3, Falcon, Cohere / Command-R, DeepSeek (V2/V3), OLMo / OLMo 2, SmolLM / SmolLM 2 / SmolLM 3, Yi, StarCoder / StarCoder 2, InternLM / InternLM 2, Baichuan, ChatGLM, StableLM.*-bnb-4bit,*-bnb-8bit,*-AWQ,*-GPTQ,*-INT4,*-FP8,*-EETQ,*-HQQ,*-AQLMlog a friendly hint that the embeddedquantization_configwill be honoured.*-gguf/.ggufrepo names now point users atfrom_ggufinstead of silently failing.TurboModel.report()returns a structured snapshot (model_id,params_billion,requested_bits,effective_loading_bits,is_quantized,quant_method,device,dtype,finetuned,lora_applied) so users on any hardware can verify what actually got loaded.TurboModel.is_quantizedpublic property as the canonical answer rather than an instance flag that drifts.Production hygiene
.github/workflows/ci.yml— runs ruff + pytest on Python 3.10/3.11/3.12 and validates the build withpython -m build+twine check. PR Add architecture registration + fallback loading path for newly released HF model types #27 was merged with no real CI; this fixes that.pyproject.toml— PEP 517/518 build metadata, conservative ruff lint profile (only blocker-class rules so we don't impose a giant reformat diff), pytest defaults..pre-commit-config.yaml— local enforcement (whitespace, EOF fixer, large-file guard, ruff with autofix).CHANGELOG.md— full record of every change above.Tests
All 25 tests pass locally (12 existing + 13 new):
tests/test_quantization_state.py—is_quantizedderivation,from_config_onlyhonesty,report()schema, override-setter contract.tests/test_resolve_model_type.py— fallback-table consultation, family-suffix matching, registry-class lookup ergonomics, override precedence.Docs
docs/guide/loading-models.mdupdated to advertise the now-automatic fallbacks, pre-quantized repo detection, andreport()API.docs/guide/consumer-hardware.md(new) — per-tier guidance for CPU-only, Apple Silicon, ≤ 8 GB / 12–24 GB / multi-GPU, plus a "when QuantLLM cannot quantize" troubleshooting table.Review & Testing Checklist for Human
is_quantizedproperty semantics._is_quantizedused to be a plain attribute; it's now backed by a property + setter that writes_is_quantized_override. Any third-party code that pickled aTurboModelfromv2.1.0rc1will need a re-instantiation, but in-process behaviour is unchanged for the existing call sites.Qwen3orPhi-4checkpoint) on a real GPU — the unit tests stub outtransformersentirely. The fallback table changes are believed correct from the existing PR Add architecture registration + fallback loading path for newly released HF model types #27 logic, but the only CI we have is unit-level.tests/test_quantization_state.py::test_runtime_quantization_property_reads_model_configmatches your real GPTQ / AWQ workflow — it uses a mockquant_method='gptq'. If your team has a known-good GPTQ repo, runturbo("that-repo")and confirmmodel.report()['quant_method']matches.DEFAULT_ARCHITECTURE_FALLBACKSmappings are sane for your supported models. I made conservative choices (qwen3→qwen2, phi4→phi3, gemma3→gemma2, deepseek_v3→llama) but architectures can drift in ways that break weight-loading even for similar-looking families..github/workflows/ci.ymlwill run on this PR — confirm the lint job, all three test matrices, and the build job pass before merging. CI was effectively absent from the repo before, so this is the first run.Notes
turbo()correctly load and quantize any HuggingFace model on any consumer hardware. Followups that I'd file separately:v2.1.0rc1(the version is insetup.pybut no git tag / GitHub release exists).trust_remote_codetoFalseby default to match HF's own default.quantllm/core/turbo_model.py(2400+ lines) intoloading.py,export/,finetune.py,registry.pyfor reviewability.*-bnb-4bitetc.) and PowerInfer-style "honest hybrid behaviour" via the newreport()API for users on mixed CPU/GPU setups.Link to Devin session: https://app.devin.ai/sessions/5f080418f8004a8cba7358b93abdabcf
Requested by: @codewithdark-git