Production-hardening: fix v2.1.0rc1 quantization-state and architecture-fallback bugs by devin-ai-integration[bot] · Pull Request #28 · codewithdark-git/QuantLLM

devin-ai-integration · 2026-04-27T18:25:45Z

Summary

Production-hardening pass on top of v2.1.0rc1 (PR #27). Fixes the
"latest HuggingFace model is properly quantized but not loaded" symptom
reported on the most recent PR, plus the related architecture-fallback
regressions, and adds the missing CI / packaging / docs scaffolding so
those bugs can't silently come back.

Correctness fixes

#	Bug	Fix
1	`from_config_only=True` returned a random-weights model but left `_is_quantized=True`, so callers thought the model was both quantized and loaded.	`_is_quantized` is now a derived property reading `model.config.quantization_config` and BitsAndBytes layer types at call time. `from_config_only=True` correctly reports `False` and warns.
2	A missing `bitsandbytes` install silently fell through to full precision but kept `_is_quantized=True`.	The warning now describes the install command and the property reports `False`.
3	Pre-quantized HF repos (GPTQ / AWQ / etc.) were not recognised when the user passed `quantize=False`.	`_has_runtime_quantization()` looks at `quantization_config.quant_method` and the actual layer types.
4	`DEFAULT_ARCHITECTURE_FALLBACKS` was dead code — only consulted when the HF config returned an empty `model_type`, which never happens in practice.	`resolve_model_type` now consults the table directly and recognises version-suffix patterns: `qwen3`→`qwen2`, `llama4`→`llama`, `phi4`→`phi3`, `gemma3`→`gemma2`, etc.
5	`register_architecture("newmodel", base_model_type="llama", model_class=NewCls)` stored the class under `"newmodel"` but looked it up under `"llama"`, so the fallback path silently ignored it.	Lookup tries the original `config.model_type` first, then falls back to the resolved base family.
6	Duplicated `if is_bnb and is_8bit ...` block in the existing-quant detection branch.	Removed.

Robustness for new architectures and consumer hardware

Expanded DEFAULT_ARCHITECTURE_FALLBACKS to 39 entries covering Llama 2/3/4, Mistral / Mixtral, Qwen 2 / 2-MoE / 3, Phi / Phi-3 / Phi-4, Gemma / Gemma 2 / Gemma 3, Falcon, Cohere / Command-R, DeepSeek (V2/V3), OLMo / OLMo 2, SmolLM / SmolLM 2 / SmolLM 3, Yi, StarCoder / StarCoder 2, InternLM / InternLM 2, Baichuan, ChatGLM, StableLM.
Pre-quantized repo detection. Names matching *-bnb-4bit, *-bnb-8bit, *-AWQ, *-GPTQ, *-INT4, *-FP8, *-EETQ, *-HQQ, *-AQLM log a friendly hint that the embedded quantization_config will be honoured.
GGUF-only repo hint. *-gguf / .gguf repo names now point users at from_gguf instead of silently failing.
TurboModel.report() returns a structured snapshot (model_id, params_billion, requested_bits, effective_loading_bits, is_quantized, quant_method, device, dtype, finetuned, lora_applied) so users on any hardware can verify what actually got loaded.
TurboModel.is_quantized public property as the canonical answer rather than an instance flag that drifts.

Production hygiene

.github/workflows/ci.yml — runs ruff + pytest on Python 3.10/3.11/3.12 and validates the build with python -m build + twine check. PR Add architecture registration + fallback loading path for newly released HF model types #27 was merged with no real CI; this fixes that.
pyproject.toml — PEP 517/518 build metadata, conservative ruff lint profile (only blocker-class rules so we don't impose a giant reformat diff), pytest defaults.
.pre-commit-config.yaml — local enforcement (whitespace, EOF fixer, large-file guard, ruff with autofix).
CHANGELOG.md — full record of every change above.

Tests

All 25 tests pass locally (12 existing + 13 new):

tests/test_quantization_state.py — is_quantized derivation, from_config_only honesty, report() schema, override-setter contract.
tests/test_resolve_model_type.py — fallback-table consultation, family-suffix matching, registry-class lookup ergonomics, override precedence.

Docs

docs/guide/loading-models.md updated to advertise the now-automatic fallbacks, pre-quantized repo detection, and report() API.
docs/guide/consumer-hardware.md (new) — per-tier guidance for CPU-only, Apple Silicon, ≤ 8 GB / 12–24 GB / multi-GPU, plus a "when QuantLLM cannot quantize" troubleshooting table.

Review & Testing Checklist for Human

Sanity-check the is_quantized property semantics. _is_quantized used to be a plain attribute; it's now backed by a property + setter that writes _is_quantized_override. Any third-party code that pickled a TurboModel from v2.1.0rc1 will need a re-instantiation, but in-process behaviour is unchanged for the existing call sites.
Run a real end-to-end load on at least one currently-failing model (e.g. a Qwen3 or Phi-4 checkpoint) on a real GPU — the unit tests stub out transformers entirely. The fallback table changes are believed correct from the existing PR Add architecture registration + fallback loading path for newly released HF model types #27 logic, but the only CI we have is unit-level.
Verify the new tests/test_quantization_state.py::test_runtime_quantization_property_reads_model_config matches your real GPTQ / AWQ workflow — it uses a mock quant_method='gptq'. If your team has a known-good GPTQ repo, run turbo("that-repo") and confirm model.report()['quant_method'] matches.
Confirm the DEFAULT_ARCHITECTURE_FALLBACKS mappings are sane for your supported models. I made conservative choices (qwen3→qwen2, phi4→phi3, gemma3→gemma2, deepseek_v3→llama) but architectures can drift in ways that break weight-loading even for similar-looking families.
CI workflow: the new .github/workflows/ci.yml will run on this PR — confirm the lint job, all three test matrices, and the build job pass before merging. CI was effectively absent from the repo before, so this is the first run.

Notes

The PR is intentionally scoped to one coherent context: making turbo() correctly load and quantize any HuggingFace model on any consumer hardware. Followups that I'd file separately:
- Tag v2.1.0rc1 (the version is in setup.py but no git tag / GitHub release exists).
- Decide whether to flip trust_remote_code to False by default to match HF's own default.
- Split quantllm/core/turbo_model.py (2400+ lines) into loading.py, export/, finetune.py, registry.py for reviewability.
Studied Unsloth and PowerInfer for behavioural cues but deliberately did not pull in their kernels — the bug surface I'm fixing is in QuantLLM's loader/state-tracking, not its kernels. Two ideas that did make it in: Unsloth-style pre-quantized repo detection (*-bnb-4bit etc.) and PowerInfer-style "honest hybrid behaviour" via the new report() API for users on mixed CPU/GPU setups.
Linked review report attached in the original session — the bug numbering above maps directly to §2 / §3 of that report.

Link to Devin session: https://app.devin.ai/sessions/5f080418f8004a8cba7358b93abdabcf
Requested by: @codewithdark-git

…re-fallback bugs Reflects the issues identified in the codebase review on top of PR #27. Correctness fixes ----------------- * TurboModel._is_quantized is now a property derived from the loaded model's config.quantization_config and BitsAndBytes layer types, with an opt-in override slot used by from_gguf. This fixes: - from_config_only=True returning a random-weights model that was misreported as quantized; - missing bitsandbytes installs falling through silently while the flag stayed True; - pre-quantized HF repos (GPTQ/AWQ/etc.) not being recognized when the user passed quantize=False. * resolve_model_type now consults DEFAULT_ARCHITECTURE_FALLBACKS for unknown HF model_types and recognizes version-suffix patterns (qwen3 -> qwen2, llama4 -> llama, phi4 -> phi3, gemma3 -> gemma2, ...). The old logic only consulted the table when the config's model_type was empty, which never happens in practice. * register_architecture(model_class=...) is now discoverable under the original architecture name as well as the resolved base family, matching the documented API. * Removed an accidentally duplicated 'if is_bnb and is_8bit ...' block in the existing-quant detection branch. Robustness for new architectures and consumer hardware ----------------------------------------------------- * Greatly expanded DEFAULT_ARCHITECTURE_FALLBACKS (Llama 2/3/4, Qwen 2/2-MoE/3, Phi/3/4, Gemma/2/3, DeepSeek V2/V3, Cohere/Command-R, OLMo/2, SmolLM/2/3, Yi, StarCoder/2, InternLM/2, Baichuan, ChatGLM, StableLM, Falcon). * Pre-quantized HF repo names (Unsloth-style *-bnb-4bit, *-AWQ, *-GPTQ, *-INT4, *-FP8, etc.) are detected and surfaced as a hint; the embedded quantization_config is honoured. * GGUF-only repo names trigger a friendly hint pointing at from_gguf. * New TurboModel.report() returns a structured snapshot of the actual loaded model state (quant_method, device, dtype, params_billion). * TurboModel.is_quantized public property is the canonical answer rather than an instance flag that could drift. Production hygiene ------------------ * New .github/workflows/ci.yml runs ruff + pytest on Python 3.10/3.11 /3.12 and validates the build with python -m build / twine check. * New pyproject.toml provides PEP 517/518 build metadata plus a conservative ruff lint profile (only blocker-class rules) and pytest defaults. * New .pre-commit-config.yaml for local pre-commit enforcement. * New CHANGELOG.md documenting every change. Tests ----- * tests/test_quantization_state.py covers the from_config_only and is_quantized property fixes, the report() schema, and the override setter. * tests/test_resolve_model_type.py covers the fallback-table consultation, family-suffix matching, and registry-class lookup ergonomics. Docs ---- * docs/guide/loading-models.md updated to reflect the now-automatic fallbacks, the pre-quantized repo detection, and report(). * docs/guide/consumer-hardware.md added with per-tier guidance for CPU-only, Apple Silicon, 4-8 GB / 12-24 GB / multi-GPU.

devin-ai-integration · 2026-04-27T18:25:48Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

The matrix tests were failing with ModuleNotFoundError because we only installed runtime deps but never installed the quantllm package itself. Use `pip install --no-deps -e .` so the package is importable without re-resolving the heavy (GPU-only) dependency set.

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

devin-ai-integration Bot assigned codewithdark-git Apr 27, 2026

devin-ai-integration Bot commented Apr 27, 2026

View reviewed changes

codewithdark-git approved these changes Apr 27, 2026

View reviewed changes

codewithdark-git merged commit dd07b0d into main Apr 27, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Production-hardening: fix v2.1.0rc1 quantization-state and architecture-fallback bugs#28

Production-hardening: fix v2.1.0rc1 quantization-state and architecture-fallback bugs#28
codewithdark-git merged 2 commits intomainfrom
devin/1777241535-quantllm-production-hardening

devin-ai-integration Bot commented Apr 27, 2026 •

edited by codewithdark-git

Loading

Uh oh!

devin-ai-integration Bot commented Apr 27, 2026 •

edited by codewithdark-git

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

devin-ai-integration Bot commented Apr 27, 2026 • edited by codewithdark-git Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Correctness fixes

Robustness for new architectures and consumer hardware

Production hygiene

Tests

Docs

Review & Testing Checklist for Human

Notes

Uh oh!

devin-ai-integration Bot commented Apr 27, 2026 • edited by codewithdark-git Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🤖 Devin AI Engineer

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

devin-ai-integration Bot commented Apr 27, 2026 •

edited by codewithdark-git

Loading

devin-ai-integration Bot commented Apr 27, 2026 •

edited by codewithdark-git

Loading