Auto-Adaptive Mixed-Precision Quantization (Target BPW) #18531
EAddario
started this conversation in
Show and tell
Replies: 1 comment 1 reply
-
|
Looks interesting for optimizing model size. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
PR #15550 introduces an optimization routine that dynamically determines an optimal per-tensor quantization type mix to achieve a user-specified global bits-per-weight (bpw) target (e.g.,
--target-bpw 4.5678).Instead of relying on heuristic presets (like Q4_K_M), the function solves a constrained optimization problem: Minimize total quantization error, subject to a total size budget by measuring the sensitivity of each layer and dynamically allocating the "bit budget" where it matters most.
How it Works
Sensitivity Analysis:
The algorithm samples every tensor in the model. It quantizes these samples into various formats (from
IQ1_Sup toQ8_0) and compares them against the original FP16/FP32 weights.Pareto Optimization:
For each tensor, the algorithm builds a curve of "Size vs. Error". It discards inefficient formats (formats that take up more space without offering better accuracy for that specific layer).
Global Resource Allocation (Lagrangian Solver):
The algorithm solves a global optimization problem: "How do we distribute X bits across Y tensors to minimize total error?"
output.weightand Attentionvweights) generally receive higher bit-precision.Advantages
Target arbitrary size models
Data-driven mixed precision often can improve quality at fixed size
attn_vQ5_K for a 70B model), that may be sub‑optimal for a given architecture or size, the quantization mix is determined by the actual error sensitivity of the specific model's weights. This often yields a better quality/size trade-off, especially in aggressive quantization scenarios (1.5 to 3.5 bpw), or for unusual architectures.Allows better like-for-like comparisons between models and families
Standard quantization uses hardcoded rules like: "use Q4_K_M, except bump some tensors up/down, except fall back if incompatible, except keep some tensors unquantized..." and for that reason, two different models quantized with the same Q4_K_M type can end up with very different bpw (e.g. 4.75 and 4.30).
All things being equal, the performance of a model is usually proportional to its overall bpw size; models with a higher bpw tend to perform better than lower bpw models. Since model A has simply been given more bits, it will typically perform better (lower perplexity, better eval scores, etc.) even if the underlying quantization method is identical. That makes comparing the performance not a controlled experiment, because the comparison is between models with different effective compression ratios.
--target-bpwtries to address that by making the experiment more controlled: each model gets quantized to land on (approximately) the same global byte budget, so that the models' performance differences are more attributable to architecture/training differences, quantization error behaviour at the same compression ratio, optimizer’s allocation decisions, etc.Disadvantages
Quantization process is significantly slower than standard
This approach can take 5x-10x longer as it quantizes a sample of most tensors into 15 different formats, dequantizes them back to floats, computes error diffs, and selects the best size/error option that fits the global bpw budget.
However, the
--keep-bpw-stateoption will save the above-mentioned computations to disk so that future quantizations, in the permissible bpw range for the same model, can be generated at normal speed. It also allows to interrupt the computation process and resume it at a later time.The optimization target is only a proxy for the model's performance quality
The process minimizes a per-tensor estimated error computed from sampled rows, not actual perplexity or divergence of output distributions (a future version may address this). Since errors interact nonlinearly across layers, there are no guarantees it will select the best possible quantization recipe subject to the bpw size constraint.
Furthermore, the process can operate in two modes: giving priority to important tensors (default) or treating each tensor equally (setting the
--no-importanceoption). To my knowledge, there is no computationally feasible way to determine ahead of time which modality will yield better results and two runs per model may be needed to obtain the best quality, but the default mode usually wins.An imatrix with activations data is required for best results
Test results
Based on 132 tests with models from 11 different families, the
target_bpw_type()optimization routine generated 96 (~70%) better quality models, and 10 (~8%) same as standard quantization. However, even though the method produced better quality often, it lost in surprising cases. Naive quants made up for the remaining 25 tests (20%) performing better, sometimes by a significant margin (e.g.ERNIE-4.5-21B-A3B-PT-IQ1_M,granite-4.0-h-tiny-IQ2_M,granite-4.0-h-tiny-IQ1_M)Of the 96 cases where it performed better, about 1/3 achieved higher scores when using the
--no-importanceoption, forcing the algorithm to treat each tensor equally instead of prioritising some (i.e. attn_output, ffn_down, etc.).Target BPW test results
Using
Cor(ln(PPL(Q)), ln(PPL(base)))as the discriminant metricUsage examples
Beta Was this translation helpful? Give feedback.
All reactions