Task
Replace fixed ternary thresholding with learned per-layer scaling factors.
Current: symmetric {-1, 0, +1}. New: asymmetric {-W_n, 0, +W_p} per layer.
Scientific Background
TTQ Paper (Li et al., 2016)
- Learns separate positive/negative scaling: {-W_n^l, 0, +W_p^l}
- Exceeds full-precision accuracy on CIFAR-10 (ResNet-32/44/56 by 0.04-0.36%)
- 3% higher accuracy than fixed TWN on ImageNet
- Dual gradient: one to weights (ternary assignment), one to thresholds (optimal values)
Gradient-Corrected STE (CVPR 2019, He et al.)
- Scale STE gradient by 1/S_l (inverse of scaling factor)
- Faster convergence than naive unit gradient
- Properly accounts for scaling factor × quantization interaction
FOGZO (NeurIPS 2025)
- First-Order-Guided Zeroth-Order gradient descent
- Reduces STE bias while keeping cost low
- 1-22 PPL improvement for quantized LMs vs baseline STE
- Good for final training phases
Implementation
// Per-layer learned thresholds
const LayerQuantParams = struct {
w_pos: f32 = 1.0, // positive scaling
w_neg: f32 = 1.0, // negative scaling
threshold: f32 = 0.5, // ternary boundary
};
fn ternarize(w: f32, params: LayerQuantParams) i2 {
if (w > params.threshold) return 1; // maps to +w_pos
if (w < -params.threshold) return -1; // maps to -w_neg
return 0;
}
// Gradient for threshold:
// dL/d_threshold = sum of dL/dw for weights near boundary
Changes
src/hslm/trainer.zig: per-layer QuantParams struct, gradient update
src/hslm/quantize.zig: asymmetric ternarization with learned thresholds
- 6 additional params total (2 per layer × 3 layers) — negligible overhead
Expected
- 2-5% PPL improvement from better quantization adaptation
- Especially impactful in early layers (embedding projection)
- Compound with OHEM: better thresholds → better hard example selection
References
Task
Replace fixed ternary thresholding with learned per-layer scaling factors.
Current: symmetric {-1, 0, +1}. New: asymmetric {-W_n, 0, +W_p} per layer.
Scientific Background
TTQ Paper (Li et al., 2016)
Gradient-Corrected STE (CVPR 2019, He et al.)
FOGZO (NeurIPS 2025)
Implementation
Changes
src/hslm/trainer.zig: per-layer QuantParams struct, gradient updatesrc/hslm/quantize.zig: asymmetric ternarization with learned thresholdsExpected
References