feat(hslm): TTQ — Trained Ternary Quantization with learned thresholds per layer

## Task
Replace fixed ternary thresholding with learned per-layer scaling factors.
Current: symmetric {-1, 0, +1}. New: asymmetric {-W_n, 0, +W_p} per layer.

## Scientific Background

### TTQ Paper (Li et al., 2016)
- Learns separate positive/negative scaling: {-W_n^l, 0, +W_p^l}
- **Exceeds full-precision accuracy** on CIFAR-10 (ResNet-32/44/56 by 0.04-0.36%)
- **3% higher accuracy** than fixed TWN on ImageNet
- Dual gradient: one to weights (ternary assignment), one to thresholds (optimal values)

### Gradient-Corrected STE (CVPR 2019, He et al.)
- Scale STE gradient by 1/S_l (inverse of scaling factor)
- Faster convergence than naive unit gradient
- Properly accounts for scaling factor × quantization interaction

### FOGZO (NeurIPS 2025)
- First-Order-Guided Zeroth-Order gradient descent
- Reduces STE bias while keeping cost low
- **1-22 PPL improvement** for quantized LMs vs baseline STE
- Good for final training phases

## Implementation
```zig
// Per-layer learned thresholds
const LayerQuantParams = struct {
    w_pos: f32 = 1.0,  // positive scaling
    w_neg: f32 = 1.0,  // negative scaling
    threshold: f32 = 0.5,  // ternary boundary
};

fn ternarize(w: f32, params: LayerQuantParams) i2 {
    if (w > params.threshold) return 1;   // maps to +w_pos
    if (w < -params.threshold) return -1; // maps to -w_neg
    return 0;
}

// Gradient for threshold:
// dL/d_threshold = sum of dL/dw for weights near boundary
```

## Changes
- `src/hslm/trainer.zig`: per-layer QuantParams struct, gradient update
- `src/hslm/quantize.zig`: asymmetric ternarization with learned thresholds
- 6 additional params total (2 per layer × 3 layers) — negligible overhead

## Expected
- **2-5% PPL improvement** from better quantization adaptation
- Especially impactful in early layers (embedding projection)
- Compound with OHEM: better thresholds → better hard example selection

## References
- TTQ: https://arxiv.org/abs/1612.01064
- Gradient-corrected STE: https://openaccess.thecvf.com/content_CVPR_2019/papers/He_Simultaneously_Optimizing_Weight_and_Quantizer_of_Ternary_Neural_Network_Using_CVPR_2019_paper.pdf
- FOGZO: NeurIPS 2025 poster 117072

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(hslm): TTQ — Trained Ternary Quantization with learned thresholds per layer #320

Task

Scientific Background

TTQ Paper (Li et al., 2016)

Gradient-Corrected STE (CVPR 2019, He et al.)

FOGZO (NeurIPS 2025)

Implementation

Changes

Expected

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

feat(hslm): TTQ — Trained Ternary Quantization with learned thresholds per layer #320

Description

Task

Scientific Background

TTQ Paper (Li et al., 2016)

Gradient-Corrected STE (CVPR 2019, He et al.)

FOGZO (NeurIPS 2025)

Implementation

Changes

Expected

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions