Description
The current MTP (Multi-Token Prediction) path is experimental and provides "at most a slight speedup." We can improve this by implementing Draft KV Stitching.
Proposed Changes
- Small Draft Model Integration: Allow the engine to load a tiny (e.g., DeepSeek 1.3B) model purely for drafting.
- Speculative Verification: The main DS4 engine verifies 4-8 draft tokens in a single Metal graph pass.
- KV Alignment: Ensure that when a draft is accepted, the "compressed KV" of the main model is updated without a full re-prefill.
Description
The current MTP (Multi-Token Prediction) path is experimental and provides "at most a slight speedup." We can improve this by implementing Draft KV Stitching.
Proposed Changes