v0.9.2 — Custom op decode, Dynamo-traceable fast path
torch.library.custom_op for CUDA graph readiness. 1.14ms/call decode. Dynamo-traceable fast path. 45 tok/s single-request (4x from v0.8.1 eager).
torch.library.custom_op for CUDA graph readiness. 1.14ms/call decode. Dynamo-traceable fast path. 45 tok/s single-request (4x from v0.8.1 eager).