feat(minimax-remover): add NVFP4 kernelized video inpainting pipeline#123
Conversation
Performance highlights (RTX 5060 Ti, SM120, CUDA 13): - End-to-end: 2.6× speedup (30.7s → 11.9s for 123 frames, 3 segments) - RTF improvement: 6.0 → 2.3 (processing-time / clip-duration) - Per-layer GEMM: 1.14–1.30× speedup vs fp16 matmul - FP4 GEMM with quantize overhead: 4–9× faster than fp16 on large FFN projections - Precision: PSNR 52.0 dB (mean) / 45.2 dB (worst frame) vs fp16 Optimizations: - NVFP4 W4A4 quantization with dynamic per-call activation (no offline calibration) - Fused LayerNorm + adaLN + gate-residual in single fp32-stat Triton kernel - Fused FFN-up GEMM + bias + GELU → FP4 output (skip re-quantization for FFN-down) - QKV quantize-once optimization (reuse quantized norm output for Q/K/V projections) - Manual graph-capturable denoise loop with CUDA Graph support - BF16 transformer (NVFP4-native, eliminates fp16↔bf16 casts) - SageAttention / FA2 backend integration
|
Thanks for the contribution. The direction is useful: adding MiniMax-Remover as an isolated model package, reusing existing generic NVFP4 kernels instead of adding model-specific CUDA operators, is the right long-term shape. Review result: Request changes / blocker before merge. I do not see evidence that this affects the default FlashRT core path directly: the PR adds new files only and does not modify CMake, bindings, or existing model routing. That part is good. The blockers are mostly packaging/import/docs/test contract issues. Before merge, please fix these items:
What I verified locally:
For the maintenance standard we are applying, please refer to the public review checklist: https://github.com/flashrt-project/FlashRT/blob/main/docs/pr_review_checklist.md The relevant sections are mainly import boundaries / optional dependencies, new model acceptance, docs matching actual build flags, and default-environment tests. Once these are fixed, the PR will be much easier to review on the actual model/runtime behavior. |
Performance highlights (RTX 5060 Ti, SM120, CUDA 13):
Optimizations:
FLASHRT_ATTN_MODE=fa2uses the dependency-light FA2 path without importingsageattentionValidation:
PYTHONPATH=. PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 pytest -q tests/test_minimax_remover_smoke.py: 8 passed, 2 skippedpython -m compileall -q flash_rt/models/minimax_remover tests/test_minimax_remover_smoke.py examples/minimax_remover_quickstart.py: passedgit diff --check: passedflash_rt,flash_rt.models.minimax_remover,_kernels, andpipeline: passedFLASHRT_ATTN_MODE=fa2dispatch check with a stubflash_rt_fa2: passed without touching SageAttention