feat(minimax-remover): add NVFP4 kernelized video inpainting pipeline by chenping9999 · Pull Request #123 · flashrt-project/FlashRT

chenping9999 · 2026-07-01T08:18:56Z

Performance highlights (RTX 5060 Ti, SM120, CUDA 13):

End-to-end: 2.6x speedup (30.7s -> 11.9s for 123 frames, 3 segments)
RTF improvement: 6.0 -> 2.3 (processing-time / clip-duration)
Per-layer GEMM: 1.14-1.30x speedup vs fp16 matmul
FP4 GEMM with quantize overhead: 4-9x faster than fp16 on large FFN projections
Precision: PSNR 52.0 dB (mean) / 45.2 dB (worst frame) vs fp16

Optimizations:

NVFP4 W4A4 quantization with dynamic per-call activation (no offline calibration)
Fused LayerNorm + adaLN + gate-residual in single fp32-stat Triton kernel
Fused FFN-up GEMM + bias + GELU -> FP4 output (skip re-quantization for FFN-down)
QKV quantize-once optimization (reuse quantized norm output for Q/K/V projections)
Manual graph-capturable denoise loop with CUDA Graph support
BF16 transformer (NVFP4-native, eliminates fp16<->bf16 casts)
SageAttention / FA2 backend integration
Shared attention dispatch for installed processors and the manual fused block, so FLASHRT_ATTN_MODE=fa2 uses the dependency-light FA2 path without importing sageattention

Validation:

PYTHONPATH=. PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 pytest -q tests/test_minimax_remover_smoke.py: 8 passed, 2 skipped
python -m compileall -q flash_rt/models/minimax_remover tests/test_minimax_remover_smoke.py examples/minimax_remover_quickstart.py: passed
git diff --check: passed
Import check for flash_rt, flash_rt.models.minimax_remover, _kernels, and pipeline: passed
FLASHRT_ATTN_MODE=fa2 dispatch check with a stub flash_rt_fa2: passed without touching SageAttention

Performance highlights (RTX 5060 Ti, SM120, CUDA 13): - End-to-end: 2.6× speedup (30.7s → 11.9s for 123 frames, 3 segments) - RTF improvement: 6.0 → 2.3 (processing-time / clip-duration) - Per-layer GEMM: 1.14–1.30× speedup vs fp16 matmul - FP4 GEMM with quantize overhead: 4–9× faster than fp16 on large FFN projections - Precision: PSNR 52.0 dB (mean) / 45.2 dB (worst frame) vs fp16 Optimizations: - NVFP4 W4A4 quantization with dynamic per-call activation (no offline calibration) - Fused LayerNorm + adaLN + gate-residual in single fp32-stat Triton kernel - Fused FFN-up GEMM + bias + GELU → FP4 output (skip re-quantization for FFN-down) - QKV quantize-once optimization (reuse quantized norm output for Q/K/V projections) - Manual graph-capturable denoise loop with CUDA Graph support - BF16 transformer (NVFP4-native, eliminates fp16↔bf16 casts) - SageAttention / FA2 backend integration

LiangSu8899 · 2026-07-01T18:39:36Z

Thanks for the contribution. The direction is useful: adding MiniMax-Remover as an isolated model package, reusing existing generic NVFP4 kernels instead of adding model-specific CUDA operators, is the right long-term shape.

Review result: Request changes / blocker before merge.
Risk class: R3/R4, because this adds a new model runtime path with optional dependencies, NVFP4 required-symbol checks, Triton kernels, and performance claims.

I do not see evidence that this affects the default FlashRT core path directly: the PR adds new files only and does not modify CMake, bindings, or existing model routing. That part is good. The blockers are mostly packaging/import/docs/test contract issues.

Before merge, please fix these items:

Keep optional dependency boundaries intact. import flash_rt.models.minimax_remover should succeed without diffusers, sageattention, or the MiniMax-Remover reference package installed. Right now package import reaches flash_rt/models/minimax_remover/_manual_denoise.py, which imports diffusers at module import time and fails in a default environment.
Move optional imports to runtime boundaries or fail-fast constructors. Dependencies such as diffusers, einops, triton, and sageattention should either be in a documented optional extra or imported lazily with a clear error message when the MiniMax-Remover path is actually constructed/used.
Add a package extra or clear install command for this model path. For example, a dedicated extra like minimax-remover = [...] or an equivalent documented command. The current pyproject.toml does not advertise the dependencies needed by this new path.
Fix the build documentation. ENABLE_CUTLASS_SM120_NVFP4_W4A16 is an internal compile definition, not the public CMake entry point users should pass. Please document the real build route, e.g. Blackwell GPU_ARCH=120 / 121 and the existing NVFP4 build conditions.
Make tests/test_minimax_remover_smoke.py pass in a default environment without diffusers installed. The test comments say import should succeed and _load_kernels() should fail clearly only when the kernel surface is missing, but currently the tests fail at import time.
Clarify the default attention dependency. The docs say the default is FLASHRT_ATTN_MODE=sage_fp8, which requires sageattention. Either make that dependency part of the model extra, make fa2 the dependency-light default, or fail fast with an explicit install hint.

What I verified locally:

git diff --name-status origin/main...origin/pr123-head: only new MiniMax-Remover docs/model/test files are added.
git diff --check origin/main...origin/pr123-head: passed.
python -m compileall flash_rt/models/minimax_remover tests/test_minimax_remover_smoke.py: passed.
No private paths, debug breakpoints, or AI-generated traces found in the changed files.
In an environment without diffusers, PYTHONPATH=. python -c "import flash_rt.models.minimax_remover" fails with ModuleNotFoundError: No module named 'diffusers'.
In the same environment, PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 PYTHONPATH=. pytest -q tests/test_minimax_remover_smoke.py reports 6 failed, 2 skipped for the same import-boundary issue.

For the maintenance standard we are applying, please refer to the public review checklist:

https://github.com/flashrt-project/FlashRT/blob/main/docs/pr_review_checklist.md

The relevant sections are mainly import boundaries / optional dependencies, new model acceptance, docs matching actual build flags, and default-environment tests. Once these are fixed, the PR will be much easier to review on the actual model/runtime behavior.

chenping9999 requested a review from LiangSu8899 as a code owner July 1, 2026 08:18

chenping9999 and others added 3 commits July 2, 2026 14:16

feat(minimax-remover): lazy-load runtime deps and improve build docs

4b366e8

fix(minimax-remover): share attention mode dispatch

274d5ec

docs(minimax-remover): mention scipy extra

c67100d

LiangSu8899 approved these changes Jul 2, 2026

View reviewed changes

LiangSu8899 merged commit d44920a into flashrt-project:main Jul 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(minimax-remover): add NVFP4 kernelized video inpainting pipeline#123

feat(minimax-remover): add NVFP4 kernelized video inpainting pipeline#123
LiangSu8899 merged 4 commits into
flashrt-project:mainfrom
chenping9999:main

chenping9999 commented Jul 1, 2026 •

edited by LiangSu8899

Loading

Uh oh!

LiangSu8899 commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

chenping9999 commented Jul 1, 2026 • edited by LiangSu8899 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LiangSu8899 commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chenping9999 commented Jul 1, 2026 •

edited by LiangSu8899

Loading