feat(logging,trace): cuda-graph-compatible level-5/10 logging + fi_trace template additions/fixes#3172
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a device-side CUDA kernel and TVM FFI binding to compute per-tensor min/max/mean and NaN/Inf counts during CUDA-graph capture/replay; integrates the kernel into JIT/AOT packaging, changes capture-time pinned-buffer staging and deferred dump semantics for log levels 5 and 10, and exposes dump flush/clear APIs. Changes
Sequence Diagram(s)sequenceDiagram
participant Host as Host (Python)
participant JIT as JIT/AOT
participant TVM as TVM FFI
participant GPU as GPU Kernel
participant DevIO as Device printf
Host->>JIT: build/load api_log_stats module
JIT-->>Host: module (or build failure)
alt module built & dtype supported
Host->>Host: emit correlation marker id=N (deferred-to-GPU)
Host->>TVM: api_log_print_tensor_stats(tensor, id) on capture stream
TVM->>GPU: launch kernel (device-stream)
GPU->>GPU: per-thread convert/reduce (min/max/sum/nan/inf)
GPU->>DevIO: printf("[flashinfer stats] id=N ...")
else unsupported dtype or load/launch failure
Host->>Host: emit "statistics skipped: CUDA graph capture in progress"
end
Host->>Host: end capture
Host->>Host: cuda_graph.replay()
GPU->>DevIO: printf outputs on replay (if kernel ran)
Host->>Host: flush_graph_dumps() to persist pinned buffers (level 10)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a CUDA-graph-friendly mechanism for logging tensor statistics at Level 5. By utilizing a device-side kernel and printf, statistics can be captured and reported during graph replay without requiring stream synchronization. Feedback focuses on ensuring robustness and accuracy: specifically, adding a check for tensor contiguity to prevent incorrect memory access, using double precision for intermediate calculations to avoid data loss with large integers, and aligning the handling of infinite values in min/max reductions with the existing eager logging implementation.
| if tensor.dtype not in _GPU_STATS_SUPPORTED_DTYPES: | ||
| return None |
There was a problem hiding this comment.
The GPU statistics kernel performs a linear scan of the tensor data using data[i], which assumes the tensor is contiguous in memory. If a non-contiguous tensor (e.g., a slice or a transposed tensor) is passed, the kernel will read incorrect data or potentially access memory out of bounds. A check for tensor.is_contiguous() should be added here to ensure the kernel is only launched for supported layouts, falling back to the "skipped" message otherwise.
| if tensor.dtype not in _GPU_STATS_SUPPORTED_DTYPES: | |
| return None | |
| if tensor.dtype not in _GPU_STATS_SUPPORTED_DTYPES or not tensor.is_contiguous(): | |
| return None |
| __device__ inline float to_float_impl(T x) { | ||
| return static_cast<float>(x); | ||
| } |
There was a problem hiding this comment.
Using float as the intermediate type for statistics causes precision loss for int32_t and int64_t types when values exceed the 24-bit mantissa limit (approx. 16.7 million). Since the final output is formatted as a double and the sum is already tracked as a double, it is better to use double for the to_float_impl return type and the thread_min/thread_max accumulators to preserve precision for integer types.
template <typename T>
__device__ inline double to_double_impl(T x) {
return static_cast<double>(x);
}
| if (is_nan) { | ||
| thread_nan += 1; | ||
| } else if (is_inf) { | ||
| thread_inf += 1; | ||
| } else { |
There was a problem hiding this comment.
The current logic excludes infinite values from the min and max calculations, which differs from the behavior of the eager path where torch.min() and torch.max() include Infs. This inconsistency can be confusing for users. For example, a tensor containing [1.0, inf] would show max=1.0 in the GPU log but max=inf in the eager log. Consider including Infs in the min/max reduction while still counting them separately to match the eager logging output.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
csrc/api_log_stats.cu (1)
69-86: Integer dtypes lose precision because reduction goes throughfloat.For
int32,int64, anduint8,to_float_implcasts each element tofloatbefore min/max/sum. Float can only represent integers exactly up to 2^24, so for largeint32/int64tensors (e.g. token-id tensors, paged-KV indices, cu_seqlens with large offsets) the printedmin/maxand especiallymeanwill be inaccurate by potentially many ULPs.Since
int64indexing tensors are exactly the kind of inputs users are most likely to be debugging at level 5, this is worth fixing. One approach: keep adoubleaccumulator and dispatch min/max via a per-T traits struct so integer types reduce in their native domain.♻️ Sketch of a precision-preserving variant
+template <typename T> +struct StatsAccum { + using Acc = float; +}; +template <> struct StatsAccum<int32_t> { using Acc = double; }; +template <> struct StatsAccum<int64_t> { using Acc = double; }; +template <> struct StatsAccum<uint8_t> { using Acc = double; }; + +template <typename T> +__device__ inline typename StatsAccum<T>::Acc to_acc(T x) { + return static_cast<typename StatsAccum<T>::Acc>(x); +} +__device__ inline float to_acc(nv_half x) { return __half2float(x); } +__device__ inline float to_acc(nv_bfloat16 x) { return __bfloat162float(x); }Then template
PrintTensorStatsKernelsothread_min/thread_max/thread_sumuse the appropriate accumulator type forT.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@csrc/api_log_stats.cu` around lines 69 - 86, The reduction currently casts every element with to_float_impl inside PrintTensorStatsKernel so integer dtypes lose precision; change the kernel to use a type-traits dispatch (e.g., a template AccumTypeFor<T>) that selects double for integer inputs and float/double for float-like types, keep per-type min/max comparisons in the native domain (avoid isnan/isinf for non-floats by using IsFloatLike<T>), and make thread_min/thread_max/thread_sum use the chosen accumulator type (sum as double for integers) so min/max/mean are computed without truncation for int32/int64/uint8 while preserving existing float handling.docs/logging.rst (1)
253-256: Minor: device printf flushing wording.The phrasing "PyTorch routes device printf to the host stream" is slightly inaccurate. Device-side
printfis buffered by the CUDA runtime and flushed on sync points (e.g.cudaDeviceSynchronize/stream sync) — it is not specifically routed by PyTorch. Consider rewording, e.g. "the CUDA runtime flushes the device printf buffer to stdout on stream synchronization."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/logging.rst` around lines 253 - 256, Reword the sentence about device printf: replace "PyTorch routes device printf to the host stream" with wording that attributes flushing to the CUDA runtime and synchronization (e.g., mention that the CUDA runtime flushes the device printf buffer to stdout on stream or device synchronization). Update the line describing cuda_graph.replay() so it states the captured kernel prints statistics to stdout because the CUDA runtime flushes device printf on sync points (not that PyTorch routes it).flashinfer/api_logging.py (1)
1161-1199: First-call JIT build inside graph capture is safe but slow; consider pre-warming as optional optimization.
_get_api_log_stats_kernel()with@functools.cachemay run thebuild_and_load()chain on first level-5 capture. The build itself is host-only (nvcc subprocess + dlopen of .so) so it will not poison the capture stream. However, the build can take seconds, which may introduce unexpected latency on first use.Two options if latency is a concern:
- Pre-warm
_get_api_log_stats_kernel()once at module import or on first CUDA tensor observed outside graph capture.- Add a note to the level-5 docstring that a short warmup pass is recommended before entering graph capture.
This is an optional optimization; the current design is correct and works under capture.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@csrc/api_log_stats.cu`:
- Around line 130-135: The kernel launch line in launch_print_tensor_stats is
not formatted to match the project's clang-format rules; update the formatting
of the PrintTensorStatsKernel<<<...>>> call inside launch_print_tensor_stats
(the call to PrintTensorStatsKernel<T><<<1, kBlockSize, 0,
stream>>>(static_cast<const T*>(data_ptr), numel, tensor_id)) to match
clang-format, then run the pre-commit formatter and commit the change (e.g., run
pre-commit run clang-format --files csrc/api_log_stats.cu) so CI passes.
- Around line 113-126: The current printf in the tid==0 block prints sentinel
s_min/s_max and a misleading mean when numel>0 but valid==0; modify the tid==0
handling to check valid (computed from numel - s_nan[0] - s_inf[0]) and when
valid==0 print an explicit message such as "[flashinfer stats] id=%lld
numel=%lld all_nan_or_inf nan=%lld inf=%lld" (include tensor_id, numel,
s_nan[0], s_inf[0]) instead of printing s_min/s_max/mean, otherwise keep the
existing min/max/mean printing; ensure mean and use of s_min/s_max only occur
when valid>0.
---
Nitpick comments:
In `@csrc/api_log_stats.cu`:
- Around line 69-86: The reduction currently casts every element with
to_float_impl inside PrintTensorStatsKernel so integer dtypes lose precision;
change the kernel to use a type-traits dispatch (e.g., a template
AccumTypeFor<T>) that selects double for integer inputs and float/double for
float-like types, keep per-type min/max comparisons in the native domain (avoid
isnan/isinf for non-floats by using IsFloatLike<T>), and make
thread_min/thread_max/thread_sum use the chosen accumulator type (sum as double
for integers) so min/max/mean are computed without truncation for
int32/int64/uint8 while preserving existing float handling.
In `@docs/logging.rst`:
- Around line 253-256: Reword the sentence about device printf: replace "PyTorch
routes device printf to the host stream" with wording that attributes flushing
to the CUDA runtime and synchronization (e.g., mention that the CUDA runtime
flushes the device printf buffer to stdout on stream or device synchronization).
Update the line describing cuda_graph.replay() so it states the captured kernel
prints statistics to stdout because the CUDA runtime flushes device printf on
sync points (not that PyTorch routes it).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 233331c7-9d12-4bef-8735-8cf70cef6557
📒 Files selected for processing (7)
csrc/api_log_stats.cucsrc/flashinfer_api_log_stats_binding.cudocs/logging.rstflashinfer/aot.pyflashinfer/api_logging.pyflashinfer/jit/api_log_stats.pytests/utils/test_logging.py
|
Follow-up commit What's new
Usagefrom flashinfer.api_logging import flush_graph_dumps, clear_graph_dumps
# Eager warmup primes the pinned-buffer cache.
out = wrapper.run(q, kv_cache)
torch.cuda.synchronize()
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
wrapper.run(q, kv_cache)
# Per replay, flush to capture this replay's tensor values.
g.replay()
flush_graph_dumps()
q.copy_(new_q)
g.replay()
flush_graph_dumps() # dump now reflects the new inputs
clear_graph_dumps()Caveats
Files
Test plan
|
|
Added What it does
Usagepython tools/dump_with_cuda_graph.py \
--dump-dir /tmp/fi_dumps \
--include '*decode*' \
--max-count 10 \
-- \
python -m sglang.launch_server --model meta-llama/Llama-3-8B ...Anything before Caveats called out in the docstring
Smoke-testedIdempotency check also passes: a second |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tests/utils/test_logging.py (1)
46-68:⚠️ Potential issue | 🟡 Minor
setup_and_teardowndoesn't restoreFLASHINFER_DUMP_DIR, leaking state into later tests.The new level-10 graph tests (
test_level_10_cuda_graph_dumps,test_level_10_cuda_graph_requires_warmup) setFLASHINFER_DUMP_DIR(and re-setFLASHINFER_LOGDEST), but the autouse fixture only saves/restoresFLASHINFER_LOGLEVELandFLASHINFER_LOGDEST. Any test running after these will see aFLASHINFER_DUMP_DIRpointing at a now-deletedtmp_path, which can change behavior of_warn_dump()and the dump count tracking on module reimport.🧪 Extend the fixture to also restore the dump-related env vars
`@pytest.fixture`(autouse=True) def setup_and_teardown(self): """Reset environment and reimport logging module for each test.""" - # Store original environment - original_level = os.environ.get("FLASHINFER_LOGLEVEL") - original_dest = os.environ.get("FLASHINFER_LOGDEST") + # Store original environment + keys = ( + "FLASHINFER_LOGLEVEL", + "FLASHINFER_LOGDEST", + "FLASHINFER_DUMP_DIR", + ) + original = {k: os.environ.get(k) for k in keys} yield - # Restore original environment - if original_level is not None: - os.environ["FLASHINFER_LOGLEVEL"] = original_level - elif "FLASHINFER_LOGLEVEL" in os.environ: - del os.environ["FLASHINFER_LOGLEVEL"] - - if original_dest is not None: - os.environ["FLASHINFER_LOGDEST"] = original_dest - elif "FLASHINFER_LOGDEST" in os.environ: - del os.environ["FLASHINFER_LOGDEST"] + # Restore original environment + for k, v in original.items(): + if v is not None: + os.environ[k] = v + elif k in os.environ: + del os.environ[k] # Force reimport to pick up new environment variables if "flashinfer.api_logging" in sys.modules: del sys.modules["flashinfer.api_logging"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/utils/test_logging.py` around lines 46 - 68, The autouse fixture setup_and_teardown fails to save/restore FLASHINFER_DUMP_DIR (and any dump-related env vars), leaking state into later tests; update setup_and_teardown to capture original values for FLASHINFER_DUMP_DIR (and any other dump-related env vars you add), restore them in the teardown branch (mirroring the pattern used for FLASHINFER_LOGLEVEL and FLASHINFER_LOGDEST), and ensure the module reimport logic for flashinfer.api_logging still runs so functions like _warn_dump() and dump-count tracking see the restored environment.flashinfer/api_logging.py (1)
327-648:⚠️ Potential issue | 🟠 MajorPre-commit
ruff-formatis failing on this hunk; CI is currently blocked.The
pre-commitjob reports formatting changes required across the new dump-staging code in this file. Run the formatter locally before pushing:pre-commit run --all-files # or, scoped: ruff format flashinfer/api_logging.py🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/api_logging.py` around lines 327 - 648, The new dump-staging code (see functions _stage_tensor_to_pinned, _extract_tensors_and_metadata_pinned and _dump_function_inputs) is failing the pre-commit ruff-format hook; run the project formatter and re-stage the file to fix whitespace/formatting issues (e.g. run `pre-commit run --all-files` or `ruff format flashinfer/api_logging.py`) and then amend the commit so CI passes.
🧹 Nitpick comments (1)
flashinfer/api_logging.py (1)
308-316: Optional: narrow the blanketexcept Exception(BLE001).
torch.cuda.is_current_stream_capturing()is documented to either return aboolor raise on missing CUDA context; swallowing all exceptions hides genuine bugs (e.g., a CUDA driver failure manifesting as "stats look fine" until something else explodes). Thehasattrcheck above already covers older PyTorch, so thetry/excepthere is mostly defensive. Consider scoping it toRuntimeError(or removing it) so unexpected failures aren't masked.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/api_logging.py` around lines 308 - 316, The blanket except in _is_current_stream_capturing hides unexpected errors; replace the broad "except Exception" around the call to torch.cuda.is_current_stream_capturing() with a narrower catch (e.g., "except RuntimeError") or remove the try/except entirely so only the documented missing-CUDA-context error is caught while other failures surface; update the exception handling around torch.cuda.is_current_stream_capturing() accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@flashinfer/api_logging.py`:
- Around line 318-365: The pinned-buffer cache (_PINNED_DUMP_BUFFER_CACHE)
aliases multiple invocations of the same (func_name, key, shape, dtype) inside a
single torch.cuda.graph capture, causing different dump entries to share and
overwrite the same pinned tensor; update _stage_tensor_to_pinned to avoid
aliasing by either (A) when _is_current_stream_capturing() is true, append a
per-call unique discriminator (e.g., a _dump_call_counter[func_name] or
ephemeral UUID) to the cache_key so each in-graph call gets its own pinned
buffer, or (B) detect an existing cache entry during capture and raise a
RuntimeError mirroring the warmup check; ensure you reference and update the
same cache key logic in _stage_tensor_to_pinned and initialize/maintain the
per-call counter/state (or explicit error path) so _PENDING_GRAPH_DUMPS entries
will point to distinct buffers rather than aliasing.
- Around line 1403-1441: The level-5 stats path lazily JIT-compiles in
_get_api_log_stats_kernel (called by _launch_gpu_stats_kernel) which can trigger
illegal module loads during cudaStreamCaptureModeGlobal; fix by adding an eager
warmup call to _get_api_log_stats_kernel() during initialization (for example
invoke it from _warn_dump or _log_system_info when FLASHINFER_LOGLEVEL >= 5) so
the kernel is built before any capture, or alternatively update the docstring
where level-10 warmup is documented (around the level-10 note) to clearly state
that callers must warm up _get_api_log_stats_kernel() before starting captures
at level 5; reference _get_api_log_stats_kernel, _launch_gpu_stats_kernel,
_warn_dump, and _log_system_info to locate the changes.
In `@tests/utils/test_logging.py`:
- Around line 685-717: Update test_level_10_cuda_graph_requires_warmup to assert
that a RuntimeError is raised instead of accepting the "no exception" branch:
replace the try/except and the silent-success fallback with a
pytest.raises(RuntimeError) context around the with torch.cuda.graph(graph):
_id(x) block, import pytest at top of the test, and remove the broad bare except
to both make the failure deterministic (matching _stage_tensor_to_pinned and
_is_current_stream_capturing behavior) and satisfy Ruff BLE001 by narrowing the
exception expectation; keep existing uses of flashinfer_api and
_PINNED_DUMP_BUFFER_CACHE as-is.
---
Outside diff comments:
In `@flashinfer/api_logging.py`:
- Around line 327-648: The new dump-staging code (see functions
_stage_tensor_to_pinned, _extract_tensors_and_metadata_pinned and
_dump_function_inputs) is failing the pre-commit ruff-format hook; run the
project formatter and re-stage the file to fix whitespace/formatting issues
(e.g. run `pre-commit run --all-files` or `ruff format
flashinfer/api_logging.py`) and then amend the commit so CI passes.
In `@tests/utils/test_logging.py`:
- Around line 46-68: The autouse fixture setup_and_teardown fails to
save/restore FLASHINFER_DUMP_DIR (and any dump-related env vars), leaking state
into later tests; update setup_and_teardown to capture original values for
FLASHINFER_DUMP_DIR (and any other dump-related env vars you add), restore them
in the teardown branch (mirroring the pattern used for FLASHINFER_LOGLEVEL and
FLASHINFER_LOGDEST), and ensure the module reimport logic for
flashinfer.api_logging still runs so functions like _warn_dump() and dump-count
tracking see the restored environment.
---
Nitpick comments:
In `@flashinfer/api_logging.py`:
- Around line 308-316: The blanket except in _is_current_stream_capturing hides
unexpected errors; replace the broad "except Exception" around the call to
torch.cuda.is_current_stream_capturing() with a narrower catch (e.g., "except
RuntimeError") or remove the try/except entirely so only the documented
missing-CUDA-context error is caught while other failures surface; update the
exception handling around torch.cuda.is_current_stream_capturing() accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c91c3f8a-903f-4af1-9e7a-75ac8587bff9
📒 Files selected for processing (3)
docs/logging.rstflashinfer/api_logging.pytests/utils/test_logging.py
✅ Files skipped from review due to trivial changes (1)
- docs/logging.rst
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tools/dump_with_cuda_graph.py (2)
139-158: Idempotent branch returns the patched callable, not the original — contradicts the docstring.The docstring promises "the original
replaycallable (so callers can reverse the patch in tests)", but on the already-patched path you returntorch.cuda.CUDAGraph.replay, which is the wrapper itself; the true original is no longer accessible. Either stash the original on the wrapper at install time and return it here, or update the docstring to say it returns the currently-installed callable.♻️ Proposed fix (preserve and return the real original)
if getattr(torch.cuda.CUDAGraph.replay, "_flashinfer_autoflush", False): # Already patched (idempotent). - return torch.cuda.CUDAGraph.replay + return getattr(torch.cuda.CUDAGraph.replay, "_flashinfer_original", None) original = torch.cuda.CUDAGraph.replay def replay_with_flush(self, *args, **kwargs): ... replay_with_flush._flashinfer_autoflush = True # type: ignore[attr-defined] + replay_with_flush._flashinfer_original = original # type: ignore[attr-defined] torch.cuda.CUDAGraph.replay = replay_with_flush # type: ignore[assignment] return original🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tools/dump_with_cuda_graph.py` around lines 139 - 158, The idempotent branch in install_replay_autoflush currently returns torch.cuda.CUDAGraph.replay (the patched wrapper), which contradicts the docstring promise to return the original replay callable; fix by storing the real original replay when you first patch (e.g., attach it to the wrapper under a unique attribute name like _flashinfer_original_replay when wrapping in install_replay_autoflush) and in the early-return path return that stored original (check _flashinfer_autoflush to detect patching and return the attached _flashinfer_original_replay), leaving the wrapper flag _flashinfer_autoflush to indicate idempotency.
200-200: Nit: use unpacking instead of list concatenation (ruff RUF005).- sys.argv = [module] + rest[2:] + sys.argv = [module, *rest[2:]]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tools/dump_with_cuda_graph.py` at line 200, Replace the list concatenation that builds sys.argv (currently using [module] + rest[2:]) with list unpacking to satisfy ruff RUF005; locate the assignment to sys.argv that references the variables module and rest and change it to use the unpacking form (module followed by the unpacked slice of rest) so the result is the same but uses unpacking instead of concatenation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tools/dump_with_cuda_graph.py`:
- Around line 194-217: The wrapper's Python detection and -m/script handling are
too narrow: update the logic around head/target_argv (symbols: head,
target_argv, rest) to treat any executable whose basename contains "python" (or
equals sys.executable) as a Python interpreter, and scan rest while stripping
leading interpreter flags (flags beginning with "-" and their potential values,
e.g., -W warnings, -X args) to find the first non-flag token; if that token is
"-m" call runpy.run_module(module, run_name="__main__", alter_sys=True) with
sys.argv set to the module + its args (symbols: runpy.run_module, sys.argv),
otherwise treat the first non-flag token as a script path and call
runpy.run_path(script, run_name="__main__") with sys.argv set to script + its
args (symbol: runpy.run_path); only fall back to os.execvp(head, target_argv)
when head is not a Python interpreter (symbol: os.execvp).
---
Nitpick comments:
In `@tools/dump_with_cuda_graph.py`:
- Around line 139-158: The idempotent branch in install_replay_autoflush
currently returns torch.cuda.CUDAGraph.replay (the patched wrapper), which
contradicts the docstring promise to return the original replay callable; fix by
storing the real original replay when you first patch (e.g., attach it to the
wrapper under a unique attribute name like _flashinfer_original_replay when
wrapping in install_replay_autoflush) and in the early-return path return that
stored original (check _flashinfer_autoflush to detect patching and return the
attached _flashinfer_original_replay), leaving the wrapper flag
_flashinfer_autoflush to indicate idempotency.
- Line 200: Replace the list concatenation that builds sys.argv (currently using
[module] + rest[2:]) with list unpacking to satisfy ruff RUF005; locate the
assignment to sys.argv that references the variables module and rest and change
it to use the unpacking form (module followed by the unpacked slice of rest) so
the result is the same but uses unpacking instead of concatenation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 662546b0-9164-4a5c-b4a9-45c1edccdb48
📒 Files selected for processing (1)
tools/dump_with_cuda_graph.py
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@flashinfer/api_logging.py`:
- Around line 797-827: flush_graph_dumps currently writes tensor files but never
promotes the per-dump or session JSONL records from execution_status
"graph_capture_pending_flush" to a terminal state, so consumers still see them
as pending; modify flush_graph_dumps to, after each successful write (inside the
loop, when a tensor file is saved and before incrementing flushed), append a
small completion record to both the per-dump metadata.jsonl and the central
session.jsonl indicating execution_status="completed" (or the same terminal
state used by eager mode), include identifying fields like func_name, kind,
dump_dir and a timestamp, and ensure you reference the same keys used by
_dump_function_inputs/_dump_function_outputs so readers can correlate entries.
- Around line 472-481: The change unconditionally stages tensors into pinned
contiguous buffers via
_extract_tensors_and_metadata_pinned/_stage_tensor_to_pinned which destroys
source strides for eager (non-capture) dumps; revert to using the original
CPU-path (_extract_tensors_and_metadata using arg.cpu()) for the
non-capturing/eager path and only call _extract_tensors_and_metadata_pinned when
_is_current_stream_capturing() is true (apply same fix in
_dump_function_outputs), ensure tensor_details["stride"] is recorded from the
source tensor before any staging/copy occurs (not from the pinned buffer), and
update the public decorator and _extract_tensors_and_metadata docstrings to
accurately reflect when stride/contiguity is preserved versus lost.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7e5f3b69-362a-40e5-9c48-c83212209614
📒 Files selected for processing (2)
csrc/api_log_stats.cuflashinfer/api_logging.py
🚧 Files skipped from review as they are similar to previous changes (1)
- csrc/api_log_stats.cu
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (4)
flashinfer/api_logging.py (4)
472-481:⚠️ Potential issue | 🟠 MajorEager Level-10 dumps now silently lose CUDA tensor strides (duplicate).
_extract_tensors_and_metadata_pinnedis invoked unconditionally for both capture and eager paths (also at lines 660–668). In eager mode,_stage_tensor_to_pinnedallocates a contiguous pinned buffer andpinned.copy_(t, ...)produces a contiguous tensor regardless of source stride, soinputs.pt/outputs.ptandtensor_details["stride"](lines 566, 720–726) no longer reflect the original layout. The decorator docstring at line 1922 and_extract_tensors_and_metadata's docstring (lines 276/282) still advertise stride preservation. Restrict pinned staging to the capture path (or pre-warm the cache during the first eager call without losing the source stride), and recordtensor_details["stride"]from the source tensor before staging.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/api_logging.py` around lines 472 - 481, The eager-mode path is losing original CUDA tensor strides because _extract_tensors_and_metadata_pinned (which calls _stage_tensor_to_pinned and uses pinned.copy_) is being invoked unconditionally; change the call site around _is_current_stream_capturing() so that you only stage to pinned buffers when capturing (use _extract_tensors_and_metadata for eager), and when you must pre-warm the pinned cache during the first eager call, explicitly record tensor_details["stride"] from the source tensor before calling _stage_tensor_to_pinned so the original stride is preserved in the metadata; update both call sites (the one around _is_current_stream_capturing and the duplicate at lines 660–668) and ensure _extract_tensors_and_metadata_pinned documents/returns original stride if staging occurs.
318-365:⚠️ Potential issue | 🟠 MajorPinned-buffer aliasing on repeated in-graph calls (duplicate).
The
(func_name, key, shape, dtype)cache key still aliases when the same@flashinfer_apiis invoked more than once during a single capture (e.g., a method called inside a captured loop, or twodecodecalls back-to-back). All such call sites end up sharing the same pinned tensor and_PENDING_GRAPH_DUMPSwill hold multiple entries pointing at the same buffer; after replay every aliaseddump_dirflushes identical content. Either disambiguate the cache key during capture (e.g., fold in_dump_call_counter[func_name]), refuse to alias and raise like the warmup case, or extend the comment to make the in-graph repeat case explicit.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/api_logging.py` around lines 318 - 365, The pinned-buffer cache currently aliasing repeated in-graph calls must be disambiguated: modify _stage_tensor_to_pinned to include a capture-specific call identifier in the cache key when inside a capture (instead of the current (func_name, key, shape, dtype) only). Add/consume a per-function-in-capture counter (e.g., _dump_call_counter[func_name] or similar) incremented on each dump invocation during capture and fold that counter into cache_key when _is_current_stream_capturing() is true; ensure the counter is incremented before building cache_key and persisted/cleared appropriately so multiple in-graph calls get distinct pinned buffers and update the top comment to document this behavior.
1405-1422:⚠️ Potential issue | 🟠 MajorLevel-5 stats also need eager warmup for capture (duplicate).
_get_api_log_stats_kernel()is@functools.cache-decorated and triggersgen_api_log_stats_module().build_and_load()on first use. Because_launch_gpu_stats_kernelis only invoked whenis_capturingis True, a "first run is the captured run" workflow underFLASHINFER_LOGLEVEL=5will JIT-build (andcuModuleLoadData) inside the capture region — which is prohibited undercudaStreamCaptureModeGlobaland aborts capture. Either eagerly prime the kernel from_warn_dump/_log_system_infowhen level ≥ 5, or document the warmup requirement alongside the level-10 note in the docstring (lines 1970–1983).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/api_logging.py` around lines 1405 - 1422, The stats JIT kernel (_get_api_log_stats_kernel) can be first-built during a CUDA stream capture because _launch_gpu_stats_kernel runs only when is_capturing is true; to avoid illegal JIT/cudaModuleLoadData inside capture, eagerly warm up the kernel when logging level >= 5 by calling _get_api_log_stats_kernel() from the startup/info path (e.g., inside _warn_dump and/or _log_system_info) so the build/load happens before any capture; ensure you handle the None return (build failure) as the existing callers do and keep the existing level-10 docstring note for FLASHINFER_LOGLEVEL while adding a short comment about the warmup at level-5.
797-827:⚠️ Potential issue | 🟡 Minor
flush_graph_dumpsdoesn't promoteexecution_statusto a terminal state (duplicate).After successful tensor writes here, neither the per-dump
metadata.jsonlnor the centralsession.jsonlgets a follow-up record — they retainexecution_status="graph_capture_pending_flush"forever. Consumers filtering byexecution_status == "completed"(the eager-mode terminal state used at line 714) will treat flushed dumps as still pending. Append a small completion record withexecution_status="completed"(and ideally agraph_capture_flushed=Truemarker plus timestamp) after each successful write so on-disk state matches reality.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/api_logging.py` around lines 797 - 827, In flush_graph_dumps, after each successful tensor write (inside the try block that writes tensors for entries from _PENDING_GRAPH_DUMPS and after incrementing flushed), append a small completion record to the per-dump metadata stream and the central session stream indicating execution_status="completed", graph_capture_flushed=True, and a timestamp (use entry["dump_dir"], entry["kind"], entry.get("func_name") to populate context); ensure this write happens only on success and does not swallow exceptions from the tensor save step, and reuse any existing helper(s) used elsewhere for writing metadata/session JSONL to keep format consistent.
🧹 Nitpick comments (1)
flashinfer/api_logging.py (1)
2042-2055: Verify intent: include/exclude filter now also gates level-3 console logging.Level 3+ logging (the
_log_function_inputs/_log_function_outputsconsole output) is now gated by_should_dump_function(func_name)along with level-10 disk dumps. This is consistent with the linked commit message ("gate level-3+ logging path by include/exclude filter"), but worth confirming: users running with onlyFLASHINFER_LOGLEVEL=3andFLASHINFER_DUMP_INCLUDE/FLASHINFER_DUMP_EXCLUDEset previously got logs for every API; they will now see logs only for filtered APIs. Consider documenting this side-effect in the env-var reference around lines 1944–1948 (the docstring still describesFLASHINFER_DUMP_INCLUDE/EXCLUDEpurely as dump filters), and/or rename_should_dump_functionto something like_should_log_functionto reflect the broader scope.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/api_logging.py` around lines 2042 - 2055, The current change also gates level-3 console logging by _should_dump_function(func_name), which unintentionally prevents normal FLASHINFER_LOGLEVEL=3 console logs for unfiltered APIs; revert that behavior by making console logging (calls to _log_function_inputs and _log_function_outputs when _API_LOG_LEVEL >= 3) not depend on _should_dump_function, while keeping _should_dump_function gating only for disk dumps (level 10 or explicit dump paths). Concretely, update the pre- and post-execution checks so that if _API_LOG_LEVEL >= 3 you call _log_function_inputs/_log_function_outputs unconditionally (or based on a separate _should_log_function predicate if you prefer), and keep _should_dump_function checks only for the level-10 dump branch; also update the env-var docstring around the FLASHINFER_DUMP_INCLUDE/EXCLUDE lines to state these filters apply to dump files (not console level-3 logging) or rename _should_dump_function to _should_log_function and adjust doc accordingly if you intended to change semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@flashinfer/api_logging.py`:
- Around line 830-839: clear_graph_dumps currently only clears
_PENDING_GRAPH_DUMPS but the docstring says it "Releases the pinned host
buffers"; either evict the corresponding entries from _PINNED_DUMP_BUFFER_CACHE
for the keys removed from _PENDING_GRAPH_DUMPS (so buffers are freed and future
captures will perform eager warmup), or update the clear_graph_dumps docstring
to accurately state that only the pending-write registry is cleared while pinned
buffers in _PINNED_DUMP_BUFFER_CACHE remain cached for reuse; locate the
function clear_graph_dumps and change its implementation to iterate removed keys
and pop from _PINNED_DUMP_BUFFER_CACHE, or modify its docstring text to the
corrected behavior.
---
Duplicate comments:
In `@flashinfer/api_logging.py`:
- Around line 472-481: The eager-mode path is losing original CUDA tensor
strides because _extract_tensors_and_metadata_pinned (which calls
_stage_tensor_to_pinned and uses pinned.copy_) is being invoked unconditionally;
change the call site around _is_current_stream_capturing() so that you only
stage to pinned buffers when capturing (use _extract_tensors_and_metadata for
eager), and when you must pre-warm the pinned cache during the first eager call,
explicitly record tensor_details["stride"] from the source tensor before calling
_stage_tensor_to_pinned so the original stride is preserved in the metadata;
update both call sites (the one around _is_current_stream_capturing and the
duplicate at lines 660–668) and ensure _extract_tensors_and_metadata_pinned
documents/returns original stride if staging occurs.
- Around line 318-365: The pinned-buffer cache currently aliasing repeated
in-graph calls must be disambiguated: modify _stage_tensor_to_pinned to include
a capture-specific call identifier in the cache key when inside a capture
(instead of the current (func_name, key, shape, dtype) only). Add/consume a
per-function-in-capture counter (e.g., _dump_call_counter[func_name] or similar)
incremented on each dump invocation during capture and fold that counter into
cache_key when _is_current_stream_capturing() is true; ensure the counter is
incremented before building cache_key and persisted/cleared appropriately so
multiple in-graph calls get distinct pinned buffers and update the top comment
to document this behavior.
- Around line 1405-1422: The stats JIT kernel (_get_api_log_stats_kernel) can be
first-built during a CUDA stream capture because _launch_gpu_stats_kernel runs
only when is_capturing is true; to avoid illegal JIT/cudaModuleLoadData inside
capture, eagerly warm up the kernel when logging level >= 5 by calling
_get_api_log_stats_kernel() from the startup/info path (e.g., inside _warn_dump
and/or _log_system_info) so the build/load happens before any capture; ensure
you handle the None return (build failure) as the existing callers do and keep
the existing level-10 docstring note for FLASHINFER_LOGLEVEL while adding a
short comment about the warmup at level-5.
- Around line 797-827: In flush_graph_dumps, after each successful tensor write
(inside the try block that writes tensors for entries from _PENDING_GRAPH_DUMPS
and after incrementing flushed), append a small completion record to the
per-dump metadata stream and the central session stream indicating
execution_status="completed", graph_capture_flushed=True, and a timestamp (use
entry["dump_dir"], entry["kind"], entry.get("func_name") to populate context);
ensure this write happens only on success and does not swallow exceptions from
the tensor save step, and reuse any existing helper(s) used elsewhere for
writing metadata/session JSONL to keep format consistent.
---
Nitpick comments:
In `@flashinfer/api_logging.py`:
- Around line 2042-2055: The current change also gates level-3 console logging
by _should_dump_function(func_name), which unintentionally prevents normal
FLASHINFER_LOGLEVEL=3 console logs for unfiltered APIs; revert that behavior by
making console logging (calls to _log_function_inputs and _log_function_outputs
when _API_LOG_LEVEL >= 3) not depend on _should_dump_function, while keeping
_should_dump_function gating only for disk dumps (level 10 or explicit dump
paths). Concretely, update the pre- and post-execution checks so that if
_API_LOG_LEVEL >= 3 you call _log_function_inputs/_log_function_outputs
unconditionally (or based on a separate _should_log_function predicate if you
prefer), and keep _should_dump_function checks only for the level-10 dump
branch; also update the env-var docstring around the
FLASHINFER_DUMP_INCLUDE/EXCLUDE lines to state these filters apply to dump files
(not console level-3 logging) or rename _should_dump_function to
_should_log_function and adjust doc accordingly if you intended to change
semantics.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
Addresses inline review comments on PR flashinfer-ai#3172. Each item below maps to one or more comments from gemini-code-assist or coderabbit. csrc/api_log_stats.cu (level-5 stats kernel): - Use ``double`` (not ``float``) for the reduction accumulators and per-thread min/max. Float's 24-bit mantissa drops precision past ~16.7M for int32_t/int64_t inputs; the kernel always emits ``%.6f`` anyway. Drops the ``CUDART_INF_F`` sentinels for ``CUDART_INF``. - Include ``+/-Inf`` in the min/max reduction (still counted separately in ``inf=N``). Pre-fix the GPU path showed e.g. ``min=1 max=1`` for ``[1.0, +inf]`` while eager ``torch.min/max`` showed ``max=+inf``; the inconsistency was confusing. - New explicit "all non-finite" branch (``valid == 0``) so a tensor of pure NaN/Inf doesn't print the misleading sentinel ``min=inf max=-inf mean=0.000000``; instead we say ``(all non-finite) nan=N inf=M``. flashinfer/api_logging.py: - ``_launch_gpu_stats_kernel`` now early-returns ``None`` when the tensor is non-contiguous. The kernel does a linear scan via ``data[i]`` and would otherwise read garbage / out-of-bounds memory for transposed views or slices. - Eager warm-up of the level-5 stats kernel at import time when ``FLASHINFER_LOGLEVEL>=5``. Without this, the first stats call inside ``torch.cuda.graph(...)`` triggers ``cuModuleLoadData`` via ``build_and_load()``, which is forbidden under ``cudaStreamCaptureModeGlobal`` and aborts the capture. - ``_dump_function_inputs``/``_dump_function_outputs``: restrict the pinned-buffer staging path to capture mode and keep the legacy ``.cpu()`` extraction in eager mode. Pre-fix, eager dumps silently lost CUDA tensor strides because the pinned destination is contiguous, contradicting the docstring promise of stride/contiguity preservation. In eager we now also call a new ``_prime_pinned_buffer(...)`` that allocates (but doesn't copy into) the pinned cache so a subsequent captured call still finds a pre-allocated buffer. - New ``_DumpWarmupRequired(RuntimeError)`` subclass; ``_stage_tensor_to_pinned`` raises it (instead of bare ``RuntimeError``) when capture finds a cache miss. Both ``_dump_function_inputs`` and the ``flashinfer_api`` decorator now special-case this subclass and let it propagate to user code, while still swallowing other dump failures via the generic ``Exception`` branch. Pre-fix, the broad ``except Exception`` blocks silently swallowed the warmup error so the contract was un-enforceable from a user-test perspective. - ``flush_graph_dumps``: after a successful tensor-file write, append a completion record to per-dump ``metadata.jsonl`` and the central ``session.jsonl`` promoting ``execution_status`` from ``graph_capture_pending_flush`` to ``completed`` (or ``inputs_saved`` for the inputs half). Consumers that filter by terminal state now see flushed dumps as completed instead of stuck in pending. - ``clear_graph_dumps``: docstring rewritten to honestly describe current behavior — only the deferred-write registry is cleared; the pinned host buffers in ``_PINNED_DUMP_BUFFER_CACHE`` are intentionally retained so subsequent replays can reuse them without ``cudaHostAlloc`` (illegal under capture). tests/utils/test_logging.py: - ``test_level_10_cuda_graph_requires_warmup`` now asserts the ``RuntimeError`` explicitly via ``pytest.raises(..., match=r"(?i)pinned host memory")``. Pre-fix, the test accepted both the "exception" and "no exception" branches, so a regression that silently swallowed the warmup error would still leave it green. All 18 tests in ``tests/utils/test_logging.py`` pass. Skipped (out of scope or stale): - clang-format complaint on csrc/api_log_stats.cu was already addressed in commit 6551545. - ``tools/dump_with_cuda_graph.py`` was deleted in commit 67d066b, so the interpreter-detection comment is moot. Deferred (separate follow-ups): - Inf-vs-eager parity in the *exclusion* of Inf from the *sum* — the comment is right that this is debatable; we keep the current behavior (Inf affects min/max, excluded from mean) and document explicitly in the kernel. - Pinned-buffer aliasing within a single graph: same ``(func, key, shape, dtype)`` captured twice in one graph (e.g. inside a loop body) still aliases the same pinned buffer. Worth a follow-up that either disambiguates with a per-call counter or detects-and-errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/bot run |
e6924c3 to
e9ad9c9
Compare
At FLASHINFER_LOGLEVEL=5 the host path (.min().item() etc.) cannot run inside torch.cuda.graph(...) because .item() synchronizes the stream, so statistics were silently skipped. Replace the skip with a single-block CUDA kernel that computes min/max/mean/nan/inf and emits one printf line per tensor — the launch is captured into the graph and the printf fires on every replay. The host log records a correlation id so kernel output can be matched back to the API call/argument that produced it. Supported dtypes: float32, float16, bfloat16, int32, int64, uint8. Other dtypes fall back to the legacy skip message. Approach mirrors flashinfer-ai/debug-print#2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
At FLASHINFER_LOGLEVEL=10 the dump path called .cpu() on every input/output, which synchronizes the captured stream and is illegal under torch.cuda.graph(...). Stage all dump tensors through cached pinned host buffers (allocated lazily during eager warmup, since cudaHostAlloc is forbidden under capture) and issue captured non_blocking copy_() ops so each replay refreshes the buffers in place. The actual inputs.pt/outputs.pt writes are deferred to a new flush_graph_dumps() API that the user calls after each g.replay() — that function synchronizes the stream, then writes the buffer's current contents to disk so the dump always reflects the most recent replay. clear_graph_dumps() releases the held pinned buffers. Caveats documented in flashinfer/api_logging.py: - requires at least one eager warmup call before capture (to populate the pinned-buffer cache); - the captured buffers are contiguous so original strides are not preserved (matches the existing safetensors path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lets users run an unmodifiable Python program (e.g. sglang) under
FLASHINFER_LOGLEVEL=10 + torch.cuda.graph(...) without touching the
program's source. The wrapper:
* sets the FLASHINFER_DUMP_* env vars from CLI flags;
* monkey-patches torch.cuda.CUDAGraph.replay so flush_graph_dumps()
fires automatically after every replay (idempotent);
* runs the target via runpy when it is `python ...` so the patch stays
alive in-process; for non-Python targets it falls back to execvp and
warns.
Usage:
python tools/dump_with_cuda_graph.py \
--dump-dir /tmp/fi_dumps --include '*decode*' --max-count 10 \
-- python -m sglang.launch_server ...
Strongly recommend setting --include and --max-count: without scoping,
every replay rewrites every captured dump, which is heavy for high-QPS
workloads.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Setting FLASHINFER_LOGLEVEL=10 inside a real workload (sglang DSR1 TP=8 warmup observed) used to emit ~28M [flashinfer stats] lines for every flashinfer API call, regardless of FLASHINFER_DUMP_INCLUDE — the include/exclude filter only gated _dump_function_inputs/outputs, not _log_function_inputs/outputs. The volume can overrun upstream HTTP health-check polling and abort the engine before the dump path even runs. Apply _should_dump_function in the level-3+ logging branches too, so include/exclude narrows BOTH log emission AND tensor dumps. Verified with sglang DSR1 FP8 TP=8 + dump_with_cuda_graph.py --include='BatchMLAPagedAttentionWrapper.run,top_k_renorm_probs,...': apilog dropped from 653 MB to 128 MB and the server reached graph capture + replay successfully (was previously timing out warmup). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A long real-workload run (e.g. sglang under InferenceX) issues the same flashinfer API at the same shape thousands of times during decode replay and emits a fresh dump every time, which inflates raw dump volume well past anything sanitize_dumps.py needs as input. The existing global FLASHINFER_DUMP_MAX_COUNT cap kicks in too uniformly: it stops after N dumps total, so rarely-seen shapes get dropped before they fire. Add a per-(func_name, input-shape-signature) cap. When set: FLASHINFER_DUMP_PER_SHAPE_LIMIT=5 every distinct shape that the workload actually exercises gets up to 5 sample dumps; further calls with the same shape are skipped. No shape synthesis — the set of shapes captured is exactly what the workload exercised. Useful as a passive prune feeding the existing flashinfer-bench / flashinfer-trace workload pipeline. Implementation: - new _compute_input_shape_signature(args, kwargs) — string-only, hashable, host-side-metadata-only (no GPU sync, safe under cuda graph capture). Tensor inputs contribute (shape, dtype); scalar kwargs contribute their repr() so e.g. different block_size counts as a distinct shape. - new _dump_shape_counter dict keyed by (func_name, signature), per-process. TP ranks see identical shapes and each contribute their own sample (fine for the typical 8-rank case). - gate runs after the existing include/exclude filter and global count cap, so all three caps compose. Default 0 = disabled (back-compat with all existing collectors). Verified with a synthetic 15-call test (3 shapes × 5 calls): with cap=2 the dump dir contains exactly 6 entries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes that together let LOGLEVEL=10 dumps land for real-workload
collection through an inference engine that uses both
torch.inference_mode() and torch.cuda.graph(...) capture (sglang DSR1
TP=8 was the proving ground).
1. _stage_tensor_to_pinned now wraps both the pinned-buffer alloc and
the copy_() in `with torch.inference_mode(False):`. Without this,
sglang's outer torch.inference_mode() marks the cached pinned buffer
as an "inference tensor" on first allocation, which causes every
subsequent in-place copy_() to raise:
RuntimeError: Inplace update to inference tensor outside
InferenceMode is not allowed.
Net effect was every dump silently failing once requests started
flowing.
2. _install_cuda_graph_replay_autoflush patches torch.cuda.CUDAGraph
.replay (idempotently) to call flush_graph_dumps() after every
replay, gated on FLASHINFER_LOGLEVEL >= 10. Without this in-process
patch, the previous mechanism (tools/dump_with_cuda_graph.py)
monkey-patched only the parent process. sglang spawns TP worker
processes via multiprocessing.spawn, which gives each worker a fresh
Python interpreter that re-imports flashinfer; the wrapper's patch
never reached the workers, so captured D2H copies happened but the
inputs.{pt,safetensors} files were never written.
Verified end-to-end: with --attention-backend flashinfer, LOGLEVEL=10,
SAFETENSORS=1, and cuda graphs enabled, sglang reaches "fired up" and
external workload drivers (e.g. InferenceX benchmark_serving.py) drive
real requests whose dumps land on disk through the captured graph
replay path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three changes that together let LOGLEVEL=10 capture real-workload shapes under sglang's cuda-graph decode without tanking decode throughput by 1-3 orders of magnitude. 1. Replace per-CUDAGraph.replay autoflush with atexit + SIGTERM hooks. The previous per-replay flush (commit 570ea4c) ran a few hundred save_file() calls after every captured graph replay, dropping decode from ~100 tok/s/rank to ~1 tok/s/rank when DUMP_DIR was on NFS, and was the cause of all-requests-failed under sustained sglang inference. Replays now refresh the pinned host buffers via captured D2H copies; one shutdown-time flush serializes the latest values for every captured shape — exactly what's needed for workload collection. 2. Add FLASHINFER_DISABLE_GRAPH_STATS=1 to skip the captured printf- from-graph stats path. _launch_gpu_stats_kernel embeds a device-side printf into every captured graph; under sustained replay this floods host stdout (~tens of thousands of [flashinfer stats] lines/s with ~122 traced ops × 8 TP × per-replay), saturating sglang's stdout pipe and stalling the inference scheduler. Set the new env var when collecting workloads under cuda graphs to suppress the printf. 3. Make flush_graph_dumps mkdir(parents=True, exist_ok=True) before each save_file. This lets the orchestrator wipe DUMP_DIR between the warmup→inference boundary (drops eager-mode warmup dumps) while still letting the captured-graph deferred dumps land at flush time — the registry's dump_dir paths just get re-created. Verified end-to-end on B200 + DSR1 FP8 TP=8 + InferenceX: - v9 (per-replay flush, NFS dump): all 16 requests timed out; 0 success - v10 (shutdown-only flush + DISABLE_GRAPH_STATS, plan-only filter): 8/8 successful, 1590 tok/s total throughput, 367 ms TTFT - post-warmup wipe verified safe via flush self-heal Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two general improvements to make level-5 stats and level-10 dumps work on real workloads under cuda graph capture: 1. Drop the per-filtered-call debug log. At FLASHINFER_LOGLEVEL>=10 the logger is at DEBUG, so emitting one line per filtered call (every gemm / rmsnorm / etc in a sustained inference run) saturates stderr at ~325k lines/sec and drops decode throughput by ~30x when an INCLUDE/EXCLUDE filter is set. 2. Add FLASHINFER_DUMP_MAX_TENSOR_MB env var (default 0 = old behaviour). Tensors over the cap are recorded as safe attrs (shape/dtype/device/stride) only — never via _serialize_value, which falls into the generic branch and calls str(t) → tensor.__repr__(), reading device memory and triggering cudaErrorStreamCaptureInvalidated under capture. Also avoids the multi-GB-per-replay D2H tax for giant KV-cache args (the canonical workload schemas treat such inputs as "type": "random" downstream). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Document FLASHINFER_DUMP_PER_SHAPE_LIMIT, FLASHINFER_DUMP_MAX_TENSOR_MB, and FLASHINFER_DISABLE_GRAPH_STATS env vars in the dump-config table. - Rewrite the CUDA Graph Compatibility section: replace the obsolete per-replay flush_graph_dumps() pattern with the atexit/SIGTERM auto-flush model that ships with the cuda-graph-aware logging. - Note when to flip FLASHINFER_DISABLE_GRAPH_STATS=1 (sustained-replay scenarios where device printf saturates stdout). - Note FLASHINFER_DUMP_MAX_TENSOR_MB rationale (avoids multi-GB D2H per replay AND the cudaErrorStreamCaptureInvalidated trap from str(tensor) under capture). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…FER_DUMP_MAX_TENSOR_MB Both env vars were added for a workload-collection use case that isn't part of this PR's intended scope. Reverting to keep the cuda-graph + level 5/10 logging implementation minimal: - Remove FLASHINFER_DUMP_PER_SHAPE_LIMIT (introduced in fa8affd): drops the per-(func, shape) cap, _compute_input_shape_signature(), and the _dump_shape_counter registry. - Remove FLASHINFER_DUMP_MAX_TENSOR_MB (introduced in d237a0e): drops the safe-metadata-only path for oversized tensors and the _tensor_nbytes_mb() helper. _extract_tensors_and_metadata_pinned() reverts to the simple "stage every tensor" form. - Update docs/logging.rst accordingly. Net delta vs the previous tip: -2 features, -124 LoC. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removes the env-var escape hatch and its branch in _log_tensor_statistics. Under cuda-graph capture, the path now always delegates stats to the captured GPU kernel (or falls back to the legacy "[statistics skipped]" message for unsupported dtypes), with no opt-out. - Drop FLASHINFER_DISABLE_GRAPH_STATS env var declaration. - Drop the if _DISABLE_GRAPH_STATS: branch in the capture-path stats code. - Drop the corresponding rows / paragraphs in docs/logging.rst. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The tool's premise — monkey-patch CUDAGraph.replay so flush_graph_dumps() runs after every replay — is now obsolete. The atexit/SIGTERM hook in api_logging._install_cuda_graph_dump_autoflush() flushes once at process shutdown, which is the recommended pattern; per-replay flushing is explicitly avoided because it drops decode throughput from ~100 tok/s/rank to ~1 tok/s/rank when DUMP_DIR is on NFS. Beyond the obsolete monkey-patch, the wrapper does nothing that an env-var prefix can't do: FLASHINFER_LOGLEVEL=10 FLASHINFER_DUMP_DIR=... python -m my_program The patch also wouldn't propagate into multiprocess.spawn TP workers (the intended target), so even at its peak it didn't solve its own use case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add @flashinfer_api(trace=gemm_fp8_nt_groupwise_trace) on the FP8 group-wise GEMM (cutlass + trtllm backends). Captures the trtllm canonical scale layout (a_scale=[M, K//bk], b_scale=[K//bk, N//bk]) used by sglang's --moe-runner-backend flashinfer_trtllm DSR1 path. Validated against a real cuda-graph stage-1 run on 8x B200: this op fired ~28k times per TP worker but produced no trace JSON because the function had a bare @flashinfer_api with no template attached. - Fix trtllm_batch_decode_mla_trace and xqa_batch_decode_mla_trace to model the rank-4 [num_pages, 1, page_size, head_dim_qk] kv_cache layout via a kv_pad_dim const, switch workspace_buffer dtype int8 → uint8, and add the missing skip_softmax_threshold_scale_factor scalar. - Fix mla_rope_quantize_fp8_trace to use rank-2 K tensors (num_k_heads=1 collapsed) instead of inheriting the rank-3 GQA template _ROPE_QUANT_AXES / _ROPE_QUANT_INPUTS. - Add example invocation in tests/trace/example.py and regenerate tests/trace/fi_trace_out/ accordingly. Tests pass: 440 passed, 8 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…K_div_block]
The trtllm backend of gemm_fp8_nt_groupwise actually expects b_scale as
[N//bk, K//bk] — the transposed form of the layout described in
flashinfer/gemm/gemm_base.py:5681-5683. tests/gemm/test_groupwise_scaled_gemm_fp8.py:128-129
proves this:
if backend == "trtllm":
b_scale = b_scale.t().contiguous()
sglang's layers/quantization/fp8_utils.py produces the same transposed
layout, and a cuda-graph stage-1 run on 8x B200 shows runtime
b_scale.shape = (17, 56) for K=7168, N=2112, block=128 (i.e. (N//bk,
K//bk), with stride (56, 1) confirming row-major contiguous storage —
not a transposed view).
The previous template declared b_scale as [K_div_block, N_div_block],
which made the matcher in flashinfer-bench's sanitize_fi_log.py reject
every real gemm_fp8_nt_groupwise call: the K_div_block axis bound to
K//bk via a_scale at step 3, then b_scale dim-0 (which is actually
N//bk) tried to bind it to N//bk at step 4 → conflict → no template
match → no workload extracted.
Update the template, the example call in tests/trace/example.py, and
the regenerated JSON. Trace tests still pass: 440 passed, 8 skipped.
The flashinfer source-code docstring is left as-is (separate issue).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses inline review comments on PR flashinfer-ai#3172. Each item below maps to one or more comments from gemini-code-assist or coderabbit. csrc/api_log_stats.cu (level-5 stats kernel): - Use ``double`` (not ``float``) for the reduction accumulators and per-thread min/max. Float's 24-bit mantissa drops precision past ~16.7M for int32_t/int64_t inputs; the kernel always emits ``%.6f`` anyway. Drops the ``CUDART_INF_F`` sentinels for ``CUDART_INF``. - Include ``+/-Inf`` in the min/max reduction (still counted separately in ``inf=N``). Pre-fix the GPU path showed e.g. ``min=1 max=1`` for ``[1.0, +inf]`` while eager ``torch.min/max`` showed ``max=+inf``; the inconsistency was confusing. - New explicit "all non-finite" branch (``valid == 0``) so a tensor of pure NaN/Inf doesn't print the misleading sentinel ``min=inf max=-inf mean=0.000000``; instead we say ``(all non-finite) nan=N inf=M``. flashinfer/api_logging.py: - ``_launch_gpu_stats_kernel`` now early-returns ``None`` when the tensor is non-contiguous. The kernel does a linear scan via ``data[i]`` and would otherwise read garbage / out-of-bounds memory for transposed views or slices. - Eager warm-up of the level-5 stats kernel at import time when ``FLASHINFER_LOGLEVEL>=5``. Without this, the first stats call inside ``torch.cuda.graph(...)`` triggers ``cuModuleLoadData`` via ``build_and_load()``, which is forbidden under ``cudaStreamCaptureModeGlobal`` and aborts the capture. - ``_dump_function_inputs``/``_dump_function_outputs``: restrict the pinned-buffer staging path to capture mode and keep the legacy ``.cpu()`` extraction in eager mode. Pre-fix, eager dumps silently lost CUDA tensor strides because the pinned destination is contiguous, contradicting the docstring promise of stride/contiguity preservation. In eager we now also call a new ``_prime_pinned_buffer(...)`` that allocates (but doesn't copy into) the pinned cache so a subsequent captured call still finds a pre-allocated buffer. - New ``_DumpWarmupRequired(RuntimeError)`` subclass; ``_stage_tensor_to_pinned`` raises it (instead of bare ``RuntimeError``) when capture finds a cache miss. Both ``_dump_function_inputs`` and the ``flashinfer_api`` decorator now special-case this subclass and let it propagate to user code, while still swallowing other dump failures via the generic ``Exception`` branch. Pre-fix, the broad ``except Exception`` blocks silently swallowed the warmup error so the contract was un-enforceable from a user-test perspective. - ``flush_graph_dumps``: after a successful tensor-file write, append a completion record to per-dump ``metadata.jsonl`` and the central ``session.jsonl`` promoting ``execution_status`` from ``graph_capture_pending_flush`` to ``completed`` (or ``inputs_saved`` for the inputs half). Consumers that filter by terminal state now see flushed dumps as completed instead of stuck in pending. - ``clear_graph_dumps``: docstring rewritten to honestly describe current behavior — only the deferred-write registry is cleared; the pinned host buffers in ``_PINNED_DUMP_BUFFER_CACHE`` are intentionally retained so subsequent replays can reuse them without ``cudaHostAlloc`` (illegal under capture). tests/utils/test_logging.py: - ``test_level_10_cuda_graph_requires_warmup`` now asserts the ``RuntimeError`` explicitly via ``pytest.raises(..., match=r"(?i)pinned host memory")``. Pre-fix, the test accepted both the "exception" and "no exception" branches, so a regression that silently swallowed the warmup error would still leave it green. All 18 tests in ``tests/utils/test_logging.py`` pass. Skipped (out of scope or stale): - clang-format complaint on csrc/api_log_stats.cu was already addressed in commit 6551545. - ``tools/dump_with_cuda_graph.py`` was deleted in commit 67d066b, so the interpreter-detection comment is moot. Deferred (separate follow-ups): - Inf-vs-eager parity in the *exclusion* of Inf from the *sum* — the comment is right that this is debatable; we keep the current behavior (Inf affects min/max, excluded from mean) and document explicitly in the kernel. - Pinned-buffer aliasing within a single graph: same ``(func, key, shape, dtype)`` captured twice in one graph (e.g. inside a loop body) still aliases the same pinned buffer. Worth a follow-up that either disambiguates with a per-call counter or detects-and-errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pre-commit's `fix-end-of-files` hook found 11 fi_trace JSONs in this PR's diff missing a trailing newline. The fi_trace auto-dump writes each JSON via `json.dumps(...)` without an explicit `"\n"` at EOF; the hook adds one. No semantic change to any trace template. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1a7e30c to
6331423
Compare
6331423 to
008a283
Compare
|
/bot run |
|
@yyihuang is not authorized to trigger this CI job. cc: @yzh119, @sricketts, @yongwww |
|
/bot run |
Summary
Two related changes to
@flashinfer_api:FLASHINFER_LOGLEVEL=5and=10. Both levels previously had host-side paths that were unsafe insidetorch.cuda.graph(...)capture (tensor.min().item(),tensor.cpu(), tensor repr on nested CUDA tensors, etc.). Level 5 skipped stats under capture; Level 10 could invalidate graph capture or only preserve the latest replay state.fi_tracetemplate additions/fixes uncovered while validating (1) end-to-end against sglang DSR1 FP8 TP=8 with--attention-backend trtllm_mla. Several flashinfer entry points exercised by that workload had bare@flashinfer_api(no trace template) or had templates whose declared shapes did not match the actual runtime layout.Part A: CUDA-graph-compatible logging
Level 5: stats under CUDA graph capture
Replaces the
[statistics skipped: CUDA graph capture in progress]path for supported dtypes with a small captured CUDA kernel that computes min/max/mean/nan/inf and emits one line via device-sideprintf. The launch is captured into the graph, so the printf fires on everyg.replay(). The host log records a correlation marker so the kernel-emitted line can be matched back to the API call/argument.Supported dtypes:
float32,float16,bfloat16,int32,int64,uint8. Other dtypes (for example fp8/fp4) fall back to the legacy skip message.Files:
csrc/api_log_stats.cu,csrc/flashinfer_api_log_stats_binding.cu,flashinfer/jit/api_log_stats.py,flashinfer/api_logging.py,flashinfer/aot.py.Example output inside
torch.cuda.graph(...):On
g.replay():Level 10: tensor dumps under CUDA graph capture
Current behavior under CUDA graph capture:
.cpu(), allocate pinned host memory, or insert D2H copy nodes into the captured graph.torch.cuda.CUDAGraph.capture_begin/capture_endare wrapped to tag deferred dumps with the owning graph id.torch.cuda.CUDAGraph.replay()is wrapped so FlashInfer automatically flushes graph dumps after every replay for that graph. No sglang code injection is needed.inputs.pt/outputs.ptcompatibility files, and also writes immutablegraph_flushes/flush_XXXX/snapshots.list/tuple/dictinputs are recursively extracted and dumped with stable keys such asarg_2__0,arg_2__1, while metadata recordstensor_keylinks soreplay_from_dump()can reconstruct containers.This means root
inputs.pt/outputs.ptcontain the latest flushed replay, whilegraph_flushes/flush_XXXX/preserves per-replay snapshots.Caveats:
FLASHINFER_DUMP_INCLUDE,FLASHINFER_DUMP_MAX_COUNT, andFLASHINFER_DUMP_MAX_SIZE_GBfor short targeted debug runs.clear_graph_dumps()or process exit.kill -9can still lose pending deferred writes.Files:
flashinfer/api_logging.py,docs/logging.rst,tests/utils/test_logging.py.Part B:
fi_tracetemplate additions / fixesEnd-to-end validation under sglang DSR1 +
trtllm_mlashowed these gaps:gemm_fp8_nt_groupwisehad notrace=template. The op fires heavily in DSR1 +flashinfer_trtllmMoE but produced no trace JSON.trtllm_batch_decode_mla_traceandxqa_batch_decode_mla_tracedeclaredkv_cacheas rank-3[num_pages, page_size, head_dim_qk], but the kernel accepts and sglang passes the rank-4[num_pages, 1, page_size, head_dim_qk]form. They were also missingskip_softmax_threshold_scale_factor, and workspace dtype was corrected fromint8touint8.mla_rope_quantize_fp8_traceinherited the rank-3 GQA rope quant axes withnum_k_heads, but MLA passes rank-2 K tensors withnum_k_heads=1collapsed.gemm_fp8_nt_groupwise_trace.b_scaleaxes were corrected to[N_div_block, K_div_block]to match the trtllm path and sglang runtime layout.Files:
flashinfer/gemm/gemm_base.py,flashinfer/trace/templates/{gemm,attention,rope}.py,tests/trace/example.py, regenerated JSONs intests/trace/fi_trace_out/.Test plan
Unit / formatting
pre-commit run --files docs/logging.rst flashinfer/api_logging.py tests/utils/test_logging.pypassed.CUDA_VISIBLE_DEVICES=0 PYTHONPATH=/home/averyh/flashinfer-pr3172 pytest -q tests/utils/test_logging.pypassed:20 passed, 2 warnings.python -m compileall -q flashinfer/api_logging.py tests/utils/test_logging.pypassed.git diff --checkpassed.pytest tests/trace/passed previously: 440 passed, 8 skipped.tests/utils/test_logging_replay.pypartial local run: 14 passed, 2 failed due local environment/JIT setup before replay validation:test_bmm_fp8_replay: cuDNN reported multiple CUDA runtime libraries,libcudart.so.12andlibcudart.so.13.test_mm_fp4_replay: local JIT build failed becausecutlass/arch/barrier.hwas missing.Manual CUDA graph smoke tests
g.replay()after mutating the input shows updated stats; multiple replays work.graph_flushes/flush_XXXX/snapshots.replay_from_dump().SGLang single-GPU validation, May 13 2026
Environment:
2.9.1+cu1280.5.10.post1/home/averyh/flashinfer-pr3172/flashinfer/__init__.pymeta-llama/Llama-3.2-3B-Instructpython -m sglang.launch_server --attention-backend flashinfer --sampling-backend flashinfer --cuda-graph-bs 1 --cuda-graph-max-bs 1 --context-length 512 --max-total-tokens 1024 --mem-fraction-static 0.55 --dtype bfloat16Level 5 result:
attention_backend='flashinfer',sampling_backend='flashinfer',disable_cuda_graph=False,cuda_graph_bs=[1].Capture cuda graph bs [1]andCapture cuda graph end." Paris. The capital".cuda graph: True.[stats deferred to GPU kernel: id=...]markers.[flashinfer stats] id=...lines.[statistics skipped: CUDA graph capture in progress]lines were observed for this run.Level 10 result:
attention_backend='flashinfer',sampling_backend='flashinfer',disable_cuda_graph=False,cuda_graph_bs=[1].FLASHINFER_DUMP_INCLUDE='BatchDecodeWithPagedKVCacheWrapper.run'andFLASHINFER_DUMP_MAX_COUNT=64.flush_graph_dumps (CUDAGraph.replay): wrote ...replay flushes.graph_flushes, and 96 immutable replay snapshot dirs.arg_1,arg_2__0,arg_2__1, proving both the query tensor and nested K/V cache tuple tensors were dumped.result.flush_0001toflush_0004snapshots had different input/output sums, confirming replay snapshots are not just the last buffer state.Capture cuda graph failed,cudaErrorStreamCaptureInvalidated, orBatchDecodeWithPagedKVCache failederrors in the final level-10 run.Larger integration validation
(api, axes)tuple produced a trace JSON.cudaErrorStreamCaptureInvalidated. Runtime is dominated by unthrottled device-side printf rate.PR state
008a2836103fcf86yyihuang/flashinfer:cuda-graph-api-logging