Skip to content

CUTracer v0.3.0 Release Notes

Latest

Choose a tag to compare

@FindHao FindHao released this 10 Jun 03:38
· 5 commits to main since this release

πŸŽ‰ Minor Release β€” 49 commits since v0.2.1

CUTracer v0.3.0 (2026-04-23 β†’ 2026-06-09) is a substantial minor release that focuses on three big themes: (1) a new high-performance RapidJSON trace serializer that becomes the default and removes ~16% of host-CPU overhead, (2) Blackwell-class race detection β€” a new TMEM-lifetime detector for the analyze data-race command plus warpgroup-targeted random-delay injection that can finally reproduce SM100 flash-attention dealloc races, and (3) a reusable cuda-gdb hang-analysis backend that samples live CUDA hangs and feeds structured evidence into the AI reasoner pipeline. The release also completes the NVBit 1.8 TMA migration, ships a unified AI all command behind a clean Reasoner protocol, expands schemas to cover every record type the writer emits, and migrates the repository from facebookresearch to facebookexperimental with the wiki now living under docs/.


✨ Highlights

  • RapidJSON trace serializer (default) β€” Removes ~16% host CPU and ~8% malloc churn from the per-record write path; byte-identical output to the legacy nlohmann path for reg/mem/opcode/mem_value records.
  • TMEM Lifetime Detector D3 β€” New detector in analyze data-race that finds Blackwell SM100 TMEM dealloc-vs-LDTM races, with prototype-matching 256-vs-0 verdicts on the case-15 baseline/fixed pair.
  • Warpgroup-targeted random delay β€” --delay-warpgroup-ids / --delay-warp-mask / --delay-enable-prob flags let random_delay stall an entire warp or warpgroup as a single scheduling unit, enabling reproduction of cross-warpgroup races that per-thread delays could not.
  • cuda-gdb live hang analyzer β€” Reusable cutracer.debugger Python backend samples PC offsets, SASS context, register operands, HGMMA/TMA memory evidence, and effective-blocker inference from a live hung CUDA process.
  • NVBit 1.8 TMA pipeline β€” 3-diff migration to first-class nvbit_parse_tma_transfer_info(); trace now carries structured tensor metadata (dim, dtype, strides, swizzle, SMEM dst/src/mbar addresses) instead of opaque parameter handles.
  • Unified AI all command β€” All --ai CLI paths re-routed through a new Reasoner Protocol; new cross-domain all command runs deadlock + data-race reasoning in one invocation.
  • Repository migrated to facebookexperimental β€” Wiki content moved into docs/ with auto-sync workflow; CI gains H100 lanes (meta-triton + upstream Triton nightly) and retires the legacy T4 GitHub-hosted workflow.

🏎️ Capture-Side Serializer Overhaul (RapidJSON)

A capture CPU profile attributed ~16% of host CPU to per-record nlohmann::json DOM serialization plus ~8% malloc churn from per-record std::stringstream hex formatting. This release ports the entire hot path to a streaming rapidjson::Writer and makes it the default.

  • Opt-in RapidJSON path (D107298645) β€” Adds streaming Writer into a reused thread_local StringBuffer for reg_trace / mem_addr_trace / mem_value_trace / opcode_only; gated behind CUTRACER_JSON_ENGINE={nlohmann,rapidjson,ab} with an ab oracle that semantically compares both engines per record and reports per-type match/mismatch counters at writer teardown.
  • Unit-test the RapidJSON serializer β€” Extracted to its own translation unit with cxx goldens, exercising all four record types.
  • Port tma_trace (D107320461) β€” Removes the last nlohmann fallback in the per-record path; semantically identical on 2,304 TMA descriptors.
  • RapidJSON becomes the default (D107323672) β€” Removes the CUTRACER_JSON_ENGINE switch, the A/B comparison machinery, build_nlohmann_line, and the entire per-record nlohmann serializer family (~390 lines net). nlohmann is retained only for the one-time kernel_metadata header.
  • Heap-corruption fix in TraceWriter (#225) β€” Serialize public mutators with std::mutex to prevent libstdc++ _M_mutate heap double-free that surfaced as "JSON decode error - unexpected character" at ~7% per kernel-run on aarch64. Replaces a misleading "CRITICAL FIX" std::move dance that did not actually fix the race.
  • OSS install fix (#224) β€” install_third_party.sh now downloads rapidjson headers so the OSS CI build succeeds.

πŸ—οΈ Blackwell Race Detection

TMEM Lifetime Detector D3

Final diff of the case-15 TMEM dealloc-race stack lands a third detector in the analyze data-race registry alongside DataRaceRAWDetector. The new cutracer/analyze/fb/data_race/tmem_lifetime/detector.py module reconciles cross-warp ARRIVE sets per TCGEN05.DEALLOC and emits one Finding(severity=ERROR, category=TMEM_LIFETIME) for every (CTA, warp) that LDTM-reads TMEM but never ARRIVEs on the guarding mbarrier before deallocation. Three-phase pipeline (static guard scan β†’ per-CTA event classification β†’ per-DEALLOC window reconciliation) matches the prototype's 256-vs-0 PASS on the case-15 baseline / D105385008-fixed trace pair. Hopper traces yield no DEALLOCs and short-circuit cleanly via the existing decoder.is_tmem_dealloc / is_tmem_read / is_mbarrier_arrive dispatch.

Warpgroup-Targeted Random Delay (5-diff stack)

Per-thread random delay washes out at the warp boundary (warps are lock-step, so the effective stall is max(thread_delays)), and existing CTA-local modes operate on clusters, not warp ranges within a single CTA. This stack teaches random_delay to stall an entire warp or warpgroup as a single scheduling unit β€” the exact timing distribution needed to reproduce SM100 flash-attention TMEM dealloc races.

  • Device function + host wrapper β€” New instrument_delay_warpgroup(pred, delay_ns, warp_mask) in inject_funcs.cu and instrument_warpgroup_delay_injection host wrapper. Computes CTA-local warp id from threadIdx/blockDim, tests warp_mask bit, and __nanosleeps the entire warp uniformly via the same 1ms chunked loop as instrument_delay_random_cluster.
  • CLI flags + env vars β€” CUTRACER_DELAY_WARP_MASK (hex/oct/dec via new get_var_uint32_auto parser; existing get_var_uint32 silently dropped 0x prefixes via atoll) and CUTRACER_DELAY_WARPGROUP_ID (integer warpgroup index, resolves to 0xF << (id * 4) on the host).
  • Dispatcher routing + --delay-enable-prob β€” Routes warp targeting through the existing delay dispatcher in cutracer.cu and adds a new probability gate that bypasses the 50/50 PC gate for tighter control over injection density.
  • Persist warp targeting in delay config JSON β€” Replay mode (--delay-load-path) now round-trips warp/warpgroup targeting for deterministic reproduction.

Periodic Hang Snapshots

Adds periodic-snapshot sampling so the analyzer can correlate evolving warp state across multiple sample windows rather than relying on a single point-in-time capture.

Blackwell Test Coverage

  • Blackwell FP8 GEMM E2E test β€” Permanent E2E test exercising the upstream Blackwell TLX UTCQMMA path with real FP8 data.
  • Block-scaled UTC*MMA fixture β€” New Blackwell mxfp8 / mxfp4 GEMM fixture for block-scaled tensor-core coverage.
  • Local-only Buck test for Blackwell FA-WS data race reproducer β€” Captures the case-15 reproducer as a permanent regression guard (local-only because of GPU requirements).
  • UTC*MMA A=tmem fix β€” Replaces the positional guard with a per-operand find("gdesc") loop so URx+1 is correctly pushed as the high half of the B gdesc; header rewritten to cite Blackwell ISA slot semantics.

πŸ” Live Hang Analysis (cuda-gdb backend)

A new reusable cutracer.debugger Python package wraps cuda-gdb to extract structured evidence from a live hung CUDA process. Designed to feed the AI reasoner pipeline.

  • Reusable backend (#212) β€” Pure Python backend with no gdb runtime dependency in the base layer; provides the foundation for the follow-up cuda-gdb command.
  • Preserve cuda-gdb warp slot identity β€” Parses cuda-gdb Wp separately from logical warp id; CUTracer parity needs logical warp id (first_active_threadIdx.x // 32) while cuda-gdb focus commands need the physical Wp slot. Both are now serialized in debugger opcode records.
  • PC offsets and SASS context β€” Records runtime PC, kernel-relative offset, and a small disassembly window for each sampled warp.
  • SASS register operands β€” Reads scalar and uniform registers named by the sampled SASS instruction and serializes captured values; failures are recorded as explicit register-read errors so reports never invent values.
  • Effective PC evidence β€” Distinguishes the instruction cuda-gdb stopped on from the effective blocking instruction reported to CUTracer analysis (post-barrier rule).
  • Effective-blocker inference β€” Small cuda-gdb disassembly parser implements the immediate-previous-instruction rule for recovering the effective blocking PC when cuda-gdb stops on a known unsafe post-barrier instruction.
  • HGMMA and TMA memory evidence β€” Captures Hopper GMMA and TMA-related memory state from the live process for the reasoner.
  • Dynamic cuda-gdb AI evidence reporting β€” Evidence stream surfaced directly in --ai reports.

🧠 AI Reasoner Refactor

  • Unified all command + Reasoner protocol (D103870397) β€” All --ai paths re-routed through the new Reasoner Protocol. New cross-domain cutracer analyze all --ai runs deadlock + data-race reasoning together. Centralizes LLM-client construction and report composition.
  • AIDeadlockAnalyzer retired β€” No longer imported from production code; per-reasoner tests (test_deadlock_reasoner.py, test_data_race_reasoner.py, test_unified_reasoner.py) cover the --ai paths end-to-end via fake LLM clients.
  • Dedup β€” Removed redundant base deadlock_command and dead tritonparse fallback shim.

πŸ“‘ TMA Pipeline (NVBit 1.8 Migration)

Three-diff stack migrates TMA tracing off non-public ISA semantics (URa/URa+1 layout, MULTICAST register-shift workarounds) and onto NVBit 1.8's first-class TMA APIs.

  • GPU-side capture β€” Asks NVBit for the TMA parameter handle directly via the new public API.
  • Host-side parsing β€” Calls nvbit_parse_tma_transfer_info() from the receive thread to decode the runtime handle into a structured TMATransferInfo_t; trace now carries fully-structured tensor metadata (dim, mode, coords, dtype, rank, global address/dim/strides, box dim, swizzle).
  • Python decoder consumption β€” analyze tma consumes the new structured tma_transfer_info field.
  • SMEM dst/src addresses β€” Serializes dst.data_address, dst.mbar_address (and src.* for UTMASTG) from NVBit 1.8's TMAAddress_t union. These are the official NVBit API equivalent of the URb/URb+1 register values the data race detector needs to build barrier-to-data mappings β€” capture no longer relies on implicit register snapshotting in reg_trace.

πŸ§ͺ Schema & Validation

  • Trace schemas for every record type + drift guard (D106872962) β€” Adds JSON Schemas for tma_trace, mem_value_trace, and mem_addr_trace. Previously SCHEMAS_BY_TYPE only covered reg_trace / mem_trace / opcode_only / kernel_metadata, so validate / compare silently skipped every current trace's data records. New CI drift guard catches writer/schema divergence.
  • SM100 uniform-register indices β€” reg_trace schema's uregs_indices maximum raised from 62 to 255 (matches regs_indices) to avoid chasing per-arch maximums; both are uint8_t register indices where high values encode special registers.

πŸ–₯️ CLI Changes

# New short alias for --output-dir (matches gcc/ffmpeg/curl/tar convention)
cutracer trace -o ./logs -- python my_kernel.py

# Warpgroup-targeted delay
cutracer trace -i random_delay --delay-ns 100000 \
    --delay-warpgroup-ids 1 \
    --delay-enable-prob 1.0 \
    -- python my_kernel.py

cutracer trace -i random_delay --delay-ns 100000 \
    --delay-warp-mask 0xF0 \
    -- python my_kernel.py

# Unified AI cross-domain analysis (deadlock + data-race in one shot)
cutracer analyze all trace.ndjson --ai -o report.md
  • Friendlier trace errors β€” Three UX fixes for the intentionally permissive trace subcommand (ignore_unknown_options=True, nargs=-1, type=click.UNPROCESSED): clearer messages when wrapped command tokens leak into CUTracer's own option parser, a warning when running under buck2 test, and absolute-path resolution for the wrapped command.
  • -o short alias β€” Closes the foot-gun where a stray -o ./logs was silently appended to the wrapped command (yielding /bin/sh: - : invalid option) because the trace subcommand passes unknown options through to the target binary.

πŸ“ Configuration Changes

Removed Environment Variables

Variable Notes
CUTRACER_JSON_ENGINE The opt-in switch from the RapidJSON migration window. No longer required β€” RapidJSON is now the default and only per-record path; the nlohmann per-record code was removed.

New Environment Variables

Variable Description Default
CUTRACER_DELAY_WARP_MASK CTA-local warp bitmask for random_delay. Accepts hex (0xF), octal (0o17), or decimal (15). 0 = disabled. 0
CUTRACER_DELAY_WARPGROUP_ID Warpgroup index; >= 0 selects warps [4N..4N+3] and wins over CUTRACER_DELAY_WARP_MASK. -1 = disabled. -1
CUTRACER_DELAY_ENABLE_PROB Probability gate for delay injection that bypasses the default 50/50 PC gate. (per-default)

πŸ”„ Dependency Updates

  • NVBit: 1.7 β†’ 1.8 (first-class TMA APIs: nvbit_parse_tma_transfer_info, TMATransferInfo_t, TMAAddress_t)
  • RapidJSON: now a vendored header dependency installed by install_third_party.sh (used as the default trace serializer)

πŸ—οΈ Infrastructure

  • Repository migration β€” facebookresearch/CUTracer β†’ facebookexperimental/CUTracer. PR references in commit messages reflect the new org from #213 onward.
  • Wiki β†’ docs/ β€” Mirrors what TritonParse did in D102010588: the GitHub Wiki content now lives in-repo under docs/ with an auto-sync workflow.
  • CONTRIBUTING.md β€” Reset to fbcode GitHub standard.
  • Monthly stale-PR cleanup workflow β€” Adds a scheduled job to prune stale PR branches.
  • Per-test/per-run subdir layout β€” tests/py_add reorganized with TEST_KEEP_OUTPUT support for debugging.

CI

  • H100 CI lanes (#219, #217) β€” Two new H100 workflows alongside the existing T4 test.yml: a meta-triton lane (sets SKIP_PROTON=1 since FB Triton lags upstream by ~100 days) and an upstream Triton lane. Both use the new self-hosted H100 runner.
  • Triton-Nightly matrix leg (#218) β€” Replaces install-triton.sh with a matrixed nightly install; nightly pinned to the latest upstream commit (workaround for in-flight breakage).
  • Inductor inline-compile (#223) β€” test_add.py forces inductor inline compile to dodge an upstream Triton subprocess-pool driver crash.
  • Retired legacy cu128 T4 workflow (#226) β€” test.yml removed in favor of the new H100 lanes.
  • fb-only unit test split (#222) β€” tests/unit/fb/ separation to fix OSS CI on tests with Meta-internal dependencies.
  • PyPI publish hardening β€” pypa/gh-action-pypi-publish pinned to a commit SHA in nightly-pypi.yml.

πŸ› Notable Bug Fixes

  • NDJSON heap double-free (#225) β€” Concurrent std::string::append on the shared json_buffer_ triggered libstdc++ _M_mutate double-free under ASan; surfaced in OSS CI as random "unexpected character" JSON decode errors at ~7% per kernel-run on aarch64. Fixed by serializing write_trace with std::mutex.
  • UTC*MMA A=tmem high-half β€” Per-operand find("gdesc") loop replaces a positional guard that was dropping URx+1 (B gdesc high half).
  • Type-checking target β€” Fixed broken fbcode//triton/tools/CUTracer/python:cutracer-type-checking (unmanaged regression from D103464).
  • uregs_indices schema bound β€” Raised from 62 to 255 so new Blackwell architectures don't break validate.

⚠️ Breaking Changes

  • CUTRACER_JSON_ENGINE env var removed β€” The opt-in switch from the migration window is gone now that RapidJSON is the default and the nlohmann per-record path has been deleted. Output is byte-identical for reg/mem/opcode/mem_value records and semantically identical for tma_trace. No user action required unless you were setting CUTRACER_JSON_ENGINE=ab for serializer parity testing.
  • GitHub organization moved β€” Repository now lives at facebookexperimental/CUTracer. Update any pinned remotes / submodule URLs / CI references.
  • Legacy T4 CI workflow retired β€” Anything that referenced the test.yml workflow name in T4-targeted automation should switch to the new H100 workflows.

Everything else is additive (new flags, new schemas, new analyzers, new commands).


πŸ™ Acknowledgments

CUTracer is built on NVBit by NVIDIA Research. We thank the NVBit team for their excellent binary instrumentation framework β€” and in particular for the NVBit 1.8 TMA APIs that let this release retire CUTracer's non-public-ISA TMA workarounds.

Contributors to this release: Chen Li, Lei Wang, Xu Zhao, Yueming Hao (alphabetical).


πŸ“„ License

  • MIT License β€” Meta Platforms, Inc. contributions
  • BSD-3-Clause License β€” NVIDIA NVBit components

See LICENSE and LICENSE-BSD for details.


πŸ“š Documentation

Full documentation is now in-repo under docs/ (auto-synced to the Wiki).


πŸ”— Links