Release CUTracer v0.3.0 Release Notes · facebookexperimental/CUTracer

🎉 Minor Release — 49 commits since v0.2.1

CUTracer v0.3.0 (2026-04-23 → 2026-06-09) is a substantial minor release that focuses on three big themes: (1) a new high-performance RapidJSON trace serializer that becomes the default and removes ~16% of host-CPU overhead, (2) Blackwell-class race detection — a new TMEM-lifetime detector for the analyze data-race command plus warpgroup-targeted random-delay injection that can finally reproduce SM100 flash-attention dealloc races, and (3) a reusable cuda-gdb hang-analysis backend that samples live CUDA hangs and feeds structured evidence into the AI reasoner pipeline. The release also completes the NVBit 1.8 TMA migration, ships a unified AI all command behind a clean Reasoner protocol, expands schemas to cover every record type the writer emits, and migrates the repository from facebookresearch to facebookexperimental with the wiki now living under docs/.

✨ Highlights

RapidJSON trace serializer (default) — Removes ~16% host CPU and ~8% malloc churn from the per-record write path; byte-identical output to the legacy nlohmann path for reg/mem/opcode/mem_value records.
TMEM Lifetime Detector D3 — New detector in analyze data-race that finds Blackwell SM100 TMEM dealloc-vs-LDTM races, with prototype-matching 256-vs-0 verdicts on the case-15 baseline/fixed pair.
Warpgroup-targeted random delay — --delay-warpgroup-ids / --delay-warp-mask / --delay-enable-prob flags let random_delay stall an entire warp or warpgroup as a single scheduling unit, enabling reproduction of cross-warpgroup races that per-thread delays could not.
cuda-gdb live hang analyzer — Reusable cutracer.debugger Python backend samples PC offsets, SASS context, register operands, HGMMA/TMA memory evidence, and effective-blocker inference from a live hung CUDA process.
NVBit 1.8 TMA pipeline — 3-diff migration to first-class nvbit_parse_tma_transfer_info(); trace now carries structured tensor metadata (dim, dtype, strides, swizzle, SMEM dst/src/mbar addresses) instead of opaque parameter handles.
Unified AI all command — All --ai CLI paths re-routed through a new Reasoner Protocol; new cross-domain all command runs deadlock + data-race reasoning in one invocation.
Repository migrated to facebookexperimental — Wiki content moved into docs/ with auto-sync workflow; CI gains H100 lanes (meta-triton + upstream Triton nightly) and retires the legacy T4 GitHub-hosted workflow.

🏎️ Capture-Side Serializer Overhaul (RapidJSON)

A capture CPU profile attributed ~16% of host CPU to per-record nlohmann::json DOM serialization plus ~8% malloc churn from per-record std::stringstream hex formatting. This release ports the entire hot path to a streaming rapidjson::Writer and makes it the default.

Opt-in RapidJSON path (D107298645) — Adds streaming Writer into a reused thread_local StringBuffer for reg_trace / mem_addr_trace / mem_value_trace / opcode_only; gated behind CUTRACER_JSON_ENGINE={nlohmann,rapidjson,ab} with an ab oracle that semantically compares both engines per record and reports per-type match/mismatch counters at writer teardown.
Unit-test the RapidJSON serializer — Extracted to its own translation unit with cxx goldens, exercising all four record types.
Port tma_trace (D107320461) — Removes the last nlohmann fallback in the per-record path; semantically identical on 2,304 TMA descriptors.
RapidJSON becomes the default (D107323672) — Removes the CUTRACER_JSON_ENGINE switch, the A/B comparison machinery, build_nlohmann_line, and the entire per-record nlohmann serializer family (~390 lines net). nlohmann is retained only for the one-time kernel_metadata header.
Heap-corruption fix in TraceWriter (#225) — Serialize public mutators with std::mutex to prevent libstdc++ _M_mutate heap double-free that surfaced as "JSON decode error - unexpected character" at ~7% per kernel-run on aarch64. Replaces a misleading "CRITICAL FIX" std::move dance that did not actually fix the race.
OSS install fix (#224) — install_third_party.sh now downloads rapidjson headers so the OSS CI build succeeds.

🏗️ Blackwell Race Detection

TMEM Lifetime Detector D3

Final diff of the case-15 TMEM dealloc-race stack lands a third detector in the analyze data-race registry alongside DataRaceRAWDetector. The new cutracer/analyze/fb/data_race/tmem_lifetime/detector.py module reconciles cross-warp ARRIVE sets per TCGEN05.DEALLOC and emits one Finding(severity=ERROR, category=TMEM_LIFETIME) for every (CTA, warp) that LDTM-reads TMEM but never ARRIVEs on the guarding mbarrier before deallocation. Three-phase pipeline (static guard scan → per-CTA event classification → per-DEALLOC window reconciliation) matches the prototype's 256-vs-0 PASS on the case-15 baseline / D105385008-fixed trace pair. Hopper traces yield no DEALLOCs and short-circuit cleanly via the existing decoder.is_tmem_dealloc / is_tmem_read / is_mbarrier_arrive dispatch.

Warpgroup-Targeted Random Delay (5-diff stack)

Per-thread random delay washes out at the warp boundary (warps are lock-step, so the effective stall is max(thread_delays)), and existing CTA-local modes operate on clusters, not warp ranges within a single CTA. This stack teaches random_delay to stall an entire warp or warpgroup as a single scheduling unit — the exact timing distribution needed to reproduce SM100 flash-attention TMEM dealloc races.

Device function + host wrapper — New instrument_delay_warpgroup(pred, delay_ns, warp_mask) in inject_funcs.cu and instrument_warpgroup_delay_injection host wrapper. Computes CTA-local warp id from threadIdx/blockDim, tests warp_mask bit, and __nanosleeps the entire warp uniformly via the same 1ms chunked loop as instrument_delay_random_cluster.
CLI flags + env vars — CUTRACER_DELAY_WARP_MASK (hex/oct/dec via new get_var_uint32_auto parser; existing get_var_uint32 silently dropped 0x prefixes via atoll) and CUTRACER_DELAY_WARPGROUP_ID (integer warpgroup index, resolves to 0xF << (id * 4) on the host).
Dispatcher routing + --delay-enable-prob — Routes warp targeting through the existing delay dispatcher in cutracer.cu and adds a new probability gate that bypasses the 50/50 PC gate for tighter control over injection density.
Persist warp targeting in delay config JSON — Replay mode (--delay-load-path) now round-trips warp/warpgroup targeting for deterministic reproduction.

Periodic Hang Snapshots

Adds periodic-snapshot sampling so the analyzer can correlate evolving warp state across multiple sample windows rather than relying on a single point-in-time capture.

Blackwell Test Coverage

Blackwell FP8 GEMM E2E test — Permanent E2E test exercising the upstream Blackwell TLX UTCQMMA path with real FP8 data.
Block-scaled UTC*MMA fixture — New Blackwell mxfp8 / mxfp4 GEMM fixture for block-scaled tensor-core coverage.
Local-only Buck test for Blackwell FA-WS data race reproducer — Captures the case-15 reproducer as a permanent regression guard (local-only because of GPU requirements).
UTC*MMA A=tmem fix — Replaces the positional guard with a per-operand find("gdesc") loop so URx+1 is correctly pushed as the high half of the B gdesc; header rewritten to cite Blackwell ISA slot semantics.

🔍 Live Hang Analysis (cuda-gdb backend)

A new reusable cutracer.debugger Python package wraps cuda-gdb to extract structured evidence from a live hung CUDA process. Designed to feed the AI reasoner pipeline.

Reusable backend (#212) — Pure Python backend with no gdb runtime dependency in the base layer; provides the foundation for the follow-up cuda-gdb command.
Preserve cuda-gdb warp slot identity — Parses cuda-gdb Wp separately from logical warp id; CUTracer parity needs logical warp id (first_active_threadIdx.x // 32) while cuda-gdb focus commands need the physical Wp slot. Both are now serialized in debugger opcode records.
PC offsets and SASS context — Records runtime PC, kernel-relative offset, and a small disassembly window for each sampled warp.
SASS register operands — Reads scalar and uniform registers named by the sampled SASS instruction and serializes captured values; failures are recorded as explicit register-read errors so reports never invent values.
Effective PC evidence — Distinguishes the instruction cuda-gdb stopped on from the effective blocking instruction reported to CUTracer analysis (post-barrier rule).
Effective-blocker inference — Small cuda-gdb disassembly parser implements the immediate-previous-instruction rule for recovering the effective blocking PC when cuda-gdb stops on a known unsafe post-barrier instruction.
HGMMA and TMA memory evidence — Captures Hopper GMMA and TMA-related memory state from the live process for the reasoner.
Dynamic cuda-gdb AI evidence reporting — Evidence stream surfaced directly in --ai reports.

🧠 AI Reasoner Refactor

Unified all command + Reasoner protocol (D103870397) — All --ai paths re-routed through the new Reasoner Protocol. New cross-domain cutracer analyze all --ai runs deadlock + data-race reasoning together. Centralizes LLM-client construction and report composition.
AIDeadlockAnalyzer retired — No longer imported from production code; per-reasoner tests (test_deadlock_reasoner.py, test_data_race_reasoner.py, test_unified_reasoner.py) cover the --ai paths end-to-end via fake LLM clients.
Dedup — Removed redundant base deadlock_command and dead tritonparse fallback shim.

📡 TMA Pipeline (NVBit 1.8 Migration)

Three-diff stack migrates TMA tracing off non-public ISA semantics (URa/URa+1 layout, MULTICAST register-shift workarounds) and onto NVBit 1.8's first-class TMA APIs.

GPU-side capture — Asks NVBit for the TMA parameter handle directly via the new public API.
Host-side parsing — Calls nvbit_parse_tma_transfer_info() from the receive thread to decode the runtime handle into a structured TMATransferInfo_t; trace now carries fully-structured tensor metadata (dim, mode, coords, dtype, rank, global address/dim/strides, box dim, swizzle).
Python decoder consumption — analyze tma consumes the new structured tma_transfer_info field.
SMEM dst/src addresses — Serializes dst.data_address, dst.mbar_address (and src.* for UTMASTG) from NVBit 1.8's TMAAddress_t union. These are the official NVBit API equivalent of the URb/URb+1 register values the data race detector needs to build barrier-to-data mappings — capture no longer relies on implicit register snapshotting in reg_trace.

🧪 Schema & Validation

Trace schemas for every record type + drift guard (D106872962) — Adds JSON Schemas for tma_trace, mem_value_trace, and mem_addr_trace. Previously SCHEMAS_BY_TYPE only covered reg_trace / mem_trace / opcode_only / kernel_metadata, so validate / compare silently skipped every current trace's data records. New CI drift guard catches writer/schema divergence.
SM100 uniform-register indices — reg_trace schema's uregs_indices maximum raised from 62 to 255 (matches regs_indices) to avoid chasing per-arch maximums; both are uint8_t register indices where high values encode special registers.

🖥️ CLI Changes

# New short alias for --output-dir (matches gcc/ffmpeg/curl/tar convention)
cutracer trace -o ./logs -- python my_kernel.py

# Warpgroup-targeted delay
cutracer trace -i random_delay --delay-ns 100000 \
    --delay-warpgroup-ids 1 \
    --delay-enable-prob 1.0 \
    -- python my_kernel.py

cutracer trace -i random_delay --delay-ns 100000 \
    --delay-warp-mask 0xF0 \
    -- python my_kernel.py

# Unified AI cross-domain analysis (deadlock + data-race in one shot)
cutracer analyze all trace.ndjson --ai -o report.md

Friendlier trace errors — Three UX fixes for the intentionally permissive trace subcommand (ignore_unknown_options=True, nargs=-1, type=click.UNPROCESSED): clearer messages when wrapped command tokens leak into CUTracer's own option parser, a warning when running under buck2 test, and absolute-path resolution for the wrapped command.
-o short alias — Closes the foot-gun where a stray -o ./logs was silently appended to the wrapped command (yielding /bin/sh: - : invalid option) because the trace subcommand passes unknown options through to the target binary.

📁 Configuration Changes

Removed Environment Variables

Variable	Notes
`CUTRACER_JSON_ENGINE`	The opt-in switch from the RapidJSON migration window. No longer required — RapidJSON is now the default and only per-record path; the `nlohmann` per-record code was removed.

New Environment Variables

Variable	Description	Default
`CUTRACER_DELAY_WARP_MASK`	CTA-local warp bitmask for `random_delay`. Accepts hex (`0xF`), octal (`0o17`), or decimal (`15`). 0 = disabled.	`0`
`CUTRACER_DELAY_WARPGROUP_ID`	Warpgroup index; `>= 0` selects warps `[4N..4N+3]` and wins over `CUTRACER_DELAY_WARP_MASK`. `-1` = disabled.	`-1`
`CUTRACER_DELAY_ENABLE_PROB`	Probability gate for delay injection that bypasses the default 50/50 PC gate.	(per-default)

🔄 Dependency Updates

NVBit: 1.7 → 1.8 (first-class TMA APIs: nvbit_parse_tma_transfer_info, TMATransferInfo_t, TMAAddress_t)
RapidJSON: now a vendored header dependency installed by install_third_party.sh (used as the default trace serializer)

🏗️ Infrastructure

Repository migration — facebookresearch/CUTracer → facebookexperimental/CUTracer. PR references in commit messages reflect the new org from #213 onward.
Wiki → docs/ — Mirrors what TritonParse did in D102010588: the GitHub Wiki content now lives in-repo under docs/ with an auto-sync workflow.
CONTRIBUTING.md — Reset to fbcode GitHub standard.
Monthly stale-PR cleanup workflow — Adds a scheduled job to prune stale PR branches.
Per-test/per-run subdir layout — tests/py_add reorganized with TEST_KEEP_OUTPUT support for debugging.

CI

H100 CI lanes (#219, #217) — Two new H100 workflows alongside the existing T4 test.yml: a meta-triton lane (sets SKIP_PROTON=1 since FB Triton lags upstream by ~100 days) and an upstream Triton lane. Both use the new self-hosted H100 runner.
Triton-Nightly matrix leg (#218) — Replaces install-triton.sh with a matrixed nightly install; nightly pinned to the latest upstream commit (workaround for in-flight breakage).
Inductor inline-compile (#223) — test_add.py forces inductor inline compile to dodge an upstream Triton subprocess-pool driver crash.
Retired legacy cu128 T4 workflow (#226) — test.yml removed in favor of the new H100 lanes.
fb-only unit test split (#222) — tests/unit/fb/ separation to fix OSS CI on tests with Meta-internal dependencies.
PyPI publish hardening — pypa/gh-action-pypi-publish pinned to a commit SHA in nightly-pypi.yml.

🐛 Notable Bug Fixes

NDJSON heap double-free (#225) — Concurrent std::string::append on the shared json_buffer_ triggered libstdc++ _M_mutate double-free under ASan; surfaced in OSS CI as random "unexpected character" JSON decode errors at ~7% per kernel-run on aarch64. Fixed by serializing write_trace with std::mutex.
UTC*MMA A=tmem high-half — Per-operand find("gdesc") loop replaces a positional guard that was dropping URx+1 (B gdesc high half).
Type-checking target — Fixed broken fbcode//triton/tools/CUTracer/python:cutracer-type-checking (unmanaged regression from D103464).
uregs_indices schema bound — Raised from 62 to 255 so new Blackwell architectures don't break validate.

⚠️ Breaking Changes

CUTRACER_JSON_ENGINE env var removed — The opt-in switch from the migration window is gone now that RapidJSON is the default and the nlohmann per-record path has been deleted. Output is byte-identical for reg/mem/opcode/mem_value records and semantically identical for tma_trace. No user action required unless you were setting CUTRACER_JSON_ENGINE=ab for serializer parity testing.
GitHub organization moved — Repository now lives at facebookexperimental/CUTracer. Update any pinned remotes / submodule URLs / CI references.
Legacy T4 CI workflow retired — Anything that referenced the test.yml workflow name in T4-targeted automation should switch to the new H100 workflows.

Everything else is additive (new flags, new schemas, new analyzers, new commands).

🙏 Acknowledgments

CUTracer is built on NVBit by NVIDIA Research. We thank the NVBit team for their excellent binary instrumentation framework — and in particular for the NVBit 1.8 TMA APIs that let this release retire CUTracer's non-public-ISA TMA workarounds.

Contributors to this release: Chen Li, Lei Wang, Xu Zhao, Yueming Hao (alphabetical).

📄 License

MIT License — Meta Platforms, Inc. contributions
BSD-3-Clause License — NVIDIA NVBit components

See LICENSE and LICENSE-BSD for details.

📚 Documentation

Full documentation is now in-repo under docs/ (auto-synced to the Wiki).

🔗 Links

Repository: https://github.com/facebookexperimental/CUTracer
PyPI: https://pypi.org/project/cutracer/
Issues: https://github.com/facebookexperimental/CUTracer/issues

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUTracer v0.3.0 Release Notes

Choose a tag to compare

Sorry, something went wrong.