π Minor Release β 49 commits since v0.2.1
CUTracer v0.3.0 (2026-04-23 β 2026-06-09) is a substantial minor release that focuses on three big themes: (1) a new high-performance RapidJSON trace serializer that becomes the default and removes ~16% of host-CPU overhead, (2) Blackwell-class race detection β a new TMEM-lifetime detector for the analyze data-race command plus warpgroup-targeted random-delay injection that can finally reproduce SM100 flash-attention dealloc races, and (3) a reusable cuda-gdb hang-analysis backend that samples live CUDA hangs and feeds structured evidence into the AI reasoner pipeline. The release also completes the NVBit 1.8 TMA migration, ships a unified AI all command behind a clean Reasoner protocol, expands schemas to cover every record type the writer emits, and migrates the repository from facebookresearch to facebookexperimental with the wiki now living under docs/.
β¨ Highlights
- RapidJSON trace serializer (default) β Removes ~16% host CPU and ~8% malloc churn from the per-record write path; byte-identical output to the legacy
nlohmannpath for reg/mem/opcode/mem_value records. - TMEM Lifetime Detector D3 β New detector in
analyze data-racethat finds Blackwell SM100 TMEM dealloc-vs-LDTM races, with prototype-matching 256-vs-0 verdicts on the case-15 baseline/fixed pair. - Warpgroup-targeted random delay β
--delay-warpgroup-ids/--delay-warp-mask/--delay-enable-probflags letrandom_delaystall an entire warp or warpgroup as a single scheduling unit, enabling reproduction of cross-warpgroup races that per-thread delays could not. - cuda-gdb live hang analyzer β Reusable
cutracer.debuggerPython backend samples PC offsets, SASS context, register operands, HGMMA/TMA memory evidence, and effective-blocker inference from a live hung CUDA process. - NVBit 1.8 TMA pipeline β 3-diff migration to first-class
nvbit_parse_tma_transfer_info(); trace now carries structured tensor metadata (dim, dtype, strides, swizzle, SMEM dst/src/mbar addresses) instead of opaque parameter handles. - Unified AI
allcommand β All--aiCLI paths re-routed through a newReasonerProtocol; new cross-domainallcommand runs deadlock + data-race reasoning in one invocation. - Repository migrated to
facebookexperimentalβ Wiki content moved intodocs/with auto-sync workflow; CI gains H100 lanes (meta-triton + upstream Triton nightly) and retires the legacy T4 GitHub-hosted workflow.
ποΈ Capture-Side Serializer Overhaul (RapidJSON)
A capture CPU profile attributed ~16% of host CPU to per-record nlohmann::json DOM serialization plus ~8% malloc churn from per-record std::stringstream hex formatting. This release ports the entire hot path to a streaming rapidjson::Writer and makes it the default.
- Opt-in RapidJSON path (D107298645) β Adds streaming
Writerinto a reusedthread_local StringBufferforreg_trace/mem_addr_trace/mem_value_trace/opcode_only; gated behindCUTRACER_JSON_ENGINE={nlohmann,rapidjson,ab}with anaboracle that semantically compares both engines per record and reports per-type match/mismatch counters at writer teardown. - Unit-test the RapidJSON serializer β Extracted to its own translation unit with cxx goldens, exercising all four record types.
- Port
tma_trace(D107320461) β Removes the lastnlohmannfallback in the per-record path; semantically identical on 2,304 TMA descriptors. - RapidJSON becomes the default (D107323672) β Removes the
CUTRACER_JSON_ENGINEswitch, the A/B comparison machinery,build_nlohmann_line, and the entire per-recordnlohmannserializer family (~390 lines net).nlohmannis retained only for the one-timekernel_metadataheader. - Heap-corruption fix in TraceWriter (#225) β Serialize public mutators with
std::mutexto prevent libstdc++_M_mutateheap double-free that surfaced as "JSON decode error - unexpected character" at ~7% per kernel-run on aarch64. Replaces a misleading "CRITICAL FIX"std::movedance that did not actually fix the race. - OSS install fix (#224) β
install_third_party.shnow downloadsrapidjsonheaders so the OSS CI build succeeds.
ποΈ Blackwell Race Detection
TMEM Lifetime Detector D3
Final diff of the case-15 TMEM dealloc-race stack lands a third detector in the analyze data-race registry alongside DataRaceRAWDetector. The new cutracer/analyze/fb/data_race/tmem_lifetime/detector.py module reconciles cross-warp ARRIVE sets per TCGEN05.DEALLOC and emits one Finding(severity=ERROR, category=TMEM_LIFETIME) for every (CTA, warp) that LDTM-reads TMEM but never ARRIVEs on the guarding mbarrier before deallocation. Three-phase pipeline (static guard scan β per-CTA event classification β per-DEALLOC window reconciliation) matches the prototype's 256-vs-0 PASS on the case-15 baseline / D105385008-fixed trace pair. Hopper traces yield no DEALLOCs and short-circuit cleanly via the existing decoder.is_tmem_dealloc / is_tmem_read / is_mbarrier_arrive dispatch.
Warpgroup-Targeted Random Delay (5-diff stack)
Per-thread random delay washes out at the warp boundary (warps are lock-step, so the effective stall is max(thread_delays)), and existing CTA-local modes operate on clusters, not warp ranges within a single CTA. This stack teaches random_delay to stall an entire warp or warpgroup as a single scheduling unit β the exact timing distribution needed to reproduce SM100 flash-attention TMEM dealloc races.
- Device function + host wrapper β New
instrument_delay_warpgroup(pred, delay_ns, warp_mask)ininject_funcs.cuandinstrument_warpgroup_delay_injectionhost wrapper. Computes CTA-local warp id fromthreadIdx/blockDim, testswarp_maskbit, and__nanosleeps the entire warp uniformly via the same 1ms chunked loop asinstrument_delay_random_cluster. - CLI flags + env vars β
CUTRACER_DELAY_WARP_MASK(hex/oct/dec via newget_var_uint32_autoparser; existingget_var_uint32silently dropped0xprefixes viaatoll) andCUTRACER_DELAY_WARPGROUP_ID(integer warpgroup index, resolves to0xF << (id * 4)on the host). - Dispatcher routing +
--delay-enable-probβ Routes warp targeting through the existing delay dispatcher incutracer.cuand adds a new probability gate that bypasses the 50/50 PC gate for tighter control over injection density. - Persist warp targeting in delay config JSON β Replay mode (
--delay-load-path) now round-trips warp/warpgroup targeting for deterministic reproduction.
Periodic Hang Snapshots
Adds periodic-snapshot sampling so the analyzer can correlate evolving warp state across multiple sample windows rather than relying on a single point-in-time capture.
Blackwell Test Coverage
- Blackwell FP8 GEMM E2E test β Permanent E2E test exercising the upstream Blackwell TLX UTCQMMA path with real FP8 data.
- Block-scaled UTC*MMA fixture β New Blackwell
mxfp8/mxfp4GEMM fixture for block-scaled tensor-core coverage. - Local-only Buck test for Blackwell FA-WS data race reproducer β Captures the case-15 reproducer as a permanent regression guard (local-only because of GPU requirements).
- UTC*MMA A=tmem fix β Replaces the positional guard with a per-operand
find("gdesc")loop soURx+1is correctly pushed as the high half of the Bgdesc; header rewritten to cite Blackwell ISA slot semantics.
π Live Hang Analysis (cuda-gdb backend)
A new reusable cutracer.debugger Python package wraps cuda-gdb to extract structured evidence from a live hung CUDA process. Designed to feed the AI reasoner pipeline.
- Reusable backend (#212) β Pure Python backend with no
gdbruntime dependency in the base layer; provides the foundation for the follow-up cuda-gdb command. - Preserve cuda-gdb warp slot identity β Parses cuda-gdb
Wpseparately from logical warp id; CUTracer parity needs logical warp id (first_active_threadIdx.x // 32) while cuda-gdb focus commands need the physicalWpslot. Both are now serialized in debugger opcode records. - PC offsets and SASS context β Records runtime PC, kernel-relative offset, and a small disassembly window for each sampled warp.
- SASS register operands β Reads scalar and uniform registers named by the sampled SASS instruction and serializes captured values; failures are recorded as explicit register-read errors so reports never invent values.
- Effective PC evidence β Distinguishes the instruction cuda-gdb stopped on from the effective blocking instruction reported to CUTracer analysis (post-barrier rule).
- Effective-blocker inference β Small cuda-gdb disassembly parser implements the immediate-previous-instruction rule for recovering the effective blocking PC when cuda-gdb stops on a known unsafe post-barrier instruction.
- HGMMA and TMA memory evidence β Captures Hopper GMMA and TMA-related memory state from the live process for the reasoner.
- Dynamic cuda-gdb AI evidence reporting β Evidence stream surfaced directly in
--aireports.
π§ AI Reasoner Refactor
- Unified
allcommand + Reasoner protocol (D103870397) β All--aipaths re-routed through the newReasonerProtocol. New cross-domaincutracer analyze all --airuns deadlock + data-race reasoning together. Centralizes LLM-client construction and report composition. AIDeadlockAnalyzerretired β No longer imported from production code; per-reasoner tests (test_deadlock_reasoner.py,test_data_race_reasoner.py,test_unified_reasoner.py) cover the--aipaths end-to-end via fake LLM clients.- Dedup β Removed redundant base
deadlock_commandand deadtritonparsefallback shim.
π‘ TMA Pipeline (NVBit 1.8 Migration)
Three-diff stack migrates TMA tracing off non-public ISA semantics (URa/URa+1 layout, MULTICAST register-shift workarounds) and onto NVBit 1.8's first-class TMA APIs.
- GPU-side capture β Asks NVBit for the TMA parameter handle directly via the new public API.
- Host-side parsing β Calls
nvbit_parse_tma_transfer_info()from the receive thread to decode the runtime handle into a structuredTMATransferInfo_t; trace now carries fully-structured tensor metadata (dim, mode, coords, dtype, rank, global address/dim/strides, box dim, swizzle). - Python decoder consumption β
analyze tmaconsumes the new structuredtma_transfer_infofield. - SMEM dst/src addresses β Serializes
dst.data_address,dst.mbar_address(andsrc.*forUTMASTG) from NVBit 1.8'sTMAAddress_tunion. These are the official NVBit API equivalent of the URb/URb+1 register values the data race detector needs to build barrier-to-data mappings β capture no longer relies on implicit register snapshotting inreg_trace.
π§ͺ Schema & Validation
- Trace schemas for every record type + drift guard (D106872962) β Adds JSON Schemas for
tma_trace,mem_value_trace, andmem_addr_trace. PreviouslySCHEMAS_BY_TYPEonly coveredreg_trace/mem_trace/opcode_only/kernel_metadata, sovalidate/comparesilently skipped every current trace's data records. New CI drift guard catches writer/schema divergence. - SM100 uniform-register indices β
reg_traceschema'suregs_indicesmaximum raised from 62 to 255 (matchesregs_indices) to avoid chasing per-arch maximums; both areuint8_tregister indices where high values encode special registers.
π₯οΈ CLI Changes
# New short alias for --output-dir (matches gcc/ffmpeg/curl/tar convention)
cutracer trace -o ./logs -- python my_kernel.py
# Warpgroup-targeted delay
cutracer trace -i random_delay --delay-ns 100000 \
--delay-warpgroup-ids 1 \
--delay-enable-prob 1.0 \
-- python my_kernel.py
cutracer trace -i random_delay --delay-ns 100000 \
--delay-warp-mask 0xF0 \
-- python my_kernel.py
# Unified AI cross-domain analysis (deadlock + data-race in one shot)
cutracer analyze all trace.ndjson --ai -o report.md- Friendlier
traceerrors β Three UX fixes for the intentionally permissivetracesubcommand (ignore_unknown_options=True,nargs=-1,type=click.UNPROCESSED): clearer messages when wrapped command tokens leak into CUTracer's own option parser, a warning when running underbuck2 test, and absolute-path resolution for the wrapped command. -oshort alias β Closes the foot-gun where a stray-o ./logswas silently appended to the wrapped command (yielding/bin/sh: - : invalid option) because the trace subcommand passes unknown options through to the target binary.
π Configuration Changes
Removed Environment Variables
| Variable | Notes |
|---|---|
CUTRACER_JSON_ENGINE |
The opt-in switch from the RapidJSON migration window. No longer required β RapidJSON is now the default and only per-record path; the nlohmann per-record code was removed. |
New Environment Variables
| Variable | Description | Default |
|---|---|---|
CUTRACER_DELAY_WARP_MASK |
CTA-local warp bitmask for random_delay. Accepts hex (0xF), octal (0o17), or decimal (15). 0 = disabled. |
0 |
CUTRACER_DELAY_WARPGROUP_ID |
Warpgroup index; >= 0 selects warps [4N..4N+3] and wins over CUTRACER_DELAY_WARP_MASK. -1 = disabled. |
-1 |
CUTRACER_DELAY_ENABLE_PROB |
Probability gate for delay injection that bypasses the default 50/50 PC gate. | (per-default) |
π Dependency Updates
- NVBit: 1.7 β 1.8 (first-class TMA APIs:
nvbit_parse_tma_transfer_info,TMATransferInfo_t,TMAAddress_t) - RapidJSON: now a vendored header dependency installed by
install_third_party.sh(used as the default trace serializer)
ποΈ Infrastructure
- Repository migration β
facebookresearch/CUTracerβfacebookexperimental/CUTracer. PR references in commit messages reflect the new org from #213 onward. - Wiki β
docs/β Mirrors what TritonParse did in D102010588: the GitHub Wiki content now lives in-repo underdocs/with an auto-sync workflow. - CONTRIBUTING.md β Reset to fbcode GitHub standard.
- Monthly stale-PR cleanup workflow β Adds a scheduled job to prune stale PR branches.
- Per-test/per-run subdir layout β
tests/py_addreorganized withTEST_KEEP_OUTPUTsupport for debugging.
CI
- H100 CI lanes (#219, #217) β Two new H100 workflows alongside the existing T4
test.yml: ameta-tritonlane (setsSKIP_PROTON=1since FB Triton lags upstream by ~100 days) and an upstream Triton lane. Both use the new self-hosted H100 runner. - Triton-Nightly matrix leg (#218) β Replaces
install-triton.shwith a matrixed nightly install; nightly pinned to the latest upstream commit (workaround for in-flight breakage). - Inductor inline-compile (#223) β
test_add.pyforces inductor inline compile to dodge an upstream Triton subprocess-pool driver crash. - Retired legacy cu128 T4 workflow (#226) β
test.ymlremoved in favor of the new H100 lanes. - fb-only unit test split (#222) β
tests/unit/fb/separation to fix OSS CI on tests with Meta-internal dependencies. - PyPI publish hardening β
pypa/gh-action-pypi-publishpinned to a commit SHA innightly-pypi.yml.
π Notable Bug Fixes
- NDJSON heap double-free (#225) β Concurrent
std::string::appendon the sharedjson_buffer_triggered libstdc++_M_mutatedouble-free under ASan; surfaced in OSS CI as random "unexpected character" JSON decode errors at ~7% per kernel-run on aarch64. Fixed by serializingwrite_tracewithstd::mutex. - UTC*MMA A=tmem high-half β Per-operand
find("gdesc")loop replaces a positional guard that was droppingURx+1(Bgdeschigh half). - Type-checking target β Fixed broken
fbcode//triton/tools/CUTracer/python:cutracer-type-checking(unmanaged regression from D103464). uregs_indicesschema bound β Raised from 62 to 255 so new Blackwell architectures don't breakvalidate.
β οΈ Breaking Changes
CUTRACER_JSON_ENGINEenv var removed β The opt-in switch from the migration window is gone now that RapidJSON is the default and thenlohmannper-record path has been deleted. Output is byte-identical for reg/mem/opcode/mem_value records and semantically identical fortma_trace. No user action required unless you were settingCUTRACER_JSON_ENGINE=abfor serializer parity testing.- GitHub organization moved β Repository now lives at
facebookexperimental/CUTracer. Update any pinned remotes / submodule URLs / CI references. - Legacy T4 CI workflow retired β Anything that referenced the
test.ymlworkflow name in T4-targeted automation should switch to the new H100 workflows.
Everything else is additive (new flags, new schemas, new analyzers, new commands).
π Acknowledgments
CUTracer is built on NVBit by NVIDIA Research. We thank the NVBit team for their excellent binary instrumentation framework β and in particular for the NVBit 1.8 TMA APIs that let this release retire CUTracer's non-public-ISA TMA workarounds.
Contributors to this release: Chen Li, Lei Wang, Xu Zhao, Yueming Hao (alphabetical).
π License
- MIT License β Meta Platforms, Inc. contributions
- BSD-3-Clause License β NVIDIA NVBit components
See LICENSE and LICENSE-BSD for details.
π Documentation
Full documentation is now in-repo under docs/ (auto-synced to the Wiki).