CUTracer v0.2.1 Release π
CUTracer v0.2.1 Release Notes
π Patch Release β 24 commits since v0.2.0
This release brings PyTorch-native Python callstack capture, a new kernel events recording system, cluster-level delay injection for inter-CTA race detection, significant trace format optimizations via instruction table embedding, a critical NVBit 1.8 TMA compatibility fix, and several CI/security hardening improvements. Date range: 2026-04-08 to 2026-04-22.
β¨ Highlights
- PyTorch CapturedTraceback Integration β Full Python callstack capture via PyTorch's CapturedTraceback API with zero compile-time dependencies, replacing unsymbolized C++ backtrace frames with readable Python call sites
- Kernel Events Recording β New structured NDJSON logging of all kernel launches (not just instrumented ones) with callstack deduplication for minimal file overhead
- Cluster-Level Delay Mode β New
--delay-mode clusterthat delays one random CTA per cluster to expose missing inter-CTA synchronization bugs - Instruction Table Embedding β Per-instruction static table with SASS binary encoding embedded in kernel_metadata, eliminating redundant per-record SASS strings from JSON traces
- Custom Delay Patterns β New
--delay-patternsflag for injecting delays at arbitrary SASS instruction types, including a"*"wildcard mode - NVBit 1.8 TMA Fix β Critical fix for TMA operand extraction regression introduced by NVBit 1.8's new
TMA_PARAM_HANDLEoperand type
π§ PyTorch CapturedTraceback Callstack Capture
A three-part series replaces the existing backtrace()-based C++ callstack capture with PyTorch's CapturedTraceback API, producing readable Python call stacks that show the actual user code and PyTorch layers that launched a kernel.
- Backtrace refactor (#202) β Extract
capture_cpu_callstack()intocapture_cpu_callstack_backtrace()to prepare for alternative backends - CapturedTraceback module (#203) β New
python_callstack.cpp(~360 lines) that dynamically resolves Python C API functions viadlsym(dlopen(NULL))β zero compile-time Python/PyTorch dependencies. CallsCapturedTraceback.extract().summary()when the current thread holds the GIL - Dynamic mode selection (#204) β New
CpuCallstackModeenum (AUTO/PYTORCH/BACKTRACE/DISABLED) replacing the old boolean flag.AUTO(default) tries PyTorch first, falls back tobacktrace(). Newcpu_callstack_sourcefield in kernel_metadata JSON output - Auto-GIL acquisition β New
auto_gilmode that re-acquires the GIL viaPyGILState_Ensure()for Triton kernels where__triton_launcherreleases the GIL beforecuLaunchKernelEx. Safe because the Python frame chain is frozen at the Tritonlaunch()call site
# Use PyTorch callstacks (auto-detected by default)
cutracer trace --cpu-callstack auto -- python my_triton_kernel.py
# Force auto_gil mode for Triton kernels
cutracer trace --cpu-callstack auto_gil -- python my_triton_kernel.pyπ Kernel Events Recording
New structured logging of every kernel launch to a dedicated NDJSON file (cutracer_kernel_events_*.ndjson), independent of instrumentation. Key innovation: callstack deduplication using FNV-1a hashing β each unique Python callstack is emitted once as a callstack_def record, with subsequent launches referencing it by callstack_id.
- Three modes:
dedup(recommended),full(inline callstack per launch),nostack(metadata only) - Zero overhead when disabled (default)
- Query integration β The
querycommand now handles kernel events files seamlessly:callstack_defrecords are cached and resolved,kernel_launchrecords get acallerfield injected with the innermost call site frame - Recommended query pattern:
--group-by kernel_checksum --countfor launch frequency analysis
# Record kernel events with callstack dedup
cutracer trace --kernel-events dedup -- python my_app.py
# Query kernel launch counts
cutracer query cutracer_kernel_events_*.ndjson --group-by kernel_checksum --countπ Cluster-Level Random Delay Mode
New --delay-mode cluster (#207) that delays only one randomly-selected CTA within each cluster while other CTAs proceed at normal speed (~430 lines). This creates timing asymmetry between CTAs in the same cluster, exposing missing inter-CTA synchronization bugs that --delay-mode random (intra-CTA) would not catch.
- Uses
cluster_ctaid/cluster_nctaidPTX registers for CTA selection within each cluster cluster_seedstored in delay config JSON for deterministic replay- Host-side cluster dimension detection via
CU_LAUNCH_ATTRIBUTE_CLUSTER_DIMENSION(including CUDA graph captures), with fallback tocuFuncGetAttribute - One-time-per-kernel diagnostic log of runtime cluster dimensions
- Automatic warning when used against non-cluster-launched kernels (no-op detection)
# Expose inter-CTA synchronization bugs
cutracer trace -i random_delay -a random_delay \
--delay-mode cluster --delay-ns 10000 \
--kernel-filters my_cluster_kernel -- python test.pyπ― Custom Delay Patterns
New --delay-patterns flag (#211) that lets users specify arbitrary SASS instruction substrings for delay injection, overriding the built-in DELAY_INJECTION_PATTERNS list. This enables targeted testing of specific instruction types without modifying source code.
- Comma-separated patterns:
--delay-patterns "UTMALDG,UTMASTG" - Wildcard
--delay-patterns "*"matches all instructions, injecting delay on a random 50% subset of ALL SASS instructions - Also adds
SYNCS.EXCH(mbarrier init) to the built-in delay injection patterns - Plumbed through
CUTRACER_DELAY_PATTERNSenvironment variable
# Delay only TMA instructions
cutracer trace -i random_delay --delay-ns 5000 \
--delay-patterns "UTMALDG,UTMASTG" -- python test.py
# Stress-test: random delay on 50% of ALL instructions
cutracer trace -i random_delay --delay-ns 1000 \
--delay-patterns "*" -- python test.pyπ¦ Instruction Table Embedding & Trace Format Optimization
Three-part series that embeds a per-instruction static table in kernel_metadata and eliminates redundant per-record SASS strings from JSON output:
- Instruction table β Each
kernel_metadatarecord now includes aninstructionsarray indexed byopcode_id, containing SASS disassembly, binary encoding (via NVBit 1.8getSassBinary()API), register indices, and uniform register indices - Schema update β
kernel_metadata.schema.jsonupdated with theinstructionsproperty definition - Per-record SASS removal β
j["sass"]serialization removed from JSON output (text mode unchanged). The PythonTraceReadercaches the instruction table and injectssassinto records on read viaopcode_idlookup. Backward compatible: old traces with inline SASS still work
π Bug Fixes
- NVBit 1.8 TMA operand extraction regression β NVBit 1.8 changed
UTMALDG/UTMASTG/UTMAREDGfrom two separateMREFoperands to a singleTMA_PARAM_HANDLEoperand. CUTracer's operand loop had no handler, silently dropping all UR information and breakingtma_trace,data-race,tma,mma, anddataflowanalysis commands. Added a manualTMA_PARAM_HANDLEdecoder plus unhandled-operand debug logging for future NVBit changes - Kernel hash always 0x β
compute_kernel_checksum()was only called inside instrumentation path; now computed at metadata creation time via idempotentensure_kernel_checksum()helper - Log truncation on SIGTERM (#206) β Added
flush_log_files()beforeraise(SIGTERM)in deadlock detection, kernel timeout, and no-data timeout termination paths - Cluster warning noise (#210) β Gated
[CLUSTER]warning byshould_instrumentso non-matching kernels don't clutter stderr; also fixed legacyMEMTRACE:prefix toCUTracer: - Kernel events writer lifetime β Fixed multi-context crash where first context teardown destroyed the shared writer; now cleaned up only when all contexts are gone
- Python callstack lineno β Clamped negative
PyFrame_GetLineNumber()return values unconditionally (previously only insidePyErr_Occurred()block) - Kernel events NDJSON enforcement β Force NDJSON format for kernel events writer regardless of
CUTRACER_TRACE_FORMAT, with warning when text mode is active
π₯οΈ CLI Changes
New Flags
# Callstack capture mode (replaces boolean --cpu-callstack 0/1)
cutracer trace --cpu-callstack auto|pytorch|backtrace|auto_gil|0|1
# Kernel events recording
cutracer trace --kernel-events dedup|full|nostack
# Cluster delay mode
cutracer trace --delay-mode cluster # (alongside existing random/fixed)
# Custom delay patterns
cutracer trace --delay-patterns "PATTERN1,PATTERN2" # or "*" for allQuery Command Updates
# Query kernel events files (auto-detected by file naming)
cutracer query cutracer_kernel_events_*.ndjson --group-by kernel_checksum --countπ Configuration Changes
Updated Environment Variables
| Variable | Change | Description |
|---|---|---|
CUTRACER_CPU_CALLSTACK |
Extended | Now accepts: auto, pytorch, backtrace, auto_gil, 0, 1 (was: 0/1 only) |
CUTRACER_KERNEL_EVENTS |
New | Kernel events recording mode: 0 (disabled, default), dedup, full, nostack |
CUTRACER_DELAY_MODE |
Extended | New cluster value (alongside existing random/fixed) |
CUTRACER_DELAY_PATTERNS |
New | Comma-separated SASS instruction substrings for delay injection; "*" for all instructions |
π Security & CI
- CodeQL fixes (#209) β Replaced
fopen("w")withopen(O_CREAT|O_WRONLY|O_TRUNC, 0644)+fdopen()for explicit file permissions; addedpermissions: contents: readto CI workflows; pinnedpeter-evans/create-pull-requestto commit SHA - CI run-all-then-summarize (#205) β Changed CI from fail-fast to run all test suites before reporting, with a summary table showing pass/fail counts
- Remove Meta-internal references β Cleaned up internal-only content from OSS files (readme.md, copyright headers)
π Dependency Updates
- tritonparse:
>=0.4.0β>=0.4.3 - CUDA 13.2:
13.2.0β13.2.1(driver595.45.04β595.58.03) ininstall_cuda.sh - NCCL: Added CUDA-12.6 pin override
v2.29.3-1(sm50 support bug fix)
π§ͺ Testing
- E2E Buck test β New
py_addE2E test target (~370 lines) covering kernel filters, three trace format modes, and cross-format consistency validation - Kernel events query tests β 115+ lines of new tests for query command kernel events support, plus 99 lines of reader unit tests
- Formatter tests β Added Buck target for
test_formatters.py
π οΈ Tooling
listgpu.shβ New utility script (~150 lines) that lists all GPU processes with user, command, start time, CPU%, GPU memory, and full command linessetup_local_skills.shβ Idempotent script to mount in-repo AI agent skills into~/.llms/skills/and~/.claude/skills/sync_install_cudaagent skill β Codified the recurring install_cuda.sh upstream sync workflow as reusable AI agent skills for Devmate and Claude Code
β οΈ Breaking Changes
CUTRACER_CPU_CALLSTACKbehavior changed: the old1value now behaves asauto(tries PyTorch first, falls back to backtrace). Existing scripts using--cpu-callstack 1will get better callstacks automatically;--cpu-callstack 0still disables capture.- JSON trace format: the per-record
"sass"field is no longer emitted in new traces (moved tokernel_metadata.instructionstable). The PythonTraceReadertransparently injects it from the instruction table, so downstream tools using the Python API are unaffected. Tools that parse raw JSON and relied on per-record"sass"should read fromkernel_metadata.instructions[opcode_id].sassinstead.
π₯ Contributors
- Hongtao Yu
- Lei Wang
- Yueming Hao
π Acknowledgments
CUTracer is built on NVBit by NVIDIA Research. We thank the NVBit team for their excellent binary instrumentation framework.
π License
- MIT License β Meta Platforms, Inc. contributions
- BSD-3-Clause License β NVIDIA NVBit components
See LICENSE and LICENSE-BSD for details.
π Documentation
Full documentation is available in the Wiki.