CUTracer v0.2.1 Release Notes

🎉 Patch Release — 24 commits since v0.2.0

This release brings PyTorch-native Python callstack capture, a new kernel events recording system, cluster-level delay injection for inter-CTA race detection, significant trace format optimizations via instruction table embedding, a critical NVBit 1.8 TMA compatibility fix, and several CI/security hardening improvements. Date range: 2026-04-08 to 2026-04-22.

✨ Highlights

PyTorch CapturedTraceback Integration — Full Python callstack capture via PyTorch's CapturedTraceback API with zero compile-time dependencies, replacing unsymbolized C++ backtrace frames with readable Python call sites
Kernel Events Recording — New structured NDJSON logging of all kernel launches (not just instrumented ones) with callstack deduplication for minimal file overhead
Cluster-Level Delay Mode — New --delay-mode cluster that delays one random CTA per cluster to expose missing inter-CTA synchronization bugs
Instruction Table Embedding — Per-instruction static table with SASS binary encoding embedded in kernel_metadata, eliminating redundant per-record SASS strings from JSON traces
Custom Delay Patterns — New --delay-patterns flag for injecting delays at arbitrary SASS instruction types, including a "*" wildcard mode
NVBit 1.8 TMA Fix — Critical fix for TMA operand extraction regression introduced by NVBit 1.8's new TMA_PARAM_HANDLE operand type

🧠 PyTorch CapturedTraceback Callstack Capture

A three-part series replaces the existing backtrace()-based C++ callstack capture with PyTorch's CapturedTraceback API, producing readable Python call stacks that show the actual user code and PyTorch layers that launched a kernel.

Backtrace refactor (#202) — Extract capture_cpu_callstack() into capture_cpu_callstack_backtrace() to prepare for alternative backends
CapturedTraceback module (#203) — New python_callstack.cpp (~360 lines) that dynamically resolves Python C API functions via dlsym(dlopen(NULL)) — zero compile-time Python/PyTorch dependencies. Calls CapturedTraceback.extract().summary() when the current thread holds the GIL
Dynamic mode selection (#204) — New CpuCallstackMode enum (AUTO/PYTORCH/BACKTRACE/DISABLED) replacing the old boolean flag. AUTO (default) tries PyTorch first, falls back to backtrace(). New cpu_callstack_source field in kernel_metadata JSON output
Auto-GIL acquisition — New auto_gil mode that re-acquires the GIL via PyGILState_Ensure() for Triton kernels where __triton_launcher releases the GIL before cuLaunchKernelEx. Safe because the Python frame chain is frozen at the Triton launch() call site

# Use PyTorch callstacks (auto-detected by default)
cutracer trace --cpu-callstack auto -- python my_triton_kernel.py

# Force auto_gil mode for Triton kernels
cutracer trace --cpu-callstack auto_gil -- python my_triton_kernel.py

📋 Kernel Events Recording

New structured logging of every kernel launch to a dedicated NDJSON file (cutracer_kernel_events_*.ndjson), independent of instrumentation. Key innovation: callstack deduplication using FNV-1a hashing — each unique Python callstack is emitted once as a callstack_def record, with subsequent launches referencing it by callstack_id.

Three modes: dedup (recommended), full (inline callstack per launch), nostack (metadata only)
Zero overhead when disabled (default)
Query integration — The query command now handles kernel events files seamlessly: callstack_def records are cached and resolved, kernel_launch records get a caller field injected with the innermost call site frame
Recommended query pattern: --group-by kernel_checksum --count for launch frequency analysis

# Record kernel events with callstack dedup
cutracer trace --kernel-events dedup -- python my_app.py

# Query kernel launch counts
cutracer query cutracer_kernel_events_*.ndjson --group-by kernel_checksum --count

🔀 Cluster-Level Random Delay Mode

New --delay-mode cluster (#207) that delays only one randomly-selected CTA within each cluster while other CTAs proceed at normal speed (~430 lines). This creates timing asymmetry between CTAs in the same cluster, exposing missing inter-CTA synchronization bugs that --delay-mode random (intra-CTA) would not catch.

Uses cluster_ctaid/cluster_nctaid PTX registers for CTA selection within each cluster
cluster_seed stored in delay config JSON for deterministic replay
Host-side cluster dimension detection via CU_LAUNCH_ATTRIBUTE_CLUSTER_DIMENSION (including CUDA graph captures), with fallback to cuFuncGetAttribute
One-time-per-kernel diagnostic log of runtime cluster dimensions
Automatic warning when used against non-cluster-launched kernels (no-op detection)

# Expose inter-CTA synchronization bugs
cutracer trace -i random_delay -a random_delay \
    --delay-mode cluster --delay-ns 10000 \
    --kernel-filters my_cluster_kernel -- python test.py

🎯 Custom Delay Patterns

New --delay-patterns flag (#211) that lets users specify arbitrary SASS instruction substrings for delay injection, overriding the built-in DELAY_INJECTION_PATTERNS list. This enables targeted testing of specific instruction types without modifying source code.

Comma-separated patterns: --delay-patterns "UTMALDG,UTMASTG"
Wildcard --delay-patterns "*" matches all instructions, injecting delay on a random 50% subset of ALL SASS instructions
Also adds SYNCS.EXCH (mbarrier init) to the built-in delay injection patterns
Plumbed through CUTRACER_DELAY_PATTERNS environment variable

# Delay only TMA instructions
cutracer trace -i random_delay --delay-ns 5000 \
    --delay-patterns "UTMALDG,UTMASTG" -- python test.py

# Stress-test: random delay on 50% of ALL instructions
cutracer trace -i random_delay --delay-ns 1000 \
    --delay-patterns "*" -- python test.py

📦 Instruction Table Embedding & Trace Format Optimization

Three-part series that embeds a per-instruction static table in kernel_metadata and eliminates redundant per-record SASS strings from JSON output:

Instruction table — Each kernel_metadata record now includes an instructions array indexed by opcode_id, containing SASS disassembly, binary encoding (via NVBit 1.8 getSassBinary() API), register indices, and uniform register indices
Schema update — kernel_metadata.schema.json updated with the instructions property definition
Per-record SASS removal — j["sass"] serialization removed from JSON output (text mode unchanged). The Python TraceReader caches the instruction table and injects sass into records on read via opcode_id lookup. Backward compatible: old traces with inline SASS still work

🐛 Bug Fixes

NVBit 1.8 TMA operand extraction regression — NVBit 1.8 changed UTMALDG/UTMASTG/UTMAREDG from two separate MREF operands to a single TMA_PARAM_HANDLE operand. CUTracer's operand loop had no handler, silently dropping all UR information and breaking tma_trace, data-race, tma, mma, and dataflow analysis commands. Added a manual TMA_PARAM_HANDLE decoder plus unhandled-operand debug logging for future NVBit changes
Kernel hash always 0x — compute_kernel_checksum() was only called inside instrumentation path; now computed at metadata creation time via idempotent ensure_kernel_checksum() helper
Log truncation on SIGTERM (#206) — Added flush_log_files() before raise(SIGTERM) in deadlock detection, kernel timeout, and no-data timeout termination paths
Cluster warning noise (#210) — Gated [CLUSTER] warning by should_instrument so non-matching kernels don't clutter stderr; also fixed legacy MEMTRACE: prefix to CUTracer:
Kernel events writer lifetime — Fixed multi-context crash where first context teardown destroyed the shared writer; now cleaned up only when all contexts are gone
Python callstack lineno — Clamped negative PyFrame_GetLineNumber() return values unconditionally (previously only inside PyErr_Occurred() block)
Kernel events NDJSON enforcement — Force NDJSON format for kernel events writer regardless of CUTRACER_TRACE_FORMAT, with warning when text mode is active

🖥️ CLI Changes

New Flags

# Callstack capture mode (replaces boolean --cpu-callstack 0/1)
cutracer trace --cpu-callstack auto|pytorch|backtrace|auto_gil|0|1

# Kernel events recording
cutracer trace --kernel-events dedup|full|nostack

# Cluster delay mode
cutracer trace --delay-mode cluster  # (alongside existing random/fixed)

# Custom delay patterns
cutracer trace --delay-patterns "PATTERN1,PATTERN2"  # or "*" for all

Query Command Updates

# Query kernel events files (auto-detected by file naming)
cutracer query cutracer_kernel_events_*.ndjson --group-by kernel_checksum --count

📁 Configuration Changes

Updated Environment Variables

Variable	Change	Description
`CUTRACER_CPU_CALLSTACK`	Extended	Now accepts: `auto`, `pytorch`, `backtrace`, `auto_gil`, `0`, `1` (was: `0`/`1` only)
`CUTRACER_KERNEL_EVENTS`	New	Kernel events recording mode: `0` (disabled, default), `dedup`, `full`, `nostack`
`CUTRACER_DELAY_MODE`	Extended	New `cluster` value (alongside existing `random`/`fixed`)
`CUTRACER_DELAY_PATTERNS`	New	Comma-separated SASS instruction substrings for delay injection; `"*"` for all instructions

🔒 Security & CI

CodeQL fixes (#209) — Replaced fopen("w") with open(O_CREAT|O_WRONLY|O_TRUNC, 0644) + fdopen() for explicit file permissions; added permissions: contents: read to CI workflows; pinned peter-evans/create-pull-request to commit SHA
CI run-all-then-summarize (#205) — Changed CI from fail-fast to run all test suites before reporting, with a summary table showing pass/fail counts
Remove Meta-internal references — Cleaned up internal-only content from OSS files (readme.md, copyright headers)

🔄 Dependency Updates

tritonparse: >=0.4.0 → >=0.4.3
CUDA 13.2: 13.2.0 → 13.2.1 (driver 595.45.04 → 595.58.03) in install_cuda.sh
NCCL: Added CUDA-12.6 pin override v2.29.3-1 (sm50 support bug fix)

🧪 Testing

E2E Buck test — New py_add E2E test target (~370 lines) covering kernel filters, three trace format modes, and cross-format consistency validation
Kernel events query tests — 115+ lines of new tests for query command kernel events support, plus 99 lines of reader unit tests
Formatter tests — Added Buck target for test_formatters.py

🛠️ Tooling

listgpu.sh — New utility script (~150 lines) that lists all GPU processes with user, command, start time, CPU%, GPU memory, and full command lines
setup_local_skills.sh — Idempotent script to mount in-repo AI agent skills into ~/.llms/skills/ and ~/.claude/skills/
sync_install_cuda agent skill — Codified the recurring install_cuda.sh upstream sync workflow as reusable AI agent skills for Devmate and Claude Code

⚠️ Breaking Changes

CUTRACER_CPU_CALLSTACK behavior changed: the old 1 value now behaves as auto (tries PyTorch first, falls back to backtrace). Existing scripts using --cpu-callstack 1 will get better callstacks automatically; --cpu-callstack 0 still disables capture.
JSON trace format: the per-record "sass" field is no longer emitted in new traces (moved to kernel_metadata.instructions table). The Python TraceReader transparently injects it from the instruction table, so downstream tools using the Python API are unaffected. Tools that parse raw JSON and relied on per-record "sass" should read from kernel_metadata.instructions[opcode_id].sass instead.

👥 Contributors

Hongtao Yu
Lei Wang
Yueming Hao

🙏 Acknowledgments

CUTracer is built on NVBit by NVIDIA Research. We thank the NVBit team for their excellent binary instrumentation framework.

📄 License

MIT License — Meta Platforms, Inc. contributions
BSD-3-Clause License — NVIDIA NVBit components

See LICENSE and LICENSE-BSD for details.

📚 Documentation

Full documentation is available in the Wiki.

🔗 Links

Repository: https://github.com/facebookresearch/CUTracer
PyPI: https://pypi.org/project/cutracer/
Issues: https://github.com/facebookresearch/CUTracer/issues

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUTracer v0.2.1 Release 🎉

Choose a tag to compare

Sorry, something went wrong.