Skip to content

CUTracer v0.2.1 Release πŸŽ‰

Choose a tag to compare

@FindHao FindHao released this 23 Apr 03:41
· 54 commits to main since this release

CUTracer v0.2.1 Release Notes

πŸŽ‰ Patch Release β€” 24 commits since v0.2.0

This release brings PyTorch-native Python callstack capture, a new kernel events recording system, cluster-level delay injection for inter-CTA race detection, significant trace format optimizations via instruction table embedding, a critical NVBit 1.8 TMA compatibility fix, and several CI/security hardening improvements. Date range: 2026-04-08 to 2026-04-22.


✨ Highlights

  • PyTorch CapturedTraceback Integration β€” Full Python callstack capture via PyTorch's CapturedTraceback API with zero compile-time dependencies, replacing unsymbolized C++ backtrace frames with readable Python call sites
  • Kernel Events Recording β€” New structured NDJSON logging of all kernel launches (not just instrumented ones) with callstack deduplication for minimal file overhead
  • Cluster-Level Delay Mode β€” New --delay-mode cluster that delays one random CTA per cluster to expose missing inter-CTA synchronization bugs
  • Instruction Table Embedding β€” Per-instruction static table with SASS binary encoding embedded in kernel_metadata, eliminating redundant per-record SASS strings from JSON traces
  • Custom Delay Patterns β€” New --delay-patterns flag for injecting delays at arbitrary SASS instruction types, including a "*" wildcard mode
  • NVBit 1.8 TMA Fix β€” Critical fix for TMA operand extraction regression introduced by NVBit 1.8's new TMA_PARAM_HANDLE operand type

🧠 PyTorch CapturedTraceback Callstack Capture

A three-part series replaces the existing backtrace()-based C++ callstack capture with PyTorch's CapturedTraceback API, producing readable Python call stacks that show the actual user code and PyTorch layers that launched a kernel.

  • Backtrace refactor (#202) β€” Extract capture_cpu_callstack() into capture_cpu_callstack_backtrace() to prepare for alternative backends
  • CapturedTraceback module (#203) β€” New python_callstack.cpp (~360 lines) that dynamically resolves Python C API functions via dlsym(dlopen(NULL)) β€” zero compile-time Python/PyTorch dependencies. Calls CapturedTraceback.extract().summary() when the current thread holds the GIL
  • Dynamic mode selection (#204) β€” New CpuCallstackMode enum (AUTO/PYTORCH/BACKTRACE/DISABLED) replacing the old boolean flag. AUTO (default) tries PyTorch first, falls back to backtrace(). New cpu_callstack_source field in kernel_metadata JSON output
  • Auto-GIL acquisition β€” New auto_gil mode that re-acquires the GIL via PyGILState_Ensure() for Triton kernels where __triton_launcher releases the GIL before cuLaunchKernelEx. Safe because the Python frame chain is frozen at the Triton launch() call site
# Use PyTorch callstacks (auto-detected by default)
cutracer trace --cpu-callstack auto -- python my_triton_kernel.py

# Force auto_gil mode for Triton kernels
cutracer trace --cpu-callstack auto_gil -- python my_triton_kernel.py

πŸ“‹ Kernel Events Recording

New structured logging of every kernel launch to a dedicated NDJSON file (cutracer_kernel_events_*.ndjson), independent of instrumentation. Key innovation: callstack deduplication using FNV-1a hashing β€” each unique Python callstack is emitted once as a callstack_def record, with subsequent launches referencing it by callstack_id.

  • Three modes: dedup (recommended), full (inline callstack per launch), nostack (metadata only)
  • Zero overhead when disabled (default)
  • Query integration β€” The query command now handles kernel events files seamlessly: callstack_def records are cached and resolved, kernel_launch records get a caller field injected with the innermost call site frame
  • Recommended query pattern: --group-by kernel_checksum --count for launch frequency analysis
# Record kernel events with callstack dedup
cutracer trace --kernel-events dedup -- python my_app.py

# Query kernel launch counts
cutracer query cutracer_kernel_events_*.ndjson --group-by kernel_checksum --count

πŸ”€ Cluster-Level Random Delay Mode

New --delay-mode cluster (#207) that delays only one randomly-selected CTA within each cluster while other CTAs proceed at normal speed (~430 lines). This creates timing asymmetry between CTAs in the same cluster, exposing missing inter-CTA synchronization bugs that --delay-mode random (intra-CTA) would not catch.

  • Uses cluster_ctaid/cluster_nctaid PTX registers for CTA selection within each cluster
  • cluster_seed stored in delay config JSON for deterministic replay
  • Host-side cluster dimension detection via CU_LAUNCH_ATTRIBUTE_CLUSTER_DIMENSION (including CUDA graph captures), with fallback to cuFuncGetAttribute
  • One-time-per-kernel diagnostic log of runtime cluster dimensions
  • Automatic warning when used against non-cluster-launched kernels (no-op detection)
# Expose inter-CTA synchronization bugs
cutracer trace -i random_delay -a random_delay \
    --delay-mode cluster --delay-ns 10000 \
    --kernel-filters my_cluster_kernel -- python test.py

🎯 Custom Delay Patterns

New --delay-patterns flag (#211) that lets users specify arbitrary SASS instruction substrings for delay injection, overriding the built-in DELAY_INJECTION_PATTERNS list. This enables targeted testing of specific instruction types without modifying source code.

  • Comma-separated patterns: --delay-patterns "UTMALDG,UTMASTG"
  • Wildcard --delay-patterns "*" matches all instructions, injecting delay on a random 50% subset of ALL SASS instructions
  • Also adds SYNCS.EXCH (mbarrier init) to the built-in delay injection patterns
  • Plumbed through CUTRACER_DELAY_PATTERNS environment variable
# Delay only TMA instructions
cutracer trace -i random_delay --delay-ns 5000 \
    --delay-patterns "UTMALDG,UTMASTG" -- python test.py

# Stress-test: random delay on 50% of ALL instructions
cutracer trace -i random_delay --delay-ns 1000 \
    --delay-patterns "*" -- python test.py

πŸ“¦ Instruction Table Embedding & Trace Format Optimization

Three-part series that embeds a per-instruction static table in kernel_metadata and eliminates redundant per-record SASS strings from JSON output:

  • Instruction table β€” Each kernel_metadata record now includes an instructions array indexed by opcode_id, containing SASS disassembly, binary encoding (via NVBit 1.8 getSassBinary() API), register indices, and uniform register indices
  • Schema update β€” kernel_metadata.schema.json updated with the instructions property definition
  • Per-record SASS removal β€” j["sass"] serialization removed from JSON output (text mode unchanged). The Python TraceReader caches the instruction table and injects sass into records on read via opcode_id lookup. Backward compatible: old traces with inline SASS still work

πŸ› Bug Fixes

  • NVBit 1.8 TMA operand extraction regression β€” NVBit 1.8 changed UTMALDG/UTMASTG/UTMAREDG from two separate MREF operands to a single TMA_PARAM_HANDLE operand. CUTracer's operand loop had no handler, silently dropping all UR information and breaking tma_trace, data-race, tma, mma, and dataflow analysis commands. Added a manual TMA_PARAM_HANDLE decoder plus unhandled-operand debug logging for future NVBit changes
  • Kernel hash always 0x β€” compute_kernel_checksum() was only called inside instrumentation path; now computed at metadata creation time via idempotent ensure_kernel_checksum() helper
  • Log truncation on SIGTERM (#206) β€” Added flush_log_files() before raise(SIGTERM) in deadlock detection, kernel timeout, and no-data timeout termination paths
  • Cluster warning noise (#210) β€” Gated [CLUSTER] warning by should_instrument so non-matching kernels don't clutter stderr; also fixed legacy MEMTRACE: prefix to CUTracer:
  • Kernel events writer lifetime β€” Fixed multi-context crash where first context teardown destroyed the shared writer; now cleaned up only when all contexts are gone
  • Python callstack lineno β€” Clamped negative PyFrame_GetLineNumber() return values unconditionally (previously only inside PyErr_Occurred() block)
  • Kernel events NDJSON enforcement β€” Force NDJSON format for kernel events writer regardless of CUTRACER_TRACE_FORMAT, with warning when text mode is active

πŸ–₯️ CLI Changes

New Flags

# Callstack capture mode (replaces boolean --cpu-callstack 0/1)
cutracer trace --cpu-callstack auto|pytorch|backtrace|auto_gil|0|1

# Kernel events recording
cutracer trace --kernel-events dedup|full|nostack

# Cluster delay mode
cutracer trace --delay-mode cluster  # (alongside existing random/fixed)

# Custom delay patterns
cutracer trace --delay-patterns "PATTERN1,PATTERN2"  # or "*" for all

Query Command Updates

# Query kernel events files (auto-detected by file naming)
cutracer query cutracer_kernel_events_*.ndjson --group-by kernel_checksum --count

πŸ“ Configuration Changes

Updated Environment Variables

Variable Change Description
CUTRACER_CPU_CALLSTACK Extended Now accepts: auto, pytorch, backtrace, auto_gil, 0, 1 (was: 0/1 only)
CUTRACER_KERNEL_EVENTS New Kernel events recording mode: 0 (disabled, default), dedup, full, nostack
CUTRACER_DELAY_MODE Extended New cluster value (alongside existing random/fixed)
CUTRACER_DELAY_PATTERNS New Comma-separated SASS instruction substrings for delay injection; "*" for all instructions

πŸ”’ Security & CI

  • CodeQL fixes (#209) β€” Replaced fopen("w") with open(O_CREAT|O_WRONLY|O_TRUNC, 0644) + fdopen() for explicit file permissions; added permissions: contents: read to CI workflows; pinned peter-evans/create-pull-request to commit SHA
  • CI run-all-then-summarize (#205) β€” Changed CI from fail-fast to run all test suites before reporting, with a summary table showing pass/fail counts
  • Remove Meta-internal references β€” Cleaned up internal-only content from OSS files (readme.md, copyright headers)

πŸ”„ Dependency Updates

  • tritonparse: >=0.4.0 β†’ >=0.4.3
  • CUDA 13.2: 13.2.0 β†’ 13.2.1 (driver 595.45.04 β†’ 595.58.03) in install_cuda.sh
  • NCCL: Added CUDA-12.6 pin override v2.29.3-1 (sm50 support bug fix)

πŸ§ͺ Testing

  • E2E Buck test β€” New py_add E2E test target (~370 lines) covering kernel filters, three trace format modes, and cross-format consistency validation
  • Kernel events query tests β€” 115+ lines of new tests for query command kernel events support, plus 99 lines of reader unit tests
  • Formatter tests β€” Added Buck target for test_formatters.py

πŸ› οΈ Tooling

  • listgpu.sh β€” New utility script (~150 lines) that lists all GPU processes with user, command, start time, CPU%, GPU memory, and full command lines
  • setup_local_skills.sh β€” Idempotent script to mount in-repo AI agent skills into ~/.llms/skills/ and ~/.claude/skills/
  • sync_install_cuda agent skill β€” Codified the recurring install_cuda.sh upstream sync workflow as reusable AI agent skills for Devmate and Claude Code

⚠️ Breaking Changes

  • CUTRACER_CPU_CALLSTACK behavior changed: the old 1 value now behaves as auto (tries PyTorch first, falls back to backtrace). Existing scripts using --cpu-callstack 1 will get better callstacks automatically; --cpu-callstack 0 still disables capture.
  • JSON trace format: the per-record "sass" field is no longer emitted in new traces (moved to kernel_metadata.instructions table). The Python TraceReader transparently injects it from the instruction table, so downstream tools using the Python API are unaffected. Tools that parse raw JSON and relied on per-record "sass" should read from kernel_metadata.instructions[opcode_id].sass instead.

πŸ‘₯ Contributors

  • Hongtao Yu
  • Lei Wang
  • Yueming Hao

πŸ™ Acknowledgments

CUTracer is built on NVBit by NVIDIA Research. We thank the NVBit team for their excellent binary instrumentation framework.


πŸ“„ License

  • MIT License β€” Meta Platforms, Inc. contributions
  • BSD-3-Clause License β€” NVIDIA NVBit components

See LICENSE and LICENSE-BSD for details.


πŸ“š Documentation

Full documentation is available in the Wiki.


πŸ”— Links