Skip to content

CUTracer v0.2.0 Release

Choose a tag to compare

@FindHao FindHao released this 08 Apr 23:14
· 79 commits to main since this release

πŸŽ‰ Major Release β€” 114 commits since v0.1.0

CUTracer v0.2.0 brings Blackwell GPU support, a unified CLI experience, advanced data race reduction, and significant improvements to trace infrastructure.


✨ Highlights

  • Blackwell (SM100) GPU Support β€” Tensor core instruction tracing for UTC*MMA, UTMALDG, UTMAREDG, and TMA descriptors
  • Unified CLI β€” New cutracer trace subcommand replaces manual CUDA_INJECTION64_PATH setup
  • Data Race Reducer β€” DDMin bisection algorithm to automatically find minimal race-triggering configurations
  • NVBit 1.8 β€” Updated from NVBit 1.7.7.1 to 1.8, with a critical fix for <<<>>> kernel launch deadlocks
  • CPU Call Stack Capture β€” Per-kernel-launch host-side stack traces for debugging
  • Kernel Timeouts & Safety Limits β€” Configurable execution timeout and trace file size limits

πŸ—οΈ Blackwell GPU Support

Full tracing support for NVIDIA Blackwell architecture:

  • UTC*MMA tensor core instructions β€” Trace Blackwell's new warp-group MMA operations (#161)
  • UTMAREDG tracing β€” Support for TMA reduction instructions (#162)
  • UTMALDG decoder β€” Decode TMA load descriptor parameters
  • TMA descriptor tracing β€” Capture and decode TMA descriptor fields for tile configuration analysis (#155)
  • TMA descriptor decoding in SASS β€” Extract descriptor parameters from cubin SASS output
  • Tensor memory delay injection β€” Extend random delay to TMA instructions for data race detection (#189)

πŸ–₯️ Unified CLI

The CLI has been completely revamped with a unified cutracer entry point:

cutracer trace β€” Run and Trace

# Trace a CUDA application (replaces manual CUDA_INJECTION64_PATH setup)
cutracer trace --instrument opcode_only -- python my_kernel.py

# With cubin dump and output directory
cutracer trace --instrument reg_trace --dump-cubin --output-dir ./traces -- python my_kernel.py

# Shell-style environment variable passthrough
cutracer trace CUTRACER_DELAY_NS=1000 -- python my_kernel.py

cutracer query β€” Query Trace Data

# Filter and query traces
cutracer query trace.ndjson --filter "warp=24"
cutracer query trace.ndjson --filter "cta=[0,0,0],opcode=LDG"  # Multi-condition AND filter
cutracer query trace.ndjson --output result.ndjson --compress

cutracer analyze β€” Analyze Traces

# Warp execution summary
cutracer analyze warp-summary trace.ndjson

cutracer reduce β€” Minimize Race Configs

# Find minimal delay configuration that triggers a race
cutracer reduce --config delay_config.json -- python my_kernel.py

cutracer sass β€” SASS Extraction

# Extract SASS from cubin files
cutracer sass --cubin kernel.cubin

πŸ” Data Race Detection Enhancements

DDMin Bisection Reducer (#187)

Automatically reduce a delay configuration to the minimal set of delay points that still trigger a data race, using the delta debugging (ddmin) algorithm:

  • Exponentially faster than brute-force elimination
  • Produces minimal reproducible configurations
  • Integrated via cutracer reduce CLI command

Per-Thread Random Delay Mode (#186)

  • New CUTRACER_DELAY_MODE=per_thread for thread-level delay granularity
  • Better coverage for detecting fine-grained data races

Delay Config Mutator (#145)

  • Programmatic API for manipulating delay configurations
  • Enables automated delay sweep workflows

⏱️ Reliability & Safety

  • Kernel execution timeout (CUTRACER_KERNEL_TIMEOUT_S) β€” Kill kernels that exceed a time limit (#169)
  • No-data timeout β€” Detect silent hangs when no trace data is produced
  • Trace file size limit (CUTRACER_TRACE_SIZE_LIMIT_MB) β€” Prevent runaway disk usage (#169)
  • Periodic flush β€” TraceWriter and log files flush periodically during kernel hangs, ensuring data is available for post-mortem analysis
  • Configurable channel buffer size (CUTRACER_CHANNEL_RECORDS) β€” Tune buffer for hang debugging scenarios
  • Fix <<<>>> deadlock β€” Preload flush_channel via fatbin + NVBit tool API to eliminate kernel launch deadlocks (#199)
  • Fix CUDA graph handling β€” Prevent graph build/capture phase from prematurely executing per-launch side effects
  • Fix trace overwrite β€” Trace file write mode changed from append to overwrite across runs

πŸ”§ Instrumentation Improvements

  • Instruction category system β€” Conditional instrumentation based on instruction categories (#134)
  • IPOINT configuration β€” Configure instrumentation points via environment variables (#183)
  • Register indices in trace output β€” CPU-side static mapping of register operands (#143)
  • opcode_only trace writing β€” Lightweight opcode-only mode now writes structured trace output
  • Auto-enable cubin dump β€” Cubin dump auto-enabled when instrumentation is active (#191)
  • Kernel checksum β€” Robust delay config replay using kernel binary checksums (#133, #141)
  • CPU call stack capture β€” Host-side stack trace for each kernel launch (#172)
  • Re-execute compiled kernel β€” Ensure trace captures both compilation and execution runs

πŸ“ Configuration Changes

Renamed Environment Variables

Old New
TRACE_FORMAT_NDJSON CUTRACER_TRACE_FORMAT
CUTRACER_TRACE_OUTPUT_DIR CUTRACER_OUTPUT_DIR

CUTRACER_TRACE_FORMAT now also accepts string names (e.g., ndjson_zst, ndjson, log) in addition to numeric values (#193).

New Environment Variables

Variable Description Default
CUTRACER_KERNEL_TIMEOUT_S Kernel execution timeout in seconds (disabled)
CUTRACER_TRACE_SIZE_LIMIT_MB Max trace file size in MB (unlimited)
CUTRACER_CHANNEL_RECORDS Channel buffer record count (default)
CUTRACER_CPU_CALLSTACK Enable CPU call stack capture 0
CUTRACER_DELAY_MODE Delay mode (uniform/per_thread) uniform
CUTRACER_OUTPUT_DIR Unified output directory for all artifacts .
CUTRACER_IPOINT Instrumentation point configuration (default)

πŸ”„ Dependency Updates

  • NVBit: 1.7.7.1 β†’ 1.7.7.3 β†’ 1.8 (#164, #198)
  • nlohmann/json: Updated default to 3.12.0
  • Python: CI updated to Python 3.13
  • GitHub Actions: Updated to latest versions
  • JSON parsing: Migrated to orjson for faster JSON I/O via tritonparse _json_compat
  • Daily NVBit update check: Automated GitHub Action to detect upstream NVBit releases (#163)

🐍 Python Package Improvements

  • CLP archive support β€” Dump and read CLP compressed log archives (#118, #148)
  • Unified logger module β€” Consistent logging across all Python modules
  • Schema validation β€” Migrated trace validation into the cutracer Python module (#154)
  • Query enhancements β€” Hex filters, --all-lines flag, NDJSON output, --output, --compress (#136)
  • Multi-condition AND filter β€” Filter by multiple fields simultaneously (#139)
  • JSON list value filters β€” Support cta=[0,0,0] style filter expressions
  • KernelConfig abstraction β€” Clean API for trace metadata
  • TraceWriter metadata β€” write_metadata() and kernel_metadata event support (#153)
  • Truncated trace detection β€” Detect and handle truncated trace files gracefully
  • GB200 aarch64 support β€” Installation scripts updated for GB200 platforms (#159, #173)

πŸ“‹ Requirements

  • CUDA Toolkit: Aligned with NVBit 1.8 requirements
  • libzstd: Required for trace compression
  • Python 3.10+: For Python package
  • NVBit 1.8: Bundled (auto-downloaded during build)

⚠️ Breaking Changes

  • TRACE_FORMAT_NDJSON renamed to CUTRACER_TRACE_FORMAT (#192)
  • CUTRACER_TRACE_OUTPUT_DIR renamed to CUTRACER_OUTPUT_DIR (#167)
  • CLI entry point unified to cutracer (replaces cutraceross)
  • --all flag renamed to --all-lines (#157)
  • analysis module renamed to query (#135)
  • pc field in trace output changed to hex string format (#137)

πŸ™ Acknowledgments

CUTracer is built on NVBit by NVIDIA Research. We thank the NVBit team for their excellent binary instrumentation framework and the v1.8 release.


πŸ“„ License

  • MIT License β€” Meta Platforms, Inc. contributions
  • BSD-3-Clause License β€” NVIDIA NVBit components

See LICENSE and LICENSE-BSD for details.


πŸ“š Documentation

Full documentation is available in the Wiki.


πŸ”— Links