CUTracer v0.2.0 Release
π Major Release β 114 commits since v0.1.0
CUTracer v0.2.0 brings Blackwell GPU support, a unified CLI experience, advanced data race reduction, and significant improvements to trace infrastructure.
β¨ Highlights
- Blackwell (SM100) GPU Support β Tensor core instruction tracing for UTC*MMA, UTMALDG, UTMAREDG, and TMA descriptors
- Unified CLI β New
cutracer tracesubcommand replaces manualCUDA_INJECTION64_PATHsetup - Data Race Reducer β DDMin bisection algorithm to automatically find minimal race-triggering configurations
- NVBit 1.8 β Updated from NVBit 1.7.7.1 to 1.8, with a critical fix for
<<<>>>kernel launch deadlocks - CPU Call Stack Capture β Per-kernel-launch host-side stack traces for debugging
- Kernel Timeouts & Safety Limits β Configurable execution timeout and trace file size limits
ποΈ Blackwell GPU Support
Full tracing support for NVIDIA Blackwell architecture:
- UTC*MMA tensor core instructions β Trace Blackwell's new warp-group MMA operations (#161)
- UTMAREDG tracing β Support for TMA reduction instructions (#162)
- UTMALDG decoder β Decode TMA load descriptor parameters
- TMA descriptor tracing β Capture and decode TMA descriptor fields for tile configuration analysis (#155)
- TMA descriptor decoding in SASS β Extract descriptor parameters from cubin SASS output
- Tensor memory delay injection β Extend random delay to TMA instructions for data race detection (#189)
π₯οΈ Unified CLI
The CLI has been completely revamped with a unified cutracer entry point:
cutracer trace β Run and Trace
# Trace a CUDA application (replaces manual CUDA_INJECTION64_PATH setup)
cutracer trace --instrument opcode_only -- python my_kernel.py
# With cubin dump and output directory
cutracer trace --instrument reg_trace --dump-cubin --output-dir ./traces -- python my_kernel.py
# Shell-style environment variable passthrough
cutracer trace CUTRACER_DELAY_NS=1000 -- python my_kernel.pycutracer query β Query Trace Data
# Filter and query traces
cutracer query trace.ndjson --filter "warp=24"
cutracer query trace.ndjson --filter "cta=[0,0,0],opcode=LDG" # Multi-condition AND filter
cutracer query trace.ndjson --output result.ndjson --compresscutracer analyze β Analyze Traces
# Warp execution summary
cutracer analyze warp-summary trace.ndjsoncutracer reduce β Minimize Race Configs
# Find minimal delay configuration that triggers a race
cutracer reduce --config delay_config.json -- python my_kernel.pycutracer sass β SASS Extraction
# Extract SASS from cubin files
cutracer sass --cubin kernel.cubinπ Data Race Detection Enhancements
DDMin Bisection Reducer (#187)
Automatically reduce a delay configuration to the minimal set of delay points that still trigger a data race, using the delta debugging (ddmin) algorithm:
- Exponentially faster than brute-force elimination
- Produces minimal reproducible configurations
- Integrated via
cutracer reduceCLI command
Per-Thread Random Delay Mode (#186)
- New
CUTRACER_DELAY_MODE=per_threadfor thread-level delay granularity - Better coverage for detecting fine-grained data races
Delay Config Mutator (#145)
- Programmatic API for manipulating delay configurations
- Enables automated delay sweep workflows
β±οΈ Reliability & Safety
- Kernel execution timeout (
CUTRACER_KERNEL_TIMEOUT_S) β Kill kernels that exceed a time limit (#169) - No-data timeout β Detect silent hangs when no trace data is produced
- Trace file size limit (
CUTRACER_TRACE_SIZE_LIMIT_MB) β Prevent runaway disk usage (#169) - Periodic flush β TraceWriter and log files flush periodically during kernel hangs, ensuring data is available for post-mortem analysis
- Configurable channel buffer size (
CUTRACER_CHANNEL_RECORDS) β Tune buffer for hang debugging scenarios - Fix
<<<>>>deadlock β Preload flush_channel via fatbin + NVBit tool API to eliminate kernel launch deadlocks (#199) - Fix CUDA graph handling β Prevent graph build/capture phase from prematurely executing per-launch side effects
- Fix trace overwrite β Trace file write mode changed from append to overwrite across runs
π§ Instrumentation Improvements
- Instruction category system β Conditional instrumentation based on instruction categories (#134)
- IPOINT configuration β Configure instrumentation points via environment variables (#183)
- Register indices in trace output β CPU-side static mapping of register operands (#143)
- opcode_only trace writing β Lightweight opcode-only mode now writes structured trace output
- Auto-enable cubin dump β Cubin dump auto-enabled when instrumentation is active (#191)
- Kernel checksum β Robust delay config replay using kernel binary checksums (#133, #141)
- CPU call stack capture β Host-side stack trace for each kernel launch (#172)
- Re-execute compiled kernel β Ensure trace captures both compilation and execution runs
π Configuration Changes
Renamed Environment Variables
| Old | New |
|---|---|
TRACE_FORMAT_NDJSON |
CUTRACER_TRACE_FORMAT |
CUTRACER_TRACE_OUTPUT_DIR |
CUTRACER_OUTPUT_DIR |
CUTRACER_TRACE_FORMAT now also accepts string names (e.g., ndjson_zst, ndjson, log) in addition to numeric values (#193).
New Environment Variables
| Variable | Description | Default |
|---|---|---|
CUTRACER_KERNEL_TIMEOUT_S |
Kernel execution timeout in seconds | (disabled) |
CUTRACER_TRACE_SIZE_LIMIT_MB |
Max trace file size in MB | (unlimited) |
CUTRACER_CHANNEL_RECORDS |
Channel buffer record count | (default) |
CUTRACER_CPU_CALLSTACK |
Enable CPU call stack capture | 0 |
CUTRACER_DELAY_MODE |
Delay mode (uniform/per_thread) |
uniform |
CUTRACER_OUTPUT_DIR |
Unified output directory for all artifacts | . |
CUTRACER_IPOINT |
Instrumentation point configuration | (default) |
π Dependency Updates
- NVBit: 1.7.7.1 β 1.7.7.3 β 1.8 (#164, #198)
- nlohmann/json: Updated default to 3.12.0
- Python: CI updated to Python 3.13
- GitHub Actions: Updated to latest versions
- JSON parsing: Migrated to orjson for faster JSON I/O via tritonparse
_json_compat - Daily NVBit update check: Automated GitHub Action to detect upstream NVBit releases (#163)
π Python Package Improvements
- CLP archive support β Dump and read CLP compressed log archives (#118, #148)
- Unified logger module β Consistent logging across all Python modules
- Schema validation β Migrated trace validation into the cutracer Python module (#154)
- Query enhancements β Hex filters,
--all-linesflag, NDJSON output,--output,--compress(#136) - Multi-condition AND filter β Filter by multiple fields simultaneously (#139)
- JSON list value filters β Support
cta=[0,0,0]style filter expressions - KernelConfig abstraction β Clean API for trace metadata
- TraceWriter metadata β
write_metadata()andkernel_metadataevent support (#153) - Truncated trace detection β Detect and handle truncated trace files gracefully
- GB200 aarch64 support β Installation scripts updated for GB200 platforms (#159, #173)
π Requirements
- CUDA Toolkit: Aligned with NVBit 1.8 requirements
- libzstd: Required for trace compression
- Python 3.10+: For Python package
- NVBit 1.8: Bundled (auto-downloaded during build)
β οΈ Breaking Changes
TRACE_FORMAT_NDJSONrenamed toCUTRACER_TRACE_FORMAT(#192)CUTRACER_TRACE_OUTPUT_DIRrenamed toCUTRACER_OUTPUT_DIR(#167)- CLI entry point unified to
cutracer(replacescutraceross) --allflag renamed to--all-lines(#157)analysismodule renamed toquery(#135)pcfield in trace output changed to hex string format (#137)
π Acknowledgments
CUTracer is built on NVBit by NVIDIA Research. We thank the NVBit team for their excellent binary instrumentation framework and the v1.8 release.
π License
- MIT License β Meta Platforms, Inc. contributions
- BSD-3-Clause License β NVIDIA NVBit components
See LICENSE and LICENSE-BSD for details.
π Documentation
Full documentation is available in the Wiki.