CUTracer v0.1.0 Release 🎉
🎉 Initial Public Release
CUTracer is an NVBit-based CUDA binary instrumentation tool for GPU kernel analysis and debugging. It enables runtime-level insights without requiring application recompilation.
✨ Highlights
- Zero-modification runtime injection - Attach to any CUDA application via
CUDA_INJECTION64_PATH - GPU Hang Detection - Automatic deadlock identification with process termination
- Data Race Detection - Random delay injection with deterministic replay support
- Triton/Proton Integration - Per-warp instruction histograms with IPC calculation
- Efficient Trace Compression - NDJSON + Zstd (~92% space savings)
- Python Analysis Toolkit - Available on PyPI:
pip install cutracer
🔧 Instrumentation Modes
| Mode | Environment Variable | Description |
|---|---|---|
opcode_only |
CUTRACER_INSTRUMENT=opcode_only |
Lightweight instruction counting |
reg_trace |
CUTRACER_INSTRUMENT=reg_trace |
Register value tracing (R/UREG support) |
mem_addr_trace |
CUTRACER_INSTRUMENT=mem_addr_trace |
Memory address tracing |
mem_value_trace |
CUTRACER_INSTRUMENT=mem_value_trace |
Memory address + value tracing (Global/Shared/Local) |
random_delay |
CUTRACER_INSTRUMENT=random_delay |
Delay injection for race detection |
Multiple modes can be enabled simultaneously with comma-separated values.
📊 Built-in Analyses
Instruction Histogram (proton_instr_histogram)
- Clock-delimited per-warp instruction counting
- CSV output:
warp_id,region_id,instruction,count - Integration with Triton Proton for IPC calculation
- Requires kernels to emit clock instructions (e.g., Triton
pl.scope())
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
CUTRACER_ANALYSIS=proton_instr_histogram \
KERNEL_FILTERS=add_kernel \
python ./vector-add-instrumented.pyDeadlock/Hang Detection (deadlock_detection)
- Detects warps stuck in stable PC loops
- Automatic SIGTERM → SIGKILL termination sequence
- Detailed warp state logging for post-mortem analysis
- Auto-enables
reg_traceinstrumentation
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
CUTRACER_ANALYSIS=deadlock_detection \
python ./test_hang.pyData Race Detection (random_delay)
- Injects delays at synchronization points to expose timing-dependent races
- Target SASS patterns:
SYNCS.PHASECHK.TRANS64.TRYWAIT(mbarrier try_wait)SYNCS.ARRIVE.TRANS64.RED.A1T0(mbarrier arrive)UTMALDG.2D(TMA load)WARPGROUP.DEPBAR.LE(MMA wait)
Deterministic Replay Support:
CUTRACER_DELAY_DUMP_PATH: Export delay config JSON for recordingCUTRACER_DELAY_LOAD_PATH: Load delay config JSON for exact replay- Workflow: Discover race with random delays → Reproduce exactly with saved config
# Record mode
CUTRACER_DELAY_NS=1000 \
CUTRACER_DELAY_DUMP_PATH=./delay_config.json \
CUTRACER_ANALYSIS=random_delay \
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
python your_kernel.py
# Replay mode (deterministic reproduction)
CUTRACER_DELAY_LOAD_PATH=./delay_config.json \
CUTRACER_ANALYSIS=random_delay \
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
python your_kernel.py📁 Output Formats
| Mode | Extension | Description |
|---|---|---|
| 0 | .log |
Human-readable text format |
| 1 (default) | .ndjson.zst |
NDJSON + Zstd compressed |
| 2 | .ndjson |
NDJSON uncompressed |
Set via TRACE_FORMAT_NDJSON environment variable.
Compression level configurable via CUTRACER_ZSTD_LEVEL (1-22, default: 22).
🐍 Python Package
Available on PyPI:
pip install cutracerFeatures
Validation:
- JSON syntax and schema validation
- Text format validation
- Cross-format consistency checking
- Transparent Zstd compression handling
Analysis:
TraceReader: Stream trace records from NDJSON filesStreamingGrouper: Memory-efficient grouped analysisWarpSummary: Warp execution status for hang analysis (completed/in-progress/missing)- Multi-format output: table, JSON, CSV
CLI Tools:
# Validate trace files
cutraceross validate trace.ndjson
cutraceross validate trace.ndjson.zst --verbose
# Analyze trace data
cutraceross analyze trace.ndjson --head 20
cutraceross analyze trace.ndjson --filter "warp=24"
cutraceross analyze trace.ndjson --group-by warp --count⚙️ Configuration Reference
| Variable | Description | Default |
|---|---|---|
CUTRACER_INSTRUMENT |
Instrumentation modes (comma-separated) | (none) |
CUTRACER_ANALYSIS |
Analysis types (comma-separated) | (none) |
KERNEL_FILTERS |
Kernel name filters (substring match) | (none) |
INSTR_BEGIN / INSTR_END |
Instruction index range filter | 0 / UINT32_MAX |
TRACE_FORMAT_NDJSON |
Output format (0/1/2) | 1 |
CUTRACER_ZSTD_LEVEL |
Zstd compression level | 22 |
CUTRACER_DELAY_NS |
Delay value in nanoseconds | 0 (disabled) |
CUTRACER_DELAY_DUMP_PATH |
Export delay config JSON | (none) |
CUTRACER_DELAY_LOAD_PATH |
Load delay config for replay | (none) |
TOOL_VERBOSE |
Verbosity level (0/1/2) | 0 |
CUTRACER_DUMP_CUBIN |
Dump cubin files | 0 |
📋 Requirements
- CUDA Toolkit: Aligned with NVBit requirements
- libzstd: Required for trace compression
- Python 3.10+: For Python package
⚠️ API Stability Notice
This is the initial public release (v0.1.0). APIs and configuration options may change in future versions as we gather feedback and iterate on the design.
Known Limitations
- Instruction histogram requires clock instruction boundaries (e.g., Triton
pl.scope()) - Nested regions not supported for instruction histogram analysis
🙏 Acknowledgments
CUTracer is built on NVBit by NVIDIA Research. We thank the NVBit team for their excellent binary instrumentation framework.
📄 License
- MIT License - Meta Platforms, Inc. contributions
- BSD-3-Clause License - NVIDIA NVBit components
See LICENSE and LICENSE-BSD for details.
📚 Documentation
Full documentation is available in the Wiki.