Skip to content

CUTracer v0.1.0 Release 🎉

Choose a tag to compare

@FindHao FindHao released this 06 Feb 03:01
· 194 commits to main since this release

🎉 Initial Public Release

CUTracer is an NVBit-based CUDA binary instrumentation tool for GPU kernel analysis and debugging. It enables runtime-level insights without requiring application recompilation.


✨ Highlights

  • Zero-modification runtime injection - Attach to any CUDA application via CUDA_INJECTION64_PATH
  • GPU Hang Detection - Automatic deadlock identification with process termination
  • Data Race Detection - Random delay injection with deterministic replay support
  • Triton/Proton Integration - Per-warp instruction histograms with IPC calculation
  • Efficient Trace Compression - NDJSON + Zstd (~92% space savings)
  • Python Analysis Toolkit - Available on PyPI: pip install cutracer

🔧 Instrumentation Modes

Mode Environment Variable Description
opcode_only CUTRACER_INSTRUMENT=opcode_only Lightweight instruction counting
reg_trace CUTRACER_INSTRUMENT=reg_trace Register value tracing (R/UREG support)
mem_addr_trace CUTRACER_INSTRUMENT=mem_addr_trace Memory address tracing
mem_value_trace CUTRACER_INSTRUMENT=mem_value_trace Memory address + value tracing (Global/Shared/Local)
random_delay CUTRACER_INSTRUMENT=random_delay Delay injection for race detection

Multiple modes can be enabled simultaneously with comma-separated values.


📊 Built-in Analyses

Instruction Histogram (proton_instr_histogram)

  • Clock-delimited per-warp instruction counting
  • CSV output: warp_id,region_id,instruction,count
  • Integration with Triton Proton for IPC calculation
  • Requires kernels to emit clock instructions (e.g., Triton pl.scope())
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
CUTRACER_ANALYSIS=proton_instr_histogram \
KERNEL_FILTERS=add_kernel \
python ./vector-add-instrumented.py

Deadlock/Hang Detection (deadlock_detection)

  • Detects warps stuck in stable PC loops
  • Automatic SIGTERM → SIGKILL termination sequence
  • Detailed warp state logging for post-mortem analysis
  • Auto-enables reg_trace instrumentation
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
CUTRACER_ANALYSIS=deadlock_detection \
python ./test_hang.py

Data Race Detection (random_delay)

  • Injects delays at synchronization points to expose timing-dependent races
  • Target SASS patterns:
    • SYNCS.PHASECHK.TRANS64.TRYWAIT (mbarrier try_wait)
    • SYNCS.ARRIVE.TRANS64.RED.A1T0 (mbarrier arrive)
    • UTMALDG.2D (TMA load)
    • WARPGROUP.DEPBAR.LE (MMA wait)

Deterministic Replay Support:

  • CUTRACER_DELAY_DUMP_PATH: Export delay config JSON for recording
  • CUTRACER_DELAY_LOAD_PATH: Load delay config JSON for exact replay
  • Workflow: Discover race with random delays → Reproduce exactly with saved config
# Record mode
CUTRACER_DELAY_NS=1000 \
CUTRACER_DELAY_DUMP_PATH=./delay_config.json \
CUTRACER_ANALYSIS=random_delay \
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
python your_kernel.py

# Replay mode (deterministic reproduction)
CUTRACER_DELAY_LOAD_PATH=./delay_config.json \
CUTRACER_ANALYSIS=random_delay \
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
python your_kernel.py

📁 Output Formats

Mode Extension Description
0 .log Human-readable text format
1 (default) .ndjson.zst NDJSON + Zstd compressed
2 .ndjson NDJSON uncompressed

Set via TRACE_FORMAT_NDJSON environment variable.

Compression level configurable via CUTRACER_ZSTD_LEVEL (1-22, default: 22).


🐍 Python Package

Available on PyPI:

pip install cutracer

Features

Validation:

  • JSON syntax and schema validation
  • Text format validation
  • Cross-format consistency checking
  • Transparent Zstd compression handling

Analysis:

  • TraceReader: Stream trace records from NDJSON files
  • StreamingGrouper: Memory-efficient grouped analysis
  • WarpSummary: Warp execution status for hang analysis (completed/in-progress/missing)
  • Multi-format output: table, JSON, CSV

CLI Tools:

# Validate trace files
cutraceross validate trace.ndjson
cutraceross validate trace.ndjson.zst --verbose

# Analyze trace data
cutraceross analyze trace.ndjson --head 20
cutraceross analyze trace.ndjson --filter "warp=24"
cutraceross analyze trace.ndjson --group-by warp --count

⚙️ Configuration Reference

Variable Description Default
CUTRACER_INSTRUMENT Instrumentation modes (comma-separated) (none)
CUTRACER_ANALYSIS Analysis types (comma-separated) (none)
KERNEL_FILTERS Kernel name filters (substring match) (none)
INSTR_BEGIN / INSTR_END Instruction index range filter 0 / UINT32_MAX
TRACE_FORMAT_NDJSON Output format (0/1/2) 1
CUTRACER_ZSTD_LEVEL Zstd compression level 22
CUTRACER_DELAY_NS Delay value in nanoseconds 0 (disabled)
CUTRACER_DELAY_DUMP_PATH Export delay config JSON (none)
CUTRACER_DELAY_LOAD_PATH Load delay config for replay (none)
TOOL_VERBOSE Verbosity level (0/1/2) 0
CUTRACER_DUMP_CUBIN Dump cubin files 0

📋 Requirements

  • CUDA Toolkit: Aligned with NVBit requirements
  • libzstd: Required for trace compression
  • Python 3.10+: For Python package

⚠️ API Stability Notice

This is the initial public release (v0.1.0). APIs and configuration options may change in future versions as we gather feedback and iterate on the design.

Known Limitations

  • Instruction histogram requires clock instruction boundaries (e.g., Triton pl.scope())
  • Nested regions not supported for instruction histogram analysis

🙏 Acknowledgments

CUTracer is built on NVBit by NVIDIA Research. We thank the NVBit team for their excellent binary instrumentation framework.


📄 License

  • MIT License - Meta Platforms, Inc. contributions
  • BSD-3-Clause License - NVIDIA NVBit components

See LICENSE and LICENSE-BSD for details.


📚 Documentation

Full documentation is available in the Wiki.


🔗 Links