A collection of profiling tools, inspired by challenges in profiling large CUDA applications using nvprof
and nvvp
.
A trace analysis tool that can report exposed GPU activities - GPU activites that actually contribute to the overall program execution time.
$ ./openvprof.py --helpUsage: openvprof.py [OPTIONS] COMMAND [ARGS]...
Options:
--debug print debugging messages
--help Show this message and exit.
Commands:
driver-time Show a histogram of driver API times
filter filter file INPUT to contain only records between START and...
kernel-time Show a histogram of kernel times (ns)
list-edges
list-ranges print summary statistics of ranges
list-records
stats
summary
timeline Generate a chrome:://tracing timeline
$ ./openvprof.py summary -r CUDATreeLearner::Train ~/ibmGBT_data/timeline.nvprof
Selected timeslices cover 19.952495698s
Marker Report
=============
<not implemented>
Communication Report
====================
Active communication Time-Slices: 6.124770744s
Exposed communication Time-Slices: 2.315078656s
Active Communication Time-Slices
--------------------------------
cpu-gpu0 6.122501773s
gpu0-cpu 0.002268971s
gpu0-gpu0 0.0s
gpu0-gpu1 0.0s
gpu0-gpu2 0.0s
gpu0-gpu3 0.0s
cpu-gpu1 0.0s
gpu1-cpu 0.0s
gpu1-gpu0 0.0s
gpu1-gpu1 0.0s
gpu1-gpu2 0.0s
gpu1-gpu3 0.0s
cpu-gpu2 0.0s
gpu2-cpu 0.0s
gpu2-gpu0 0.0s
gpu2-gpu1 0.0s
gpu2-gpu2 0.0s
gpu2-gpu3 0.0s
cpu-gpu3 0.0s
gpu3-cpu 0.0s
gpu3-gpu0 0.0s
gpu3-gpu1 0.0s
gpu3-gpu2 0.0s
gpu3-gpu3 0.0s
Exposed communication breakdown
-------------------------------
Runtime Report
==============
Any CUDA Runtime Time-Slices: 5.527960601s
Exposed CUDA Runtime Time-Slices: 0.307504125s
Exposed Runtime by Thread
-------------------------
1917872512 0.307504125s
Any Runtime by Call
-----------------------
cudaMemcpyAsync 3.366746327s
cudaEventSynchronize 2.095184403s
cudaLaunchKernel 0.031538696s
cudaEventRecord 0.025995506s
cudaDeviceSynchronize 0.005472751s
cudaStreamSynchronize 0.00254224s
cudaGetLastError 0.000480678s
Exposed Runtime by Call
-----------------------
cudaMemcpyAsync 0.254882977s
cudaLaunchKernel 0.029222504s
cudaEventSynchronize 0.011957917s
cudaDeviceSynchronize 0.005472751s
cudaEventRecord 0.003424999s
cudaStreamSynchronize 0.00254224s
cudaGetLastError 7.37e-07s
Exposed Runtime Breakdown
-------------------------
(53816, 1917872512, 'cudaMemcpyAsync') 0.254882977s
(53816, 1917872512, 'cudaLaunchKernel') 0.029222504s
(53816, 1917872512, 'cudaEventSynchronize') 0.011957917s
(53816, 1917872512, 'cudaDeviceSynchronize') 0.005472751s
(53816, 1917872512, 'cudaEventRecord') 0.003424999s
(53816, 1917872512, 'cudaStreamSynchronize') 0.00254224s
(53816, 1917872512, 'cudaGetLastError') 7.37e-07s
Any Runtime Breakdown
---------------------
(53816, 1917872512, 'cudaMemcpyAsync') 3.366746327s
(53816, 1917872512, 'cudaEventSynchronize') 2.095184403s
(53816, 1917872512, 'cudaLaunchKernel') 0.031538696s
(53816, 1917872512, 'cudaEventRecord') 0.025995506s
(53816, 1917872512, 'cudaDeviceSynchronize') 0.005472751s
(53816, 1917872512, 'cudaStreamSynchronize') 0.00254224s
(53816, 1917872512, 'cudaGetLastError') 0.000480678s
Kernel Report
=============
Any GPU Kernel Time-Slices: 1.411868259s
Exposed GPU Kernel Time-Slices: 0.001103871s
Active kernel time-slices by GPU
--------------------------------
GPU 0 Kernel Time: 1.411868259s
GPU 1 Kernel Time: 0.0s
GPU 2 Kernel Time: 0.0s
GPU 3 Kernel Time: 0.0s
Active kernel time-slices on GPU 0
-----------------------------------
_Z12histogram256PK6uchar4S1_jPKjjPKfS5_PcPViPfm 0.813565823s
_Z21histogram256_fulldataPK6uchar4S1_jPKjjPKfS5_PcPViPfm 0.598302436s
An open CUDA GPU profiler using CuPTI and Nvidia Management Library.
docker build -f ppc64le.Dockerfile -t openvprof/ppc64le .
docker run -v pwd
/container:/host -it openvprof/ppc64le
mkdir build
cd build
cmake ..
make
Run a CUDA program and produce an output record openvprof.json
LD_PRELOAD=libopenvprof.so ./my-exe
OPENVPROF_LOG_LEVEL
:trace
,debug
,info
,warn
,err
,crit
OPENVPROF_OUTPUT_PATH
: control the output path that can be loaded into chrome://tracing
Convert openvprof.json
to trace.json
python3 scrips/trace.py
Open Chromium or Chrome to chrome://tracing
and load trace.json
Recording:
- Kernel activity
- Memcpy activity
- raw NVLink traffic
- raw PCIe traffic
- Defer writing report to disk with
OPENVPROF_DEFER_IO
env variable
Visualizing:
- logical link traffic visualization
- physical link traffic visualization
- Split or combine different transfer directions on same link
These tools were conceived and partially prototyped at IBM's T.J. Watson Research Center during a fall 2018 internship