# **Introduction to NVIDIA Toolkit**

NVIDIA's CUDA Toolkit offers a suite of command-line tools essential for developing, debugging, and optimizing GPU-accelerated applications. Below is an overview of key tools and their basic usage:

---
Add CUDA to path in Jupyter Notebook even though nvcc compiler is detected in terminal, as it is not directly detected by ipykernel.

In [18]:
import os
os.environ["PATH"] += ":/usr/local/cuda/bin"

----
## **01 - NVIDIA CUDA Compiler (nvcc)**

Compiles CUDA source files (.cu) into executable programs or object files.

In [19]:
!nvcc -o my_program my_program.cu

----
## **02 - NVIDIA System Management Interface (nvidia-smi)**

Monitors and manages NVIDIA GPU devices, providing information on usage, temperature, and running processes.

In [20]:
!nvidia-smi

Tue Nov 19 13:06:17 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.59                 Driver Version: 556.13         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 3060 ...    On  |   00000000:01:00.0  On |                  N/A |
| N/A   54C    P3             22W /   80W |     385MiB /   6144MiB |     28%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

----
## **03 - CUDA Object Dump (cuobjdump)**

Purpose: Disassembles CUDA binaries to provide a human-readable representation of the machine code.

Necessity: Essential for analyzing the low-level assembly code generated by the CUDA compiler, aiding in performance optimization and debugging.

In [21]:
!cuobjdump --dump-ptx my_program


Fatbin elf code:
arch = sm_52
code version = [1,7]
host = linux
compile_size = 64bit

Fatbin elf code:
arch = sm_52
code version = [1,7]
host = linux
compile_size = 64bit

Fatbin ptx code:
arch = sm_52
code version = [8,5]
host = linux
compile_size = 64bit
compressed
ptxasOptions = 

//
//
//
//
//
//

.version 8.5
.target sm_52
.address_size 64

//

.visible .entry _Z9vectorAddPKfS0_Pfi(
.param .u64 _Z9vectorAddPKfS0_Pfi_param_0,
.param .u64 _Z9vectorAddPKfS0_Pfi_param_1,
.param .u64 _Z9vectorAddPKfS0_Pfi_param_2,
.param .u32 _Z9vectorAddPKfS0_Pfi_param_3
)
{
.reg .pred %p<2>;
.reg .f32 %f<4>;
.reg .b32 %r<6>;
.reg .b64 %rd<11>;


ld.param.u64 %rd1, [_Z9vectorAddPKfS0_Pfi_param_0];
ld.param.u64 %rd2, [_Z9vectorAddPKfS0_Pfi_param_1];
ld.param.u64 %rd3, [_Z9vectorAddPKfS0_Pfi_param_2];
ld.param.u32 %r2, [_Z9vectorAddPKfS0_Pfi_param_3];
mov.u32 %r3, %ntid.x;
mov.u32 %r4, %ctaid.x;
mov.u32 %r5, %tid.x;
mad.lo.s32 %r1, %r3, %r4, %r5;
setp.ge.s32 %p1, %r1, %r2;
@%p1 bra $L__BB0_2;

cvta.to

----
## **04 - Disassemble CUDA binaries (nvdisasm)**

Purpose: Disassembles CUDA binaries to provide a human-readable representation of the machine code.

Necessity: Essential for analyzing the low-level assembly code generated by the CUDA compiler, aiding in performance optimization and debugging.

In [22]:
!nvcc -cubin -o my_program.cubin my_program.cu
!nvdisasm my_program.cubin

	.headerflags	@"EF_CUDA_TEXMODE_UNIFIED EF_CUDA_64BIT_ADDRESS EF_CUDA_SM52 EF_CUDA_VIRTUAL_SM(EF_CUDA_SM52)"
	.elftype	@"ET_EXEC"


//--------------------- .nv.info                  --------------------------
	.section	.nv.info,"",@"SHT_CUDA_INFO"
	.align	4


	//----- nvinfo : EIATTR_REGCOUNT
	.align		4
        /*0000*/ 	.byte	0x04, 0x2f
        /*0002*/ 	.short	(.L_1 - .L_0)
	.align		4
.L_0:
        /*0004*/ 	.word	index@(_Z9vectorAddPKfS0_Pfi)
        /*0008*/ 	.word	0x00000008


	//----- nvinfo : EIATTR_FRAME_SIZE
	.align		4
.L_1:
        /*000c*/ 	.byte	0x04, 0x11
        /*000e*/ 	.short	(.L_3 - .L_2)
	.align		4
.L_2:
        /*0010*/ 	.word	index@(_Z9vectorAddPKfS0_Pfi)
        /*0014*/ 	.word	0x00000000


	//----- nvinfo : EIATTR_MIN_STACK_SIZE
	.align		4
.L_3:
        /*0018*/ 	.byte	0x04, 0x12
        /*001a*/ 	.short	(.L_5 - .L_4)
	.align		4
.L_4:
        /*001c*/ 	.word	index@(_Z9vectorAddPKfS0_Pfi)
        /*0020*/ 	.word	0x00000000
.L_5:


//--------------------- .nv.info.

----
## **05 - NVIDIA Nsight Systems (nsys)**

Collects and visualizes system-wide performance data, aiding in the optimization of CUDA applications.

In [23]:
!nsys profile ./my_program

Collecting data...
Test PASSED
Generating '/tmp/nsys-report-e057.qdstrm'
Generated:
    /home/darshith2000/learn-cuda/01_introduction/report1.nsys-rep


----
## **06 - NVIDIA Visual Profiler**

Provides a graphical interface for profiling CUDA applications, offering detailed performance insights.

In [24]:
# "!nvvp" # Not executed here because ipykernel cannot open GUI.

This launches the NVIDIA Visual Profiler GUI.

----
## **07 - NVIDIA Nsight Compute**

A performance analysis tool for CUDA kernels, providing detailed metrics and guidance for optimization.

In [25]:
!ncu ./my_program

==PROF== Connected to process 27802 (/home/darshith2000/learn-cuda/01_introduction/my_program)
==PROF== Profiling "vectorAdd" - 0: 0%....50%....100% - 8 passes
Test PASSED
==PROF== Disconnected from process 27802
[27802] my_program@127.0.0.1
  vectorAdd(const float *, const float *, float *, int) (4, 1, 1)x(256, 1, 1), Context 1, Stream 7, Device 0, CC 8.6
    Section: GPU Speed Of Light Throughput
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         6.77
    SM Frequency                    Mhz       882.11
    Elapsed Cycles                cycle         2936
    Memory Throughput                 %         1.36
    DRAM Throughput                   %         1.02
    Duration                         us         3.33
    L1/TEX Cache Throughput           %         7.12
    L2 Cache Throughput               %         1.36
    SM Active C