# **CUDA Libraries**

In [None]:
import os
os.environ["PATH"] += ":/usr/local/cuda-12.6/bin"

# Verify nvcc is now accessible
!nvcc --version

## **1. cuBLAS - Basic Linear Algegra Subprograms Library**

### **a. Matrix-Vector Multiplication**

```cpp
cublasSgemv(
    handle,           // cuBLAS library context
    CUBLAS_OP_N,     // Operation: N=no transpose, T=transpose, C=conjugate transpose
    M,               // Number of rows in matrix A
    N,               // Number of columns in matrix A
    &alpha,          // Scalar multiplier for A*x
    d_A,             // Device pointer to matrix A
    M,               // Leading dimension of A (row stride)
    d_x,             // Device pointer to vector x
    1,               // Stride between elements in x
    &beta,           // Scalar multiplier for y
    d_y,             // Device pointer to vector y
    1                // Stride between elements in y
);
```

[./cuBLAS/matrix_vector_mul.cu](./cuBLAS/matrix_vector_mul.cu)

In [None]:
!nvcc -g -G -lcublas -o ./cuBLAS/matrix_vector_mul ./cuBLAS/matrix_vector_mul.cu
!./cuBLAS/matrix_vector_mul

### **b. Matrix-Matrix Multiplication**

```cpp
cublasSgemm(
   handle,           // cuBLAS library context
   CUBLAS_OP_N,     // Operation for matrix A: N=no transpose, T=transpose, C=conjugate transpose
   CUBLAS_OP_N,     // Operation for matrix B
   M,               // Rows of output matrix C and op(A)
   N,               // Columns of output matrix C and op(B)
   K,               // Columns of op(A) and rows of op(B)
   &alpha,          // Scalar multiplier for A*B
   d_A,             // Device pointer to matrix A
   M,               // Leading dimension of A (row stride)
   d_B,             // Device pointer to matrix B
   K,               // Leading dimension of B
   &beta,           // Scalar multiplier for C
   d_C,             // Device pointer to matrix C
   M                // Leading dimension of C
);
```

[./cuBLAS/matrix_matirx_mul.cu](./cuBLAS/matrix_matrix_mul.cu)

In [None]:
!nvcc -g -G -lcublas -o ./cuBLAS/matrix_matrix_mul ./cuBLAS/matrix_matrix_mul.cu
!./cuBLAS/matrix_matrix_mul

### **c. Batched Matrix Multiplication**

```cpp
cublasSgemmBatched(
   handle,           // cuBLAS library context
   CUBLAS_OP_N,     // Operation for matrices A: N=no transpose, T=transpose
   CUBLAS_OP_N,     // Operation for matrices B
   M,               // Rows of output matrices C and op(A)
   N,               // Columns of output matrices C and op(B)
   K,               // Columns of op(A) and rows of op(B)
   &alpha,          // Scalar multiplier for A*B
   d_Aarray,        // Array of pointers to device matrices A
   M,               // Leading dimension of A matrices
   d_Barray,        // Array of pointers to device matrices B
   K,               // Leading dimension of B matrices
   &beta,           // Scalar multiplier for C
   d_Carray,        // Array of pointers to device matrices C
   M,               // Leading dimension of C matrices
   BATCH            // Number of matrices in batch
);
```

[./cuBLAS/matrix_batched_mul.cu](./cuBLAS/matrix_batched_mul.cu)

In [None]:
!nvcc -g -G -lcublas -o ./cuBLAS/matrix_batched_mul ./cuBLAS/matrix_batched_mul.cu
!./cuBLAS/matrix_batched_mul

In [None]:
!make SRC=./cuBLAS/matrix_vector_mul.cu clean
!make SRC=./cuBLAS/matrix_matrix_mul.cu clean
!make SRC=./cuBLAS/matrix_batched_mul.cu clean

## **2. cuDNN - Deep Neural Networks Library**

### **a. Convolution Layer**

```cpp
cudnnConvolutionForward(
    cudnn,                 // cuDNN handle
    &alpha,                // Scaling factor for input
    input_descriptor,      // Input tensor descriptor
    d_input,              // Input data pointer
    kernel_descriptor,     // Filter descriptor
    d_kernel,             // Filter data pointer
    convolution_descriptor, // Convolution descriptor
    algo_perf[0].algo,    // Convolution algorithm
    workspace,            // Workspace memory pointer
    workspace_size,       // Workspace size in bytes
    &beta,                // Scaling factor for output
    output_descriptor,     // Output tensor descriptor
    d_output              // Output data pointer
);
```

[./cuDNN/convolution_layer.cu](./cuDNN/convolution_layer.cu)

In [None]:
!nvcc -g -G -lcudnn -lcuda -lcudart -o ./cuDNN/convolution_layer ./cuDNN/convolution_layer.cu
!./cuDNN/convolution_layer

### **b. ReLU Activation Layer**

```cpp
cudnnActivationBackward(
    cudnn,
    activation_descriptor,    // Activation function parameters
    &alpha,                  // Input scaling factor
    tensor_descriptor,       // Output tensor descriptor (forward pass)
    d_output,               // Output from forward pass
    tensor_descriptor,       // Gradient tensor descriptor
    d_dy,                   // Incoming gradient
    tensor_descriptor,       // Input tensor descriptor (forward pass)
    d_input,                // Input from forward pass
    &beta,                  // Output scaling factor
    tensor_descriptor,       // Output gradient tensor descriptor
    d_dx                    // Output gradient
);
```

[./cuDNN/relu_activation.cu](./cuDNN/relu_activation.cu)

In [None]:
!nvcc -g -G -lcudnn -o ./cuDNN/relu_activation ./cuDNN/relu_activation.cu
!./cuDNN/relu_activation

In [None]:
!make SRC=./cuDNN/convolution_layer.cu clean
!make SRC=./cuDNN/relu_activation.cu clean

## **3. cuFFT - Fast Fourier Transform Library**

### **b. 1D FFT - Real to Complex**

```cpp
cufftExecR2C(
    plan,      // FFT plan handle
    d_input,   // Input array (real)
    d_output   // Output array (complex)
);
```

[./cuFFT/fft_1d_r2c.cu](./cuFFT/fft_1d_r2c_.cu)

In [None]:
!nvcc -g -G -lcufft -o ./cuFFT/fft_1d_r2c ./cuFFT/fft_1d_r2c.cu
!./cuFFT/fft_1d_r2c

### **a. 1D FFT - Complex to Complex**

```cpp
cufftExecC2C(
    plan,           // FFT plan handle
    d_input,        // Input array (complex)
    d_output,       // Output array (complex)
    CUFFT_FORWARD   // Transform direction (CUFFT_FORWARD or CUFFT_INVERSE)
);
```

[./cuFFT/fft_1d_c2c.cu](./cuFFT/fft_1d_c2c_.cu)

In [None]:
!nvcc -g -G -lcufft -o ./cuFFT/fft_1d_c2c ./cuFFT/fft_1d_c2c.cu
!./cuFFT/fft_1d_c2c

In [None]:
!make SRC=./cuFFT/fft_1d_r2c.cu clean
!make SRC=./cuFFT/fft_1d_c2c.cu clean

## **4. Thrust - Library to perform vector tasks**

### **a. Vector Transform**

```cpp
thrust::transform(
    d_vec.begin(),     // Input iterator - start of input range
    d_vec.end(),       // Input iterator - end of input range
    d_vec.begin(),     // Output iterator - where to write results
    square()           // Function object instance to apply to each element
);
```

[./thrust/vector_transform.cu](./thrust/vector_transform.cu)

In [None]:
!nvcc -g -G -o ./thrust/vector_transform ./thrust/vector_transform.cu
!./thrust/vector_transform

### **b. Vector Sort**

```cpp
thrust::sort(
    d_vec.begin(),     // Start of device vector
    d_vec.end()        // End of device vector
);
```

[./thrust/vector_sort.cu](./thrust/vector_sort.cu)

In [None]:
!nvcc -g -G -o ./thrust/vector_sort ./thrust/vector_sort.cu
!./thrust/vector_sort

### **c. Vector Reduction Sum**

```cpp
int sum = thrust::reduce(
    d_vec.begin(),         // Start of input range
    d_vec.end(),          // End of input range
    0,                    // Initial value for reduction
    thrust::plus<int>()   // Binary operation for reduction
);
```

[./thrust/vector_reduction_sum.cu](./thrust/vector_reduction_sum.cu)

In [None]:
!nvcc -g -G -o ./thrust/vector_reduction_sum ./thrust/vector_reduction_sum.cu
!./thrust/vector_reduction_sum

### **d. Vector Prefix Scan (Inclusive Sum)**

```cpp
thrust::inclusive_scan(
    d_vec.begin(),     // Input start
    d_vec.end(),       // Input end
    d_vec.begin()      // Output start (in-place)
);
```

[./thrust/vector_inclusive_scan_prefix_sum.cu](./thrust/vector_inclusive_scan_prefix_sum.cu)

In [None]:
!nvcc -g -G -o ./thrust/vector_inclusive_scan_prefix_sum ./thrust/vector_inclusive_scan_prefix_sum.cu
!./thrust/vector_inclusive_scan_prefix_sum

### **e. Vector Conditional Copy**

```cpp
auto end = thrust::copy_if(
    d_vec.begin(),         // Input start
    d_vec.end(),          // Input end
    d_result.begin(),     // Output start
    is_even()             // Predicate functor
);
```

[./thrust/vector_conditional_copy.cu](./thrust/vector_conditional_copy.cu)

In [None]:
!nvcc -g -G -o ./thrust/vector_conditional_copy ./thrust/vector_conditional_copy.cu
!./thrust/vector_conditional_copy

In [None]:
!make SRC=./thrust/vector_transform.cu clean
!make SRC=./thrust/vector_sort.cu clean
!make SRC=./thrust/vector_reduction_sum.cu clean
!make SRC=./thrust/vector_inclusive_scan_prefix_sum.cu clean
!make SRC=./thrust/vector_conditional_copy.cu clean