# CUDA on Colab

This notebook, based on an example from Nvidia, shows how to check the GPU status of your Colab notebook, check out a github repository containing your c++ code, and compile it using either g++ for CPU or nvcc for GPU. and run it.

Not yet covered, profiling.

Author: Evelyn Mitchell
Source Repository: https://github.com/evelynmitchell/cuda-on-colab
Date: 2023-12-04

The nvidia-smi cli tells you about your GPU. The sample outputs for different types of GPUs or TPUs follow.

In [None]:
!nvidia-smi

A100 GPU
```

```

V100 GPU
```
Mon Dec  4 18:42:36 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P0    23W / 300W |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
```

T4 TPU
```
Mon Dec  4 18:40:38 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P8    11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
```

# libcuda Driver

If you install some libraries from source, such as Triton, you may lose the libcuda driver which is already installed in colab, when you uninstall triton, to install from source. Following the diagnosis and fix in [6] we will find out if the library is installed, then make sure it is in our execution path.

This will show up as:
```
libcuda.so cannot found
```

To check if the cuda library is availanble run:
```
!ldconfig -p |grep libcuda
```
Which should show a result like
```
libcudart.so.11.0 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.11.0
libcudart.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so
```
Note that ```libcuda.so``` is not listed.

[6] (https://github.com/pytorch/pytorch/issues/107960#issuecomment-1709589190)

In [None]:
!ldconfig -p | grep libcuda

To find the path to ```libcuda.so``` run
```
find /usr -name 'libcuda.so'
```
Which should output something similar to:
```
/usr/local/cuda-11.8/compat/libcuda.so
/usr/local/cuda-11.8/targets/x86_64-linux/lib/stubs/libcuda.so
/usr/lib64-nvidia/libcuda.so
```
The version numbers may be different.

In [None]:
!find /usr -name 'libcuda.so'

We have the same issue as in [6], in that the ```stubs``` path is incorrect, so we will apply the fix, which is to add ```/usr/lib64-nvidia/libcuda.so``` to our shared libraries with:
```
ldconfig /usr/lib64-nvidia
```

In [None]:
!ldconfig /usr/lib64-nvidia

And  then verify

In [None]:
!ldconfig -p | grep libcuda

```
	libcudart.so.11.0 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.11.0
	libcudart.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so
	libcudadebugger.so.1 (libc6,x86-64) => /usr/lib64-nvidia/libcudadebugger.so.1
	libcuda.so.1 (libc6,x86-64) => /usr/lib64-nvidia/libcuda.so.1
	libcuda.so (libc6,x86-64) => /usr/lib64-nvidia/libcuda.so
  ```
  that libcuda.so shows up in the list of shared libraries.

# C++ for CUDA
Install the c++ build chain, which should be already available on colab.

In [None]:
!apt install build-essential

The GPU compiler for c++ from Nvidia is called nvcc, and is already installed on Colab, as is build-essential, which provides g++ as well.

In [None]:
!nvcc --version

## Get the code
This notebook will show the files inline, and you can also checkout the repository containing the c++ files to compile.
```
!git clone https://github.com/evelynmitchell/cuda-on-colab
```

The simple c++ example of adding the elements of two arrays, without gpu.

In [None]:
%%file /tmp/simple.cpp
#include <iostream>
#include <math.h>

// function to add the elements of two arrays
void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
      y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<20; // 1M elements

  float *x = new float[N];
  float *y = new float[N];

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the CPU
  add(N, x, y);

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  delete [] x;
  delete [] y;

  return 0;
}

## Build the code for CPU

In [None]:
# compile the code checked out from the repoository to
# create the binary /content/cuda-on-colab/src/simple
# add the executable bit, and then run it.
#!g++ /content/cuda-on-colab/src/simple.cpp -o simple
# !chmod +x ./simple
# !./simple

# compile the code in the cell to create the binary /tmp/simple, 
#  and then run it.
! g++ /tmp/simple.cpp -o /tmp/simple && /tmp/simple



## Compile to a CUDA kernel

Adding the  ```__global__``` specifier to a function indicates it will be compiled to a CUDA kernel and run on a GPU processor.

This code fails when it's compiled due to an error in how it is called. The error and fix follow this section.

In [None]:
%%file /tmp/simple_cuda.cu
#include <iostream>
#include <math.h>

// CUDA Kernel function to add the elements of two arrays on the GPU
__global__
void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
      y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<20; // 1M elements

  float *x = new float[N];
  float *y = new float[N];

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the CPU
  add(N, x, y);

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  delete [] x;
  delete [] y;

  return 0;
}

In [None]:
# from the repository
# !nvcc /content/cuda-on-colab/src/simple_cuda.cu -o simple_cuda && /tmp/simple_cuda

# from the cell
! nvcc /tmp/simple_cuda.cu -o /tmp/simple_cuda && /tmp/simple_cuda

## Configure kernel launch

The error from the prior version of the compilation "__global__ function call must be configured" is corrected by adding kernel launch parameters <<<gridsize,blocksize>>> to the function.

In [None]:
%%file /tmp/simple_cuda_kernel_launch.cu
#include <iostream>
#include <math.h>

// CUDA Kernel function to add the elements of two arrays on the GPU

__global__
void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
      y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<20; // 1M elements

  float *x = new float[N];
  float *y = new float[N];

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the GPU
  // <<< (gridsize), (blocksize) >>>
  // <<<1,1>>> means 1 block with 1 thread
  add<<<1,1>>>(N, x, y);

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  delete [] x;
  delete [] y;

  return 0;
}

In [None]:
# from the repository
# !nvcc /content/cuda-on-colab/src/simple_cuda_kernel_launch.cu -o simple_cuda_kernal_launch

# from the cell
!nvcc /tmp/simple_cuda_kernel_launch.cu -o /tmp/simple_cuda_kernal_launch && /tmp/simple_cuda_kernal_launch


## Configure kernel threads


In [None]:
%%file /tmp/simple_cuda_kernel_threads.cu
#include <iostream>
#include <math.h>

// CUDA Kernel function to add the elements of two arrays on the GPU

__global__
void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
      y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<20; // 1M elements

  float *x = new float[N];
  float *y = new float[N];

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the GPU
  // <<< (gridsize), (blocksize) >>>
  // <<<1,1>>> means 1 block with 1 thread
  // "CUDA GPUs run kernels using blocks of threads that are a multiple of 
  // 32 in size, so 256 threads is a reasonable size to choose.""
  add<<<1,256>>>(N, x, y);

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  delete [] x;
  delete [] y;

  return 0;
}

In [None]:
# from the repository
# !nvcc /content/cuda-on-colab/src/simple_cuda_kernel_threads.cu -o simple_cuda_kernal_threads

# from the cell
!nvcc /tmp/simple_cuda_kernel_threads.cu -o /tmp/simple_cuda_kernal_threads && /tmp/simple_cuda_kernal_threads


## Profile the CUDA code

nvprof is the nvidia profiler for CUDA code. 

In [None]:
# %cd /content/cuda-on-colab to run from the repository
# %cd /tmp to run from the cell
!nvprof /tmp/simple_cuda_kernel_launch

## Memory profiling

nvprof can also be used to profile memory usage. First we compile an
example that uses a lot of memory, then we profile it.

In [None]:
%%file /tmp/simple_cuda_memory_alloc.cu
#include <iostream>
#include <math.h>
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
    y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<20;
  float *x, *y;

  // Allocate Unified Memory – accessible from CPU or GPU
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the GPU
  add<<<1, 1>>>(N, x, y);

  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  cudaFree(x);
  cudaFree(y);
  
  return 0;
}

In [None]:
# from the repository
# !nvcc /content/cuda-on-colab/src/simple_cuda_memory_alloc.cu -o simple_cuda_memory_alloc

# from the cell
!nvcc /tmp/simple_cuda_memory_alloc.cu -o /tmp/simple_cuda_memory_alloc && /tmp/simple_cuda_memory_alloc 


In [None]:
# run the executable with nvprof
!nvprof ./simple_cuda_memory_alloc