## Numba + CUDA on Google Colab

By default, Google Colab is not able to run numba + CUDA, because two lilbraries are not found, `libdevice` and `libnvvm.so`. So we need to make sure that these libraries are found in the notebook.  

First, we look for these libraries on the system. To do that, we simply run the `find` command, to recursively look for these libraries starting at the root of the filesystem. The exclamation mark escapes the line so that it's executed by the Linux shell, and not by the jupyter notebook. 

In [1]:
!find / -iname 'libdevice'
!find / -iname 'libnvvm.so'


/usr/local/cuda-10.0/nvvm/libdevice
/usr/local/cuda-10.0/nvvm/lib64/libnvvm.so


Then, we add the two libraries to numba environment variables:

In [0]:
import os
os.environ['NUMBAPRO_LIBDEVICE'] = "/usr/local/cuda-10.0/nvvm/libdevice"
os.environ['NUMBAPRO_NVVM'] = "/usr/local/cuda-10.0/nvvm/lib64/libnvvm.so"


And we're done! Now let's get started. 

## A first CUDA kernel

Let's get started by implementing a very simple CUDA kernel to compute the square root of each value in an array. First, here is our array: 

In [10]:
import numpy as np
a = np.arange(4096,dtype=np.float32)
a

array([0.000e+00, 1.000e+00, 2.000e+00, ..., 4.093e+03, 4.094e+03,
       4.095e+03], dtype=float32)

As we have seen in [part II](https://thedatafrog.com/boost-python-gpu/), and as discussed in the introduction, we can simply use numba's vectorize decorator to compute the square root of all elements in parallel on the GPU: 

In [11]:
import math
from numba import vectorize

@vectorize(['float32(float32)'], target='cuda')
def gpu_sqrt(x):
    return math.sqrt(x)
  
gpu_sqrt(a)

array([ 0.       ,  1.       ,  1.4142135, ..., 63.97656  , 63.98437  ,
       63.992188 ], dtype=float32)

This time, as an exercise, we'll do the same with a custom CUDA kernel. 

We first define our kernel: 

In [0]:
from numba import cuda

@cuda.jit
def gpu_sqrt_kernel(x, out):
  idx = cuda.grid(1)
  out[idx] = math.sqrt(x[idx])

Let's discuss this code in some details. 

We have an input array of 4096 values, so we will use 4096 threads on the GPU. 

Our input and output arrays are one dimensional, so we will use a one-dimensional *grid* of threads (we will discuss grids in details in the next section). The call `cuda.grid(1)` returns the unique index for the current thread in the whole grid.  With 4096 threads, `idx` will range from 0 to 4095. 

Then, we see in the code that each thread is going to deal with a single element of the input array, producing a single element in the output array. 

Now, we copy our input array to the GPU device, create an output array on the device with the same shape, and finally launch the kernel: 

In [23]:
# move input data to the device
d_a = cuda.to_device(a)
# create output data on the device
d_out = cuda.device_array_like(d_a)

# we decide to use 32 blocks, each containing 128 threads
blocks_per_grid = 32
threads_per_block = 128
gpu_sqrt_kernel[blocks_per_grid, threads_per_block](d_a, d_out)
# wait for all threads to complete
cuda.synchronize()
# copy the output array back to the host system
# and print it
print(d_out.copy_to_host())

[ 0.         1.         1.4142135 ... 63.97656   63.98437   63.992188 ]


**Exercises:**

- Go back to the previous cell, and try to decrease the number of blocks per grid, or the number of threads per block. 
- Then try to increase the number of blocks per grid, or the number of threads per blocks
- Try to remove the `cuda.synchronize()` call

**Results**: 

- When you reduce the number of threads, either by decreasing the number of blocks per grid or the number of threads per block, some elements are not processed, and the corresponding slots at the end of the output array remain set to their default value, which is 0. 
- If, on the other hand, you increase the number of threads, it seems that everything is working fine. However, this actually creates an error even though we cannot see it. We will see later how to expose this error. Debugging code is one of the difficulties of CUDA, as the error messages are not always visible.
- Finally, you might have expected that commenting out the call to `cuda.synchronize` would have resulted in output array only partially filled, or not filled at all. That would be a good guess since the CPU keeps running the main program (the one of the cell) while the GPU processes the data asynchronously. However, the copy to the host performs an implicit synchronization, so the call to cuda.syncronize is not necessary.  

## Execution configuration : grid, blocks, and threads

In our first example above, we decided to use 4096 threads. These threads were arranged as a one-dimensional grid containing 32 blocks of 128 threads each. That's quite mysterious and in this section, I will try and answer the questions you may have. 

First, what is a grid? We have used a one dimensional grid in our simple example but can a grid have more dimensions? Why would we need that? 

Then, why do we need to arrange threads into blocks within the grid? This is actually driven by the hardware, so we will need to learn a bit more about the GPU architecture to understand that better. 

Finally, how did we come up with the magical numbers 32 and 128? Could we use other values like 16 and 256 (which still total 4096), or 64 and 64? How to make this choice? 



Let's find out which GPU we are using on the Google Colab platform:

In [31]:
cuda.detect()

Found 1 CUDA devices
id 0             b'Tesla T4'                              [SUPPORTED]
                      compute capability: 7.5
                           pci device id: 4
                              pci bus id: 0
Summary:
	1/1 devices are supported


True

We see that's an nvidia [T4](https://www.nvidia.com/en-us/data-center/tesla-t4/), and here is the [white paper with the specs](https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf) for the Turing architecture. In this paper, we see that the T4 is based on the Turing TU102 GPU which has: 

- 72 streaming multiprocessors
- 64 cuda cores per multiprocessor

Here is a schematic view of the TU102 GPU: 

![TU102 GPU](tu_102_diagram.jpg)