## Numba CUDA intro

> **GITHUB: https://github.com/daniel-vera-g/numba-cuda-intro**

## Short summary of CUDA components

1. host: CPU
2. device: GPU
3. host memory: system main memory
4. device memory: onboard memory on a GPU card
5. kernel: a GPU function launched by the host and executed on the device
6. device function: a GPU function executed on the device which can only be called from the device (i.e. from a kernel or another device function)

![](http://upload.wikimedia.org/wikipedia/commons/thumb/5/59/CUDA_processing_flow_%28En%29.PNG/450px-CUDA_processing_flow_%28En%29.PNG)

### CUDA structure

A grid contains blocks with threads:

![](https://www.researchgate.net/profile/Omar-Bouattane/publication/321666991/figure/fig2/AS:572931245260800@1513608861931/Figure-2-Execution-model-of-a-CUDA-program-on-NVidias-GPU-Hierarchy-grid-blocks-and.png)

## Use CUDA with Numba

**NOTE**: If you don't have a CUDA capable GPU, you can activate the _simulation_ mode by starting Jupyterlab with the `NUMBA_ENABLE_CUDASIM=1` ENV:

> `NUMBA_ENABLE_CUDASIM=1 jupyter-lab`

To learn more about how to use the simulator for debugging, see the [cuda-debugger](./cuda-debugger.ipynb) section.

In [2]:
from numba import cuda
print(cuda.gpus) # -> Will probably not work, if you don't have a GPU

<Managed Device 0>


### CUDA Python rules

When using the `@cuda.jit` Annotation, the Numba just in time compiler creates an optimized version of the code to be executed on the GPU.

By doing this, it makes use of the [Single instruction, multiple threads](https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads) concept used by CUDA. As the SIMT concept executes the code by multiple threads in parallel, often used array expression should be avoided. Otherwise the sequential nature of this operation would hinder the goal of having a preformance boost compared to the CPU. 

### Not supported operations

Due to the explanation obove, some python operations are not supported in CUDA kernel code. Some of the most important ones to be aware of are:

1. **Generators** that are constructed with the `yield` statement
2. **Comprehensions**. Like _list, dict, set or generator_ comprehensions. Those are often used in functional python programs.
3. **Exception handling** with `try...except` or `try...finally`


For a more detailed list, consult the [Numba documentation](https://numba.pydata.org/numba-doc/dev/cuda/cudapysupported.html#constructs)

### Support of Numpy operations

By doing dynamic memory allocation, a lot of read and write operations to the global memory are done which diminishes the performance greatly. Therefore, Numba does not allow the use of dynamic memory allocation.

As effect some Numpy features like array creation, array methods or functions that return a new array are not supported.

<!--- ## Table of contents -->

<!--- - A crash course over the theory: [CUDA components](#Short-summary-of-CUDA-components) -->
<!--- - Use CUDA in with Numba: [CUDA with Numba](#Use-CUDA-with-Numba) -->

## Continue here...

> You're here 👉: [Basics](./numba_cuda_tutorial.ipynb)

**CUDA concepts in Numba:**

- What Kernels are and how they work: [Kernels](kernels.ipynb)
- How to manage memory when doing operations: [Memory management](memory-management.ipynb)
- How to debug Numba code: [Debugging](./cuda-debugger.ipynb)
- Other useful CUDA features in Numba: [Other Numba features](./other-cuda-features.ipynb)