# Introduction to GPU Programming with Python
## Intro to CUDA

To compile a function on the GPU we must use `numba.cuda.jit`.

Questions
* What is CUDA ?

Objectives
* Learn CUDA terminology
* Learn CUDA programming model
* Understand what the CUDA kernel is
* Understand the CUDA block-threading model


#### CUDA terminology
Before we jump into CUDA with Python lets talk about CUDA terminology and main execution concept:

![](images/host_device.png)

### CUDA Programming Model
1. Copy data from the CPU memory to GPU memory (remember: CPU and GPU are physically separated)
![](images/cuda_model1.png)

2. Load the GPU program and execute it
![](images/cuda_model2.png)

3. Copy the results from the GPU memory back to the CPU memory (so that you could print it, get it analyzed, etc)
![](images/cuda_model3.png)

#### CUDA kernel
We have been talking about CUDA kernels, but what is CUDA kernel ? 
![](images/cuda_kernel.png)

In CUDA we divide a program into a grid of threads, and a kernel is a program executed on each of those threads independently.

It's different from how we create a CPU program as there we have to explicitate every operation, every loop, etc.

Lets look at the matrix addition example.
In the CPU implementation we would loop over all the elements of matrix A:

![](images/matrix_cpu.png)

![](images/matrix_gpu2.png)

Unfortunately there is another layer of complexity:

![](images/cuda_block_grid2.png)

### CUDA Execution model
Thread Layout:
* Threads are organized into blocks
* Blocks are organized into a grid
* SM executes one block at a time

Each thread uses IDs to decide what data to work on:
* Block IDs (e.g. blockIdx.x, blockIdx.y)
* Thread Ids (threadIdx.x, threadIdx.y)

Such model simplifies memory addressing when processing multidimmensional data.

Simple CUDA program executed on those GPU threads is called KERNEL 
CUDA programmer is responsible for setting up a grid and tune it for better performance

### Choosing the block size

* On the software side, the block size determines how many threads share a given area of shared memory.
* On the hardware side, the block size must be large enough for full occupation of execution units; 
The block size you choose depends on:
* The size of the data array
* The size of the shared mempory per block (e.g. 64KB)
* The maximum number of threads per block supported by the hardware (e.g. 512 or 1024)
* The maximum number of threads per multiprocessor (MP) (e.g. 2048)
* The maximum number of blocks per MP (e.g. 32)
* The number of threads that can be executed concurrently (a “warp” i.e. 32)

Rules of thumb for threads per block:

    Should be a round multiple of the warp size (32)
    A good place to start is 128-512 but benchmarking is required to determine the optimal value.


## Key points
* **Host and Device** 
    * Device (GPU) won't work without Host(CPU)
    * Both Host and Device have their own memory
* **Kernel and Device functions**
    * Kernel is called from  the Host
    * Device function is called from the Device.
* **Threads, Blocks, Grids**
    * They can be 1D, 2D, or 3D depending on data complexity
    * Each of them has IDs: they are used to access certain portion of data