# Introduction to CUDA Python

*CUDA* is a proprietary parallel computing platform and API that allows software to use GPUs for accelerated general-purpose processing in HPC. It is a software layer that manages data, giving direct access to the GPU and CPU as necessary and a library of APIs that enable parallel computation of various needs. In addition to drivers and runtime kernels, the CUDA platform includes compilers, libraries and developer tools.

*Numba* is a just-in-time Python function compiler that exposes a simple interface for accelerating numerically-focused Python functions. It is an attractive option for Python programmers wishing to GPU accelerate their applications without needing to write C/C++ code, especially for developers already performing computationallu heavy operations on NumPy arrays. 

## What is Numba?

* It is a function compilr, meaning it is just a module that can turn your Python function into a faster one
* It speeds up your function by generating specialized implementation of specific data types as Python functions operate on specific datatypes but are slow, so Numba generates fast implementation for each set of types
* It translates function when they are first called, ensuring the compiler knows what argument types you will be using and in Jupyter
* Limited string support, mainly numerical datatypes
* CUDA C/C++ is the most common performant and flexible way to use CUDA, whilst pyCUDA exposes the entire API, but Numba enables massive acceleration, often with limited modification, allowing developers to write directly to Python. 

# 1. Compile for the cpu

The Numba compiler is typically enabled by applying a function decorator to a Python function, for example `@jit`:

In [5]:
!pip install numba
from numba import jit
import math

# This is the function decorator syntax and is equivalent to `hypot = jit(hypot)`.
# The Numba compiler is just a function you can call whenever you want!
@jit
def hypot(x, y):
    # Implementation from https://en.wikipedia.org/wiki/Hypot
    x = abs(x);
    y = abs(y);
    t = min(x, y);
    x = max(x, y);
    t = t / x;
    return x * math.sqrt(1+t*t)


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [6]:
hypot(3.0, 4.0)

5.0

What we know is the compiler is triggered and compiles a machine code implementation of the function for float inputs. Numba also saves the original Python implementation of the function in the `.py_func` 

In [7]:
hypot.py_func(3.0, 4.0)

5.0

## Benchmarking

An important part of using Numba is measuring the performance of the new code, hence if we use the `%timeit` magic function, we can measure the speed of the function. You'll see that the Python built-in function is faster than the Numba version because of the overhead to each function each time. 

In [8]:
%timeit hypot.py_func(3.0, 4.0)
%timeit hypot(3.0, 4.0)
%timeit math.hypot(3.0, 4.0)

502 ns ± 3.43 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
172 ns ± 0.711 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
102 ns ± 0.471 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)


# Use Numba to Compile a Function for the CPU

The following function uses the Monte Carlo Method to determine Pi.

In [9]:
nsamples = 1000000
import random


@jit
def monte_carlo_pi(nsamples):
    acc = 0
    for i in range(nsamples):
        x = random.random()
        y = random.random()
        if (x**2 + y**2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

In [10]:
from numpy import testing

# This assertion will fail until you successfully complete the exercise one cell above
testing.assert_almost_equal(monte_carlo_pi(nsamples), monte_carlo_pi.py_func(nsamples), decimal=2)

In [11]:
%timeit monte_carlo_pi(nsamples)
%timeit monte_carlo_pi.py_func(nsamples)

28 ms ± 79.6 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
270 ms ± 5.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


# How does Numba actually work?

We can see the result of type inference bt using the `.inspect_types()`, which prints the annotated version of the source code.

In [12]:
hypot.inspect_types()

hypot (float64, float64)
--------------------------------------------------------------------------------
# File: /var/folders/gr/j8nwj3f15hgg45p8gkw2c5b00000gn/T/ipykernel_62403/1334060371.py
# --- LINE 7 --- 
# label 0
#   x = arg(0, name=x)  :: float64
#   y = arg(1, name=y)  :: float64

@jit

# --- LINE 8 --- 

def hypot(x, y):

    # --- LINE 9 --- 

    # Implementation from https://en.wikipedia.org/wiki/Hypot

    # --- LINE 10 --- 
    #   $4load_global.0 = global(abs: <built-in function abs>)  :: Function(<built-in function abs>)
    #   x.1 = call $4load_global.0(x, func=$4load_global.0, args=[Var(x, 1334060371.py:7)], kws=(), vararg=None, varkwarg=None, target=None)  :: (float64,) -> float64
    #   del x
    #   del $4load_global.0

    x = abs(x);

    # --- LINE 11 --- 
    #   $34load_global.4 = global(abs: <built-in function abs>)  :: Function(<built-in function abs>)
    #   y.1 = call $34load_global.4(y, func=$34load_global.4, args=[Var(y, 1334060371.py:7)], kws=(), v

# Use Numba to Compile a Function for the CPU


# Custom CUDA Kernels in Python with Numba


# Multidimensional Grids and Shared Memory for CUDA Python with Numba