<a href="https://colab.research.google.com/github/carlnotsagan/LSST-DSFP-Session15-Materials/blob/main/fields_intro_to_GPUs_solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from __future__ import print_function, division, absolute_import 

# Introduction to GPUs (in Python):

By Carl Fields (Los Alamos National Lab)

*This exercise was designed heavily based on tutorials from the [GTC 2018 Conference](https://github.com/ContinuumIO/gtc2018-numba).* 

---

## Problem 0) Creating Our HPC Environment
**Before** beginning, we want to prepare a new environment for the purpose of this exercise.

Be sure to activate our HPC environment from last time:

```linux
$ conda activate hpc
```
---



## Learning Objectives

- Using Numba decorators to speed up algorithms
- Using `timeit` profiler for generic algorithm profiling
- Creating complex alogrithms utilizing Numba decorators
- Using `line_profiler` to expose parallelizable regions in complex algorithms
- Creating Numpy Ufuncs
- Exploring shared memory parallelism in Numba
- Characterizing strong scaling 
---

We can check that our installation worked by trying to import the required libraries:

In [None]:
import numpy as np
import math
import matplotlib.pyplot as plt

%matplotlib notebook

## Problem 1) Exploring a basic algorithm targeting the GPU


We want to begin by compaaring a native numpy function which will leverage the CPU and compare that to a Ufunc that will target the CPU. 

Lets start by considering the addition of two numbers.

**Problem 1a** Run the following cell to compute the addition of two arrays:

In [None]:
import numpy as np

a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])

np.add(a, b)

array([11, 22, 33, 44])

More information on Numpy Ufuncs can be found [here](https://docs.scipy.org/doc/numpy/reference/ufuncs.html).

Next, we want to explore using Numba to create Ufuncs that target the GPU.

**Problem 1b** Use the `vectorize` decorator in Numba to write a function that adds to arrays. Use the `int64` data types and `target='cuda'`. Note: An overview including some common terminology for CUDA programming can be found [here](https://numba.readthedocs.io/en/stable/cuda/overview.html).

In [None]:
from numba import vectorize

@vectorize(['int64(int64, int64)'], target='cuda')
def add_ufunc(x, y):
    return x + y

Before running we need to be sure we have the hardware we would like to target!

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

Check that the GPU was found:

In [None]:
%tensorflow_version 2.x
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


More information can be found about the available types of devices:

In [None]:
from tensorflow.python.client import device_lib
device_lib.list_local_devices()

[name: "/device:CPU:0"
 device_type: "CPU"
 memory_limit: 268435456
 locality {
 }
 incarnation: 2500660274138816220
 xla_global_id: -1, name: "/device:GPU:0"
 device_type: "GPU"
 memory_limit: 14444920832
 locality {
   bus_id: 1
   links {
   }
 }
 incarnation: 4334664493303736742
 physical_device_desc: "device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5"
 xla_global_id: 416903419]

**Problem 1c** Run your Ufunc which utilizes the GPU and compare the numerical result to Numpy.

In [None]:
print('a+b:\n', add_ufunc(a, b))
assert np.allclose(np.add(a, b),add_ufunc(a, b))

a+b:
 [11 22 33 44]


Now, lets use our favorite `timeit` magic command to see how much we benefitted from targeting the GPU.


**Problem 1d** Use `timeit` to compare the execution time of the default Numpy Ufunc and our new Numba Ufunc

In [None]:
%timeit np.add(a,b)   # NumPy on CPU

The slowest run took 49.62 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 5: 506 ns per loop


In [None]:
%timeit add_ufunc(a,b) # Numba on GPU

1000 loops, best of 5: 1.22 ms per loop


The GPU result is... *slower*?

**Problem 1e** Discuss in this situation why our GPU result may be slower and how we can modify the problem to benfit fromm the GPU.

**Answer** 

Some points to consider: 

 - **Our inputs are too small**: the GPU achieves performance through parallelism, operating on thousands of values at once. Our test inputs have only 4 and 16 integers, respectively. We need a much larger array to even keep the GPU busy.


- **Our calculation is too simple**: Sending a calculation to the GPU involves quite a bit of overhead compared to calling a function on the CPU. If our calculation does not involve enough math operations (often called "arithmetic intensity"), then the GPU will spend most of its time waiting for data to move around.


- **We copy the data to and from the GPU**: While including the copy time can be realistic for a single function, often we want to run several GPU operations in sequence. In those cases, it makes sense to send data to the GPU and keep it there until all of our processing is complete.


- **Our data types are larger than necessary**: Our example uses int64 when we probably don't need it. 

## Problem 2) Exploring a more complex, data-intensive algorithm targeting the GPU

We will now consider a more complex example where we can take some of the necessary steps to make the problem more efficient to run on GPUs. A few of these steps include: 

1. Using native math module functions described [here](https://docs.python.org/3/library/math.html).
2. Using a less precise datatype than necessary. Consider using `float32` instead. 
3. Solving a more complex algorithm - one with more math operations in this case than the addition of two arrays. 
4. Precompute constant values when possible.


**Problem 2a** Take the above steps to define the Ufunc for a Gaussian PDF using Numba `vectorize` again targeting `cuda`: 

 $f(x) = \frac{1}{\sigma \sqrt{\pi}} e^{-\frac{1}{2} \left ( \frac{x-\mu}{\sigma} \right )^{2}}$.

 Information on solving Normal distributions are discussed [here](https://en.wikipedia.org/wiki/Normal_distribution).[link text](https://)

In [None]:
import math  # Note that for the CUDA target, we need to use the scalar functions from the math module, not NumPy

SQRT_2PI = np.float32((2*math.pi)**0.5)  # Precompute this constant as a float32.  Numba will inline it at compile time.

@vectorize(['float32(float32, float32, float32)'], target='cuda')
def gaussian_pdf(x, mean, sigma):
    '''Compute the value of a Gaussian probability density function at x with given mean and sigma.'''
    return math.exp(-0.5 * ((x - mean) / sigma)**2) / (sigma * SQRT_2PI)

**Problem 2b** Evaluate our Ufunc a million times, set $\mu=0$ and $\sigma=1$. Use `np.random.uniform` to create our `x` array for a bound of [-3,3].: 

In [None]:
# Evaluate the Gaussian a million times!
x = np.random.uniform(-3, 3, size=1000000).astype(np.float32)
mean = np.float32(0.0)
sigma = np.float32(1.0)

# Quick test
gaussian_pdf(x[0], 0.0, 1.0)

array([0.00636183], dtype=float32)

**Problem 2c** Perform the same calculation using scipys native `norm` function (details [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html)). Time the results using `timeit`.[link text](https://)

In [None]:
import scipy.stats # for definition of gaussian distribution
norm_pdf = scipy.stats.norm
%timeit norm_pdf.pdf(x, loc=mean, scale=sigma)

10 loops, best of 5: 37.5 ms per loop


**Problem 2d** Time the result for our GPU Ufunc for comparison:

In [None]:
%timeit gaussian_pdf(x, mean, sigma)

The slowest run took 35.46 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 5: 5.34 ms per loop


**Problem 2e** Thats a big improvement! 

Discuss some of the overhead costs still associated with this approach.

**Answer**: Copying data to and from the GPU.

## Problem 3) Memory Management 

## Problem 4) Writing CUDA Kernels


## Problem 5) Matrix multiplication (Optional/Challenge)