# Lecture 13: Hardware Acceleration Implementation

In this lecture, we will to walk through backend scafoldings to get us hardware accelerations for Needle.

**GPU runtime**

In this lecture, we are going to make use of C++ and CUDA to build accelerated linear algebra libraries. In order to do so, please make sure you select a runtime type with GPU:

$$
\verb|Runtime|
\longrightarrow
\verb|Change runtime type|
\longrightarrow
\verb|Hardware accelerator: GPU|
\longrightarrow
\verb|Save|
$$

After you started the right runtime, you can run the following command to check if there is a GPU available.

In [1]:
!nvidia-smi

Wed Nov  2 11:01:29 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:06:00.0 Off |                  N/A |
| 30%   37C    P8    16W / 350W |      5MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:41:00.0 Off |                  N/A |
| 30%   45C    P8    32W / 350W |      5MiB / 24268MiB |      0%      Default |
|       

## Preparation

To get started, we can clone the related repo from the github. 

In [None]:
# Code to set up the assignment
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/
!mkdir -p 10714f22
%cd /content/drive/MyDrive/10714f22
# comment out the following line if you run it for the second time
# as you already have a local copy of lecture14
# !git clone https://github.com/dlsyscourse/lecture14 
!ln -s /content/drive/MyDrive/10714f22/lecture14 /content/needle

Mounted at /content/drive
/content/drive/MyDrive
/content/drive/MyDrive/10714f22
Cloning into 'lecture14'...
remote: Enumerating objects: 53, done.[K
remote: Counting objects: 100% (53/53), done.[K
remote: Compressing objects: 100% (39/39), done.[K
remote: Total 53 (delta 15), reused 50 (delta 12), pack-reused 0[K
Unpacking objects: 100% (53/53), done.


In [None]:
!python3 -m pip install pybind11

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pybind11
  Downloading pybind11-2.10.0-py3-none-any.whl (213 kB)
[K     |████████████████████████████████| 213 kB 5.2 MB/s 
[?25hInstalling collected packages: pybind11
Successfully installed pybind11-2.10.0


### Build project

We leverage pybind to build a C++/CUDA library for acceleration. Type make to build the corresponding library.

In [2]:
#%cd /content/needle
!make clean
!make

rm -rf build python/needle/backend_ndarray/ndarray_backend*.so
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Python: /home/willyu/anaconda3/envs/10-414/bin/python3.9 (found version "3.9.13") found components: Development Interpreter 
-- Performing Test HAS_FLTO
-- Performing Test HAS_FLTO - Success
-- Found pybind11: /home/willyu/anaconda3/envs/10-414/lib/python3.9/site-packages/pybind11/include (found version "2.10.0")
-- Looking for pthr

We can then run the following command to make the path to the package available in local environment as well as `PYTHONPATH`.

In [1]:
%set_env PYTHONPATH ./python:/env/python
import sys
sys.path.append("./python")

env: PYTHONPATH=./python:/env/python


## File organization

Now click the files panel on the left side. You should be able to see these files

Python:
- `needle/backend_ndarray/ndarray.py`
- `needle/backend_ndarray/ndarray_backend_numpy.py`

C++/CUDA
- `src/ndarray_backend_cpu.cc`
- `src/ndarray_backend_cuda.cu`

The main goal of this lecture is to create an accelerated NDArray library. As a result, we do not need to deal with `needle.Tensor` for now and will focus on `backend_ndarray` implementation. 

After we build up this array library, we can use it to power backend array computations in Needle.

### Intro

In [2]:
from needle import backend_ndarray as nd

We can create a CUDA tensor from the data by specifying the `device` argument.

In [3]:
x = nd.NDArray([1, 2, 3], device=nd.cuda())

In [4]:
x.device

cuda()

In [5]:
y = x + 1
y

NDArray([2. 3. 4.], device=cuda())

In [6]:
y = x + x
y

NDArray([2. 4. 6.], device=cuda())

`.numpy()` returns a **new** CPU tensor instead of modifiying the tensor in place.

In [7]:
x.numpy()

array([1., 2., 3.], dtype=float32)

In [8]:
x.device

cuda()

### Key Data Structures

Key data structures in `backend_ndarray`

- NDArray: the container to hold device specific ndarray
- `BackendDevice`: backend device
    - `mod` holds the module implementation that implements all functions
    - checkout `ndarray_backend_numpy.py` for a python-side reference

## GPU execution trace

Now, let us take a look at what happens when we execute the following code

In [None]:
x = nd.NDArray([1, 2, 3], device=nd.cuda())
y = x + 1

In [None]:
x.device.from_numpy

<function needle.backend_ndarray.ndarray_backend_cuda.PyCapsule.from_numpy>

In [None]:
x = nd.NDArray([1, 2, 3])

In [None]:
x.device.from_numpy

<function needle.backend_ndarray.ndarray_backend_cuda.PyCapsule.from_numpy>

Have the following trace:

backend_ndarray/ndarray.py
- `NDArray.__add__`
- `NDArray.ewise_or_scalar`
- `ndarray_backend_cpu.cc:ScalarAdd`

In [None]:
y.numpy()

array([2., 3., 4.], dtype=float32)

Have the following trace:

- `NDArray.numpy`
- `ndarray_backend_cpu.cc:to_numpy`

### Reading C++/CUDA codes

Read
- `src/ndarray_backend_cpu.cc`
- `src/ndarray_backend_cuda.cu`

Optional
- `CMakeLists.txt`: this is used to setup the build and likely you do not need to tweak it.

## NDArray

Open `python/needle/backend_ndarray/ndarray.py`.

An NDArray contains the following fields:
- `handle`: the backend handle that build a flat array which stores the data
- `shape`: the shape of NDArray
- `strides`: the strides that shows how do we access multi-dimensional elements
- `offset`: the offset of the first element
- `device`: the backend device that backs the computation

### Strided transformation

We can leverage the strides and offset to perform transform/slicing with zero copy.

- Broadcast: insert $0$ strides
- Tranpose: swap the strides
- Slice: change the offset and shape 

For most of the computation, however, we will call `array.compact()` first to get a contiguous and aligned memory before running the computation.

In [10]:
import numpy as np
x = nd.NDArray([0, 1, 2, 3, 4, 5],
               device=nd.cpu_numpy())

#### Reshaping

$$
\begin{align*}
\verb|y|[i, j]
&=
\verb|x|[\verb|strides|[0] \times i + \verb|strides|[1] \times j] \\
&=
\verb|x|[3i + \color{gray}{1}j]
\end{align*}
$$

In [16]:
y = nd.NDArray.make(shape=(2,3),
                    strides=(3,1),
                    device=x.device,
                    handle=x._handle,
                    offset=0)
y

NDArray([[0. 1. 2.]
 [3. 4. 5.]], device=cpu_numpy())

#### Transpose

$$
\begin{align*}
\verb|z|[i, j]
&=
\verb|x|[\verb|strides|^\mathsf{T}[0] \times i + \verb|strides|^\mathsf{T}[1] \times j] \\
&=
\verb|x|[\color{gray}{1}i + 3j]
\end{align*}
$$

In [17]:
z = nd.NDArray.make(shape=(3,2),
                    strides=(1,3),
                    device=x.device,
                    handle=x._handle,
                    offset=0)
z

NDArray([[0. 3.]
 [1. 4.]
 [2. 5.]], device=cpu_numpy())

#### Slicing

$$
\begin{align*}
\verb|w|[i, j]
&=
\verb|x|[\color{blue}{1} + \verb|strides|[0] \times i + \verb|strides|[1] \times j] \\
&=
\verb|x|[\color{blue}{1} + 3i + \color{gray}{1}j]
\end{align*}
$$

In [14]:
w = nd.NDArray.make(shape=(2,2),
                    strides=(3,1),
                    device=x.device,
                    handle=x._handle,
                    offset=1)
w

NDArray([[1. 2.]
 [4. 5.]], device=cpu_numpy())

#### Broadcasting

$$
\begin{align*}
\verb|b|[i, j, k]
&=
\verb|y|[\verb|strides|[0] \times i + \verb|strides|[1] \times j + \verb|strides|[2] \times k] \\
&=
\verb|y|[3i + \color{gray}{1}j \color{gray}{+ 0k}]
\end{align*}
$$

In [18]:
b = nd.NDArray.make(shape=(2,3,4),
                    strides=(3,1,0),
                    device=y.device,
                    handle=y._handle,
                    offset=0)
b

NDArray([[[0. 0. 0. 0.]
  [1. 1. 1. 1.]
  [2. 2. 2. 2.]]

 [[3. 3. 3. 3.]
  [4. 4. 4. 4.]
  [5. 5. 5. 5.]]], device=cpu_numpy())

## CUDA Acceleration

Open `src/ndarray_cuda_backend.cu` and take a look at current implementation of GPU ops.


### Adding operators

- Add an implementation in `ndarray_backend_cuda.cu` and expose it via pybind11
- Call into the operator in `ndarray.py`
- Write up testcases

In [24]:
!make

-- Found pybind11: /home/willyu/anaconda3/envs/10-414/lib/python3.9/site-packages/pybind11/include (found version "2.10.0")
-- Found cuda, building cuda backend
Wed Nov  2 13:26:19 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:06:00.0 Off |                  N/A |
| 30%   40C    P8    23W / 350W |    261MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  

The error is because `ndarray.py` was imported in the absence of `ewise_mul`. We can either restart the Jupyter kernel or prepare a separate test file.

In [25]:
x = nd.NDArray([1, 2, 3], device=nd.cuda())
x * 2

AttributeError: module 'needle.backend_ndarray.ndarray_backend_cuda' has no attribute 'ewise_mul'

Running a separate `.py` script from command line, which initiates a new Python session, eschews the issue. This is **common development practice** in large projects involving Python C++ *foreign function interface* (FFI).

In [26]:
!python3 test/test_mul.py

[2. 4. 6.]


## Needle `Tensor` backend

So far we only played with the `backend_ndarray` (sub)module, which is a self-contained NDArray implementation within Needle.

We can connect NDArray back to Needle as backend.

In [27]:
import needle as ndl

In [28]:
x = ndl.Tensor([1,2,3], device=ndl.cuda(), dtype="float32")
y = ndl.Tensor([2,3,5], device=ndl.cuda(), dtype="float32")
z = x + y

In [29]:
type(z.cached_data)

needle.backend_ndarray.ndarray.NDArray