<a href="https://colab.research.google.com/github/anvibhagavathula/CNN-Kernel/blob/main/4_gemm_gpu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GEMM on GPU

## 1. Set-up

In [None]:
# Mount google drive
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
# Make sure your token is stored in a txt file at the location below.
# This way there is no risk that you will push it to your repo
# Never share your token with anyone, it is basically your github password!
with open('/content/gdrive/MyDrive/ece5545/token.txt') as f:
    token = f.readline().strip()
# Use another file to store your github username
with open('/content/gdrive/MyDrive/ece5545/git_username.txt') as f:
    handle = f.readline().strip()

In [None]:
# Clone your github repo
YOUR_TOKEN = token
YOUR_HANDLE = handle
BRANCH = "main"

%mkdir /content/gdrive/MyDrive/ece5545
%cd /content/gdrive/MyDrive/ece5545
!git clone https://{YOUR_TOKEN}@github.com/ML-HW-SYS/a3-{YOUR_HANDLE}.git
%cd /content/gdrive/MyDrive/ece5545/a3-{YOUR_HANDLE}
!git checkout {BRANCH}
!git pull
%cd /content/gdrive/MyDrive/ece5545

PROJECT_ROOT = f"/content/gdrive/MyDrive/ece5545/a3-{YOUR_HANDLE}"

mkdir: cannot create directory ‘/content/gdrive/MyDrive/ece5545’: File exists
/content/gdrive/MyDrive/ece5545
fatal: destination path 'a3-anvibhagavathula' already exists and is not an empty directory.
/content/gdrive/MyDrive/ece5545/a3-anvibhagavathula
M	src/ops.py
M	tests/test_dwsp_2dconv_gpu.py
Already on 'main'
Your branch is behind 'origin/main' by 18 commits, and can be fast-forwarded.
  (use "git pull" to update your local branch)
remote: Enumerating objects: 7, done.[K
remote: Counting objects: 100% (7/7), done.[K
remote: Compressing objects: 100% (2/2), done.[K
remote: Total 4 (delta 2), reused 4 (delta 2), pack-reused 0[K
Unpacking objects: 100% (4/4), 425 bytes | 0 bytes/s, done.
From https://github.com/ML-HW-SYS/a3-anvibhagavathula
   9ffe26d..dc711be  main       -> origin/main
Updating 3a2aa83..dc711be
error: Your local changes to the following files would be overwritten by merge:
	src/ops.py
	tests/test_dwsp_2dconv_gpu.py
Please commit your changes or stash them befor

In [None]:
# This extension reloads all imports before running each cell
%load_ext autoreload
%autoreload 2

In [None]:
!ls {PROJECT_ROOT}

1-conv1d_cpu.ipynb  3-conv1d_fpga.ipynb  5-conv2d_dw_gpu.ipynb	README.md  tests
2-conv1d_gpu.ipynb  4-gemm_gpu.ipynb	 leaderboard_id.txt	src


## 2. Install TVM

In [None]:
!pip install tlcpack-nightly-cu102 -f https://tlcpack.ai/wheels

Looking in links: https://tlcpack.ai/wheels
Collecting tlcpack-nightly-cu102
  Downloading https://github.com/tlc-pack/tlcpack/releases/download/v0.12.dev/tlcpack_nightly_cu102-0.15.dev118%2Bg51bdaec6e-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (428.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m428.6/428.6 MB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tlcpack-nightly-cu102
Successfully installed tlcpack-nightly-cu102-0.15.dev118+g51bdaec6e


## 3. Implement `make_conv1d_gpu_scheduler_func` function in `src.ops`

In that function, you are required to implemented 1D convolution and use TVM to optimize it.
Let $x \in \mathbb{R}^m$ and $y \in \mathbb{R}^n$, then
$$
\operatorname{conv1d}(x, y)_i = \sum_{j=-\infty}^{\infty} x[j]y[i-j], \forall i \in \{0, 1, \dots, m + n - 1\}
$$

Please use zero padding and unit stride. Please see the numpy convolution function for more detail: [link](https://numpy.org/doc/stable/reference/generated/numpy.convolve.html).

The `make_conv1d_gpu_scheduler_func` takes $m$ and $n$, which are the size of the two 1D input array.
You should return both the TVM scheduler and the TVM opterator for
1. Input $x$
2. Input $y$
3. Output $out$

The scheduler should be able to used to build a function with signature $func(x, y, out)$.
Please see the following cells for usage.

In [None]:
import tvm
import numpy as np
import sys
# Adding assignment 3 to the system path
# Make sure this matches your git directory
sys.path.insert(0, PROJECT_ROOT)
from src.ops import make_gemm_gpu_scheduler

M = 1024
N = 512
K = 2048
dtype = 'float32'
a_np = np.random.rand(M, K).astype(dtype)
w_np = np.random.rand(K, N).astype(dtype)
b_np = np.matmul(a_np, w_np)

s, A, W, B = make_gemm_gpu_scheduler(M, K, N)
func = tvm.build(s, [A, W, B], "cuda")

dev = tvm.cuda(0)
a = tvm.nd.array(a_np, dev)
w = tvm.nd.array(w_np, dev)
b = tvm.nd.array(np.zeros((M, N), dtype), dev)
func(a, w, b)
evaluator = func.time_evaluator(func.entry_name, dev, number=1, repeat =1)


print("Answer:", b_np)
print("Output:", b)
print(f"Matmul TVM: %f ms" % (evaluator(a, w, b).mean * 1e3))

Answer: [[504.48987 502.16168 501.24298 ... 501.7519  500.59567 503.5177 ]
 [521.4901  529.48444 516.503   ... 519.0668  511.2707  509.10162]
 [508.5849  511.44553 503.95245 ... 509.95572 500.98517 504.83356]
 ...
 [520.79333 520.29065 519.69934 ... 513.336   511.75153 513.4515 ]
 [511.9951  517.05914 507.03625 ... 518.98175 513.7041  505.92166]
 [517.14667 520.35077 504.18915 ... 509.2376  516.76984 511.081  ]]
Output: [[504.4899  502.16248 501.24332 ... 501.75116 500.5963  503.51743]
 [521.4898  529.4843  516.5035  ... 519.06635 511.27127 509.10147]
 [508.58475 511.44574 503.95206 ... 509.95605 500.98486 504.83395]
 ...
 [520.7934  520.2907  519.6992  ... 513.3356  511.75177 513.4513 ]
 [511.99506 517.0588  507.03586 ... 518.98206 513.70374 505.92203]
 [517.1471  520.35095 504.1891  ... 509.23767 516.77014 511.08044]]
Matmul TVM: 6.375487 ms


In [None]:
# Calculating the Numpy implementation runtime
import numpy as np
import time

M = 1024
N = 512
K = 2048
dtype = 'float32'
a_np = np.random.rand(M, K).astype(dtype)
w_np = np.random.rand(K, N).astype(dtype)

start_time = time.time()
b_np = np.matmul(a_np, w_np)
end_time = time.time()

# Calculate elapsed time
elapsed_time_ms = (end_time - start_time) * 1000

print("(Numpy baseline):", elapsed_time_ms, "ms")

(Numpy baseline): 27.416229248046875 ms


In [None]:
print(tvm.lower(s, [A, W, B], simple_mode=True))

# from tvm.script import ir as I
# from tvm.script import tir as T

@I.ir_module
class Module:
    @T.prim_func
    def main(A: T.Buffer((1024, 2048), "float32"), B: T.Buffer((2048, 512), "float32"), C: T.Buffer((1024, 512), "float32")):
        T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)})
        blockIdx_y = T.launch_thread("blockIdx.y", 256)
        blockIdx_x = T.launch_thread("blockIdx.x", 128)
        C_1 = T.Buffer((524288,), data=C.data)
        for x_inner_init, y_inner_init in T.grid(4, 4):
            C_1[blockIdx_y * 2048 + x_inner_init * 512 + blockIdx_x * 4 + y_inner_init] = T.float32(0)
        for k_outer, k_inner, x_inner, y_inner in T.grid(512, 4, 4, 4):
            A_1 = T.Buffer((2097152,), data=A.data)
            B_1 = T.Buffer((1048576,), data=B.data)
            C_1[blockIdx_y * 2048 + x_inner * 512 + blockIdx_x * 4 + y_inner] = C_1[blockIdx_y * 2048 + x_inner * 512 + blockIdx_x * 4 + y_inner] + A_1[blockIdx_y * 8192 + x_in

In [None]:
%cd {PROJECT_ROOT}
!python -m pytest tests/test_gemm_gpu.py

/content/gdrive/MyDrive/ece5545/a3-anvibhagavathula
platform linux -- Python 3.10.12, pytest-7.4.4, pluggy-1.4.0
rootdir: /content/gdrive/MyDrive/ece5545/a3-anvibhagavathula
plugins: anyio-3.7.1
collected 20 items                                                                                 [0m

tests/test_gemm_gpu.py [32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m                                                  [100%][0m

