# INT8, 32x32 VERSION EXAMPLE

This notebook follows the same structure as `example_basic`, but showcases how to use the accelerator with a different arithmetic (int8 instead of FP16) and array size (32x32 instead of 8x16).

Note that requantization inside the accelerator is not supported (yet), so when using int8 arithmetic for the inputs, the output partial sums use 32 bits (16 for the multiplication plus 16 to avoid overflows in the reduction).

In [None]:
# Let's import the dependencies we need
import numpy as np
import sys
import torch
import dotenv

# LOAD SYSTEM ENVIRONMENT VARIABLES - To compile Verilator from here
dotenv.load_dotenv('../env', override=True)

sys.path.insert(1, './../') # To find the libraries inside Python folder
import src.hw_versions as hwv
import src.sauria_lib as slib

### Compile verilator model

Remember that this only needs to be done once.

In [None]:
import os
import subprocess

# Version - See 'Python/versions/hw_versions.py'
sauria_version = 'int8_32x32'

cwd = os.getcwd()

os.chdir("../../test/verilator")
f1 = open("verilator_compile.log","w")
subprocess.call(["sh","./compile_sauria.sh",sauria_version],stdout=f1)
os.chdir(cwd)

### Define convolution & generate random tensors

This time, we will use int8 values so we generate them directly with `randint`.

In [None]:
# Convolution options:
C_in = 64       # Input Channels
C_out = 64      # Output Channels
Kh,Kw = 3,3     # Kernel size
s = 1           # Strides
d = 1           # Dilation coefficient
#p = 0          # Padding (UNSUPPORTED ATM!)

# Output tensor shape
Cw = 64         # Output tensor width
Ch = 64         # Output tensor height

# Input tensor shape determined by output tensor shape
Aw = (1+s*(Cw-1)) + (1+d*(Kw-1)) - 1
Ah = (1+s*(Ch-1)) + (1+d*(Kh-1)) - 1

# Randomly generate input tensors
tensor_A_torch = torch.randint(-127,127, (C_in, Ah, Aw), dtype=torch.int8)

# Randomly generate weights and biases
tensor_B_torch = torch.randint(-127,127, (C_out, C_in, Kh, Kw), dtype=torch.int8)
tensor_bias_torch = torch.randint(-127,127, (C_out, 1, 1), dtype=torch.int8)

# Perform convolution with Pytorch and print result
tensor_C_torch = tensor_bias_torch + torch.nn.functional.conv2d(tensor_A_torch.double(),tensor_B_torch.double(),stride=s,padding=0,dilation=d)
tensor_C_torch = tensor_C_torch.int()

print(tensor_C_torch.shape)
print(tensor_C_torch[:3,:3,:3])

### Convert tensors to numpy

As in the FP16 case, we convert the tensors to numpy arrays before passing them to the SAURIA library.

In [None]:
# Input tensor is the same, but converted to numpy
tensor_A = np.array(tensor_A_torch.detach())

# Weights tensor is obtained from the conv layer (randomly generated)
tensor_B = np.array(tensor_B_torch.detach())

# Bias can be added by preloading data into the array
# (This is OPTIONAL! It adds the cost of replicating the data!)
# (However, it is useful as an example of data preloading)
bias_numpy = np.array(tensor_bias_torch.detach())
preload_C = np.zeros([C_out,Ch,Cw])
preload_C[:,:,:] = np.reshape(bias_numpy,[C_out,1,1])

# Convert result into numpy to compare
tensor_C = np.array(tensor_C_torch.detach())

print(tensor_C.shape)
print(tensor_C[:3,:3,:3])

### Define the tiling configuration, generate the CONV dictionary

In [None]:
# Dictionary of hardware parameters describing the version of SAURIA
HW_PARAMS = hwv.get_params(sauria_version)

# Array with the tensor shapes to compute
tensor_shapes = [tensor_A.shape, tensor_B.shape, tensor_C.shape]

# Dictionary describing the tiling sizes
TILING_DICT = {
    'C_tile_shape'  :   [64,8,64],  #[C_out, Ch, Cw]
    'tile_cin'      :   64,
    'X_used'        :   32,
    'Y_used'        :   32
}

# Dictionary fully describing the convolution to compute
CONV_DICT = slib.get_conv_dict(tensor_shapes, TILING_DICT, HW_PARAMS, d=d, s=s, preloads=True)

print(CONV_DICT)

### Run RTL Emulation

In [None]:
SAURIA_outputs, SAURIA_stats = slib.Conv2d_SAURIA(tensor_A, tensor_B, preload_C, tensor_C, CONV_DICT, HW_PARAMS, generate_vcd=False, print_statistics=True, silent=False)

### Compare results to golden

In [None]:
# Print and compare to Pytorch result
print("From Pytorch:")
print(tensor_C[:3,:3,:3])

print("\nFrom SAURIA:")
SAURIA_outputs = SAURIA_outputs.astype(np.int32)
print(SAURIA_outputs[:3,:3,:3])

print("\nAverage absolute error:")
print(np.abs(SAURIA_outputs - tensor_C).mean())

### Matrix-matrix multiplication example

Now let's try to perform a matrix-matrix multiplication directly

In [None]:
# GEMM options:
C = 512         # Input Channels
L = 256         # Input size (number of hyperdimensional vectors)
K = 512         # Output Channels

# Generate random input matrices
matrix_A = np.random.randint(-127,127,size=(C,L),dtype=np.int8)
matrix_B = np.random.randint(-127,127,size=(K,C),dtype=np.int8)

bias_numpy = np.random.randint(-127,127,size=(K,1),dtype=np.int8)

# Compute matmul in numpy (cast to int32 to avoid overflow)
matmul_C = np.matmul(matrix_B.astype(np.int32), matrix_A.astype(np.int32))

# Add bias as a 2nd step
# (NOTE: we will add the bias in software, so we need the matmul only result if we want to pass golden values to the library)
matrix_C = matmul_C + bias_numpy.astype(np.int32)

print(matrix_C.shape)
print(matrix_C[:3,:3])

### Convolution equivalence

A GEMM can be seen as a particular case of the convolution operation. The equivalence between tensor and matrix dimensions is shown below.

In [None]:
# To use SAURIA for GEMM operation, the equivalence is as follows:
C_in = C        # Input Channels
C_out = K       # Output Channels
Kh,Kw = 1,1     # Kernel size           -> ALWAYS 1,1 BECAUSE THERE IS NO KERNEL!
s = 1           # Strides               -> ALWAYS 1!
d = 1           # Dilation coefficient  -> ALWAYS 1!
#p = 0          # Padding (UNSUPPORTED ATM!)
Cw = L          # Output tensor width
Ch = 1          # Output tensor height  -> ALWAYS 1!
Aw = Cw         # Input tensor shape is the same because there is no convolutional kernel
Ah = Ch

# Now we must simply reshape the matrices to fit the tensor template:
matrix_A_reshaped = np.reshape(matrix_A, (C_in,Ah,Aw))
matrix_B_reshaped = np.reshape(matrix_B, (C_out,C_in,Kh,Kw))
matmul_C_reshaped = np.reshape(matmul_C, (C_out,Ch,Cw))

# Array with the "tensor" shapes to compute
tensor_shapes = [matrix_A_reshaped.shape, matrix_B_reshaped.shape, matmul_C_reshaped.shape]

# Dictionary describing the tiling sizes
TILING_DICT = {
    'C_tile_shape'  :   [128,1,256],  #[C_out, Ch, Cw]
    'tile_cin'      :   256,
    'X_used'        :   32,
    'Y_used'        :   32
}

# Dictionary fully describing the convolution to compute
CONV_DICT = slib.get_conv_dict(tensor_shapes, TILING_DICT, HW_PARAMS, d=d, s=s, preloads=True)

print(CONV_DICT)

### GEMM Emulation with SAURIA

Now let's run the hardware emulation. To switch things up a little bit, we will add the bias externally via software, instead of preloading the bias values into the array, which is actually better for the required bandwidth.

(*Note that how the bias is added does not depend on whether or not we perform a GEMM or a convolution, this is just an illustrative example for both things*)

In [None]:
SAURIA_outputs_2, SAURIA_stats_2 = slib.Conv2d_SAURIA(matrix_A_reshaped, matrix_B_reshaped, None, matmul_C_reshaped, CONV_DICT, HW_PARAMS, generate_vcd=False, print_statistics=True, silent=False)

# Squeeze to reshape from [K,1,L] into [K,L]
SAURIA_outputs_2 = SAURIA_outputs_2.squeeze()

# This time we add the bias externally via software
SAURIA_matrix = SAURIA_outputs_2 + bias_numpy

In [None]:
# Print and compare to Numpy result
print("From Pytorch:")
print(matrix_C[:3,:3])

print("\nFrom SAURIA:")
SAURIA_outputs_2_plusbias = SAURIA_matrix.astype(np.int32)
print(SAURIA_outputs_2_plusbias[:3,:3])

print("\nAverage absolute error:")
print(np.abs(SAURIA_outputs_2_plusbias - matrix_C).mean())