# Test functional correctness of quant layer implementation

In [5]:
%load_ext autoreload
%autoreload 2

import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer
import math

from src.quant_layer import attention, ffn, layer_kernel_gt

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [6]:
text_512 = 'This project aims to implement a transformer layer on a cluster of FPGAs. In recent years transformers have outperformed traditional convolutional neural networks in many fields, but serial performance is dismal and parallel GPU performance is power-intensive. Specialized architectures have been studied little, especially using FPGA platforms. This research will improve transformer inference performance by offloading computationally intensive sections of the network to reconfigurable accelerators running on a cluster of multiple FPGA devices. This research will result in an acceleration architecture for a single layer of a transformer network along with a performance comparison with CPU and GPU baselines. We propose the investigation of distributed transformer inference across a cluster of multiple field programmable gate arrays (FPGAs). This research will investigate the partitioning of a transformer layer across multiple FPGA devices along with networking between FPGAs in the cluster. Transformers have become a dominant machine learning architecture for many domains such as natural language processing, therefore high speed inference is desirable. However, networks sizes and limited FPGA resources often make inference on a single FPGA slow due to limited parallelism and pipeline depth or impossible due to limited resources. The purpose of this research is to explore methods to overcome these challenges by introducing parallelism through multi-FPGA clusters. Transformers are highly parallel neural network architectures which consist of stacks of encoder and decoder layers. These layers consist of many linear transformations on matrices which are represented by matrix-matrix multiplication. Within an encoder/decoder layer there is an opportunity to parallelize both between concurrent general matrix multiplies (GeMM) and within each GeMM. Attempting to serialize these operations on a CPU leads to high execution time and is a poor utilization of the CPU\'s general purpose architecture. GPUs can deliver high throughput inference for transformers, though they are power-hungry and do not achieve the low latency required by some applications. Both in the datacenter and at the edge, low-latency and efficient inference is desired. Optimally, there would be an architecture that could scale between these two extremes of computational demand. State-of-the-art transformers can contain upwards of 12 layers and multiply matrices on the order of 1024x1024 elements. In addition, the trend of increasing transformer size does not show signs of slowing. This large use of memory and FLOPs leads to difficulty mapping an entire transformer network to a '
text_128 = 'This project aims to implement a transformer layer on a cluster of FPGAs. In recent years transformers have outperformed traditional convolutional neural networks in many fields, but serial performance is dismal and parallel GPU performance is power-intensive. Specialized architectures have been studied little, especially using FPGA platforms. This research will improve transformer inference performance by offloading computationally intensive sections of the network to reconfigurable accelerators running on a cluster of multiple FPGA devices. This research will result in an acceleration architecture for a single layer of a transformer network along with a  '
text = text_512
encoded_input = tokenizer(text, return_tensors='pt')
embedding_output = model.embeddings(
    input_ids=encoded_input['input_ids'],
    position_ids=None,
    token_type_ids=encoded_input['token_type_ids'],
    inputs_embeds=None,
    past_key_values_length=0,
)

In [7]:
layer = model.encoder.layer[0]
attention_out = attention(layer, embedding_output)
output_gt = ffn(layer, attention_out)
print(output_gt)

tensor([[[ 0.0052,  0.0445, -0.2700,  ...,  0.1715, -0.0361,  0.0237],
         [-0.5202,  0.4449,  0.1375,  ...,  0.1542,  0.5503,  0.0874],
         [ 0.2533,  0.4850,  0.1049,  ..., -0.0433,  0.4303, -0.8026],
         ...,
         [ 0.5527,  0.6596,  0.1512,  ...,  0.1748,  0.3423, -0.0273],
         [ 0.7551,  0.6538,  0.7972,  ..., -0.1573, -0.2155,  0.1298],
         [-0.1963, -0.1912,  0.2267,  ..., -0.2514,  0.5664, -1.2584]]],
       grad_fn=<MulBackward0>)


In [8]:
output_test = layer_kernel_gt(layer, embedding_output)
print(output_test)

tensor([[[ 0.0052,  0.0445, -0.2700,  ...,  0.1715, -0.0361,  0.0237],
         [-0.5202,  0.4449,  0.1375,  ...,  0.1542,  0.5503,  0.0874],
         [ 0.2533,  0.4850,  0.1049,  ..., -0.0433,  0.4303, -0.8026],
         ...,
         [ 0.5527,  0.6596,  0.1512,  ...,  0.1748,  0.3423, -0.0273],
         [ 0.7551,  0.6538,  0.7972,  ..., -0.1573, -0.2155,  0.1298],
         [-0.1963, -0.1912,  0.2267,  ..., -0.2514,  0.5664, -1.2584]]],
       grad_fn=<MulBackward0>)


In [9]:
assert torch.allclose(output_gt, output_test)

# Generate Ground Truth
Using this input and the parameters of the first encoder layer, generate a slew of ground truth values.