In [152]:
import torch
from transformers import BertModel, BertTokenizer

In [44]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [160]:
# Input is the first 512 tokens generated from the proposal for this project.
text = 'This project aims to implement a transformer layer on a cluster of FPGAs. In recent years transformers have outperformed traditional convolutional neural networks in many fields, but serial performance is dismal and parallel GPU performance is power-intensive. Specialized architectures have been studied little, especially using FPGA platforms. This research will improve transformer inference performance by offloading computationally intensive sections of the network to reconfigurable accelerators running on a cluster of multiple FPGA devices. This research will result in an acceleration architecture for a single layer of a transformer network along with a performance comparison with CPU and GPU baselines. We propose the investigation of distributed transformer inference across a cluster of multiple field programmable gate arrays (FPGAs). This research will investigate the partitioning of a transformer layer across multiple FPGA devices along with networking between FPGAs in the cluster. Transformers have become a dominant machine learning architecture for many domains such as natural language processing, therefore high speed inference is desirable. However, networks sizes and limited FPGA resources often make inference on a single FPGA slow due to limited parallelism and pipeline depth or impossible due to limited resources. The purpose of this research is to explore methods to overcome these challenges by introducing parallelism through multi-FPGA clusters. Transformers are highly parallel neural network architectures which consist of stacks of encoder and decoder layers. These layers consist of many linear transformations on matrices which are represented by matrix-matrix multiplication. Within an encoder/decoder layer there is an opportunity to parallelize both between concurrent general matrix multiplies (GeMM) and within each GeMM. Attempting to serialize these operations on a CPU leads to high execution time and is a poor utilization of the CPU\'s general purpose architecture. GPUs can deliver high throughput inference for transformers, though they are power-hungry and do not achieve the low latency required by some applications. Both in the datacenter and at the edge, low-latency and efficient inference is desired. Optimally, there would be an architecture that could scale between these two extremes of computational demand. State-of-the-art transformers can contain upwards of 12 layers and multiply matrices on the order of 1024x1024 elements. In addition, the trend of increasing transformer size does not show signs of slowing. This large use of memory and FLOPs leads to difficulty mapping an entire transformer network to a '
encoded_input = tokenizer(text, return_tensors='pt')
encoded_input['input_ids'].shape

torch.Size([1, 512])

### Ground truth
This output hidden state is what we care about. After passing this tokenized input through BERT's embedder and it's 12 encoder layers, this is the result. We want to show that our implementation yields the same `last_hidden_state`.

In [161]:
output = model(**encoded_input)
output.last_hidden_state.shape, output.last_hidden_state

(torch.Size([1, 512, 768]),
 tensor([[[-3.7457e-01, -6.9886e-01, -4.4155e-04,  ..., -3.0714e-01,
           -3.8659e-01,  4.7352e-01],
          [-6.7209e-01, -7.5042e-01, -6.9455e-01,  ...,  1.4919e-01,
            1.1460e+00,  1.7025e-01],
          [-8.8504e-01, -6.3164e-01, -5.9148e-01,  ...,  2.0482e-01,
            1.7474e-01,  2.4267e-01],
          ...,
          [-2.5009e-01,  4.4047e-02, -2.1806e-01,  ...,  1.0060e-01,
            2.7695e-01,  8.8146e-01],
          [-7.5948e-01,  7.5724e-02, -3.9088e-01,  ..., -4.3433e-01,
            2.8015e-01,  7.4720e-01],
          [-3.3422e-01, -5.3717e-02,  5.4829e-01,  ...,  5.3513e-01,
           -3.9397e-01, -2.6217e-01]]], grad_fn=<NativeLayerNormBackward0>))

### Step 1: Explicitly call model.embeddings and model.encoder
Now, do what BERT does in the forward pass, but explicitly.

In [121]:
embedding_output = model.embeddings(
    input_ids=encoded_input['input_ids'],
    position_ids=None,
    token_type_ids=encoded_input['token_type_ids'],
    inputs_embeds=None,
    past_key_values_length=0,
)

This is the input data we are now working with. It has gone through the encoder.

In [122]:
embedding_output

tensor([[[ 0.1686, -0.2858, -0.3261,  ..., -0.0276,  0.0383,  0.1640],
         [-0.6485,  0.6739, -0.0932,  ...,  0.4475,  0.6696,  0.1820],
         [ 0.3184,  0.3346, -0.0722,  ...,  0.0517, -0.0069, -0.7439],
         ...,
         [ 0.7621,  0.5828, -0.0454,  ...,  0.3383,  0.0191, -0.0997],
         [ 0.9247,  0.4532,  0.6505,  ..., -0.0661, -0.1281, -0.0595],
         [-0.2301, -0.4165,  0.3172,  ..., -0.2481,  0.5677, -1.6841]]],
       grad_fn=<NativeLayerNormBackward0>)

Now pass it through the encoder layers, all 12 of them.

In [124]:
head_mask = model.get_head_mask(None, model.config.num_hidden_layers)
extended_attention_mask = model.get_extended_attention_mask(attention_mask=encoded_input['attention_mask'],
                                                            input_shape=encoded_input['input_ids'].size(),
                                                            device=encoded_input['input_ids'].device)

In [125]:
encoder_outputs = model.encoder(
    embedding_output,
    attention_mask=extended_attention_mask,
    head_mask=head_mask,
    encoder_hidden_states=None,
    encoder_attention_mask=None,
    past_key_values=None,
    use_cache=None,
    output_attentions=None,
    output_hidden_states=None,
    return_dict=None,
)

Notice that we got the same result that we did when calling `model.forward()` at the beginning.

In [126]:
encoder_outputs

(tensor([[[-3.7457e-01, -6.9886e-01, -4.4155e-04,  ..., -3.0714e-01,
           -3.8659e-01,  4.7352e-01],
          [-6.7209e-01, -7.5042e-01, -6.9455e-01,  ...,  1.4919e-01,
            1.1460e+00,  1.7025e-01],
          [-8.8504e-01, -6.3164e-01, -5.9148e-01,  ...,  2.0482e-01,
            1.7474e-01,  2.4267e-01],
          ...,
          [-2.5009e-01,  4.4047e-02, -2.1806e-01,  ...,  1.0060e-01,
            2.7695e-01,  8.8146e-01],
          [-7.5948e-01,  7.5724e-02, -3.9088e-01,  ..., -4.3433e-01,
            2.8015e-01,  7.4720e-01],
          [-3.3422e-01, -5.3717e-02,  5.4829e-01,  ...,  5.3513e-01,
           -3.9397e-01, -2.6217e-01]]], grad_fn=<NativeLayerNormBackward0>),)

Can you replicate this `encoder_outputs` result very simply?

### Step 2: Call each encoder layer's `forward`

In [130]:
hidden_states = embedding_output
for layer_module in model.encoder.layer:
    layer_outputs = layer_module(hidden_states)
    hidden_states = layer_outputs[0]

Yes! Call each layer's `forward()`.

In [131]:
hidden_states

tensor([[[-3.7457e-01, -6.9886e-01, -4.4155e-04,  ..., -3.0714e-01,
          -3.8659e-01,  4.7352e-01],
         [-6.7209e-01, -7.5042e-01, -6.9455e-01,  ...,  1.4919e-01,
           1.1460e+00,  1.7025e-01],
         [-8.8504e-01, -6.3164e-01, -5.9148e-01,  ...,  2.0482e-01,
           1.7474e-01,  2.4267e-01],
         ...,
         [-2.5009e-01,  4.4047e-02, -2.1806e-01,  ...,  1.0060e-01,
           2.7695e-01,  8.8146e-01],
         [-7.5948e-01,  7.5724e-02, -3.9088e-01,  ..., -4.3433e-01,
           2.8015e-01,  7.4720e-01],
         [-3.3422e-01, -5.3717e-02,  5.4829e-01,  ...,  5.3513e-01,
          -3.9397e-01, -2.6217e-01]]], grad_fn=<NativeLayerNormBackward0>)

How low can we go?

### Step 3: For each layer, call it's submodules

In [142]:
hidden_states = embedding_output
for layer_module in model.encoder.layer:
    # MHA + LayerNorm
    self_attn_out = layer_module.attention(hidden_states)
    attn_out = self_attn_out[0]
    # First FF
    intermediate_out = layer_module.intermediate(attn_out)
    # Second FF + LayerNorm
    layer_out = layer_module.output(intermediate_out, attn_out)
    hidden_states = layer_out
    

In [144]:
hidden_states.shape, hidden_states

(torch.Size([1, 512, 768]),
 tensor([[[-3.7457e-01, -6.9886e-01, -4.4155e-04,  ..., -3.0714e-01,
           -3.8659e-01,  4.7352e-01],
          [-6.7209e-01, -7.5042e-01, -6.9455e-01,  ...,  1.4919e-01,
            1.1460e+00,  1.7025e-01],
          [-8.8504e-01, -6.3164e-01, -5.9148e-01,  ...,  2.0482e-01,
            1.7474e-01,  2.4267e-01],
          ...,
          [-2.5009e-01,  4.4047e-02, -2.1806e-01,  ...,  1.0060e-01,
            2.7695e-01,  8.8146e-01],
          [-7.5948e-01,  7.5724e-02, -3.9088e-01,  ..., -4.3433e-01,
            2.8015e-01,  7.4720e-01],
          [-3.3422e-01, -5.3717e-02,  5.4829e-01,  ...,  5.3513e-01,
           -3.9397e-01, -2.6217e-01]]], grad_fn=<NativeLayerNormBackward0>))

### Step 4: Implement the layer's operations
Now we use the model just as a repository of each layer's weights. Implement the linear transform, multi-head attention, feedforward, and layernorm. The cell below shows the parameters in one encoder layer.

In [146]:
model.encoder.layer[0]

BertLayer(
  (attention): BertAttention(
    (self): BertSelfAttention(
      (query): Linear(in_features=768, out_features=768, bias=True)
      (key): Linear(in_features=768, out_features=768, bias=True)
      (value): Linear(in_features=768, out_features=768, bias=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (output): BertSelfOutput(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
  )
  (intermediate): BertIntermediate(
    (dense): Linear(in_features=768, out_features=3072, bias=True)
  )
  (output): BertOutput(
    (dense): Linear(in_features=3072, out_features=768, bias=True)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
)

In [159]:
def attention(layer, hidden_states):
    '''
    hidden_states: <bs, seqlen, dmodel>
    '''
    bs, seqlen, dmodel = hidden_states.size()
    num_heads = layer.attention.self.attention_head_size
    Wq = layer.attention.self.query
    Wk = layer.attention.self.key
    Wv = layer.attention.self.value
    drop1 = layer.attention.self.dropout
    
    Wg = layer.attention.output.dense
    layernorm = layer.attention.output.LayerNorm
    drop2 = layer.attention.output.dropout
    
    # Linear transform to get multiple heads
    Q = Wq(hidden_states) # <bs, seqlen, dmodel>
    K = Wk(hidden_states) # <bs, seqlen, dmodel>
    V = Wv(hidden_states) # <bs, seqlen, dmodel>
    ### TODO: MHA with the right tensor shapes.
    K = torch.transpose(K, 1, 2) # <bs, dmodel, seqlen>
    K = torch.view(K, shape=(bs, dmodel/num_heads))
    
    
    
attention(model.encoder.layer[0], embedding_output)

AttributeError: module 'torch' has no attribute 'view'

In [None]:
hidden_states = embedding_output
for layer_module in model.encoder.layer:
    # MHA + LayerNorm
    self_attn_out = layer_module.attention(hidden_states)
    attn_out = self_attn_out[0]
    # First FF
    intermediate_out = layer_module.intermediate(attn_out)
    # Second FF + LayerNorm
    layer_out = layer_module.output(intermediate_out, attn_out)
    hidden_states = layer_out