# Bidirectional Encoder Representations from Transformers (BERT)

While BERT is similar to models like GPT, the focus of BERT is to understand text rather than generate it. This is useful in a variety of tasks like ranking how positive a review of a product is, or predicting if an answer to a question is correct.

Transformers general idea: the encoder summarizes an input into an abstract and meaning rich representation, and the projection head generates text.

The point of an "encoder only" transformer like BERT is to summarize some input sequence into an abstract, dense, and meaning rich representation.

In [None]:
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
from tqdm import tqdm
import torch.optim as optim
from multiprocessing import Pool, cpu_count

# defining the device the data ends up living
device= torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
!pip install datasets
!pip install nltk

from datasets import load_dataset
import nltk

In [None]:
# the dataset is big, to make things easier we are going to be streaming a subset
dataset= load_dataset('wikipedia', '20220301.en', trust_remote_code= True, streaming=True)

# a sentence tokenizer we will be using to extract sentences from articles
nltk.download('punkt')

wikipedia.py:   0%|          | 0.00/36.7k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/16.0k [00:00<?, ?B/s]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# breaking wikipedia articles into sentences and paragraphs

import itertools

num_articles= 10000

# geting n articles
dataset_iter= iter(dataset['train'])
articles= list(itertools.islice(dataset_iter, num_articles))

# getting paragraphs
paragraphs= []
for article in articles:
    paragraphs.extend(article['text'].splitlines())

# filtering paragraphs so they are hopefully actually paragraphs
paragraphs= [p for p in paragraphs if len(p)>50]

# dividing paragraphs into sentences
divided_paragraphs= []
for p in paragraphs:
    divided_paragraphs.append(nltk.sent_tokenize(p))

# only using paragraphs with 3 or more sentences
divided_paragraphs= [pls for pls in divided_paragraphs if len(pls)>=3]


In [None]:
# using the paragraph data to construct paris of following sentences and pairs
# of random sentences

import random

positive_pairs= []
negative_pairs= []

num_paragraphs= len(divided_paragraphs)

for i, paragraph in enumerate(divided_paragraphs):
    for j in range(len(paragraph)-1):
        positive_pairs.append((paragraph[j], paragraph[j+1]))
        rand_par= i

        # avoiding to take a sentence from the same paragraph
        while rand_par == i:
            rand_par= random.randint(0, num_paragraphs-1)

        rand_sent= random.randint(0, len(divided_paragraphs[rand_par])-1)
        negative_pairs.append((paragraph[j], divided_paragraphs[rand_par][rand_sent]))


# Tokenization

In order to feed data into our model we need to somehow turn our sentences into vectors. We can use that to break up text into individual tokens.

In [None]:
from transformers.models.bert.tokenization_bert_fast import BertTokenizerFast

tokenizer= BertTokenizerFast.from_pretrained('google-bert/bert-base-uncased')

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



In [None]:
# playing arround with tokenizer
sentence= "Here's a weird word: Withoutadoubticus."
print(f'Original sentence: "{sentence}"')

demo_tokens= tokenizer([sentence])
print(f"Token IDs: {demo_tokens['input_ids']}")

tokens= tokenizer.convert_ids_to_tokens(demo_tokens['input_ids'][0])
print(f'Token values: {tokens}')

Original sentence: "Here's a weird word: Withoutadoubticus."
Token IDs: [[101, 2182, 1005, 1055, 1037, 6881, 2773, 1024, 2302, 9365, 12083, 29587, 1012, 102]]
Token values: ['[CLS]', 'here', "'", 's', 'a', 'weird', 'word', ':', 'without', '##ado', '##ub', '##ticus', '.', '[SEP]']


The tokenizer broke our sentence up into individual components which may have included dividing individual words into more than one component. This is called sub-word tokenization, meaning the tokenizer has both words and word components in its vocabulary. This is important because it allows the tokenizer to express complicated words as a series of tokens.

In [None]:
# special tokens
tokenizer

BertTokenizerFast(name_or_path='google-bert/bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

# Defining Training Batches

Each batch will contain 128 individual sentence pair examples, 64 of which are positive pairs and 64 of which are negative pairs. To keep our model fairly small, to speed up training, we'll make the context window for our model equal to 64 tokens. So, at the end of this process, we'll get a tensor which is [number_of_batches x 128(batch_size) x 64(sequence_length)]

In [None]:
# number of examples in the batch
batch_size= 128 # should be divisible by 2
# sequence length of model
max_input_length= 64


# defining parallelizable function for processing batches
def process_batch(batch_index):

    # establishing bounds of the batch
    start_index= batch_index * batch_size
    end_index= start_index + batch_size

    if end_index > len(positive_pairs):
        return None, None, None

    # getting the sentence pairs of the batch, and if they are pos or neg
    sentence_pairs= []
    is_positives= []

    # creating positive pairs
    sentence_pairs.extend(positive_pairs[start_index:start_index+int(batch_size / 2)])
    # positive labels
    is_positives.extend([1] * int(batch_size / 2))

    # creating negative pairs
    sentence_pairs.extend(negative_pairs[start_index:start_index+int(batch_size / 2)])
    # negative labels
    is_positives.extend([0] * int(batch_size / 2))

    # defining outputs
    # at the end of the day we need to know three things:
    #   - the tokens for the sequences in a batch
    #   - which sentence the tokens belong to, for positional encoding
    #   - if the examples in the batch are positive or negative
    # these keep track of the first two
    batch_sentence_location_tokens= []
    batch_sequence_tokens= []

    # tokenizing pairs
    for sentence_pair in sentence_pairs:
        sentence1= sentence_pair[0]
        sentence2= sentence_pair[1]

        # tokenizing both sentences
        tokens= tokenizer([sentence1, sentence2])
        sentence1_tokens= tokens['input_ids'][0]
        sentence2_tokens= tokens['input_ids'][1]

        # trimming down tokens
        if len(sentence1_tokens) + len(sentence2_tokens) > max_input_length:
            """
            If the sentences are too long I opted to preserve the end of the first sentence and the
            beginning of the second sentence. This should still allow long inputs to be reasonably
            interpretable by the model.
            """
            sentence1_tokens= [101] + sentence1_tokens[-int(max_input_length / 2) + 1:]
            sentence2_tokens= sentence2_tokens[:int(max_input_length / 2) - 1] + [102]

        # creating sentence tokens - a vector which has zeros in the length of the first sentence,
        # and ones for the length of the second sentence. We'll use this vector to help us with
        # positional encoding when we go to build the model.
        sentence_tokens= [0] * len(sentence1_tokens) + [1] * len(sentence2_tokens)

        # combining and padding
        pad_num= max_input_length - (len(sentence1_tokens) + len(sentence2_tokens))
        sequence_tokens= sentence1_tokens + sentence2_tokens + [0]*pad_num
        sentence_location_tokens= sentence_tokens + [1]*pad_num
        """
        We combine our sentence tokens together, and if the combined length is less than the model
        length we add a bunch of pad tokens. If we do add pad tokens, we say the pad tokens belong
        to the second sentence for convenience sake.
        """

        # adding to batch
        batch_sequence_tokens.append(sequence_tokens)
        batch_sentence_location_tokens.append(sentence_location_tokens)

    return torch.tensor(batch_sentence_location_tokens), torch.tensor(batch_sequence_tokens), torch.tensor(is_positives)


In [None]:
# determine the number of batches
num_batches= len(positive_pairs) // batch_size

# use a pool of workers equal to the number of CPU cores
with Pool(processes=cpu_count()) as pool:
    results= list(tqdm(pool.imap(process_batch, range(num_batches)), total=num_batches))

# filter out None results from the process_batch function
results= [result for result in results if result[0] is not None]

# unpack results into batches
sentence_location_batches, sequence_tokens_batches, is_positives_batches= zip(*results)

# stack tensors into final batches
sentence_location_batches= torch.stack(sentence_location_batches).to(device)
sequence_tokens_batches= torch.stack(sequence_tokens_batches).to(device)
is_positives_batches= torch.stack(is_positives_batches).to(device)

100%|██████████| 7311/7311 [05:12<00:00, 23.40it/s]


In [None]:
sentence_location_batches.shape

torch.Size([7311, 128, 64])

# Creating a Masking Function

As we build our masking function we don't want to inadvertently mask out special tokens like [CLS], [SEP], and [PAD]. We only want to mask out tokens which correspond to the sentences themselves.

After we train our model, the [MASK] token will never be seen when the model is actually being used and making inferences. If we only train our model on the [MASK] token, it might learn to disregard other words that might be important in understanding the sequence generally. So, when we decide a random token should be masked we usually replace it with the [MASK] token, but we also sometimes preserve the original token value, and sometimes replace the masked token with a
completely random token.

In the original BERT paper they decided to mask 15% of words within the input. Of that 15%, 80% are replaced with [MASK], while 10% are replaced with a random word and 10% are not replaced at all.

In [None]:
# listing out vocab for random token masking
vocab= tokenizer.get_vocab()
valid_token_ids= list(vocab.values())

def mask_batch(batch_tokens, clone=True):
    if clone:
        batch_tokens= torch.clone(batch_tokens)

    # define the percentage of tokens to potentially mask
    replace_percentage= 0.15

    # define tokens that should not be replaced
    excluded_tokens= {0, 100, 101, 102, 103}

    # create a mask to identify tokens that are eligible for replacement
    eligible_mask= ~torch.isin(batch_tokens, torch.tensor(list(excluded_tokens)).to(device))

    # count the number of eligible tokens
    num_eligible_tokens= eligible_mask.sum().item()

    # calculate the number of tokens to potentially mask
    num_tokens_to_mask= int(num_eligible_tokens * replace_percentage)

    # create a random permutation of eligible token indices
    eligible_indices= eligible_mask.nonzero(as_tuple=True)
    random_indices= torch.randperm(num_eligible_tokens)[:num_tokens_to_mask]

    # create a probability distribution for replacement
    replacement_probs= torch.tensor([0.8, 0.1, 0.1]) # probs for [103, random token, leave unchanged]
    replacement_choices= torch.multinomial(replacement_probs, num_tokens_to_mask, replacement=True)

    # vector to store if a token was masked (0: not masked, 1: masked)
    masked_indicator= torch.zeros_like(batch_tokens, dtype=torch.int32)

    # apply replacements based on sampled choices
    for i, idx in enumerate(random_indices):
        row= eligible_indices[0][idx]
        col= eligible_indices[1][idx]

        # replacing with mask
        if replacement_choices[i]== 0:
            batch_tokens[row, col]= 103
            masked_indicator[row, col]= 1

        # replacing with random tokens
        elif replacement_choices[i]== 1:
            batch_tokens[row, col]= random.choice(valid_token_ids)
            masked_indicator[row, col]= 1

        # not replacing at all
        elif replacement_choices[i]== 2:
            masked_indicator[row, col]= 1


    return batch_tokens, masked_indicator


In [None]:
batch_tokens, masked_indicator= mask_batch(sequence_tokens_batches[0])
batch_tokens

tensor([[  101,   103, 11140,  ...,     0,     0,     0],
        [  101,   101,  9617,  ...,     0,     0,     0],
        [  101,  4286,  2973,  ...,     0,     0,     0],
        ...,
        [  101,  1037,  2146,  ...,     0,     0,     0],
        [  101,  1997, 13193,  ...,     0,     0,     0],
        [  101,   103,  1996,  ...,     0,     0,     0]], device='cuda:0')

# Defining the Model

Now that we have tokenization, and we've shown that we can build an embedding model that can work with that tokenization, we can build the model itself.

# Embedding

A BERT style model, in being a derivative of transformers, expects a high dimensional vector to represent each word. The model will use these vectors to reason about words, allowing it to (hopefully) create a strong understanding of the input text. So, we need to turn our tokens (which are just integers) into these high dimensional vectors.

The embedding portion of the model will take care of both the conversion of tokens into vectors and the addition of positional information by using a lookup table. We'll define random vectors for every possible token, random vectors that correspond to each input position, and random vectors which correspond to the two sentence inputs. We'll replace tokens and positions with these random vectors, and use them to represent a token and it's position. Naturally, it will do a bad job at first as we're using completely random data, but these random values will be learnable parameters of the model, so the model will learn how to make good vectors for both token and position encoding.

In [None]:
class Embedding(nn.Module):
    """
    This is the first component of the model which converts tokens into vectors. These vectors are
    learned throughout the training process, where there's esssentially a lookup table for each
    word.
    Here we're saying we'll represent the words with vectors of length 256 with the parameter
    d_model=256, and we're saying we're dealing with two sentences with n_segments=2.
    """

    def __init__(self, d_model, vocab_size, input_length, n_segments) -> None:
        super(Embedding, self).__init__()
        # token embedding
        self.tok_embed= nn.Embedding(vocab_size, d_model)
        # position embedding
        self.pos_embed= nn.Embedding(input_length, d_model)
        # segment (token type) embedding
        self.seg_embed= nn.Embedding(n_segments, d_model)


    def forward(self, x, seg_location):
        seq_len= x.size(1)
        pos= torch.arange(seq_len, dtype=torch.long, device=x.device)
        # [seq_len, ] -> [batch_size, seq_len]
        pos= pos.unsqueeze(0).expand_as(x)

        embedding= self.tok_embed(x) + self.pos_embed(pos) + self.seg_embed(seg_location)

        return embedding


In [None]:
e= Embedding(d_model=256, vocab_size=tokenizer.vocab_size, input_length=max_input_length,
             n_segments=2)
e.to(device)

Embedding(
  (tok_embed): Embedding(30522, 256)
  (pos_embed): Embedding(256, 256)
  (seg_embed): Embedding(2, 256)
)

In [None]:
# we can pass in a batch of data through this module and see what we get.
dummy_embedding= e(sequence_tokens_batches[0], sentence_location_batches[0])

print(dummy_embedding.shape)
print(dummy_embedding)

torch.Size([128, 64, 256])
tensor([[[-0.2935, -0.2505, -0.3464,  ..., -1.4271,  1.4150,  0.4988],
         [-1.4458,  2.4072, -0.3145,  ..., -1.8255, -0.2334,  1.1606],
         [-0.6110,  0.8082, -1.3392,  ...,  0.1489,  1.7131,  0.6110],
         ...,
         [-1.9659, -0.7140,  0.2818,  ...,  0.7296, -1.3188, -0.2563],
         [-1.9306, -0.3924,  1.0346,  ...,  2.0033, -1.6667, -0.4071],
         [-1.6372,  0.5793,  0.6758,  ...,  0.7765, -1.3375, -0.4205]],

        [[-0.2935, -0.2505, -0.3464,  ..., -1.4271,  1.4150,  0.4988],
         [-1.3037,  1.4894,  0.6361,  ..., -1.1468,  0.0313,  0.4960],
         [-0.9860,  1.4732,  0.0764,  ..., -1.1544,  1.1918,  1.5997],
         ...,
         [-1.9659, -0.7140,  0.2818,  ...,  0.7296, -1.3188, -0.2563],
         [-1.9306, -0.3924,  1.0346,  ...,  2.0033, -1.6667, -0.4071],
         [-1.6372,  0.5793,  0.6758,  ...,  0.7765, -1.3375, -0.4205]],

        [[-0.2935, -0.2505, -0.3464,  ..., -1.4271,  1.4150,  0.4988],
         [-0.6913,

# Multi-Headed Self Attention

BERT is a transformer style model, so multi-headed self-attention is a critical component.

First of all we can implement a single attention head. We'll assume the query, key, and value have already been created, so we can whip that up. This doesn't have any learnable parameters, those will be in the multi headed self attention mechanism which will employ this as a sub-component.

In [None]:
class ScaledDotProductAttention(nn.Module):

    def __init__(self, dropout=0.1) -> None:
        super(ScaledDotProductAttention, self).__init__()
        self.dropout= nn.Dropout(p=dropout)


    def forward(self, Q, K, V):
        # Q, K, V of size [batch, head, seq_len, head_dim]
        attn= (Q @ K.transpose(-2, -1)) * (1.0 / math.sqrt(K.size(-1)))
        attn= F.softmax(attn, dim=-1)
        attn= self.dropout(attn)
        context= attn @ V

        return context, attn


In [None]:
# sanity check
q= torch.tensor([[[1.1, 1.3], [0.9, 0.8]]]).to(device)
k= torch.tensor([[[0.9, 1.0], [0.2, 2.1]]]).to(device)
v= torch.tensor([[[1.1, 1.3], [0.9, 0.8]]]).to(device)

sample= ScaledDotProductAttention().to(device)
sample(q, k, v)

(tensor([[[1.0856, 1.1030],
          [0.5572, 0.6586]]], device='cuda:0'),
 tensor([[[0.4282, 0.6829],
          [0.5066, 0.0000]]], device='cuda:0'))

We have a batch of examples which need to be turned into querys, keys, and values, then those need to be further divided into multiple heads. This means we effectively have two axis which we need to parallelize self attention across; the batch dimension and a new dimension for the heads. Because PyTorch automagically parallelizes across the 0th dimension by assuming it's the batch dimension, we can effectively parallelize across the batch and heads by squeezing both dimensions into a single dimension.

Here we are defining a few constants that we'll use through training.
- n_heads specifies how many attention heads exist per MHSA block
- query_key_dim specifies how big the query and key vectors will be
- value_dim specifies how big the value vectors will be.

In [None]:
class MultiHeadSelfAttention(nn.Module):
    """
    MHSA has four sets of parameters. These are all dense linear modules. Three that turn the
    tensors of the model into inputs for MHSA, and one that turns the output of MHSA back into the
    shape needed for modeling.
    These are "pointwise dense modules" which is the default setup in PyTorch. Basically, these
    apply to all the vectors in your space and assumes the last dimension is the vector dimension.
    So if you have, for instance, an input of shape [batch_size, seq_len, emb_dim] and you want
    to turn that into an output of shape [batch_size, seq_len, output_dim] you can use
    nn.Linear(input_dim, output_dim).
    """

    def __init__(self, d_model, query_key_dim, value_dim, n_heads, dropout=0.1) -> None:
        super(MultiHeadSelfAttention, self).__init__()
        self.query_key_dim= query_key_dim
        self.value_dim= value_dim
        self.n_heads= n_heads
        # defining the linear layers that construct the query, key, and value
        self.W_Q= nn.Linear(d_model, n_heads * query_key_dim)
        self.W_K= nn.Linear(d_model, n_heads * query_key_dim)
        self.W_V= nn.Linear(d_model, n_heads * value_dim)
        # parameterless system that calculates Attention
        self.attn= ScaledDotProductAttention(dropout)
        # projects final output of MHSA back into model dimension
        self.proj_back= nn.Linear(n_heads * value_dim, d_model)


    def forward(self, x):
        batch_size, seq_len, emb_dim= x.size()
        # passing x (embedding) through dense networks
        qs= self.W_Q(x) # [batch_size, seq_len, (n_heads * query_key_dim)]
        ks= self.W_K(x) # [batch_size, seq_len, (n_heads * query_key_dim)]
        vs= self.W_V(x) # [batch_size, seq_len, (n_heads * value_dim)]

        # dividing out heads -- [batch_size, seq_len, n_heads, qk/v_dim]
        qs= qs.view(batch_size, -1, self.n_heads, self.query_key_dim)
        ks= ks.view(batch_size, -1, self.n_heads, self.query_key_dim)
        vs= vs.view(batch_size, -1, self.n_heads, self.value_dim)

        # moving the head dimension next to the batch dimension
        qs= qs.permute(0, 2, 1, 3)
        ks= ks.permute(0, 2, 1, 3) # [batch_size, n_heads, seq_len, qk/v_dim]
        vs= vs.permute(0, 2, 1, 3)

        # passing batches/heads through the 'scaled dot product'
        head_results, _= self.attn(qs, ks, vs)

        # permuting back between head and seq_len dimensions
        head_results= head_results.permute(0, 2, 1, 3) # [batch_size, seq_len, n_heads, value_dim]

        # combining the last dim to effectively concatenate the result of the heads
        # [batch_size, seq_len, (n_heads * value_dim)]
        head_results= head_results.contiguous().view(batch_size, seq_len, -1)

        # projecting result of head back into model dimension
        return self.proj_back(head_results)


In [None]:
# example usage
d_model= 256
sample_embeddings= torch.tensor([[[1.1] * d_model] * max_input_length] * batch_size).to(device)
print("Sample embeddings shape:", sample_embeddings.shape)

attn= MultiHeadSelfAttention(d_model=d_model, query_key_dim=64, value_dim=64, n_heads=3).to(device)
output= attn(dummy_embedding)
print('Output shape of MHSA:', output.shape)
# the output should be the same size as the input

Sample embeddings shape: torch.Size([128, 64, 256])
Output shape of MHSA: torch.Size([128, 64, 256])


# Pointwise Feed Forward

We already implemented pointwise feedforward in the construction of the query, key, and value in multi headed self-attention, but this process is also done to the model tokens themselves, as per the classic transformer architecture.

Just like in multi-headed self-attention, this applies a neural network to each word vector individually, allowing the model to learn to manipulate individual vectors as necessary.

In [None]:
class PointwiseFeedForwardNet(nn.Module):
    """
    In this particular implementation we are expanding the vectors to four times their length with
    a neural network, applying a non-linear activation function, then compressing that data back
    into the original model dimension length (d_model=256, in this example).
    Basically, we are allowing our model to stretch each word vector out into a bigger
    representation, allowing the model to represent each vector in a diverse number of ways, then
    we are passing that larger representation through a function that manipulates the vector in
    complex ways. The model can learn to exploit that large number of complex representations to
    create better word vectors, which are then compressed back into the original modeling dimension.
    """

    def __init__(self, d_model, d_ff, dropout=0.1) -> None:
        super(PointwiseFeedForwardNet, self).__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.gelu= nn.GELU()
        self.fc2 = nn.Linear(d_ff, d_model)
        self.dropout= nn.Dropout(p=dropout)


    def forward(self, x):
        x= self.fc1(x) # [batch_size, seq_len, d_model] -> [batch_size, seq_len, d_ff]
        x= self.gelu(x)
        x= self.dropout(x)
        x= self.fc2(x) # [batch_size, seq_len, d_ff] -> [batch_size, seq_len, d_model]

        return x


In [None]:
# example usage
d_model= 256
d_ff= 4 * d_model

sample= PointwiseFeedForwardNet(d_model=d_model, d_ff=d_ff).to(device)
sample_embeddings= torch.tensor([[[1.1] * d_model] * max_input_length] * batch_size).to(device)
sample(sample_embeddings).shape

torch.Size([128, 64, 256])

# The Encoder Block

Now that we have multi-headed self-attention and pointwise feed forward figured out, we can implement the entire encoder block.

Here we are:
- Passing the input through multi-headed self-attention
- Adding the original input to the output of MHSA, combining both, creating the first skip connection.
- Passing that through pointwise feed forward
- Adding the output of pointwise feed forward to the previous skip connection output, creating the second skip connection.

Skip connections help a model learn more easily by combining simple and more complex information together, allowing the model to use both to its advantage.

Very few details about the Transformer have changed in the last five years, but there is something slightly departs from the original paper. You see that Add and Norm is applied **after** (Post-LN) the transformation (Multi Head Attention). But now it is more common to apply LayerNorm **before** (Pre-LN) the transformation, so there is a reshuffling of the Layer Norm. This is called **pre-norm formulation** and that is the one we are going to implement as well.

In [None]:
class ResidualConnection(nn.Module):
    """
    A residual connection followed by a layer norm.
    Note for code simplicity the norm is first as opposed to last.
    """

    def __init__(self, size, dropout=0.0) -> None:
        super(ResidualConnection, self).__init__()
        self.norm= nn.LayerNorm(size)
        self.dropout= nn.Dropout(p=dropout)


    def forward(self, x, sublayer):
        # apply residual connection to any sublayer with the same size
        pre_ln= self.norm(x)
        x= x + self.dropout(sublayer(pre_ln))

        return x


In [None]:
class EncoderBlock(nn.Module):
    """
    Defining the Encoder block.
    """

    def __init__(self, d_model, d_ff, query_key_dim, value_dim, n_heads, dropout=0.0) -> None:
        super(EncoderBlock, self).__init__()
        self.connection1= ResidualConnection(d_model, dropout)
        self.mhsa= MultiHeadSelfAttention(d_model, query_key_dim, value_dim, n_heads, dropout)
        self.connection2= ResidualConnection(d_model, dropout)
        self.pwff= PointwiseFeedForwardNet(d_model, d_ff, dropout)


    def forward(self, x):
        x= self.connection1(x, self.mhsa)
        x= self.connection2(x, self.pwff)

        return x


In [None]:
# example usage
sample= EncoderBlock(d_model, d_ff, query_key_dim=64, value_dim=64, n_heads=3).to(device)
sample_embeddings= torch.tensor([[[1.1] * d_model] * max_input_length] * batch_size).to(device)
sample(sample_embeddings).shape

torch.Size([128, 64, 256])

Let's get onto the fun stuff, actually building BERT!

# Building BERT

BERT is straight forward if you understand the two major subcomponents we have already created:
- It has an embedding sub-module which turns the token_ids of the input into vectors
- It has a bunch of encoder blocks which manipulate the input to create a dense and meaning rich representation. Under the hood these consist of multi-headed self-attention and pointwise feed forward layers.

The only new things are the projection head and the classifier, these are used to turn certain vectors in the output of the last encoder layer into predictions. The classifier looks at the first input token, which is always [CLS] in the input (we set our data up that way) and makes a prediction as to whether or not the sentences in the input belong together or not.

We are only using two Encoder Blocks to speed up training (n_layers). A BERT model would expect at least a few Encoder Blocks stacked on top of one another.

In [None]:
class BERT(nn.Module):
    """
    BERT consists of the following:
    - An embedding module
    - A list of encoder blocks
    - A dense network for classifying if sequences are a positive pair
    - A dense network for projecting word vectors into probabilities
    """

    def __init__(self, vocab_size, input_length, n_segments, n_layers=2, d_model=256, d_ff=1024,
                 n_heads=4, query_key_dim=64, value_dim=64, dropout=0.1) -> None:
        super(BERT, self).__init__()
        self.input_length= input_length
        # for converting tokens into vector embeddings
        self.embedding= Embedding(d_model, vocab_size, input_length, n_segments)
        # encoder blocks stacked on top of one another
        self.encoder_blocks= nn.ModuleList([
            EncoderBlock(d_model, d_ff, query_key_dim, value_dim, n_heads, dropout)
            for _ in range(n_layers)
        ])
        self.ln_final= nn.LayerNorm(d_model)
        # for projecting a word vector (or tensor of them) into token predictions
        self.proj_head= nn.Linear(d_model, vocab_size, bias=False)
        # for converting the first output token into a binary classification
        self.classifier= nn.Linear(d_model, 1, bias=False)

        # initialize Linear modules with Glorot / fan_avg
        # let LayerNorm and Embedding modules use default initializations
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.xavier_uniform_(m.weight)
                if m.bias is not None: nn.init.zeros_(m.bias)


    def forward(self, x, seg_location, masked_token_locations):
        # x is our token indices of shape [batch_size x seq_len]
        batch_size, seq_len= x.size()
        assert seq_len <= self.input_length, f'Cannot forward sequence of length {seq_len}, block size is only {self.input_length}'

        embeddings= self.embedding(x, seg_location)
        # x of shape [batch x seq_len x model_dim]
        x= embeddings

        for block in self.encoder_blocks:
            x= block(x)
        # forward the final layerNorm
        x= self.ln_final(x)

        # for every example in the batch, this takes the first vector and passes it to the
        # classifier linear network for prediction
        clsf_logits= self.classifier(x[:,0,:])

        # passing masked tokens through the projection head
        masked_token_embeddings= embeddings[masked_token_locations.bool()]
        token_logits= self.proj_head(masked_token_embeddings)

        return clsf_logits, token_logits


Notice how the classifier is defined as a linear network of output size 1.
This is because we're making a binary classification (yes or no). Predicted values over 0.5 will be interpreted as true, while predicted values of less than 0.5 will be interpreted as false. The projection head does something similar, except it looks at all masked tokens, and instead of making a true or false prediction, it has to predict what token should be there. Thus, the output is of length tokenizer.vocab_size , meaning we predict, out of all tokens, what token a particular masked word should be.

In [None]:
# --- BERT LARGE hyperparameters config ---
vocab_size= 30522
input_length= 512
n_segments= 2
N= 24
d_model= 1024
d_ff= 4 * d_model
h= 16
dk= 64
dv= 64
dropout=0.1

model= BERT(vocab_size, input_length, n_segments, N, d_model, d_ff, h, dk, dv, dropout).to(device)

total_params= sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'Number of parameters: {total_params}\n')

model

Number of parameters: 365347840



BERT(
  (embedding): Embedding(
    (tok_embed): Embedding(30522, 1024)
    (pos_embed): Embedding(512, 1024)
    (seg_embed): Embedding(2, 1024)
  )
  (encoder_blocks): ModuleList(
    (0-23): 24 x EncoderBlock(
      (connection1): ResidualConnection(
        (norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (mhsa): MultiHeadSelfAttention(
        (W_Q): Linear(in_features=1024, out_features=1024, bias=True)
        (W_K): Linear(in_features=1024, out_features=1024, bias=True)
        (W_V): Linear(in_features=1024, out_features=1024, bias=True)
        (attn): ScaledDotProductAttention(
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (proj_back): Linear(in_features=1024, out_features=1024, bias=True)
      )
      (connection2): ResidualConnection(
        (norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (pwff):

# Pre-Training BERT

Self Supervised Training for pre training the model

- BERT uses a pre training step which is designed to encourage the model to understand language generally, then allows for fine tuning to allow the model to learn specific tasks.

- BERT is pre-trained on two objectives simultaneously: "masked language modeling", which is like fill in the blank, and "next sentence prediction" which is essentially asking the model to predict of two sentences make sense with one another.

In [None]:
def self_supervised_train(model, sen_loc_batches, seq_tkn_batches, labels_batches,
                          lr=1e-3, epochs=50, verbose:bool=True):

    # expect indices, not one-hot vectors
    token_criterion= nn.CrossEntropyLoss()
    # for logits directly
    classification_criterion= nn.BCEWithLogitsLoss()

    optimizer= optim.Adam(model.parameters(), lr=lr)

    # keeping track of the losses across all epochs
    losses= [[]]

    # these epochs can take a while, keeping it at a fairly small number
    for epoch in range(epochs):
        for location_batch, sequence_batch, classtarg_batch in tqdm(zip(sen_loc_batches,
                                                                        seq_tkn_batches,
                                                                        labels_batches)):
            # zeroing out gradients from last iteration
            optimizer.zero_grad()

            # masking the tokens in the input sequence
            masked_tokens, masked_token_locations= mask_batch(sequence_batch)

            # generating class and masked token predictions
            clsf_logits, token_logits= model(masked_tokens, location_batch, masked_token_locations)

            # setting up target for masked token prediction
            masked_token_targets= sequence_batch[masked_token_locations.bool()]

            # calculating loss for masked language modeling
            loss_mlm= token_criterion(token_logits, masked_token_targets)

            # calculating loss for next sentence classification
            loss_clsf= classification_criterion(clsf_logits.squeeze(), classtarg_batch.float())

            # combining losses
            loss= loss_mlm + loss_clsf

            # keeping track of loss across the current epoch
            losses[-1].append(float(loss))

            # backpropagation
            loss.backward()
            optimizer.step()


        if verbose:
            print(f'=======Epoch {epoch} Completed=======')
            print(f'Average loss in this epoch: {np.mean(losses[-1])}')
        losses.append([])

    return losses


In [None]:
# --- BERT NANO (17M params) hyperparameters config ---
vocab_size= tokenizer.vocab_size
input_length= max_input_length
n_segments= 2
N= 2
d_model= 256
d_ff= 4 * d_model
h= 4
dk= 64
dv= 64
dropout=0.1

model= BERT(vocab_size, input_length, n_segments, N, d_model, d_ff, h, dk, dv, dropout).to(device)

losses_hist= self_supervised_train(model, sentence_location_batches, sequence_tokens_batches,
                                   is_positives_batches, lr=0.001, epochs=10)

7311it [18:08,  6.72it/s]


Average loss in this epoch: 7.618604445715085


7311it [16:55,  7.20it/s]


Average loss in this epoch: 7.4066862244724065


7311it [16:53,  7.21it/s]


Average loss in this epoch: 7.373141652499744


7311it [16:48,  7.25it/s]


Average loss in this epoch: 7.358185444757365


7311it [16:50,  7.23it/s]


Average loss in this epoch: 7.338351139469848


7311it [16:42,  7.29it/s]


Average loss in this epoch: 7.31868136028571


7311it [16:32,  7.36it/s]


Average loss in this epoch: 7.299446144242275


7311it [16:44,  7.27it/s]


Average loss in this epoch: 7.287740415007777


7311it [16:44,  7.28it/s]


Average loss in this epoch: 7.276668093074381


7311it [16:46,  7.26it/s]

Average loss in this epoch: 7.268032291489576





Working through some highlights:
- We define our model, and put it on the device (a CPU or GPU)
- We define "criteria", these are the functions which will calculate the loss (how wrong the model was) from masked language modeling and next sentence prediction
- We define an optimizer, which will look at how large the loss was and update the model accordingly
- We go over all the data over n epochs
- We iterate over all batches of the data
- We mask our batch randomly with the masking function we defined previously
- We run the masked tokens, along with location information and where the masked tokens are, through the model. We get back predictions for next sentence prediction and predictions as to what the model thinks each masked token should be
- We pass our predictions through each respective criterion, with what the outputs should have been, to calculate loss
- We call loss.backward() to calculate how the model should change to be less wrong at this particular example
- We allow the optimizer to update the model based on the model's performance on this batch

# Fine Tuning BERT

The exact process of fine tuning depends on the type of data you're trying to fine tune against. Let's use sentiment analysis as an example.

Now have a BERT model that has some understanding of text. Let's use it to do something. The amazon_polarity dataset is an open dataset from amazon that contains information about whether a review is positive or negative. It consists of a big batch of review titles, review content, and labels saying if the review is positive or negative.

In [None]:
fine_tune_ds= load_dataset('fancyzhx/amazon_polarity')

for elem in fine_tune_ds['train']:
    print(elem)
    break

README.md:   0%|          | 0.00/6.81k [00:00<?, ?B/s]

train-00000-of-00004.parquet:   0%|          | 0.00/260M [00:00<?, ?B/s]

train-00001-of-00004.parquet:   0%|          | 0.00/258M [00:00<?, ?B/s]

train-00002-of-00004.parquet:   0%|          | 0.00/255M [00:00<?, ?B/s]

train-00003-of-00004.parquet:   0%|          | 0.00/254M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/117M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3600000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/400000 [00:00<?, ? examples/s]

{'label': 1, 'title': 'Stuning even for the non-gamer', 'content': 'This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^'}


We are going to use this data to fine tune our BERT model to predict if the review is positive or negative.

First we need to turn this data into data that makes sense in a BERT model. The exact approach for this process can vary from task to task. Luckily for us this dataset consists of pairs of sentences (the title and content) so we can format the fine-tuned data just like we formatted the pre-training data previously.

In [None]:
def preprocess_ft_data(data, input_length, max_num=100000):

    data_tokens= []
    data_positional= []
    data_targets= []

    # unpacking data
    for i, elem in enumerate(data):

        # tokenizing the title and content
        sentence1= elem['title']
        sentence2= elem['content']
        tokens= tokenizer([sentence1, sentence2])
        sentence1_tokens= tokens['input_ids'][0]
        sentence2_tokens= tokens['input_ids'][1]

        # trimming down tokens
        if len(sentence1_tokens) + len(sentence2_tokens) > input_length:
            sentence1_tokens= [101] + sentence1_tokens[-int(input_length / 2) + 1:]
            sentence2_tokens= sentence2_tokens[:int(input_length / 2) - 1] + [102]

        # creating sentence tokens
        sentence_tokens= [0] * len(sentence1_tokens) + [1] * len(sentence2_tokens)

        # combining and padding
        pad_num= input_length - (len(sentence1_tokens) + len(sentence2_tokens))
        sequence_tokens= sentence1_tokens + sentence2_tokens + [0] * pad_num
        sentence_location_tokens= sentence_tokens + [1] * pad_num

        data_tokens.append(sequence_tokens)
        data_positional.append(sentence_location_tokens)
        data_targets.append(elem['label'])

        if i> max_num:
            break

    return torch.tensor(data_positional), torch.tensor(data_tokens), torch.tensor(data_targets)


In [None]:
# processing data into modeling data
train_pos, train_tok, train_targ= preprocess_ft_data(fine_tune_ds['train'], max_input_length)
test_pos, test_tok, test_targ= preprocess_ft_data(fine_tune_ds['test'], max_input_length)

# moving training to device
train_pos= train_pos.to(device)
train_tok= train_tok.to(device)
train_targ= train_targ.to(device)

# moving testing to device
test_pos= test_pos.to(device)
test_tok= test_tok.to(device)
test_targ= test_targ.to(device)

Supervised Training for fine tuning the model We can run the same pre training code as before, except on the fine tuned dataset and with the pre-trained model with a new classification head. Here we don't care about the masked language
modeling objective, so we are passing the original tokens into the model rather than the masked ones. If you wanted to do this properly you would artificially create a mask of all zeros.

Actually, when I ran this model I forgot to change any of this code, so I was still optimizing on masked language modeling as well as optimizing on the the classification of positive or negative reviews. I'm sure you could experiment with that strategy, there might be some merit to getting the model to better understand the type of text used in reviews specifically.

In [None]:
def supervised_train(model, train_loc_batches, train_tkn_batches, train_lbl_batches, batch_size,
                     lr=1e-3, epochs=50, verbose:bool=True):

    # for logits directly
    classification_criterion= nn.BCEWithLogitsLoss()

    # resetting the optimizer to have access to the parameters of the new head
    optimizer= optim.Adam(model.parameters(), lr=lr)

    ft_losses= [[] * 1]

    for epoch in range(epochs):
        for i in tqdm(range(0, train_loc_batches.shape[0], batch_size)):

            if i+batch_size>= train_loc_batches.shape[0]:
                break

            # getting batch
            train_pos_batch = train_loc_batches[i:i+batch_size]
            train_tok_batch = train_tkn_batches[i:i+batch_size]
            train_targ_batch= train_lbl_batches[i:i+batch_size]

            # zeroing out gradients from last iteration
            optimizer.zero_grad()

            # masking the tokens in the input sequence
            masked_tokens, masked_token_locations= mask_batch(train_tok_batch)

            # generating class and masked token predictions -- we do not use masked_tokens in the
            # fine tuning process since after we train our model, the [MASK] token will never be
            # seen when the model is actually being used and making inferences.
            clsf_logits, token_logits= model(train_tok_batch, train_pos_batch, masked_token_locations)

            # setting up target for masked token prediction -- not used in fine tuning
            #masked_token_targets= train_tok_batch[masked_token_locations.bool()]

            # calculating loss for next sentence classification
            loss_clsf= classification_criterion(clsf_logits.squeeze(), train_targ_batch.float())

            # combining losses -- we are not optimizing the masked language modeling objective
            loss= loss_clsf

            ft_losses[-1].append(float(loss))

            # backpropagation
            loss.backward()
            optimizer.step()

        if verbose:
            print(f'=======Epoch {epoch} Completed=======')
            print(f'Average loss in this epoch: {np.mean(ft_losses[-1])}')
        ft_losses.append([])

    return ft_losses


Before we fine tune let's replace the classifier with a randomly initialized model. This allows us to preserve BERT's general language understanding, but start fresh in terms of the part of the model that's doing the classification, which is good to do because we are classifying something completely different.

In [None]:
# the new training objective is still binary classification, except these parameters will be used
# to decide if a review was positive or negative
model.classifier= nn.Linear(d_model, 1, bias=False).to(device)

losses_hist= supervised_train(model, train_pos, train_tok, train_targ, batch_size,
                              lr=0.001, epochs=15)

100%|█████████▉| 781/782 [01:31<00:00,  8.50it/s]


Average loss in this epoch: 0.5975763738231683


100%|█████████▉| 781/782 [01:29<00:00,  8.76it/s]


Average loss in this epoch: 0.4568517957927323


100%|█████████▉| 781/782 [01:30<00:00,  8.60it/s]


Average loss in this epoch: 0.3947166712542044


100%|█████████▉| 781/782 [01:29<00:00,  8.72it/s]


Average loss in this epoch: 0.359711596434614


100%|█████████▉| 781/782 [01:30<00:00,  8.62it/s]


Average loss in this epoch: 0.3369802456353904


100%|█████████▉| 781/782 [01:29<00:00,  8.73it/s]


Average loss in this epoch: 0.31829915106983403


100%|█████████▉| 781/782 [01:30<00:00,  8.58it/s]


Average loss in this epoch: 0.30147400132062036


100%|█████████▉| 781/782 [01:31<00:00,  8.54it/s]


Average loss in this epoch: 0.2875791148362483


100%|█████████▉| 781/782 [01:29<00:00,  8.68it/s]


Average loss in this epoch: 0.27459024081767447


100%|█████████▉| 781/782 [01:29<00:00,  8.69it/s]


Average loss in this epoch: 0.2645113267548258


100%|█████████▉| 781/782 [01:30<00:00,  8.63it/s]


Average loss in this epoch: 0.2537397682266107


100%|█████████▉| 781/782 [01:30<00:00,  8.61it/s]


Average loss in this epoch: 0.24425093834401704


100%|█████████▉| 781/782 [01:29<00:00,  8.69it/s]


Average loss in this epoch: 0.23666734255077593


100%|█████████▉| 781/782 [01:31<00:00,  8.58it/s]


Average loss in this epoch: 0.2266305345331203


100%|█████████▉| 781/782 [01:29<00:00,  8.74it/s]

Average loss in this epoch: 0.21977813532349394





This dataset has a test set, so we can apply this fine tuned BERT model to see how good it is at classifying reviews it's never seen before.

In [None]:
def test_accuracy(model, test_loc_batches, test_tkn_batches, test_lbl_batches, batch_size):

    is_correct= []
    predicted_class= []
    original_class= []

    for i in tqdm(range(0, test_loc_batches.shape[0], batch_size)):

        if i+batch_size>= test_loc_batches.shape[0]:
            break

        # getting batch
        test_pos_batch = test_loc_batches[i:i+batch_size]
        test_tok_batch = test_tkn_batches[i:i+batch_size]
        test_targ_batch= test_lbl_batches[i:i+batch_size]

        # making prediction, not masking anything
        clsf_logits, _= model(test_tok_batch, test_pos_batch, torch.zeros(test_pos_batch.shape))

        # converting logits to probabilities then rounding to classifications
        res= torch.sigmoid(clsf_logits).round().squeeze()

        # keeping track of the original class (positive or negative) and if the model was correct
        original_class.extend(np.array(test_targ_batch.to('cpu')))
        is_correct.extend(np.array((res== test_targ_batch).to('cpu')))
        predicted_class.extend(np.array(res.detach().to('cpu')))

    # accuracy rate, original_class, and predicted_class
    return (sum(list(is_correct)) / len(is_correct)), original_class, predicted_class


In [None]:
acc, original_class, predicted_class= test_accuracy(model, test_pos, test_tok, test_targ, batch_size)
print(f'\nBERT-based model accuracy: {(acc * 100):02.2f}%')

100%|█████████▉| 781/782 [00:08<00:00, 89.53it/s]


BERT-based model accuracy: 84.78%





In [None]:
from sklearn.metrics import confusion_matrix, classification_report

print(classification_report(original_class, predicted_class))

              precision    recall  f1-score   support

           0       0.85      0.85      0.85     49405
           1       0.85      0.85      0.85     50563

    accuracy                           0.85     99968
   macro avg       0.85      0.85      0.85     99968
weighted avg       0.85      0.85      0.85     99968



We got a model that could classify if a review was positive or negative with a 84.78% accuracy. That might not sound that impressive, but the BERT model used in this example is virtually microscopic. If you used more encoder blocks, a larger model dimension, and played around with a few other model parameters I think you could easily pass 90%.

We actually created a BERT style model. We explored tokenization, data processing, embedding, multi-headed self-attention, pointwise feed forward, pre-training, and fine tuning. By the end of that process we had created a BERT style model, trained it on Wiki articles to understand text, then fine-tuned it to classify if product reviews were positive or negative.

In [None]:
# https://towardsdatascience.com/bert-intuitively-and-exhaustively-explained-48a24ecc1c8a

In [None]:
# https://github.com/DanielWarfield1/MLWritingAndResearch/blob/main/BERTFromScratch.ipynb