* In this assignment you will be implementing an encoder model that uses just **Masked Language Modelling (MLM)** objective.
* We will use a simple BERT with the following modifications
  * We just use MLM (just masking words) and **skip** NSP (Next Sentence Prediction) objective
  * Therefore, we won't use [CLS] token
* Again, it is absolutely fine if your loss value does not match with the one given here.
* Just ensure that the model overfits the training data
* You may increase the size of the training data if you want to test your implementation. In that case, we recommend you to use the tokenizer library from Hugging face



# Installations

In [None]:
!pip install torchdata==0.6.0 # to be compatible with torch 2.0
!pip install portalocker==2.0.0
!pip install -U torchtext==0.15.1

# Common Imports

In [None]:
import torch
from torch import Tensor

import torch.nn as nn
from torch.nn import Parameter
import torch.nn.functional as F
from torch.nn.functional import one_hot

import torch.optim as optim

#text lib
import torchtext

# tokenizer
from torchtext.data.utils import get_tokenizer

#build vocabulary
from torchtext.vocab import vocab
from torchtext.vocab import build_vocab_from_iterator

# get input_ids (numericalization)
from torchtext.transforms import VocabTransform

# get embeddings
from torch.nn import Embedding

from  pprint import pprint
from yaml import safe_load
import copy
import numpy as np

In [None]:
import nltk
nltk.download('punkt')

# Tokenize the given text

In [None]:
batch_size = 10

In [None]:
class Tokenizer(object):

  def __init__(self,text):
    self.text = text
    self.word_tokenizer = get_tokenizer(tokenizer="basic_english",language='en')
    self.vocab_size = None

  def get_tokens(self):
    for sentence in self.text.strip().split('\n'):
      yield self.word_tokenizer(sentence)

  def build_vocab(self):
    v = build_vocab_from_iterator(self.get_tokens(),
                                  min_freq=1,specials=['<unk>','<mask>'])
    v.set_default_index(v['<unk>']) # index of OOV
    self.vocab_size = len(v)
    return v

  def token_ids(self):
    v = self.build_vocab()
    vt = VocabTransform(v)
    num_tokens = len(self.word_tokenizer(self.text))
    max_seq_len = np.ceil(num_tokens/batch_size)
    data = torch.zeros(size=(1,num_tokens))
    data = vt(self.word_tokenizer(self.text))
    data = torch.tensor(data,dtype=torch.int64)
    return data.reshape(batch_size,torch.tensor(max_seq_len,dtype=torch.int64))



In [None]:
text = """Best known for the invention of Error Correcting Codes, he was a true polymath who applied his mathematical and problem-solving skills to numerous disciplines.
Reflecting on the significant benefits I received from Hamming, I decided to develop a tribute to his legacy. There has not been a previous biography of Hamming, and the few articles about him restate known facts and assumptions and leave us with open questions.
One thought drove me as I developed this legacy project: An individual's legacy is more than a list of their attempts and accomplishments. Their tribute should also reveal the succeeding generations they inspired and enabled and what each attempted and achieved.
This book is a unique genre containing my version of a biography that intertwines the story "of a life" and a multi-player memoir with particular events and turning points recalled by those, including me, who he inspired and enabled.
Five years of research uncovered the people, places, opportunities, events, and influences that shaped Hamming. I discovered unpublished information, stories, photographs, videos, and personal remembrances to chronicle his life, which helped me put Hamming's
legacy in the context I wanted.The result demonstrates many exceptional qualities, including his noble pursuit of excellence and helping others. Hamming paid attention to the details, his writings continue to influence, and his guidance is a timeless gift to the world.
This biography is part of """

In [None]:
Tk = Tokenizer(text)

In [None]:
input_ids = Tk.token_ids()
print(input_ids.shape)

* We need to mask some words randomly based on the mask probability
* The token id for the [mask] is 1
* The function given below takes in the input ids and replaces some of the ids by 1 (token id for the [mask])
* Since the loss is computed only over the predictions of masked tokens, we replace all non-masked input ids by -100

In [None]:
def getdata(ip_ids,mask_token_id,mask_prob=0.2):
  masked_ids = copy.deepcopy(ip_ids)
  mask_random_idx = torch.randn_like(ip_ids,dtype=torch.float64)>(1-mask_prob)
  masked_ids[mask_random_idx]=mask_token_id
  labels = copy.deepcopy(ip_ids)
  neg_mask = ~mask_random_idx
  labels[neg_mask]=torch.tensor(-100)
  return (masked_ids,labels,mask_random_idx)

In [None]:
mask_token_id = torch.tensor([1],dtype=torch.int64)
x,y,mask_mtx = getdata(input_ids,mask_token_id)
print(x[0,:],'\n',y[0,:])

* Now we have our inputs and labels stored in x and y,respectively
* It is always good to test the implementation by displaying the input sentence with masked tokens

In [None]:
v = Tk.build_vocab()
words = []
for idx in x[0,:]:
  words.append(v.vocab.get_itos()[idx.item()])
print(' '.join(words))

best known for the invention of error correcting codes , he <mask> <mask> true <mask> <mask> applied <mask> <mask> <mask> problem-solving <mask> to <mask> disciplines .


* Also display the words that are masked

In [None]:
words = []
for idx in y[0,:]:
  if idx != -100:
    words.append(v.vocab.get_itos()[idx.item()])
print(' '.join(words))

was a polymath who his mathematical and skills numerous


# Configuration

In [None]:
vocab_size = Tk.vocab_size
seq_len = x.shape[1]
embed_dim = 32
dmodel = embed_dim
dq = torch.tensor(4)
dk = torch.tensor(4)
dv = torch.tensor(4)
heads = torch.tensor(8)
d_ff = 4*dmodel

# Model

In [None]:
class MHA(nn.Module):
  pass


class FFN(nn.Module):
  pass



class Prediction(nn.Module):
  pass


class PositionalEncoding(nn.Module):
  pass


class Embed(nn.Module):

  def __init__(self,vocab_size,embed_dim):
    super(Embed,self).__init__()
    embed_weights= # seed 70


  def forward(self,x):
    '''
    Take in the input ids and output the final embeddings (token embedding + positional embedding)
    '''
    out = None
    return out


class EncoderLayer(nn.Module):

  def __init__(self,dmodel,dq,dk,dv,d_ff,heads):
    super(EncoderLayer,self).__init__()
    self.mha = MHA(dmodel,dq,dk,dv,heads)
    self.layer_norm_1 = torch.nn.LayerNorm(dmodel)
    self.layer_norm_2 = torch.nn.LayerNorm(dmodel)
    self.ffn = FFN(dmodel,d_ff)

  def forward(self,x):

    return out

In [None]:
class BERT(nn.Module):

  def __init__(self,vocab_size,dmodel,dq,dk,dv,d_ff,heads,num_layers=1):
    self.embed_lookup = Embed(vocab_size,embed_dim)
    self.enc_layers = nn.ModuleList(copy.deepcopy(EncoderLayer(dmodel,dq,dk,dv,d_ff,heads)) for i in range(num_layers))
    self.predict = Prediction(dmodel,vocab_size)

  def forward(self,input_ids):
    x = self.embed_lookup(input_ids)
    for enc_layer in self.enc_layers:
      x = enc_layer(x)
    out = self.predict(x)
    return out

In [None]:
model = BERT(vocab_size,dmodel,dq,dk,dv,d_ff,heads,num_layers=1)
optimizer = optim.SGD(model.parameters(),lr=0.01)
criterion = nn.CrossEntropyLoss()

# Training the model

In [None]:
def train(token_ids,labels,epochs=1000):
  loss_trace = []
  for epoch in range(epochs):
    out = model(token_ids)
    loss = None
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()


In [None]:
train(x,y,20000)

* The loss will come around 0.02 after 20000 epochs (again, it is absolutely fine if you get a different value)
* Let us predict the masked tokens for all the samples in the tiny dataset

In [None]:
with torch.inference_mode():
  predictions = torch.argmax(model(x),dim=-1)

In [None]:
v = Tk.build_vocab()
masked_words = []
predicted_words=[]
for index,idx in enumerate(y.flatten()):
  # to display only the masked tokens
  if idx != -100:
    masked_words.append(v.vocab.get_itos()[idx.item()])
    predicted_words.append(v.vocab.get_itos()[predictions.flatten()[index].item()])
print('Masked Words: ')
print(' '.join(masked_words))
print('Predicted Words: ')
print(' '.join(predicted_words))

Masked Words: 
was a polymath who his mathematical and skills numerous reflecting decided his articles him restate facts us with . thought drove developed than their also generations inspired each unique genre containing that intertwines story of with points , the . information , videos , context . demonstrates his pursuit excellence helping writings , is a timeless
Predicted Words: 
was a polymath who his mathematical and skills numerous reflecting decided his articles him restate facts us with . thought drove developed than their also generations inspired each unique genre containing that intertwines story of with points , the . information , videos , context . demonstrates his pursuit excellence helping writings , is a timeless
