<h1 style="color:orange;text-align:center;font-family:courier;font-size:280%">Masked Language model From Scratch Using Encoder only Transformers</h1>
<p style="color:orange;text-align:center;font-family:courier"> The objective is to understand and train a MLM(Masked language model) which can be also called as a type of foundational model.

### Objectives 
* Understand the theory and building blocks of NLP(Natural Language Processing Pipeline.
* Generate a basic understanding of how to use Tensorflow Keras API for creating custom utilities(We leverage our pipeline using KerasNLP package).
* Simplify the pedagogy of explaining NLP topics and specific to MLM.
* As an example we will use **Wikitext dataset** since this is an unsupervised learning task.
<!-- * Though the code works there are significant drawbacks with yolov1 which has been addressed on YoloV2,YoloV3 -->


<p style="text-align:center"><img src="assets/MLM.jpeg" alt="textcla" width="540"/>

### Import the dependencies

In [54]:
import os
import random
import numpy as np
import pandas as pd
import tensorflow as tf
from transformers import GPT2TokenizerFast #use BPE tokenizer
from tensorflow.keras.layers import Input,Embedding,Dense
from tensorflow.keras.preprocessing.sequence import pad_sequences
from keras_nlp.layers import TransformerEncoder,SinePositionEncoding

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 

As a beginner if you are barging into NLP it is very crucial to understand tokenizers(Words to Numbers), there are different techniques used to achieve this.

Byte-Pair Encoding (BPE) was initially developed as an algorithm to compress texts, and then used by OpenAI for tokenization when pretraining the GPT model. It’s used by a lot of Transformer models, including GPT, GPT-2, RoBERTa, BART, and DeBERTa. To know more https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt

### Load the Dataset

In [56]:
dataset = pd.read_csv("wiki001_sample.csv")
text = dataset["text"][:]
text = [str(i) for i in text if len(str(i)) >3]
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
vocab_size = tokenizer.vocab_size
tokenizer.add_tokens(" [MASK]")
MASK_TOKEN_IDX = tokenizer(" [MASK]").input_ids[0]
print("Mask idx: ",MASK_TOKEN_IDX)

Mask idx:  50257


### Custom DataLoader for MLM task
* The objective here is randomly mask words and train the Model to predict the Masked words, in higher level it learns a language.
  * Example : INPUT: The dog was [MASK] in the dawn. OUTPUT: The dog was barking in the dawn.  

In [63]:
class MLM_dataset(tf.keras.utils.Sequence):
    def __init__(self,text_list,eos_text,max_token_length,batch_size,tokenizer,shuffle=True):
        self.textlist = text_list
        self.shuffle = shuffle
        self.batchsize = batch_size
        self.max_token_length = max_token_length
        self.eos_text = eos_text
        self.tokenizer = tokenizer
        self.indices = list(range(len(self.textlist)))
        
    def __len__(self):
        return int(len(self.indices)/self.batchsize)
    
    
    def on_epoch_end(self):
        if self.shuffle == True:
            np.random.shuffle(self.indices)
            
    
    def prepare_sample(self,idx):
        sample = self.tokenizer(self.textlist[idx])
        
        tokens = [sample.input_ids]
        pad_tokens = pad_sequences(tokens,maxlen=self.max_token_length,padding="post",truncating="post",value=self.eos_text)[0]
        weights = np.zeros((self.max_token_length))
        output_tokens = pad_tokens.copy()
        choices = [np.random.randint(3,18,3)]
        random_mask_indices = random.choice(choices)
        pad_tokens[random_mask_indices] = MASK_TOKEN_IDX
        weights[random_mask_indices] = 1
        return pad_tokens,output_tokens,weights

    def __getitem__(self,idx):
        masked_inp = []
        unmasked_out = []
        weights = []
        
        batch_list = list(range(idx * self.batchsize,(idx + 1)*self.batchsize))
        for idx_ in batch_list:
            x,y,z = self.prepare_sample(idx_)
            masked_inp.append(x)
            unmasked_out.append(y)
            weights.append(z)
            
        return tf.cast(masked_inp,tf.int32),tf.cast(unmasked_out,tf.int32),tf.cast(weights,tf.float32)

### Create dataset pipeline

In [69]:
EOS_text = tokenizer.encode("<|endoftext|>")[0] #from gpt2 vocab
max_seq_length = 32 #Max token length
batch = 32

mlmdataset = MLM_dataset(text_list=text,batch_size=batch,
                         eos_text=EOS_text,max_token_length=max_seq_length,
                         tokenizer=tokenizer)

### Build the model and set the model parameters

In [70]:
EMBEDDING_DIM = 256
INTERMEDIATE_DIM = 512
NUM_BLOCKS = 6
HEADS=6

def mlm_nn():
    vocab_size = MASK_TOKEN_IDX+3
    embedding_dim=EMBEDDING_DIM
    inputs = Input((None,), dtype=tf.int32)
    embedding = Embedding(input_dim=vocab_size, output_dim=embedding_dim)(inputs)
    positional_encoding = SinePositionEncoding()(embedding)
    outputs = embedding + positional_encoding
    for i in range(NUM_BLOCKS):
        encoder = TransformerEncoder(intermediate_dim=INTERMEDIATE_DIM,num_heads=HEADS,activation="gelu")(outputs)
    
    output_node = Dense(vocab_size)(encoder)
    model = tf.keras.models.Model(inputs,output_node)
    return model

### Model initialization and hyper-parameters setting

In [71]:
nn = mlm_nn()
epochs = 100
loss_func = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.Adam(1e-4)

### Train the model - (custom training loop)

In [9]:
for i in range(epochs):
    epoch_loss= 0
    print("EPOCH : ",i)
    for idx,(x,y,z) in enumerate(mlmdataset):
        with tf.GradientTape() as tape:
            nn_out = nn(x)
            loss_ = loss_func(y,nn_out,sample_weight=z)
            epoch_loss+=loss_
        trainable_vars = nn.trainable_variables
        gradients = tape.gradient(loss_, trainable_vars)
        optimizer.apply_gradients(zip(gradients,trainable_vars))
        if idx%100==0:
            print(f"loss : {loss_.numpy()}")
            nn.save_weights("savednn.h5")
    epoch_loss_value = epoch_loss/int(len(text)/24)
    print(f"epochloss : {epoch_loss_value.numpy()}")

### An experimental training was already executed, will utilize that for direct inference
* Load the trained weights for inference **wiki_base256.h5**

In [17]:
nn.load_weights("wiki_base256.h5")

In [20]:
def predict_masked_word(text):
    tokenized = tf.cast(tokenizer.encode(text),tf.int32)
    tokenized = tokenized[tf.newaxis,:].numpy()
    ELEMENTS = np.where(tokenized[0]==MASK_TOKEN_IDX)[0]
    indices = tf.argmax(nn.predict(tokenized,verbose=0),axis=-1)[0]
    for i in ELEMENTS:
        tokenized[0][i] =  indices[i]
    return tokenizer.decode(tokenized[0])

### Model Inference

In [52]:
test_examples = ["The revolution of russian army was [MASK] by civiliztion of",
                 "the dog was [MASK] in the kennel",
                 "A business proposal is a written offer from a [MASK] to a prospective sponsor",
                 "the law states we can never fight [MASK] our own customs"]
for i in test_examples:
    print(f"INPUT: {i}")
    print(f"OUTPUT: {predict_masked_word(i)}")
    print("-"*80)

INPUT: The revolution of russian army was [MASK] by civiliztion of
OUTPUT: The revolution of russian army was captured by civiliztion of
--------------------------------------------------------------------------------
INPUT: the dog was [MASK] in the kennel
OUTPUT: the dog was placed in the kennel
--------------------------------------------------------------------------------
INPUT: A business proposal is a written offer from a [MASK] to a prospective sponsor
OUTPUT: A business proposal is a written offer from a reference to a prospective sponsor
--------------------------------------------------------------------------------
INPUT: the law states we can never fight [MASK] our own customs
OUTPUT: the law states we can never fight against our own customs
--------------------------------------------------------------------------------
