<a href="https://colab.research.google.com/github/ashaduzzaman-sarker/Transfer-learning-and-Generalised-Language-Models/blob/main/End_to_end_Masked_Language_Modeling_with_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Implement a Masked Language Model (MLM) with BERT and fine-tune it on the IMDB Reviews dataset.

## Introduction

- **Masked Language Modeling (MLM):** A task where a model predicts masked words in a sentence using the surrounding context.
  
- **Functionality:** Given an input with one or more mask tokens, the model generates likely substitutions for each masked word.

- **Example:**
  - Input: "I have watched this [MASK] and it was awesome."
  - Output: "I have watched this movie and it was awesome."

- **Training Method:** MLM is effective for training language models in a self-supervised manner, eliminating the need for human-annotated labels.

- **Fine-tuning:** The model trained with MLM can be fine-tuned for various supervised NLP tasks, such as sentiment classification.

- **Objective:** The tutorial focuses on building a BERT model from scratch, training it using MLM, and fine-tuning it for sentiment classification.

- **Tools Used:** Keras TextVectorization and MultiHeadAttention layers are employed to create a BERT Transformer-Encoder network architecture.

## Setup

In [1]:
!pip install tf-nightly keras-nlp tensorflow

Collecting tf-nightly
  Downloading tf_nightly-2.18.0.dev20240808-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.2 kB)
Collecting keras-nlp
  Downloading keras_nlp-0.14.4-py3-none-any.whl.metadata (6.8 kB)
Collecting tb-nightly~=2.18.0.a (from tf-nightly)
  Downloading tb_nightly-2.18.0a20240809-py3-none-any.whl.metadata (1.6 kB)
Collecting keras-nightly>=3.2.0.dev (from tf-nightly)
  Downloading keras_nightly-3.4.1.dev2024081003-py3-none-any.whl.metadata (5.8 kB)
Collecting tensorflow-text (from keras-nlp)
  Downloading tensorflow_text-2.17.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.8 kB)
Downloading tf_nightly-2.18.0.dev20240808-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (636.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m636.5/636.5 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading keras_nlp-0.14.4-py3-none-any.whl (572 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m

In [2]:
import os

os.environ["KERAS_BACKEND"] = "tensorflow"
import keras_nlp
import keras
import tensorflow as tf
from keras import layers
from keras.layers import TextVectorization
from dataclasses import dataclass
import pandas as pd
import numpy as np
import glob
import re
from pprint import pprint

## Hyperparameter Configuration

In [3]:
@dataclass
class Config:
    MAX_LEN = 256
    BATCH_SIZE = 32
    LR = 0.001
    VOCAB_SIZE = 30000
    EMBED_DIM = 128
    NUM_HEADS = 8  # Number of attention heads
    FF_DIM = 128  # Hidden layer size in feed forward network inside transformer
    NUM_LAYERS = 1

config = Config()

## Load the IMDB Dataset

In [4]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  51.8M      0  0:00:01  0:00:01 --:--:-- 51.8M


In [5]:
# Load the dataset into a Pandas dataframe
def get_text_list_from_files(files):
    text_list = []
    for name in files:
        with open(name) as f:
            for line in f:
                text_list.append(line)
    return text_list

def get_data_from_text_files(folder_name):
    pos_files = glob.glob("aclImdb/" + folder_name + "/pos/*.txt")
    neg_files = glob.glob("aclImdb/" + folder_name + "/neg/*.txt")

    pos_text = get_text_list_from_files(pos_files)
    neg_text = get_text_list_from_files(neg_files)

    df = pd.DataFrame(
        {
            "review": pos_text + neg_text,
            "sentiment": [0] * len(pos_text) + [1] * len(neg_text),
        }
    )
    df = df.sample(len(df)).reset_index(drop=True)
    return df

train_df = get_data_from_text_files("train")
test_df = get_data_from_text_files("test")

all_data = pd.concat([train_df, test_df])

## Dataset preparation

- **TextVectorization Layer:** Used to convert text into integer token IDs.
  - It can transform text into sequences of token indices or into a dense, unordered set of tokens.

- **Preprocessing Functions:**
  1. **`get_vectorize_layer`:** Builds the TextVectorization layer for the model.
  2. **`encode`:** Encodes raw text into integer token IDs.
  3. **`get_masked_input_and_labels`:** Masks 15% of input tokens randomly in each sequence to prepare the data for the masked language modeling task.

In [23]:
def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
    return tf.strings.regex_replace(
        stripped_html, "[%s]" % re.escape("!#$%&'()*+,-./:;<=>?@[\]^_`{|}~"), ""
    )


In [24]:
def get_vectorize_layer(texts, vocab_size, max_seq, special_tokens=["[MASK]"]):
    """Build Text vectorization layer

    Args:
      texts (list): List of string i.e input texts
      vocab_size (int): vocab size
      max_seq (int): Maximum sequence lenght.
      special_tokens (list, optional): List of special tokens. Defaults to ['[MASK]'].

    Returns:
        layers.Layer: Return TextVectorization keras Layer
    """
    vectorize_layer = TextVectorization(
        max_tokens=vocab_size,
        output_mode="int",
        standardize=custom_standardization,
        output_sequence_length=max_seq,
    )
    vectorize_layer.adapt(texts)

    # Insert mask token in Vocabulary
    vocab = vectorize_layer.get_vocabulary()
    vocab = vocab[2 : vocab_size - len(special_tokens)] + ["[mask]"]
    vectorize_layer.set_vocabulary(vocab)
    return vectorize_layer

vectorize_layer = get_vectorize_layer(
    all_data.review.values.tolist(),
    config.VOCAB_SIZE,
    config.MAX_LEN,
    special_tokens=["[mask]"],
)

# Get mask token id for masked language model
mask_token_id = vectorize_layer(["[mask]"]).numpy()[0][0]

In [25]:
def encode(texts):
    encoded_texts = vectorize_layer(texts)
    return encoded_texts.numpy()

def get_masked_input_and_labels(encoded_texts):
    # 15% BERT masking
    inp_mask = np.random.rand(*encoded_texts.shape) < 0.15
    # Do not mask special tokens
    inp_mask[encoded_texts <= 2] = False
    # Set targets to -1 by default, it means ignore
    labels = -1 * np.ones(encoded_texts.shape, dtype=int)
    # Set labels for masked tokens
    labels[inp_mask] = encoded_texts[inp_mask]

    # Prepare input
    encoded_texts_masked = np.copy(encoded_texts)
    # Set input to [MASK] which is the last token for the 90% of tokens, leaving 10% unchanged
    inp_mask_2mask = inp_mask & (np.random.rand(*encoded_texts.shape) < 0.90)
    encoded_texts_masked[
        inp_mask_2mask
    ] = mask_token_id  # mask token is the last in the dict

    # Set 10% to a random token
    inp_mask_2random = inp_mask_2mask & (np.random.rand(*encoded_texts.shape) < 1 / 9)
    encoded_texts_masked[inp_mask_2random] = np.random.randint(
        3, mask_token_id, inp_mask_2random.sum()
    )

    # Prepare sample_weights to pass to .fit() method
    sample_weights = np.ones(labels.shape)
    sample_weights[labels == -1] = 0

    # y_labels would be same as encoded_texts i.e input tokens
    y_labels = np.copy(encoded_texts)

    return encoded_texts_masked, y_labels, sample_weights


In [26]:
# We hava 25000 training samples
x_train = encode(train_df.review.values)  # encode reviews with vectorizer
y_train = train_df.sentiment.values
train_classifier_ds = (
    tf.data.Dataset.from_tensor_slices((x_train, y_train))
    .shuffle(1000)
    .batch(config.BATCH_SIZE)
)

In [27]:
# We have 25000 testing samples
x_test = encode(test_df.review.values)
y_test = test_df.sentiment.values
test_classifier_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(
    config.BATCH_SIZE
)

In [28]:
# Build the dataset for end to end model input
test_raw_classifier_ds = tf.data.Dataset.from_tensor_slices(
    (test_df.review.values, y_test)
).batch(config.BATCH_SIZE)

In [29]:
# Prepare the dataset for masked language model
x_all_reviews = encode(all_data.review.values)
x_masked_train, y_masked_labels, sample_weights = get_masked_input_and_labels(
    x_all_reviews
)

mlm_ds = tf.data.Dataset.from_tensor_slices(
    (x_masked_train, y_masked_labels, sample_weights)
)

mlm_ds = mlm_ds.shuffle(1000).batch(config.BATCH_SIZE)

## Create BERT model (Pretraining Model) for masked language modeling

In [30]:
def bert_module(query, key, value, i):
    # Multi headed self-attention
    attention_output = layers.MultiHeadAttention(
        num_heads=config.NUM_HEADS,
        key_dim=config.EMBED_DIM // config.NUM_HEADS,
        name="encoder_{}_multiheadattention".format(i),
    )(query, key, value)
    attention_output = layers.Dropout(0.1, name="encoder_{}_att_dropout".format(i))(
        attention_output
    )
    attention_output = layers.LayerNormalization(
        epsilon=1e-6, name="encoder_{}_att_layernormalization".format(i)
    )(query + attention_output)

    # Feed-Forward layer
    ffn = keras.Sequential(
        [
            layers.Dense(config.FF_DIM, activation="relu"),
            layers.Dense(config.EMBED_DIM),
        ],
        name = "encoder_{}_ffn".format(i)
    )
    ffn_output = ffn(attention_output)
    ffn_output = layers.Dropout(0.1, name="encoder_{}_ffn_dropout".format(i))(
        ffn_output
    )
    sequence_output = layers.LayerNormalization(
        epsilon=1e-6, name="encoder_{}_ffn_layernormalization".format(i)
    )(attention_output + ffn_output)
    return sequence_output

loss_fn = keras.losses.SparseCategoricalCrossentropy(reduction=None)
loss_tracker = tf.keras.metrics.Mean(name="loss")

In [31]:
class MaskedLanguageModel(keras.Model):
    def train_step(self, inputs):
        if len(inputs) == 3:
            features, labels, sample_weight = inputs
        else:
            features, labels = inputs
            sample_weight = None

        with tf.GradientTape() as tape:
            predictions = self(features, training=True)
            loss = loss_fn(labels, predictions, sample_weight=sample_weight)

        # Compute gradients
        trainable_vars = self.trainable_variables
        gradients = tape.gradient(loss, trainable_vars)

        # Update weights
        self.optimizer.apply_gradients(zip(gradients, trainable_vars))

        # Compute our own metrics
        loss_tracker.update_state(loss, sample_weight=sample_weight)

        # Return a dict mapping metric names to current value
        return {"loss": loss_tracker.result()}

    @property
    def metrics(self):
        return [loss_tracker]

In [32]:
def create_masked_language_bert_model():
    inputs = layers.Input((config.MAX_LEN,), dtype="int64")

    word_embeddings = layers.Embedding(
        config.VOCAB_SIZE, config.EMBED_DIM, name="word_embedding"
    )(inputs)

    position_embeddings = keras_nlp.layers.PositionEmbedding(
        sequence_length=config.MAX_LEN
    )(word_embeddings)

    embeddings = word_embeddings + position_embeddings

    encoder_output = embeddings
    for i in range(config.NUM_LAYERS):
        encoder_output = bert_module(encoder_output, encoder_output, encoder_output, i)

    mlm_output = layers.Dense(config.VOCAB_SIZE, name="mlm_cls", activation="softmax")(
        encoder_output
    )
    mlm_model = MaskedLanguageModel(inputs, mlm_output, name="masked_bert_model")

    optimizer = keras.optimizers.Adam(learning_rate=config.LR)
    mlm_model.compile(optimizer=optimizer)
    return mlm_model

id2token = dict(enumerate(vectorize_layer.get_vocabulary()))
token2id = {y: x for x, y in id2token.items()}

In [33]:
class MaskedTextGenerator(keras.callbacks.Callback):
    def __init__(self, sample_tokens, top_k=5):
        self.sample_tokens = sample_tokens
        self.k = top_k

    def decode(self, tokens):
        return " ".join([id2token[t] for t in tokens if t != 0])

    def convert_ids_to_tokens(self, id):
        return id2token[id]

    def on_epoch_end(self, epoch, logs=None):
        prediction = self.model.predict(self.sample_tokens)
        masked_index = np.where(self.sample_tokens == mask_token_id)
        masked_index = masked_index[1]
        mask_prediction = prediction[0][masked_index]

        top_indices = mask_prediction[0].argsort()[-self.k :][::-1]
        values = mask_prediction[0][top_indices]

        for i in range(len(top_indices)):
            p = top_indices[i]
            v = values[i]
            tokens = np.copy(sample_tokens[0])
            tokens[masked_index[0]] = p
            result = {
                "input_text": self.decode(sample_tokens[0].numpy()),
                "prediction": self.decode(tokens),
                "probability": v,
                "predicted mask token": self.convert_ids_to_tokens(p),
            }
            pprint(result)

sample_tokens = vectorize_layer(["I have watched this [mask] and it was awesome"])
generator_callback = MaskedTextGenerator(sample_tokens.numpy())

bert_masked_model = create_masked_language_bert_model()
bert_masked_model.summary()

## Train and Save the Model

In [34]:
bert_masked_model.fit(
      mlm_ds,
      epochs=5,
      callbacks=[generator_callback]
)

bert_masked_model.save("bert_mlm_imdb.keras")

Epoch 1/5
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 448ms/step
{'input_text': 'i have watched this mask and it was awesome',
 'predicted mask token': 'i',
 'prediction': 'i have watched this i and it was awesome',
 'probability': 0.09340388}
{'input_text': 'i have watched this mask and it was awesome',
 'predicted mask token': 'was',
 'prediction': 'i have watched this was and it was awesome',
 'probability': 0.06273627}
{'input_text': 'i have watched this mask and it was awesome',
 'predicted mask token': 'this',
 'prediction': 'i have watched this this and it was awesome',
 'probability': 0.04304767}
{'input_text': 'i have watched this mask and it was awesome',
 'predicted mask token': 'movie',
 'prediction': 'i have watched this movie and it was awesome',
 'probability': 0.035238024}
{'input_text': 'i have watched this mask and it was awesome',
 'predicted mask token': 'it',
 'prediction': 'i have watched this it and it was awesome',
 'probability': 0.026751947}


## Fine-tune a sentiment classification model

In [35]:
# Load pretrained bert model
mlm_model = keras.models.load_model(
    "bert_mlm_imdb.keras",
    custom_objects={"MaskedLanguageModel": MaskedLanguageModel}
)

pretrained_bert_model = keras.Model(
    mlm_model.input,
    mlm_model.get_layer("encoder_0_ffn_layernormalization").output
)

# Freeze it
pretrained_bert_model.trainable = False


def create_classifier_bert_model():
    inputs = layers.Input((config.MAX_LEN,), dtype="int64")
    sequence_output = pretrained_bert_model(inputs)
    pooled_output = layers.GlobalMaxPooling1D()(sequence_output)
    hidden_layer = layers.Dense(64, activation="relu")(pooled_output)
    outputs = layers.Dense(1, activation="sigmoid")(hidden_layer)
    classifer_model = keras.Model(inputs, outputs, name="classification")
    optimizer = keras.optimizers.Adam()
    classifer_model.compile(
        optimizer=optimizer,
        loss="binary_crossentropy",
        metrics=["accuracy"]
    )
    return classifer_model

classifer_model = create_classifier_bert_model()
classifer_model.summary()

In [37]:
# Train the classifier with frozen BERT stage
classifer_model.fit(
    train_classifier_ds,
    epochs=5,
    validation_data=test_classifier_ds,
)

Epoch 1/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 13ms/step - accuracy: 0.6378 - loss: 0.6653 - val_accuracy: 0.7699 - val_loss: 0.4793
Epoch 2/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 9ms/step - accuracy: 0.7528 - loss: 0.5030 - val_accuracy: 0.7587 - val_loss: 0.5050
Epoch 3/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 8ms/step - accuracy: 0.7682 - loss: 0.4915 - val_accuracy: 0.7801 - val_loss: 0.4673
Epoch 4/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 8ms/step - accuracy: 0.7647 - loss: 0.4898 - val_accuracy: 0.7766 - val_loss: 0.4715
Epoch 5/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 8ms/step - accuracy: 0.7631 - loss: 0.4933 - val_accuracy: 0.7837 - val_loss: 0.4609


<keras.src.callbacks.history.History at 0x7cb47c6323b0>

In [38]:
# Unfreeze the BERT model for Fine-Tuning
pretrained_bert_model.trainable = True
optimizer = keras.optimizers.Adam()

classifer_model.compile(
    optimizer=optimizer,
    loss="binary_crossentropy",
    metrics=["accuracy"]
)

classifer_model.fit(
    train_classifier_ds,
    epochs=1, #5,
    validation_data=test_classifier_ds,
)

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 21ms/step - accuracy: 0.8283 - loss: 0.3840 - val_accuracy: 0.8590 - val_loss: 0.3359


<keras.src.callbacks.history.History at 0x7cb47c644700>

## Create an end-to-end model and evaluate it

In [39]:
def create_end_to_end_model(model):
    inputs_string = keras.Input(shape=(1,), dtype="string")
    indices = vectorize_layer(inputs_string)
    outputs = model(indices)
    end_to_end_model = keras.Model(inputs_string, outputs, name="end_to_end_model")
    optimizer = keras.optimizers.Adam(learning_rate=config.LR)

    end_to_end_model.compile(
        optimizer=optimizer,
        loss="binary_crossentropy",
        metrics=["accuracy"]
    )

    return end_to_end_model

end_to_end_classification_model = create_end_to_end_model(classifer_model)
end_to_end_classification_model.evaluate(test_raw_classifier_ds)

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 8ms/step - accuracy: 0.8560 - loss: 0.0000e+00


[0.0, 0.0, 0.859000027179718, 0.859000027179718]