<a href="https://colab.research.google.com/github/ashaduzzaman-sarker/Transfer-learning-and-Generalised-Language-Models/blob/main/Pretraining_BERT_with_Hugging_Face_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pretraining BERT using Hugging Face Transformers on NSP and MLM

## Introduction:

- **BERT Overview:**
  - **BERT (Bidirectional Encoder Representations from Transformers):** A model that leverages the Transformer architecture for natural language processing (NLP) tasks.

![](https://miro.medium.com/v2/resize:fit:1400/1*B-Kd1JHDms479Id2uCW22A.png)

  - **Transfer Learning:** Similar to techniques used in computer vision, BERT can be pretrained on large datasets and then fine-tuned for specific NLP tasks.

- **Transformer Architecture:**
  - **Attention Mechanism:** The core of Transformer, which learns contextual relations between words or subwords in a text.
  - **Encoder-Decoder Structure:** In its original form, Transformer includes both an encoder (reads text input) and a decoder (produces predictions). BERT uses only the encoder to generate a language model.
  - **Bidirectionality:** Unlike directional models that process text sequentially, the Transformer encoder in BERT reads the entire sequence simultaneously, allowing it to capture context from both directions.

![](https://storrs.io/content/images/2021/06/Screen-Shot-2021-06-27-at-8.36.36-AM.png)

- **Training Objectives for BERT:**
  - **Masked Language Modeling (MLM):** 15% of the words in a sequence are masked, and the model predicts the original words using the context from the surrounding words.
  - **Next Sentence Prediction (NSP):** The model predicts whether the second sentence in a pair is the subsequent sentence in the original document or a random sentence from the corpus.

- **Training and Pretraining:**
  - **Google's Pretrained BERT:** While a pretrained BERT model for English is available, there may be a need to pretrain BERT from scratch for other languages or domains.
  - **Pretraining Process:** The example provided uses the WikiText English dataset and trains BERT from scratch, optimizing both MLM and NSP objectives with the 🤗 Transformers library.

## Setup

In [None]:
!pip install git+https://github.com/huggingface/transformers.git
!pip install datasets
!pip install huggingface-hub
!pip install nltk

!pip install -q upgrade tensorflow keras

## Imports

In [2]:
import nltk
import random
import logging

import tensorflow as tf
from tensorflow import keras

nltk.download("punkt")

tf.keras.utils.set_random_seed(42)

# Only log error message
tf.get_logger().setLevel(logging.ERROR)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Define Variables

In [3]:
TOKENIZER_BATCH_SIZE = 128    # Batch size of tokenizer to train
TOKENIZER_VOCABULARY = 25000    # Total number of unique subwords the tokenizer can have

BLOCK_SIZE = 128    # The maximum sequence length in an input
NSP_PROB = 0.50    # Probability of next sentence prediction
SHORT_SEQ_PROB = 0.1    # Probability of creating short sequences
MAX_LENGTH = 512    # Maximum sequence length

MLM_PROB = 0.20    # Probability of masking a token for MLM

TRAIN_BATCH_SIZE = 2    # Batch size for training
MAX_EPOCHS = 1    # Maximum number of epochs
LEARNING_RATE = 1e-5    # Learning rate for training

MODEL_CHECKPOINT = "bert-base-uncased"    # Model checkpoint to use

## Load the WikiText dataset

- **WikiText Dataset:**
  - A language modeling dataset containing over 100 million tokens extracted from "Good" and "Featured" articles on Wikipedia.
  
- **Loading the Dataset:**
  - The dataset is available via the 🤗 Datasets library.
  - For demonstration purposes, only the train split of the dataset is used.
  
- **Using the `load_dataset` Function:**
  - The `load_dataset` function from 🤗 Datasets is utilized to download and load the WikiText dataset.
  
- **Focus:**
  - The example emphasizes working with the train split, simplifying the demonstration.

In [4]:
from datasets import load_dataset

dataset = load_dataset("wikitext", "wikitext-2-raw-v1")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [5]:
print(dataset)

DatasetDict({
    test: Dataset({
        features: ['text'],
        num_rows: 4358
    })
    train: Dataset({
        features: ['text'],
        num_rows: 36718
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 3760
    })
})


## Training a new Tokenizer

- **Training a Tokenizer:**
  - Custom tokenizer captures the specific vocabulary and subwords in your dataset.
  - Essential for Transformer models using subword tokenization.

- **🤗 Transformers Tokenizer:**
  - Converts inputs into token IDs and prepares data for the model.

- **Process:**
  - Start by listing all raw documents from the WikiText corpus to train the tokenizer.

In [6]:
all_texts = [
    doc for doc in dataset["train"]["text"] if len(doc) > 0 and not doc.startswith(" =")
]

In [7]:
# Create a batch iterator for tokenizer training
def batch_iterator():
    for i in range(0, len(all_texts), TOKENIZER_BATCH_SIZE):
        yield all_texts[i : i + TOKENIZER_BATCH_SIZE]

In [8]:
"""
In this notebook, we replicate an existing tokenizer by training a new version
with the same algorithms and parameters, starting by loading the desired
tokenizer model (e.g., BERT-CASED).
"""
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)



In [9]:
# Next, we train our tokenizer using the full train split of the Wikitext-2 dataset.
tokenizer = tokenizer.train_new_from_iterator(
    batch_iterator(), vocab_size=TOKENIZER_VOCABULARY
)

## Data Pre-processing


In [10]:
dataset["train"] = dataset["train"].select([i for i in range(1000)])
dataset["validation"] = dataset["validation"].select([i for i in range(1000)])

- **Preprocessing for BERT Pretraining:**
  - **Tasks:** Prepare data for the Next Sentence Prediction (NSP) and Masked Language Modeling (MLM) tasks.
  - **DataCollator:** 🤗 Transformers provides a `DataCollatorForLanguageModeling` for MLM, but NSP requires manual preparation.

- **`prepare_train_features` Function:**
  - **NSP Preparation:** Create sentence pairs (A, B) where B either follows A or is randomly sampled, and assign labels (1 if B follows A, 0 if not).
  - **Tokenization:** Convert text to token IDs for BERT's embedding lookup.
  - **Additional Inputs:** Generate necessary inputs like `token_type_ids` and `attention_mask` for the model.

In [11]:
# We define the maximum number of tokens after tokenization that each training sample
# will have
max_num_tokens = BLOCK_SIZE - tokenizer.num_special_tokens_to_add(pair=True)


def prepare_train_features(examples):

    """Function to prepare features for NSP task

    Arguments:
      examples: A dictionary with 1 key ("text")
        text: List of raw documents (str)
    Returns:
      examples:  A dictionary with 4 keys
        input_ids: List of tokenized, concatnated, and batched
          sentences from the individual raw documents (int)
        token_type_ids: List of integers (0 or 1) corresponding
          to: 0 for senetence no. 1 and padding, 1 for sentence
          no. 2
        attention_mask: List of integers (0 or 1) corresponding
          to: 1 for non-padded tokens, 0 for padded
        next_sentence_label: List of integers (0 or 1) corresponding
          to: 1 if the second sentence actually follows the first,
          0 if the senetence is sampled from somewhere else in the corpus
    """

    # Remove un-wanted samples from the training set
    examples["document"] = [
        d.strip() for d in examples["text"] if len(d) > 0 and not d.startswith(" =")
    ]
    # Split the documents from the dataset into it's individual sentences
    examples["sentences"] = [
        nltk.tokenize.sent_tokenize(document) for document in examples["document"]
    ]
    # Convert the tokens into ids using the trained tokenizer
    examples["tokenized_sentences"] = [
        [tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sent)) for sent in doc]
        for doc in examples["sentences"]
    ]

    # Define the outputs
    examples["input_ids"] = []
    examples["token_type_ids"] = []
    examples["attention_mask"] = []
    examples["next_sentence_label"] = []

    for doc_index, document in enumerate(examples["tokenized_sentences"]):

        current_chunk = []  # a buffer stored current working segments
        current_length = 0
        i = 0

        # We *usually* want to fill up the entire sequence since we are padding
        # to `block_size` anyways, so short sequences are generally wasted
        # computation. However, we *sometimes*
        # (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
        # sequences to minimize the mismatch between pretraining and fine-tuning.
        # The `target_seq_length` is just a rough target however, whereas
        # `block_size` is a hard limit.
        target_seq_length = max_num_tokens

        if random.random() < SHORT_SEQ_PROB:
            target_seq_length = random.randint(2, max_num_tokens)

        while i < len(document):
            segment = document[i]
            current_chunk.append(segment)
            current_length += len(segment)
            if i == len(document) - 1 or current_length >= target_seq_length:
                if current_chunk:
                    # `a_end` is how many segments from `current_chunk` go into the `A`
                    # (first) sentence.
                    a_end = 1
                    if len(current_chunk) >= 2:
                        a_end = random.randint(1, len(current_chunk) - 1)

                    tokens_a = []
                    for j in range(a_end):
                        tokens_a.extend(current_chunk[j])

                    tokens_b = []

                    if len(current_chunk) == 1 or random.random() < NSP_PROB:
                        is_random_next = True
                        target_b_length = target_seq_length - len(tokens_a)

                        # This should rarely go for more than one iteration for large
                        # corpora. However, just to be careful, we try to make sure that
                        # the random document is not the same as the document
                        # we're processing.
                        for _ in range(10):
                            random_document_index = random.randint(
                                0, len(examples["tokenized_sentences"]) - 1
                            )
                            if random_document_index != doc_index:
                                break

                        random_document = examples["tokenized_sentences"][
                            random_document_index
                        ]
                        random_start = random.randint(0, len(random_document) - 1)
                        for j in range(random_start, len(random_document)):
                            tokens_b.extend(random_document[j])
                            if len(tokens_b) >= target_b_length:
                                break
                        # We didn't actually use these segments so we "put them back" so
                        # they don't go to waste.
                        num_unused_segments = len(current_chunk) - a_end
                        i -= num_unused_segments
                    else:
                        is_random_next = False
                        for j in range(a_end, len(current_chunk)):
                            tokens_b.extend(current_chunk[j])

                    input_ids = tokenizer.build_inputs_with_special_tokens(
                        tokens_a, tokens_b
                    )
                    # add token type ids, 0 for sentence a, 1 for sentence b
                    token_type_ids = tokenizer.create_token_type_ids_from_sequences(
                        tokens_a, tokens_b
                    )

                    padded = tokenizer.pad(
                        {"input_ids": input_ids, "token_type_ids": token_type_ids},
                        padding="max_length",
                        max_length=MAX_LENGTH,
                    )

                    examples["input_ids"].append(padded["input_ids"])
                    examples["token_type_ids"].append(padded["token_type_ids"])
                    examples["attention_mask"].append(padded["attention_mask"])
                    examples["next_sentence_label"].append(1 if is_random_next else 0)
                    current_chunk = []
                    current_length = 0
            i += 1

    # We delete all the un-necessary columns from our dataset
    del examples["document"]
    del examples["sentences"]
    del examples["text"]
    del examples["tokenized_sentences"]

    return examples


tokenized_dataset = dataset.map(
    prepare_train_features, batched=True, remove_columns=["text"], num_proc=1,
)

Map:   0%|          | 0/4358 [00:00<?, ? examples/s]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

- **Masked Language Modeling (MLM) Preprocessing:**
  - **Token Masking:** Randomly replace some tokens with `[MASK]` and adjust labels to include only the masked tokens.
  - **Special Tokens:** Ensure the `[MASK]` token is included if you trained your own tokenizer.

- **Using DataCollator:**
  - **DataCollatorForLanguageModeling:** Utilize this collator from the 🤗 Transformers library to prepare the dataset for MLM.
  - **Compatibility:** Works on the dataset already prepared for NSP.
  - **Parameters:** Default parameters from the original BERT paper are used, with `return_tensors='tf'` to get `tf.Tensor` objects.

In [12]:
from transformers import DataCollatorForLanguageModeling

collater = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=MLM_PROB, return_tensors="tf"
)

- **Training Set Definition:**
  - **`to_tf_dataset` Method:** Provided by 🤗 Datasets to integrate the dataset with the collator for model training.

- **Key Parameters:**
  - **`columns`:** Specifies the columns for independent variables.
  - **`label_cols`:** Specifies the columns for dependent variables (labels).
  - **`batch_size`:** Defines the batch size for training.
  - **`shuffle`:** Option to shuffle the training dataset.
  - **`collate_fn`:** Refers to the collator function used for data preparation.

In [13]:
train = tokenized_dataset["train"].to_tf_dataset(
    columns=["input_ids", "token_type_ids", "attention_mask"],
    label_cols=["labels", "next_sentence_label"],
    batch_size=TRAIN_BATCH_SIZE,
    shuffle=True,
    collate_fn=collater,
)

validation = tokenized_dataset["validation"].to_tf_dataset(
    columns=["input_ids", "token_type_ids", "attention_mask"],
    label_cols=["labels", "next_sentence_label"],
    batch_size=TRAIN_BATCH_SIZE,
    shuffle=True,
    collate_fn=collater,
)

## Defining the model

- **Model Configuration:**
  - **Purpose:** Define model architecture parameters such as transformer layers, attention heads, and hidden dimensions.

- **BertConfig Class:**
  - **Usage:** Utilize `BertConfig` from the 🤗 Transformers library to set up the configuration.
  - **Pretrained Model:** Use the `from_pretrained()` method with the model name (e.g., `bert-base-cased`) to replicate the configuration from the original BERT paper.


In [14]:
from transformers import BertConfig

config = BertConfig.from_pretrained(MODEL_CHECKPOINT)

- **Model Definition:**
  - **Class Used:** `TFBertForPreTraining` from the 🤗 Transformers library.
  
- **Functionality:**
  - This class manages the entire process, including model definition, input handling, and loss calculation.
  
- **Ease of Use:**
  - Simply define the model with the desired configuration, and the class takes care of the rest.

In [15]:
from transformers import TFBertForPreTraining

model = TFBertForPreTraining(config)

In [16]:
model.compile(optimizer='adam')

In [None]:
model.fit(train, validation_data=validation, epochs=MAX_EPOCHS)

