<a href="https://colab.research.google.com/github/Zenith1618/LLM/blob/main/Extractive_Question_Answering_by_Finetuning_DistilBERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Packages 📦 and Basic Setup

In [1]:
%%capture
!pip install git+https://github.com/huggingface/transformers.git
!pip install datasets
!pip install huggingface-hub

In [2]:
from pprint import pprint

#Loading the Dataset💿

In [3]:
%%capture
from datasets import load_dataset
datasets = load_dataset("squad")

The datasets object itself is a DatasetDict, which contains one key for the training, validation and test set. We can see the training, validation and test sets all have a column for the context, the question and the answers to those questions. To access an actual element, you need to select a split first, then give an index.

In [4]:
pprint(datasets["train"][0])

{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
 'context': 'Architecturally, the school has a Catholic character. Atop the '
            "Main Building's gold dome is a golden statue of the Virgin Mary. "
            'Immediately in front of the Main Building and facing it, is a '
            'copper statue of Christ with arms upraised with the legend '
            '"Venite Ad Me Omnes". Next to the Main Building is the Basilica '
            'of the Sacred Heart. Immediately behind the basilica is the '
            'Grotto, a Marian place of prayer and reflection. It is a replica '
            'of the grotto at Lourdes, France where the Virgin Mary reputedly '
            'appeared to Saint Bernadette Soubirous in 1858. At the end of the '
            'main drive (and in a direct line that connects through 3 statues '
            'and the Gold Dome), is a simple, modern stone statue of Mary.',
 'id': '5733be284776f41900661182',
 'question': 'To whom did t

#Preprocessing

Before we can feed those texts to our model, we need to preprocess them. This is done by a Transformers Tokenizer which will tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.


To do all of this, we instantiate our tokenizer with the AutoTokenizer.from_pretrained method, which will ensure:

*   We get a tokenizer that corresponds to the model architecture we want to use.
*   We download the vocabulary used when pretraining this specific checkpoint.

In [5]:
from transformers import AutoTokenizer

model_checkpoint = "distilbert-base-cased"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Here in preprocessing we will not truncate the long sentence(as we lose info) which we usually do in other task because this would affect the performance(output would be wrong or not accurate) so to solve this we allow long example in our dataset to give several input features, each of shorter length than max length of the model and we introduce some overlap so that if answer lies at the point we split, we still get the desired answer(this is controlled by hyperparameter doc_stride)

In [6]:
max_length = 384  # The maximum length of a feature (question and context)
doc_stride = 128  # The authorized overlap between two part of the context when splitting

We want to avoid truncating the question, and instead only truncate the context to ensure the task remains solvable. To do that, we'll set truncation to "only_second", so that only the second sequence (the context) in each pair is truncated. To get the list of features capped by the maximum length, we need to set return_overflowing_tokens to True and pass the doc_stride to stride. To see which feature of the original context contain the answer, we can return "offset_mapping".In the case of impossible answers (the answer is in another feature given by an example with a long context), we set the cls index for both the start and end position. We could also simply discard those examples from the training set if the flag allow_impossible_answers is False.

In [7]:
def prepare_train_features(examples):
    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results in one example possible giving several features when a context
    # is long, each of those features having a context that overlaps a bit the context of the previous feature.
    examples["question"] = [q.lstrip() for q in examples["question"]]   # To remove left spaces
    examples["context"] = [c.lstrip() for c in examples["context"]]
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a
    # map from a feature to its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original
    # context. This will help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what
        # is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this
        # span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the
            # CLS index).
            if not (
                offsets[token_start_index][0] <= start_char
                and offsets[token_end_index][1] >= end_char
            ):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the
                # answer.
                # Note: we could go after the last offset if the answer is the last word (edge
                # case).
                while (
                    token_start_index < len(offsets)
                    and offsets[token_start_index][0] <= start_char
                ):
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

In [8]:
tokenized_datasets = datasets.map(
    prepare_train_features,
    batched=True,   #for multiprocessing
    remove_columns=datasets["train"].column_names,  #remove the columns that existed before tokenization was applied - this ensures that the only features remaining are the
    #ones we actually want to pass to our model
    num_proc=3,
)

Because all our data has been padded or truncated to the same length, and it is not too large, we can now simply convert it to a dict of numpy arrays, ready for training(if the data is of variable length or its too large to fit in memory we can use tf.data.Dataset method)

In [9]:
train_set = tokenized_datasets["train"].with_format("numpy")[:]  # Load the whole dataset as a dict of numpy arrays
validation_set = tokenized_datasets["validation"].with_format("numpy")[:]

#✍️ Fine Tuning the model

Now our data is ready, so now we will download the pretrained model

In [10]:
from transformers import TFAutoModelForQuestionAnswering

model = TFAutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForQuestionAnswering: ['vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_transform.bias']
- This IS expected if you are initializing TFDistilBertForQuestionAnswering from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForQuestionAnswering from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForQuestionAnswering were not initialized from the PyTorch model and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be able to use it

As we are changing the main task of BERT from language modelling to question answering task, the output layer(head) would be removed and a new output layer would be used based on task thats why some weights have been thrown of.

In [13]:
import tensorflow as tf
from tensorflow import keras

optimizer = keras.optimizers.Adam(learning_rate=5e-5)

When finetuning we should always use a very less learning rate so that training doesnt diverge the result. The best range is around 1e-5 to 1e-4


In [17]:
# Optionally uncomment the next line for float16 training
keras.mixed_precision.set_global_policy("mixed_float16")

model.compile(optimizer=optimizer)


ValueError: Could not interpret optimizer identifier: <keras.src.optimizers.adam.Adam object at 0x7ac301baa290>

As a convenience, all Transformers models come with a default loss which matches their output head