## Q1) Select the Bert-base-uncased model.

## Q2) Calculate the number of parameters of the selected model from the code. Does your calculated parameters matches with the parameters reported in the respective paper.

In [None]:
%pip install transformers
%pip install datasets
%pip install evaluate
%pip install accelerate>=0.20.1
%pip install sentencepiece
%pip install sacrebleu
%pip install torch
%pip install transformers[torch]
%pip install prettytable

Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.5-py3-none-any.whl (7.8 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarrow-hotfix, dill, multiprocess, datasets
Successfully installed datasets-2.15.0 dill-0.3.7 multiprocess-0.70.15 pyarrow-hotfix-0.5
Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━

In [None]:
from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")

total_params = sum(p.numel() for p in model.parameters())

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
print(f"Total parameters in the model (Actual): {total_params}")
print("Total parameters in the model (On Paper): 108432384")

Total parameters in the model (Actual): 109514298
Total parameters in the model (On Paper): 108432384


In [None]:
from prettytable import PrettyTable

print("Description of Parameters:")

# to display trainable parameters in the model
def count_parameters(model):
    table = PrettyTable(["Modules", "Parameters"])
    total_params = 0
    for name, parameter in model.named_parameters():
        # skip the non-trainable parameter (if it doesn't require gradient)
        if not parameter.requires_grad:
            continue
        params = parameter.numel()
        table.add_row([name, params])
        total_params += params
    print(table)
    print(f"Total Trainable Params: {total_params}")
    return total_params
    # Source: Stackoverflow (https://stackoverflow.com/questions/49201236/check-the-total-number-of-parameters-in-a-pytorch-model)

count_parameters(model)

Description of Parameters:
+---------------------------------------------------------+------------+
|                         Modules                         | Parameters |
+---------------------------------------------------------+------------+
|          bert.embeddings.word_embeddings.weight         |  23440896  |
|        bert.embeddings.position_embeddings.weight       |   393216   |
|       bert.embeddings.token_type_embeddings.weight      |    1536    |
|             bert.embeddings.LayerNorm.weight            |    768     |
|              bert.embeddings.LayerNorm.bias             |    768     |
|     bert.encoder.layer.0.attention.self.query.weight    |   589824   |
|      bert.encoder.layer.0.attention.self.query.bias     |    768     |
|      bert.encoder.layer.0.attention.self.key.weight     |   589824   |
|       bert.encoder.layer.0.attention.self.key.bias      |    768     |
|     bert.encoder.layer.0.attention.self.value.weight    |   589824   |
|      bert.encoder.laye

109514298

### Reason for difference in total no. of parameters :
    1) Additional bias with each of self attention's query, key, bias and output.
    2) Extra Layer normalization in the beginning.
    3) Token type embaddings weight is not counted in the paper.
    4) Pretrained bert based's prediction bias, weights and Layer Normalization are specifict to this pretrained model.

# Q3) Pretrain the selected model on the train split of ‘wikitext-2-raw-v1’. For 5 epochs. Use the hyperparameters as per your choice.  

### Dataset

In [None]:
from datasets import load_dataset

dataset = load_dataset("wikitext", "wikitext-2-raw-v1")

print(dataset)

Downloading builder script:   0%|          | 0.00/8.48k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/6.84k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/9.62k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/4.72M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

DatasetDict({
    test: Dataset({
        features: ['text'],
        num_rows: 4358
    })
    train: Dataset({
        features: ['text'],
        num_rows: 36718
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 3760
    })
})


### Training the Tokenizer

In [None]:
# to convert a dataset to a text file
def dataset_to_text(dataset, output_filename="data.txt"):
    with open(output_filename, "w") as f:
        for t in dataset["text"]:
            print(t, file=f)

dataset_to_text(dataset["train"], "train.txt")


In [None]:
special_tokens = ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]", "<S>", "<T>"] # special tokens to be used in the vocabulary or text processing

files = ["train.txt"]

vocab_size = 30522

max_length = 512

truncate_longer_samples = True # to truncate longer samples to match the maximum length (max_length)

In [None]:
from tokenizers import BertWordPieceTokenizer

tokenizer = BertWordPieceTokenizer()

tokenizer.train(files=files, vocab_size=vocab_size, special_tokens=special_tokens)

tokenizer.enable_truncation(max_length=max_length) # enabling truncation of sequences longer than the specified maximum length

In [None]:
import os
import json
from transformers import BertTokenizerFast

model_path = "pretrained_tokenizer" # to save the tokenizer model

if not os.path.isdir(model_path):
    os.mkdir(model_path)

tokenizer.save_model(model_path)

with open(os.path.join(model_path, "config.json"), "w") as f:
    # tokenizer configuration parameters
    tokenizer_cfg = {
        "do_lower_case": True,
        "unk_token": "[UNK]",
        "sep_token": "[SEP]",
        "pad_token": "[PAD]",
        "cls_token": "[CLS]",
        "mask_token": "[MASK]",
        "model_max_length": max_length,
        "max_len": max_length,
    }
    json.dump(tokenizer_cfg, f)

tokenizer = BertTokenizerFast.from_pretrained(model_path) # loading the saved tokenizer model from the model path

tokenizer.save_pretrained(model_path) # saving the tokenizer configuration and vocabulary to the model path


('pretrained_tokenizer/tokenizer_config.json',
 'pretrained_tokenizer/special_tokens_map.json',
 'pretrained_tokenizer/vocab.txt',
 'pretrained_tokenizer/added_tokens.json',
 'pretrained_tokenizer/tokenizer.json')

### Tokenize the Dataset

In [None]:
# encode examples with truncation if specified, else without truncation
def encode_with_truncation(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length",
                     max_length=max_length, return_special_tokens_mask=True)

def encode_without_truncation(examples):
    return tokenizer(examples["text"], return_special_tokens_mask=True)

# chosing the appropriate encoding function based on the truncate_longer_samples flag defined above
encode = encode_with_truncation if truncate_longer_samples else encode_without_truncation

train_dataset = dataset["train"].map(encode, batched=True)

train_dataset = train_dataset.select(range(2000))

test_dataset = dataset["test"].map(encode, batched=True)

test_dataset = test_dataset.select(range(500))

# setting the format of datasets based on truncation flag - for Torch or special tokens mask
if truncate_longer_samples:
    train_dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])
    test_dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])
else:
    test_dataset.set_format(columns=["input_ids", "attention_mask", "special_tokens_mask"])
    train_dataset.set_format(columns=["input_ids", "attention_mask", "special_tokens_mask"])

Map:   0%|          | 0/36718 [00:00<?, ? examples/s]

Map:   0%|          | 0/4358 [00:00<?, ? examples/s]

In [None]:
from itertools import chain

# to group texts into chunks of max_length when truncation is not applied
def group_texts(examples):
    # concatenating examples into a single list for each key in the examples dictionary
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])

    # when the total length exceeds max_length, truncate to multiples of max_length
    if total_length >= max_length:
        total_length = (total_length // max_length) * max_length

    # splitting the concatenated examples into chunks of max_length
    result = {
        k: [t[i : i + max_length] for i in range(0, total_length, max_length)]
        for k, t in concatenated_examples.items()
    }
    return result

# processing train and test datasets by grouping texts into chunks of max_length when truncation is not applied
if not truncate_longer_samples:
    train_dataset = train_dataset.map(group_texts, batched=True,
                                      desc=f"Grouping texts in chunks of {max_length}")
    test_dataset = test_dataset.map(group_texts, batched=True,
                                    desc=f"Grouping texts in chunks of {max_length}")

    train_dataset.set_format("torch")
    test_dataset.set_format("torch")


### Training

In [None]:
from transformers import BertForMaskedLM, BertConfig, DataCollatorForLanguageModeling, TrainingArguments
from transformers import Trainer
import accelerate
import torch

model_config = BertConfig(vocab_size=vocab_size, max_position_embeddings=max_length,hidden_state=12)
model = BertForMaskedLM(config=model_config)

# data collator for language modeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

training_args = TrainingArguments(
    output_dir="./Bert-Base-Uncased-Pretrained",
    evaluation_strategy="epoch",
    num_train_epochs=1,
    per_device_train_batch_size=5,
    gradient_accumulation_steps=8,
    per_device_eval_batch_size=64,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
)

perplexity_scores=[]

# looping through 5 epochs (as mentioned in the question)
for epoch in range(5):
    trainer.train()

    eval_results = trainer.evaluate()

    eval_loss = eval_results.get('eval_loss', None)

    perplexity = torch.exp(torch.tensor(eval_loss)).item()

    perplexity_scores.append(perplexity)

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,No log,8.469707


Epoch,Training Loss,Validation Loss
1,No log,7.896328


Epoch,Training Loss,Validation Loss
1,No log,7.526801


Epoch,Training Loss,Validation Loss
1,No log,7.407213


Epoch,Training Loss,Validation Loss
1,No log,7.209648


## Q4) Compute and report the Perplexity scores using the inbuilt function on the test split of   ‘wikitext-2-raw-v1’ for each epoch. Do scores decrease after every epoch? Why and why not?

In [None]:
print(perplexity_scores)

[4788.35302734375, 2485.92333984375, 1886.669189453125, 1536.3048095703125, 1346.3994140625]


### Reason why perplexity decreases :

> As seen above output, the perplexity scores decrease after every epoch. The reasons for this can be:


    1)
    2)
    3)
    4)

## Q5) Push the pre-trained model to HuggingFace

In [None]:
# Token = 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

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
trainer.push_to_hub()

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

events.out.tfevents.1700311867.5d6ebea0a962.770.0:   0%|          | 0.00/4.76k [00:00<?, ?B/s]

events.out.tfevents.1700312125.5d6ebea0a962.770.1:   0%|          | 0.00/5.02k [00:00<?, ?B/s]

Upload 8 LFS files:   0%|          | 0/8 [00:00<?, ?it/s]

events.out.tfevents.1700312644.5d6ebea0a962.770.3:   0%|          | 0.00/5.02k [00:00<?, ?B/s]

events.out.tfevents.1700312384.5d6ebea0a962.770.2:   0%|          | 0.00/5.02k [00:00<?, ?B/s]

events.out.tfevents.1700312904.5d6ebea0a962.770.4:   0%|          | 0.00/5.02k [00:00<?, ?B/s]

events.out.tfevents.1700313164.5d6ebea0a962.770.5:   0%|          | 0.00/354 [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.60k [00:00<?, ?B/s]

'https://huggingface.co/dewanshsinghchandel/Bert-Base-Uncased-Pretrained/tree/main/'

## Q6 a) Fine-tune the final pretrained model on the following three tasks:
## Classification: SST-2

In [None]:
from datasets import load_dataset, concatenate_datasets
from datasets.utils import stratify
from sklearn.model_selection import train_test_split

dataset= load_dataset("sst2")
dataset= concatenate_datasets([dataset["train"],dataset["validation"],dataset["test"]]) # concatenating train, validation, and test sets into a single dataset
dataset =  dataset.train_test_split(test_size = 0.2,seed=1)
dataset

Downloading builder script:   0%|          | 0.00/3.77k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.85k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.10k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.44M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['idx', 'sentence', 'label'],
        num_rows: 56033
    })
    test: Dataset({
        features: ['idx', 'sentence', 'label'],
        num_rows: 14009
    })
})

In [None]:
from transformers import AutoTokenizer, AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained("dewanshsinghchandel/Bert-Base-Uncased-Pretrained",num_labels=3)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

In [None]:
def tokenize_function(train_dataset):
    return tokenizer(train_dataset['sentence'], padding='max_length', truncation=True)


tokenized_dataset = dataset.map(tokenize_function, batched=True)

train_dataset = tokenized_dataset['train'][:1000]
test_dataset = tokenized_dataset['test'][:250]

Map:   0%|          | 0/14009 [00:00<?, ? examples/s]

In [None]:
import tensorflow as tf

# training features and labels for the final model
train_features = { x: train_dataset[x] for x in tokenizer.model_input_names  }
train_set_for_final_model = tf.data.Dataset.from_tensor_slices((train_features, train_dataset['label'] )) # creating tensorflow dataset
train_set_for_final_model = train_set_for_final_model.shuffle(len(train_dataset)).batch(8) # shuffling

# testing features and labels for the final model
test_features = {x: test_dataset[x] for x in tokenizer.model_input_names}
test_set_for_final_model = tf.data.Dataset.from_tensor_slices((test_features, test_dataset["label"]))
test_set_for_final_model =test_set_for_final_model.batch(8) # batching

In [None]:
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained("dewanshsinghchandel/Bert-Base-Uncased-Pretrained", num_labels=3)
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), # loss function for sparse categorical labels
    metrics=tf.metrics.SparseCategoricalAccuracy(),
)

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Q9) Push the fine-tuned model to Hugging Face

In [None]:
# Token = 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

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from transformers import PushToHubCallback

push_to_hub_callback = PushToHubCallback(
    output_dir="finetune_sst2", tokenizer=tokenizer, hub_model_id="dewanshsinghchandel/finetune_sst2"
)

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
Cloning https://huggingface.co/dewanshsinghchandel/finetune_sst2 into local empty directory.


Download file tf_model.h5:   0%|          | 8.00k/418M [00:00<?, ?B/s]

Clean file tf_model.h5:   0%|          | 1.00k/418M [00:00<?, ?B/s]

In [None]:
model.fit(train_set_for_final_model, validation_data=test_set_for_final_model, epochs=3 ,callbacks=push_to_hub_callback)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.src.callbacks.History at 0x7dcc20065300>

## Q7 a) Calculate the scores for the following metrics on the test splits. Note that metrics depend on the selected task:
## Classification: Accuracy, Precision, Recall, F1


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

test_labels = test_dataset["label"] # extracting true labels from the test data
predictions = model.predict(test_set_for_final_model).logits

predicted_labels = predictions.argmax(axis=1)

accuracy = accuracy_score(test_labels, predicted_labels)
precision = precision_score(test_labels, predicted_labels, average='weighted')
recall = recall_score(test_labels, predicted_labels, average='weighted')
f1 = f1_score(test_labels, predicted_labels, average='weighted')

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)


Accuracy: 0.376
Precision: 0.141376
Recall: 0.376
F1 Score: 0.20548837209302329


  _warn_prf(average, modifier, msg_start, len(result))


**Reasoning for good/bad performance:** As infered from the output above, the accuracy is low suggesting not so good performance of the model on the test data. The low precision indicates a large number of false positives, while the relatively low recall suggests that the model misses a significant portion of actual positive samples.

> The reasons for poor performance can be:


    1) The model might have overfitted the training data
    2) The noise/inconsistency in the data can mislead the model and negatively impact its performance.

## Q8) Calculate the number of parameters in the model after fine-tuning. Does it remain the same as the pre-trained model?

In [None]:
from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained("dewanshsinghchandel/Bert-Base-Uncased-Pretrained")

total_params_pretrained = sum(p.numel() for p in model.parameters())

model = AutoModelForMaskedLM.from_pretrained("dewanshsinghchandel/finetune_sst2",from_tf=True)

total_params_finetune_1 = sum(p.numel() for p in model.parameters())

tf_model.h5:   0%|          | 0.00/438M [00:00<?, ?B/s]

All TF 2.0 model weights were used when initializing BertForMaskedLM.

Some weights of BertForMaskedLM were not initialized from the TF 2.0 model and are newly initialized: ['cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.predictions.decoder.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
print("total_params_pretrained = ",total_params_pretrained)
print("total_params_finetune_1 = ",total_params_finetune_1)

total_params_pretrained =  109514298
total_params_finetune_1 =  109514298


### Yes, The no. of params remain the same after fine-tuning

## Q6 b) Fine-tune the final pretrained model on the following three tasks:
## Question-Answering: SQuAD

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

model_name = "dewanshsinghchandel/Bert-Base-Uncased-Pretrained"

model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/611 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at dewanshsinghchandel/Bert-Base-Uncased-Pretrained and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/224k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/704k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [None]:
from datasets import load_dataset, concatenate_datasets
from datasets.utils import stratify
from sklearn.model_selection import train_test_split

squad= load_dataset("squad")

squad= concatenate_datasets([squad["train"],squad["validation"]])
squad =  squad.train_test_split(test_size = 0.2,seed=1)
squad

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 78535
    })
    test: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 19634
    })
})

In [None]:
# Preprocess the data to a BERT format
def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]] # striping leading/trailing spaces from questions
    # Tokenize questions and contexts, ensuring a maximum length of 384 tokens
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second", # ensures the context is truncated when needed
        return_offsets_mapping=True,
        padding="max_length", # pads sequences to their maximum length
    )

    # extracting offset mappings and answer positions
    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    # determining start and end positions for answers
    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # finding start and end positions in the context for the answers
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # checing if character positions of answer are within the context
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # finding token positions corresponding to the answer's character positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)
train_dataset = tokenized_squad['train'][:1000]
test_dataset = tokenized_squad['test'][:250]

Map:   0%|          | 0/78535 [00:00<?, ? examples/s]

Map:   0%|          | 0/19634 [00:00<?, ? examples/s]

In [None]:
!pip install accelerate -U
!pip install accelerate>=0.20.1
!pip install transformers[torch]



In [None]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, default_data_collator, TrainingArguments, Trainer

tokenizer = AutoTokenizer.from_pretrained("dewanshsinghchandel/Bert-Base-Uncased-Pretrained")
model = AutoModelForQuestionAnswering.from_pretrained("dewanshsinghchandel/Bert-Base-Uncased-Pretrained")

# training configuration
training_args = TrainingArguments(
    output_dir="./finetune_squad_v2",
    evaluation_strategy="epoch",
    learning_rate=0.01,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=11,
    num_train_epochs=1,
    weight_decay=0.01,
)

# trainer for fine-tuning
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
)

trainer.train()


## Q9) Push the fine-tuned model to Hugging Face

In [None]:
# Token = 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

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
trainer.push_to_hub()

## Q7 b) Calculate the scores for the following metrics on the test splits. Note that metrics depend on the selected task:
## Question-Answering: squad_v2, F1, METEOR, BLEU, ROUGE, exact-match


In [None]:
from datasets import load_metric
import sacrebleu

squad_v2_metric = load_metric("squad_v2")

# to compute exact match occurrences between predictions and references
def compute_exact(predictions, references):
    return sum([1 if pred == ref else 0 for pred, ref in zip(predictions, references)])

meteor_metric = load_metric("meteor")
bleu_metric = sacrebleu.metrics.BLEU()
rouge_metric = load_metric("rouge")

def compute_metrics(p):
    predictions = tokenizer.batch_decode(p.predictions, skip_special_tokens=True)
    references = tokenizer.batch_decode(p.label_ids, skip_special_tokens=True)

    squad_v2_result = squad_v2_metric.compute(predictions=predictions, references=references)

    exact_match = compute_exact(predictions, references)

    tokenized_references = [tokenizer.tokenize(ref) for ref in references]
    tokenized_predictions = [tokenizer.tokenize(pred) for pred in predictions]

    # METEOR
    meteor_result = meteor_metric.compute(predictions=tokenized_predictions, references=tokenized_references)

    # BLEU
    bleu_metric.add(reference_corpus=[tokenized_references], hypothesis_corpus=tokenized_predictions)
    bleu_result = bleu_metric.score()

    # ROUGE
    rouge_result = rouge_metric.compute(predictions=predictions, references=references)

    return {
        "squad_v2": squad_v2_result["f1"],
        "exact_match": exact_match / len(predictions),
        "meteor": meteor_result["meteor"],
        "bleu": bleu_result,
        "rouge": rouge_result["rouge1"]["f"],
    }


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

results = trainer.evaluate()

print(results)

## Q8) Calculate the number of parameters in the model after fine-tuning. Does it remain the same as the pre-trained model?

In [None]:
from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained("dewanshsinghchandel/Bert-Base-Uncased-Pretrained")

total_params_pretrained = sum(p.numel() for p in model.parameters())

model = AutoModelForMaskedLM.from_pretrained("dewanshsinghchandel/results")

total_params_finetune_2 = sum(p.numel() for p in model.parameters())

config.json:   0%|          | 0.00/669 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForMaskedLM were not initialized from the model checkpoint at dewanshsinghchandel/results and are newly initialized: ['cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
print("total_params_pretrained = ",total_params_pretrained)
print("total_params_finetune_2 = ",total_params_finetune_2)

total_params_pretrained =  109514298
total_params_finetune_2 =  109514298


### Yes, The no. of params are remain same after fine-tuning

## Q10) Write appropriate comments and rationale behind:
## a) Poor/good performance.
There are several reasons behind our model poor performance <br>
1) Training over few examples: In order to reduce the training time of our model we have reduced the training examples in the train dataset this action significantly reduced our model's performance.

2) Hyperparameter Tuning: Model hyperparameters play a crucial role in determining its performance. Careful tuning of hyperparameters, such as learning rate, regularization strength, and architecture-specific parameters, can lead to improved generalization and overall better performance.

3) Diversity of human language: Our NLP model may exhibit poor performance due to the vast and nuanced nature of human language. Languages are rich in variety, including dialects, slang, and cultural nuances that may not be adequately represented in the training data. If the model has been primarily trained on a specific subset of language, it may struggle to generalize and comprehend the full spectrum of linguistic diversity.
## b) Understanding from the number of parameters between pretraining and fine-tuning of the model.

1) Domain-Specific Features in Fine-Tuning: Fine-tuning may introduce task-specific features or adaptations to the model architecture. While these changes may slightly modify the parameters, they are often minor compared to the overall parameter count. The goal is to retain the general knowledge gained during pretraining while tailoring the model to the intricacies of the target task.

2) Parameter Freezing and Feature Extraction: In some cases, certain layers or parameters of the pretrained model may be frozen during fine-tuning, especially if the lower layers capture generic features. This helps in focusing the learning process on task-specific layers, reducing the risk of overfitting and keeping the overall parameter count stable.

# Computing the metrics over the models

### Pretrained ->

## Contribution