# How To Train Model for Open Book Q&A Technique
In this notebook we demonstrate how to train a model to be used with top scoring Open Book Q&A method. The Open Book method was first presented by JJ (@jjinho) [here][1], then Quangteo (@quangbk) improved RAM usage [here][2], and Anil (@nlztrk) combined with Q&A [here][3]. Radek (@radek1) demonstrated the strength of Q&A [here][5]. Next Mgoksu (@mgoksu) demonstrated how to achieve top public LB=0.807 using this method [here][4] by finetuning DeBerta large on this method.

In order to train a model for use with Open Book Q&A, we need a CSV that contains; `prompt` (i.e. question), `A, B, C, D, E` (i.e. answer choices), and we need a column of `context` extracted from wikipedia pages for each question. To generate the `context` column, we run Mgoksu's notebook [here][4]. In code cell #5, we load our CSV without `context` column with code `trn = pd.read_csv(OUR_DATASET.CSV)`. Then in code cell #21 our dataset is saved to disk as `test_context.csv` with the column `context` added.

I have searched and concatenated all publicly shared datasets into one 60k CSV and then ran Mgoksu's notebook with `NUM_TITLES_INCLUDE = 5` and `NUM_SENTENCES_INCLUDE = 20`. This added an additional `context` column. I uploaded the resultant CSV file to a Kaggle dataset [here][6]. If you enjoy the notebook you are reading, please upvote the dataset too. Thanks! 

![](https://miro.medium.com/v2/resize:fit:800/format:webp/1*bTGY3fKIgNefQxNsOYpnBw.png)
 
(image source [here][7])

[1]: https://www.kaggle.com/code/jjinho/open-book-llm-science-exam
[2]: https://www.kaggle.com/code/quangbk/open-book-llm-science-exam-reduced-ram-usage
[3]: https://www.kaggle.com/code/nlztrk/openbook-debertav3-large-baseline-single-model
[4]: https://www.kaggle.com/code/mgoksu/0-807-sharing-my-trained-with-context-model
[5]: https://www.kaggle.com/code/radek1/new-dataset-deberta-v3-large-training
[6]: https://www.kaggle.com/datasets/cdeotte/60k-data-with-context-v2
[7]: https://blog.gopenai.com/enrich-llms-with-retrieval-augmented-generation-rag-17b82a96b6f0

# Load CSV
We will load 60k CSV of `prompts`, `A,B,C,D,E`, and `context` from my Kaggle dataset [here][1]. This dataset is all publicly shared datasets concatenated then processed with Mgoksu's notebook [here][2] to create a `context` column. (To learn more about the datasets within read my discussion post). This Kaggle dataset also contains competition `train.csv` with added `context` column (to be used as a validation dataset).

In this train notebook, we have internet turned on and can choose whatever model we wish to download and train. After we finetune this model, we will create a second notebook with the Open Book Q&A technique and load the finetuned model from the output of this notebook. The second notebook will have internet turned off so that it can be submitted to Kaggle's competition.

[1]: https://www.kaggle.com/datasets/cdeotte/60k-data-with-context-v2
[2]: https://www.kaggle.com/code/mgoksu/0-807-sharing-my-trained-with-context-model

In [53]:
import os
# Set CUDA visible devices to GPU 0 and 1 (The 2xT4 GPUs that Kaggle provides)
os.environ["CUDA_VISIBLE_DEVICES"]="0,1"

import os
from typing import Optional, Union
import pandas as pd, numpy as np, torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from datasets import Dataset, load_metric
from dataclasses import dataclass
from transformers import AutoTokenizer, AutoModelForMultipleChoice, EarlyStoppingCallback, \
                         TrainingArguments, Trainer, set_seed
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy

# Random seed
seed = 42
set_seed(seed)
# Define constants
VER=1
# Number of layers to freeze, DeBERTa has a total number of 24 layers
FREEZE_LAYERS = 18
# Boolean to freeze embeddings
FREEZE_EMBEDDINGS = True
# Length of context + question + answers
MAX_INPUT = 256
# The Hugging Face model we're using
MODEL = 'microsoft/deberta-v3-large'

In [54]:
# Read validation data from a CSV file
qna_df = pd.read_csv('/kaggle/input/openbook-qna/openbook-qna-data.csv')
print('Validation data size:', qna_df.shape )
qna_df.head()

Validation data size: (200, 8)


Unnamed: 0,prompt,A,B,C,D,E,answer,context
0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...,D,"In cosmology, the missing baryon problem is an..."
1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,A,Dynamic scaling (sometimes known as Family-Vic...
2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...,A,thumb|Neolithic triple spiral symbol A triskel...
3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,C,"In physics, especially quantum field theory, r..."
4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,D,Kinematic diffraction is the approach to study...


In [55]:
# train validation test split
train_df, temp_df = train_test_split(qna_df, test_size=0.5, random_state=seed)
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=seed)

# Data Loader
Code is from Radek's notebook [here][1] with modifications to the tokenization process from Chris Deotte's notebook [here][2].

[1]: https://www.kaggle.com/code/radek1/new-dataset-deberta-v3-large-training
[2]: https://www.kaggle.com/code/cdeotte/how-to-train-open-book-model-part-1

In [56]:
# Mapping options to indices
option_to_index = {option: idx for idx, option in enumerate('ABCDE')}
index_to_option = {v: k for k, v in option_to_index.items()}

def preprocess(example):
    # Repeat the prompt for all five choices
    first_sentence = [example['prompt']] * 5
    # Extract sentences corresponding to choices 'A' to 'E'
    second_sentences = [example[option] for option in 'ABCDE']
    # Tokenize the sentences using the provided tokenizer (not shown in this snippet)
    tokenized_example = tokenizer(first_sentence, second_sentences, truncation=False)
    # Assign the index corresponding to the correct answer as the label
    tokenized_example['label'] = option_to_index[example['answer']]

    return tokenized_example

@dataclass
class DataCollatorForMultipleChoice:
    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        # Determine the label name ('label' or 'labels')
        label_name = 'label' if 'label' in features[0].keys() else 'labels'
        # Extract labels from each example and remove the corresponding key
        labels = [feature.pop(label_name) for feature in features]
        # Compute batch size and number of choices
        batch_size = len(features)
        num_choices = len(features[0]['input_ids'])
        # Restructure features into a list of dictionaries for each choice
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        # Flatten the list of dictionaries
        flattened_features = sum(flattened_features, [])

        # Tokenize and pad the examples into a batch
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors='pt',
        )
        # Reshape the batch to have dimensions (batch_size, num_choices, -1)
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        # Add labels to the batch as a PyTorch tensor
        batch['labels'] = torch.tensor(labels, dtype=torch.int64)
        # Return the formatted batch
        return batch

In [57]:
# Create tokenizer and datasets
tokenizer = AutoTokenizer.from_pretrained(MODEL)

train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)
test_dataset = Dataset.from_pandas(test_df)

train_dataset = train_dataset.remove_columns(["__index_level_0__"])

train_dataset

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Dataset({
    features: ['prompt', 'A', 'B', 'C', 'D', 'E', 'answer', 'context'],
    num_rows: 100
})

In [58]:
# Tokenize datasets
tokenized_train_dataset = train_dataset.map(preprocess, remove_columns=['prompt', 'context', 'A', 'B', 'C', 'D', 'E', 'answer'])
tokenized_val_dataset = val_dataset.map(preprocess, remove_columns=['prompt', 'context', 'A', 'B', 'C', 'D', 'E', 'answer'])
# We do not remove the answer column from the test dataset for evaluation
tokenized_test_dataset = test_dataset.map(preprocess, remove_columns=['prompt', 'context', 'A', 'B', 'C', 'D', 'E'])

tokenized_train_dataset

  0%|          | 0/100 [00:00<?, ?ex/s]

  0%|          | 0/50 [00:00<?, ?ex/s]

  0%|          | 0/50 [00:00<?, ?ex/s]

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'label'],
    num_rows: 100
})

# Build Model
We will use a Hugging Face AutoModelForMultipleChoice. For the list of possible models, see Hugging Face's repository [here][1].  We can also optionally freeze layers. This also accelerates training and uses less memory. However validation accuracy may become less.

[1]: https://huggingface.co/models

In [59]:
# Loading in the original deberta model
model = AutoModelForMultipleChoice.from_pretrained(MODEL)

Some weights of DebertaV2ForMultipleChoice were not initialized from the model checkpoint at microsoft/deberta-v3-large and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [60]:
if FREEZE_EMBEDDINGS:
    print('Freezing embeddings.')
    for param in model.deberta.embeddings.parameters():
        param.requires_grad = False
if FREEZE_LAYERS>0:
    print(f'Freezing {FREEZE_LAYERS} layers.')
    for layer in model.deberta.encoder.layer[:FREEZE_LAYERS]:
        for param in layer.parameters():
            param.requires_grad = False

Freezing embeddings.
Freezing 18 layers.


# MAP@3 Metric
The competition metric is MAP@3 therefore we will make a custom code to add to Hugging Face's trainer. Discussion [here][1]

[1]: https://www.kaggle.com/competitions/kaggle-llm-science-exam/discussion/435602

In [61]:
def map_at_3(predictions, labels):
    map_sum = 0
    pred = np.argsort(-1*np.array(predictions),axis=1)[:,:3]
    for x,y in zip(pred,labels):
        z = [1/i if y==j else 0 for i,j in zip([1,2,3],x)]
        map_sum += np.sum(z)
    return map_sum / len(predictions)

# Define metrics computation function for Hugging Face Trainer
def compute_metrics(p):
    # computing the predictions and the labels
    predictions, labels = p.predictions, p.label_ids

    # Log multiple metrics: map@3 and accuracy
    return {"map@3": map_at_3(predictions.tolist(), labels.tolist()),
            "accuracy": accuracy_score(labels, predictions.argmax(axis=1))}

# Train and Save 
We will now train and save our model using Hugging Face's easy to use trainer.

In [62]:
training_args = TrainingArguments(
    warmup_ratio=0.1, 
    learning_rate=2e-5,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=2,
    num_train_epochs=30,
    report_to='none',
    output_dir = f'./checkpoints_{VER}',
    overwrite_output_dir=True,
    fp16=True,
    gradient_accumulation_steps=8,
    logging_steps=25,
    evaluation_strategy='steps',
    eval_steps=25,
    save_strategy="steps",
    save_steps=25,
    load_best_model_at_end=True,
    metric_for_best_model='map@3',
    lr_scheduler_type='cosine',
    weight_decay=0.01,
    save_total_limit=2,
)

In [63]:
trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    compute_metrics = compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=5)],
)

# Start training
trainer.train()
# Save the trained model
trainer.save_model(f'model_v{VER}')

You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Map@3,Accuracy
25,1.6175,1.60533,0.566667,0.4
50,1.61,1.602876,0.76,0.64
75,1.5679,1.364298,0.79,0.68
100,0.8827,1.08131,0.723333,0.58
125,0.4246,1.147815,0.77,0.66
150,0.2936,1.200993,0.77,0.66
175,0.2628,1.225123,0.776667,0.66




# Verify Saved Model
During training, we see the MAP@3 validation score above. Let's load the saved model and compute it again here to verify that our model is saved correctly.

In [64]:
del model, trainer
model = AutoModelForMultipleChoice.from_pretrained(f'model_v{VER}')
trainer = Trainer(model=model,
                  tokenizer=tokenizer,
                  data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer))

In [67]:
test_predictions = trainer.predict(tokenized_test_dataset).predictions
predictions_as_ids = np.argsort(-test_predictions, 1)
predictions_as_answer_letters = np.array(list('ABCDE'))[predictions_as_ids]
predictions_as_string = test_df['prediction'] = [
    ' '.join(row) for row in predictions_as_answer_letters[:, :3]
]

# Compute Validation Score

In [68]:
# https://www.kaggle.com/code/philippsinger/h2ogpt-perplexity-ranking
import numpy as np
def precision_at_k(r, k):
    """Precision at k"""
    assert k <= len(r)
    assert k != 0
    return sum(int(x) for x in r[:k]) / k

def MAP_at_3(predictions, true_items):
    """Score is mean average precision at 3"""
    U = len(predictions)
    map_at_3 = 0.0
    for u in range(U):
        user_preds = predictions[u].split()
        user_true = true_items[u]
        user_results = [1 if item == user_true else 0 for item in user_preds]
        for k in range(min(len(user_preds), 3)):
            map_at_3 += precision_at_k(user_results, k+1) * user_results[k]
    return map_at_3 / U

In [69]:
m = MAP_at_3(test_df.prediction.values, test_df.answer.values)
print( 'Test MAP@3 =',m )

CV MAP@3 = 0.77


MAP@3 is the official evaluaion metric of the competition, but out of curiosity let's check the accuracy too.

In [84]:
test_predictions = trainer.predict(tokenized_test_dataset).predictions
test_labels = [option_to_index[answer] for answer in tokenized_test_dataset['answer']]

test_accuracy = accuracy_score(test_labels, test_predictions.argmax(axis=1))
print( 'Test Accuracy =',test_accuracy )



Test Accuracy = 0.66
