<a href="https://www.kaggle.com/code/emmermarcell/training-an-open-book-model?scriptVersionId=154209282" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Training a model on the open book dataset
The second notebook of our homework contains the training process of the 'microsoft/deberta-v3-large' LLM on the open book database we created in the [context-creation][1] notebook. The entire notebook is heavily inspired from the popular competition notebook of the [Kaggle - LLM Science Exam][2], called [How To Train Open Book Model - Part 1][3]. There, [Chris Deotte][4] used a dataset containig 65k question and answer pairs with context compiled from different notebooks using e.g. gpt-3.5 turbo for data augmentation. This augmentation step was left out in our solution due to time constraints but it's an obvious choice of improvement in the future.

We use the 'microsoft/deberta-v3-large' on the openbook data simply because it turned out to be one of the most successful model in terms of highest [notebook scores][5] on kaggle


[1]: https://www.kaggle.com/code/emmermarcell/context-creation
[2]: https://www.kaggle.com/competitions/kaggle-llm-science-exam
[3]: https://www.kaggle.com/code/cdeotte/how-to-train-open-book-model-part-1
[4]: https://www.kaggle.com/cdeotte
[5]: https://www.kaggle.com/competitions/kaggle-llm-science-exam/models

In [1]:
!pip install -qq wandb --upgrade

# Load CSV

First we load in  the csv file generated in the [context-creation][1] notebook and split it into a training, validation, and a test set.

[1]: https://www.kaggle.com/code/emmermarcell/context-creation

In [2]:
import os

# Set CUDA visible devices to GPU 0 and 1 (The 2xT4 GPUs that Kaggle provides)
os.environ["CUDA_VISIBLE_DEVICES"]="0,1"

from typing import Optional, Union
import pandas as pd, numpy as np, torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from datasets import Dataset, load_metric
from dataclasses import dataclass
import transformers
from transformers import AutoTokenizer, AutoModelForMultipleChoice, EarlyStoppingCallback, \
                         TrainingArguments, Trainer, set_seed
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy

# Random seed
seed = 42
set_seed(seed)
# Define constants
VER=1
# Number of layers to freeze, DeBERTa has a total number of 24 layers
FREEZE_LAYERS = 18
# Boolean to freeze embeddings
FREEZE_EMBEDDINGS = True
# Length of context + question + answers
MAX_INPUT = 256
# The Hugging Face model we're using
MODEL = 'microsoft/deberta-v3-large'



In [3]:
# Import Weights & Biases library
import wandb

wandb.login()


%env WANDB_PROJECT=llm_science_exam_open_book_approach
%env WANDB_LOG_MODEL=true

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


env: WANDB_PROJECT=llm_science_exam_open_book_approach
env: WANDB_LOG_MODEL=true


In [4]:
# Read validation data from a CSV file
qna_df = pd.read_csv('/kaggle/input/context-creation/openbook-qna-data.csv')
print('Validation data size:', qna_df.shape )
qna_df.head()

Validation data size: (200, 8)


Unnamed: 0,prompt,A,B,C,D,E,answer,context
0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...,D,"In cosmology, the missing baryon problem is an..."
1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,A,Dynamic scaling (sometimes known as Family-Vic...
2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...,A,thumb|Neolithic triple spiral symbol A triskel...
3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,C,"In physics, especially quantum field theory, r..."
4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,D,Kinematic diffraction is the approach to study...


In [5]:
# train validation test split
train_df, temp_df = train_test_split(qna_df, test_size=0.5, random_state=seed)
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=seed)

# Data Loader
We use a custom class for tokenizing the input data with dynamic padding in order to use it as an input to the ['microsoft/deberta-v3-large'][1] model.

The implementation of the preprocess function and the DataCollatorForMultipleChoice class is not ours, it's from Radek's notebook [here][2] with modifications to the tokenization process from Chris Deotte's notebook [here][3].

[1]: https://huggingface.co/microsoft/deberta-v3-large
[2]: https://www.kaggle.com/code/radek1/new-dataset-deberta-v3-large-training
[3]: https://www.kaggle.com/code/cdeotte/how-to-train-open-book-model-part-1

In [6]:
# Mapping options to indices
option_to_index = {option: idx for idx, option in enumerate('ABCDE')}
index_to_option = {v: k for k, v in option_to_index.items()}

def preprocess(example):
    # Repeat the prompt for all five choices
    first_sentence = [example['prompt']] * 5
    # Extract sentences corresponding to choices 'A' to 'E'
    second_sentences = [example[option] for option in 'ABCDE']
    # Tokenize the sentences using the provided tokenizer (not shown in this snippet)
    tokenized_example = tokenizer(first_sentence, second_sentences, truncation=False)
    # Assign the index corresponding to the correct answer as the label
    tokenized_example['label'] = option_to_index[example['answer']]

    return tokenized_example

@dataclass
class DataCollatorForMultipleChoice:
    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        # Determine the label name ('label' or 'labels')
        label_name = 'label' if 'label' in features[0].keys() else 'labels'
        # Extract labels from each example and remove the corresponding key
        labels = [feature.pop(label_name) for feature in features]
        # Compute batch size and number of choices
        batch_size = len(features)
        num_choices = len(features[0]['input_ids'])
        # Restructure features into a list of dictionaries for each choice
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        # Flatten the list of dictionaries
        flattened_features = sum(flattened_features, [])

        # Tokenize and pad the examples into a batch
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors='pt',
        )
        # Reshape the batch to have dimensions (batch_size, num_choices, -1)
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        # Add labels to the batch as a PyTorch tensor
        batch['labels'] = torch.tensor(labels, dtype=torch.int64)
        # Return the formatted batch
        return batch

In [7]:
# Create tokenizer and datasets
tokenizer = AutoTokenizer.from_pretrained(MODEL)

train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)
test_dataset = Dataset.from_pandas(test_df)

train_dataset = train_dataset.remove_columns(["__index_level_0__"])

train_dataset

Downloading tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/580 [00:00<?, ?B/s]

Downloading spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Dataset({
    features: ['prompt', 'A', 'B', 'C', 'D', 'E', 'answer', 'context'],
    num_rows: 100
})

In [8]:
# Tokenize datasets
tokenized_train_dataset = train_dataset.map(preprocess, remove_columns=['prompt', 'context', 'A', 'B', 'C', 'D', 'E', 'answer'])
tokenized_val_dataset = val_dataset.map(preprocess, remove_columns=['prompt', 'context', 'A', 'B', 'C', 'D', 'E', 'answer'])
# We do not remove the answer column from the test dataset for evaluation
tokenized_test_dataset = test_dataset.map(preprocess, remove_columns=['prompt', 'context', 'A', 'B', 'C', 'D', 'E'])

tokenized_train_dataset

  0%|          | 0/100 [00:00<?, ?ex/s]

  0%|          | 0/50 [00:00<?, ?ex/s]

  0%|          | 0/50 [00:00<?, ?ex/s]

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'label'],
    num_rows: 100
})

# Building the model

In [9]:
def model_init():
    # Loading in the deberta model
    model = AutoModelForMultipleChoice.from_pretrained(MODEL)
    # We freeze the first 18 layers of the model for faster training time. 
    # However, his is compensated by lower valuation accuracy.
    if FREEZE_EMBEDDINGS:
        print('Freezing embeddings.')
        for param in model.deberta.embeddings.parameters():
            param.requires_grad = False
    if FREEZE_LAYERS>0:
        print(f'Freezing {FREEZE_LAYERS} layers.')
        for layer in model.deberta.encoder.layer[:FREEZE_LAYERS]:
            for param in layer.parameters():
                param.requires_grad = False
    
    return model

# Exploring Hyperparameter Combinations With Sweeps

In [10]:
# method and metric
sweep_config = {
    'method': 'bayes',
    'metric': {
        'name': 'map@3',
        'goal': 'maximize'
    },
}


# hyperparameters
parameters_dict = {
    'learning_rate': {
      'min': 1e-6,
      'max': 1e-4
    },
   'epochs':{
      'values': [20, 30, 50]
   },
    'weight_decay': {
      'values': [0.0, 0.01, 0.02, 0.03, 0.04, 0.05]
    },
    'warmup_ratio': {
      'values': [0.0, 0.05, 0.1, 0.15, 0.2]
    },
    'gradient_accumulation_steps': {
      'values': [2, 4, 8, 16]
    },
    'early_stopping_patience': {
      'values': [5, 10]
    },
}


sweep_config['parameters'] = parameters_dict

In [11]:
sweep_id = wandb.sweep(sweep_config, project='llm_science_exam_open_book_approach')

Create sweep with ID: o017v6q1
Sweep URL: https://wandb.ai/import_this/llm_science_exam_open_book_approach/sweeps/o017v6q1


# MAP@3 Metric
The competition metric is MAP@3 therefore we will make a custom code to add to Hugging Face's trainer. Discussion [here][1]

[1]: https://www.kaggle.com/competitions/kaggle-llm-science-exam/discussion/435602

In [12]:
def map_at_3(predictions, labels):
    map_sum = 0
    pred = np.argsort(-1*np.array(predictions),axis=1)[:,:3]
    for x,y in zip(pred,labels):
        z = [1/i if y==j else 0 for i,j in zip([1,2,3],x)]
        map_sum += np.sum(z)
    return map_sum / len(predictions)

# Define metrics computation function for Hugging Face Trainer
def compute_metrics(p):
    # computing the predictions and the labels
    predictions, labels = p.predictions, p.label_ids

    # Log multiple metrics: map@3 and accuracy
    return {"map@3": map_at_3(predictions.tolist(), labels.tolist()),
            "accuracy": accuracy_score(labels, predictions.argmax(axis=1))}

# Compute Validation Score

In [13]:
# https://www.kaggle.com/code/philippsinger/h2ogpt-perplexity-ranking
import numpy as np
def precision_at_k(r, k):
    """Precision at k"""
    assert k <= len(r)
    assert k != 0
    return sum(int(x) for x in r[:k]) / k

def MAP_at_3(predictions, true_items):
    """Score is mean average precision at 3"""
    U = len(predictions)
    map_at_3 = 0.0
    for u in range(U):
        user_preds = predictions[u].split()
        user_true = true_items[u]
        user_results = [1 if item == user_true else 0 for item in user_preds]
        for k in range(min(len(user_preds), 3)):
            map_at_3 += precision_at_k(user_results, k+1) * user_results[k]
    return map_at_3 / U

# Train and Save 
We now train our model using the Hugging Face Trainer API and leverage Weights & Biases (W&B) Sweeps to perform hyperparameter search. A great article that describes this method can be found [here][1].

[1]: https://wandb.ai/matt24/vit-snacks-sweeps/reports/Hyperparameter-Search-for-HuggingFace-Transformer-Models--VmlldzoyMTUxNTg0

In [14]:
def train(config=None):
    with wandb.init(config=config):
        # set sweep configuration
        config = wandb.config
        
        # Run the wandb magic comand!
        # This displays the details of each training
        %wandb


        # set training arguments
        training_args = TrainingArguments(
            output_dir = f'/kaggle/working/checkpoints_{VER}',
            report_to='wandb',
            num_train_epochs=config.epochs,
            learning_rate=config.learning_rate,
            weight_decay=config.weight_decay,
            warmup_ratio=config.warmup_ratio,
            gradient_accumulation_steps=config.gradient_accumulation_steps,
            per_device_train_batch_size=1,
            per_device_eval_batch_size=2,
            overwrite_output_dir=True,
            fp16=True,
            logging_steps=25,
            evaluation_strategy='steps',
            eval_steps=25,
            save_strategy="steps",
            save_steps=25,
            load_best_model_at_end=True,
            metric_for_best_model='map@3',
            lr_scheduler_type='cosine',
            save_total_limit=2,
        )


        # define training loop
        trainer = Trainer(
            model_init=model_init,
            args=training_args,
            tokenizer=tokenizer,
            data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
            train_dataset=tokenized_train_dataset,
            eval_dataset=tokenized_val_dataset,
            compute_metrics = compute_metrics,
            callbacks=[EarlyStoppingCallback(early_stopping_patience=config.early_stopping_patience)],
        )
        

        # start training loop
        trainer.train()
        
        
        # Save the trained model
        trainer.save_model(f'/kaggle/working/model_v{VER}')
        
        
        # Free up space
        del model, trainer
        gc.collect()
        
        model = AutoModelForMultipleChoice.from_pretrained(f'/kaggle/working/model_v{VER}')
        trainer = Trainer(model=model,
                          tokenizer=tokenizer,
                          data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer))
        
        # Verify Saved Model
        test_predictions = trainer.predict(tokenized_test_dataset).predictions
        predictions_as_ids = np.argsort(-test_predictions, 1)
        predictions_as_answer_letters = np.array(list('ABCDE'))[predictions_as_ids]
        predictions_as_string = test_df['prediction'] = [
            ' '.join(row) for row in predictions_as_answer_letters[:, :3]
        ]
        test_labels = [option_to_index[answer] for answer in tokenized_test_dataset['answer']]
        
        m = MAP_at_3(test_df.prediction.values, test_df.answer.values)
        test_accuracy = accuracy_score(test_labels, test_predictions.argmax(axis=1))
        
        # Log the metrics on the test set
        wandb.log({'Test MAP@3': m,
                  'Test Accuracy': test_accuracy})
        
        print('Test MAP@3 = ',m)
        print('Test Accuracy = ', test_accuracy)

In [15]:
# Run the sweep agent
wandb.agent(sweep_id, train, count=5)

[34m[1mwandb[0m: Agent Starting Run: 3xvws7p7 with config:
[34m[1mwandb[0m: 	early_stopping_patience: 10
[34m[1mwandb[0m: 	epochs: 50
[34m[1mwandb[0m: 	gradient_accumulation_steps: 8
[34m[1mwandb[0m: 	learning_rate: 8.2034401103388e-05
[34m[1mwandb[0m: 	warmup_ratio: 0.05
[34m[1mwandb[0m: 	weight_decay: 0.04
[34m[1mwandb[0m: Currently logged in as: [33memmermarci[0m ([33mimport_this[0m). Use [1m`wandb login --relogin`[0m to force relogin
cat: /sys/module/amdgpu/initstate: No such file or directory


Downloading pytorch_model.bin:   0%|          | 0.00/874M [00:00<?, ?B/s]

Some weights of DebertaV2ForMultipleChoice were not initialized from the model checkpoint at microsoft/deberta-v3-large and are newly initialized: ['pooler.dense.weight', 'pooler.dense.bias', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Freezing embeddings.
Freezing 18 layers.


Some weights of DebertaV2ForMultipleChoice were not initialized from the model checkpoint at microsoft/deberta-v3-large and are newly initialized: ['pooler.dense.weight', 'pooler.dense.bias', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Freezing embeddings.
Freezing 18 layers.


You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Map@3,Accuracy
25,1.6094,1.568839,0.76,0.62
50,0.8053,1.538824,0.74,0.58
75,0.1616,1.747755,0.706667,0.54
100,0.0401,2.641541,0.72,0.54
125,0.0415,3.035136,0.7,0.52
150,0.0041,3.714784,0.69,0.52
175,0.0099,2.432216,0.716667,0.54
200,0.0002,3.054488,0.723333,0.52
225,0.0,2.925502,0.686667,0.5
250,0.0002,2.917889,0.726667,0.56


Traceback (most recent call last):
  File "/tmp/ipykernel_32/2400146832.py", line 58, in train
    del model, trainer
UnboundLocalError: local variable 'model' referenced before assignment


VBox(children=(Label(value='1670.376 MB of 1670.376 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/accuracy,█▆▃▃▂▂▃▂▁▅▅
eval/loss,▁▁▂▅▆█▄▆▅▅▆
eval/map@3,█▆▃▄▂▁▄▅▁▅▆
eval/runtime,▅▁▇▁▁▁▁▆▁▁█
eval/samples_per_second,▄█▂████▃██▁
eval/steps_per_second,▄█▂████▃██▁
train/epoch,▁▁▂▂▂▂▃▃▄▄▅▅▅▅▆▆▇▇▇▇███
train/global_step,▁▁▂▂▂▂▃▃▄▄▅▅▅▅▆▆▇▇▇▇███
train/learning_rate,██▇▇▆▅▄▃▂▁▁
train/loss,█▅▂▁▁▁▁▁▁▁▁

0,1
eval/accuracy,0.56
eval/loss,2.99789
eval/map@3,0.73667
eval/runtime,5.1936
eval/samples_per_second,9.627
eval/steps_per_second,2.503
train/epoch,44.0
train/global_step,275.0
train/learning_rate,0.0
train/loss,0.0


Run 3xvws7p7 errored: UnboundLocalError("local variable 'model' referenced before assignment")
[34m[1mwandb[0m: [32m[41mERROR[0m Run 3xvws7p7 errored: UnboundLocalError("local variable 'model' referenced before assignment")
[34m[1mwandb[0m: Agent Starting Run: vr44p2h3 with config:
[34m[1mwandb[0m: 	early_stopping_patience: 10
[34m[1mwandb[0m: 	epochs: 50
[34m[1mwandb[0m: 	gradient_accumulation_steps: 8
[34m[1mwandb[0m: 	learning_rate: 3.623632873706874e-06
[34m[1mwandb[0m: 	warmup_ratio: 0.05
[34m[1mwandb[0m: 	weight_decay: 0
cat: /sys/module/amdgpu/initstate: No such file or directory


Some weights of DebertaV2ForMultipleChoice were not initialized from the model checkpoint at microsoft/deberta-v3-large and are newly initialized: ['pooler.dense.weight', 'pooler.dense.bias', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Freezing embeddings.
Freezing 18 layers.


Some weights of DebertaV2ForMultipleChoice were not initialized from the model checkpoint at microsoft/deberta-v3-large and are newly initialized: ['pooler.dense.weight', 'pooler.dense.bias', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Freezing embeddings.
Freezing 18 layers.




Step,Training Loss,Validation Loss,Map@3,Accuracy
25,1.6177,1.610545,0.283333,0.14
50,1.6177,1.609479,0.43,0.26
75,1.6115,1.608089,0.583333,0.4
100,1.612,1.606702,0.636667,0.48
125,1.6065,1.604766,0.693333,0.54
150,1.6056,1.600988,0.693333,0.54
175,1.5972,1.593765,0.68,0.54
200,1.5908,1.58719,0.716667,0.58
225,1.5891,1.577128,0.716667,0.58
250,1.5619,1.566447,0.706667,0.58


Traceback (most recent call last):
  File "/tmp/ipykernel_32/2400146832.py", line 58, in train
    del model, trainer
UnboundLocalError: local variable 'model' referenced before assignment


VBox(children=(Label(value='1670.376 MB of 1670.376 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/accuracy,▁▃▅▆▇▇▇█████
eval/loss,███▇▇▇▆▅▃▂▁▁
eval/map@3,▁▃▆▇██▇█████
eval/runtime,▁▂▁▂▂█▂▂▂▁▂▂
eval/samples_per_second,█▇█▇▇▁▇▇▇█▇▇
eval/steps_per_second,█▇█▇▇▁▇▇▇█▇▇
train/epoch,▁▁▂▂▂▂▃▃▄▄▄▄▅▅▅▅▆▆▇▇▇▇███
train/global_step,▁▁▂▂▂▂▃▃▄▄▄▄▅▅▅▅▆▆▇▇▇▇███
train/learning_rate,██▇▇▆▅▄▃▂▂▁▁
train/loss,██▇▇▇▇▅▅▄▁▂▁

0,1
eval/accuracy,0.58
eval/loss,1.56289
eval/map@3,0.70667
eval/runtime,5.0374
eval/samples_per_second,9.926
eval/steps_per_second,2.581
train/epoch,48.0
train/global_step,300.0
train/learning_rate,0.0
train/loss,1.5608


Run vr44p2h3 errored: UnboundLocalError("local variable 'model' referenced before assignment")
[34m[1mwandb[0m: [32m[41mERROR[0m Run vr44p2h3 errored: UnboundLocalError("local variable 'model' referenced before assignment")
[34m[1mwandb[0m: Agent Starting Run: m9chx59e with config:
[34m[1mwandb[0m: 	early_stopping_patience: 10
[34m[1mwandb[0m: 	epochs: 20
[34m[1mwandb[0m: 	gradient_accumulation_steps: 2
[34m[1mwandb[0m: 	learning_rate: 6.728546499681623e-05
[34m[1mwandb[0m: 	warmup_ratio: 0.05
[34m[1mwandb[0m: 	weight_decay: 0.04
cat: /sys/module/amdgpu/initstate: No such file or directory


Some weights of DebertaV2ForMultipleChoice were not initialized from the model checkpoint at microsoft/deberta-v3-large and are newly initialized: ['pooler.dense.weight', 'pooler.dense.bias', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Freezing embeddings.
Freezing 18 layers.


Some weights of DebertaV2ForMultipleChoice were not initialized from the model checkpoint at microsoft/deberta-v3-large and are newly initialized: ['pooler.dense.weight', 'pooler.dense.bias', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Freezing embeddings.
Freezing 18 layers.




Step,Training Loss,Validation Loss,Map@3,Accuracy
25,1.615,1.614609,0.333333,0.24
50,1.6157,1.591963,0.72,0.58
75,1.6044,1.602128,0.813333,0.74
100,1.4204,1.505946,0.653333,0.52
125,1.0604,1.317188,0.696667,0.54
150,0.6357,1.202507,0.693333,0.56
175,0.4051,1.262098,0.743333,0.66
200,0.3974,2.076625,0.726667,0.56
225,0.2705,1.889822,0.753333,0.58
250,0.1513,1.964887,0.726667,0.58


Traceback (most recent call last):
  File "/tmp/ipykernel_32/2400146832.py", line 58, in train
    del model, trainer
UnboundLocalError: local variable 'model' referenced before assignment


VBox(children=(Label(value='1670.376 MB of 1670.376 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/accuracy,▁▆█▅▅▅▇▅▆▆▆▆▇
eval/loss,▄▃▄▃▂▁▁▇▅▆▄▇█
eval/map@3,▁▇█▆▆▆▇▇▇▇▇▇▇
eval/runtime,█▁▄▂▃▃▃▃▂▄▃▄▃
eval/samples_per_second,▁█▅▇▆▆▆▆▇▅▆▅▆
eval/steps_per_second,▁█▅▆▆▆▆▆▆▅▆▅▆
train/epoch,▁▁▂▂▂▂▃▃▃▃▄▄▅▅▅▅▆▆▆▆▇▇▇▇███
train/global_step,▁▁▂▂▂▂▃▃▃▃▄▄▅▅▅▅▆▆▆▆▇▇▇▇███
train/learning_rate,███▇▇▆▆▅▄▃▃▂▁
train/loss,███▇▅▃▂▂▂▁▁▁▁

0,1
eval/accuracy,0.64
eval/loss,2.30509
eval/map@3,0.77667
eval/runtime,5.0502
eval/samples_per_second,9.901
eval/steps_per_second,2.574
train/epoch,13.0
train/global_step,325.0
train/learning_rate,2e-05
train/loss,0.1151


Run m9chx59e errored: UnboundLocalError("local variable 'model' referenced before assignment")
[34m[1mwandb[0m: [32m[41mERROR[0m Run m9chx59e errored: UnboundLocalError("local variable 'model' referenced before assignment")
[34m[1mwandb[0m: Agent Starting Run: o395ip9l with config:
[34m[1mwandb[0m: 	early_stopping_patience: 10
[34m[1mwandb[0m: 	epochs: 20
[34m[1mwandb[0m: 	gradient_accumulation_steps: 2
[34m[1mwandb[0m: 	learning_rate: 8.34491372502644e-05
[34m[1mwandb[0m: 	warmup_ratio: 0.1
[34m[1mwandb[0m: 	weight_decay: 0.04
cat: /sys/module/amdgpu/initstate: No such file or directory


Some weights of DebertaV2ForMultipleChoice were not initialized from the model checkpoint at microsoft/deberta-v3-large and are newly initialized: ['pooler.dense.weight', 'pooler.dense.bias', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Freezing embeddings.
Freezing 18 layers.


Some weights of DebertaV2ForMultipleChoice were not initialized from the model checkpoint at microsoft/deberta-v3-large and are newly initialized: ['pooler.dense.weight', 'pooler.dense.bias', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Freezing embeddings.
Freezing 18 layers.




Step,Training Loss,Validation Loss,Map@3,Accuracy
25,1.6149,1.620336,0.386667,0.24
50,1.5944,1.526233,0.75,0.6
75,1.3839,1.468326,0.506667,0.38
100,1.0086,1.218655,0.843333,0.76
125,0.7273,1.075755,0.753333,0.62
150,0.635,1.10832,0.74,0.58
175,0.2946,1.055765,0.81,0.7
200,0.3982,1.793868,0.77,0.64
225,0.1165,1.956651,0.773333,0.7
250,0.1363,1.481208,0.813333,0.7


Traceback (most recent call last):
  File "/tmp/ipykernel_32/2400146832.py", line 58, in train
    del model, trainer
UnboundLocalError: local variable 'model' referenced before assignment


VBox(children=(Label(value='1670.376 MB of 1670.376 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/accuracy,▁▆▃█▆▆▇▆▇▇▇▇▆▆
eval/loss,▃▃▃▂▁▁▁▄▄▃▅▇██
eval/map@3,▁▇▃█▇▆▇▇▇██▇▇▇
eval/runtime,▂▂▂▃▁▂▂▁▂█▂▁▁▂
eval/samples_per_second,▇▇▇▆█▇▇█▇▁▇██▇
eval/steps_per_second,▇▇▇▆█▇▇█▇▁▇██▇
train/epoch,▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇███
train/global_step,▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇███
train/learning_rate,▃███▇▇▆▆▅▄▃▃▂▁
train/loss,██▇▅▄▄▂▃▁▂▁▁▁▁

0,1
eval/accuracy,0.6
eval/loss,2.86175
eval/map@3,0.75333
eval/runtime,5.0438
eval/samples_per_second,9.913
eval/steps_per_second,2.577
train/epoch,14.0
train/global_step,350.0
train/learning_rate,2e-05
train/loss,0.0223


Run o395ip9l errored: UnboundLocalError("local variable 'model' referenced before assignment")
[34m[1mwandb[0m: [32m[41mERROR[0m Run o395ip9l errored: UnboundLocalError("local variable 'model' referenced before assignment")
[34m[1mwandb[0m: Agent Starting Run: ptd1z34q with config:
[34m[1mwandb[0m: 	early_stopping_patience: 5
[34m[1mwandb[0m: 	epochs: 50
[34m[1mwandb[0m: 	gradient_accumulation_steps: 2
[34m[1mwandb[0m: 	learning_rate: 1.3952780399567949e-05
[34m[1mwandb[0m: 	warmup_ratio: 0.2
[34m[1mwandb[0m: 	weight_decay: 0.04
cat: /sys/module/amdgpu/initstate: No such file or directory


Some weights of DebertaV2ForMultipleChoice were not initialized from the model checkpoint at microsoft/deberta-v3-large and are newly initialized: ['pooler.dense.weight', 'pooler.dense.bias', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Freezing embeddings.
Freezing 18 layers.


Some weights of DebertaV2ForMultipleChoice were not initialized from the model checkpoint at microsoft/deberta-v3-large and are newly initialized: ['pooler.dense.weight', 'pooler.dense.bias', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Freezing embeddings.
Freezing 18 layers.


Traceback (most recent call last):
  File "/tmp/ipykernel_32/2400146832.py", line 50, in train
    trainer.train()
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1837, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2693, in training_step
    self.accelerator.backward(loss)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1921, in backward
    self.scaler.scale(loss).backward(**kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pa

VBox(children=(Label(value='0.000 MB of 0.000 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

Run ptd1z34q errored: OutOfMemoryError('CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 14.76 GiB total capacity; 13.47 GiB already allocated; 11.75 MiB free; 13.75 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF')
[34m[1mwandb[0m: [32m[41mERROR[0m Run ptd1z34q errored: OutOfMemoryError('CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 14.76 GiB total capacity; 13.47 GiB already allocated; 11.75 MiB free; 13.75 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF')


We are far behind the best MAP@3 metric of the competition, which is 0.933208 made by [Team H2O LLM Studio][1] but that is to be expected. They have used a dataset of 2.46 TB of data, where we only used our original 200 questions and answers with the additional context column. Nonetheless, it is exciting to see that our model reached a far better outcome than random guessing from such a small dataset.


MAP@3 is the official evaluaion metric of the competition, but out of curiosity we checked the accuracy too.

[1]: https://www.kaggle.com/competitions/kaggle-llm-science-exam/discussion/446422