### https://www.kaggle.com/competitions/kaggle-llm-science-exam/overview

#### Context
As the scope of large language model capabilities expands, a growing area of research is using LLMs to characterize themselves. Because many preexisting NLP benchmarks have been shown to be trivial for state-of-the-art models, there has also been interesting work showing that LLMs can be used to create more challenging tasks to test ever more powerful models.

At the same time methods like quantization and knowledge distillation are being used to effectively shrink language models and run them on more modest hardware. The Kaggle environment provides a unique lens to study this as submissions are subject to both GPU and time limits.

The dataset for this challenge was generated by giving gpt3.5 snippets of text on a range of scientific topics pulled from wikipedia, and asking it to write a multiple choice question (with a known answer), then filtering out easy questions.

Right now we estimate that the largest models run on Kaggle are around 10 billion parameters, whereas gpt3.5 clocks in at 175 billion parameters. If a question-answering model can ace a test written by a question-writing model more than 10 times its size, this would be a genuinely interesting result; on the other hand if a larger model can effectively stump a smaller one, this has compelling implications on the ability of LLMs to benchmark and test themselves.

#### I am using https://www.kaggle.com/code/wlifferth/starter-notebook-ranked-predictions-with-bert as a starter notebook. All due to credit to the author William Lifferth for this notebook

#### LLM Science Exam
This starter notebook walks through a basic example of using BERT to rank the answers to each question. We'll finetune BERT on the 200 public questions, then use the AutoModelForMultipleChoice class to generate probabilities that each option correctly answers the prompt, and finally we'll turn those predictions into a MAP@3-formatted prediction like A B C.

In [1]:
import os

print(os.getcwd())

C:\Users\anind\OneDrive\Anindo\Python\projects\aiprojects\etc


In [2]:
import pandas as pd

In [3]:
# Let's import the public training set and take a look

train_df = pd.read_csv('kaggle-llm-science-exam/train.csv')
train_df.head()

Unnamed: 0,id,prompt,A,B,C,D,E,answer
0,0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...,D
1,1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,A
2,2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...,A
3,3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,C
4,4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,D


In [4]:
# For convenience we'll turn our pandas Dataframe into a Dataset
from datasets import Dataset
train_ds = Dataset.from_pandas(train_df)

### What is Autotokenizer

In 🤗 Transformers, AutoTokenizer is a factory that picks the right tokenizer class + vocab for a model name. You just give the checkpoint ID, and it returns the correct tokenizer (WordPiece for BERT, SentencePiece for ALBERT, BPE for RoBERTa, etc.). This ensures you don’t mix the wrong tokenizer with a model, which would hurt performance.

The “best” one is the one paired with the model you’re using (same pretraining, same vocab). Within BERT-style models, pick based on language, casing, and domain.

let's use **bert-base-cased** (WordPiece): better when casing matters (e.g., NER, proper nouns).

In [5]:
from transformers import AutoTokenizer

# The path of the model checkpoint we want to use

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

In [6]:
# We'll create a dictionary to convert option names (A, B, C, D, E) into indices and back again
options = 'ABCDE'
indices = list(range(5))

option_to_index = {option: index for option, index in zip(options, indices)}
index_to_option = {index: option for option, index in zip(options, indices)}

In [7]:
# preproccess the data

def preprocess(example):
    # The AutoModelForMultipleChoice class expects a set of question/answer pairs
    # so we'll copy our question 5 times before tokenizing
    first_sentence = [example['prompt']] * 5
    second_sentence = []
    for option in options:
        second_sentence.append(example[option])
    # Our tokenizer will turn our text into token IDs BERT can understand
    tokenized_example = tokenizer(first_sentence, second_sentence, truncation=True)
    tokenized_example['label'] = option_to_index[example['answer']]
    return tokenized_example

tokenized_train_ds = train_ds.map(preprocess, batched=False, remove_columns=['prompt', 'A', 'B', 'C', 'D', 'E', 'answer'])

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [24]:
tokenized_train_ds[:1]

{'id': [0],
 'input_ids': [[[101,
    2029,
    1997,
    1996,
    2206,
    8635,
    14125,
    5577,
    1996,
    4254,
    1997,
    6310,
    8446,
    2937,
    10949,
    1006,
    12256,
    2094,
    1007,
    2006,
    1996,
    5159,
    1000,
    4394,
    3347,
    14001,
    2594,
    3742,
    1000,
    5860,
    2890,
    9739,
    5666,
    1999,
    9088,
    12906,
    1029,
    102,
    12256,
    2094,
    2003,
    1037,
    3399,
    2008,
    13416,
    1996,
    5159,
    4394,
    3347,
    14001,
    2594,
    3742,
    1999,
    9088,
    12906,
    2011,
    2695,
    10924,
    1996,
    4598,
    1997,
    1037,
    2047,
    2433,
    1997,
    3043,
    2170,
    1000,
    18001,
    2601,
    3043,
    1012,
    1000,
    102],
   [101,
    2029,
    1997,
    1996,
    2206,
    8635,
    14125,
    5577,
    1996,
    4254,
    1997,
    6310,
    8446,
    2937,
    10949,
    1006,
    12256,
    2094,
    1007,
    2006,
    1996,
    5159,
    

### We will be using AutoModelForMultipleChoice class

#### What it does

It wraps a pretrained encoder (like BERT, RoBERTa, DeBERTa, etc.) plus a classification head designed for multiple-choice input.

The model expects:

A batch of questions with several answer options each,

It processes each option separately through the encoder,

Then applies a classification layer to score each option,

And outputs a logit per choice (higher = more likely correct).

In [8]:
# Following datacollator (adapted from https://huggingface.co/docs/transformers/tasks/multiple_choice)
# will dynamically pad our questions at batch-time so we don't have to make every question the length
# of our longest question.
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import torch

In [9]:

@dataclass
class DataCollatorForMultipleChoice:
    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    
    def __call__(self, features):
        label_name = "label" if 'label' in features[0].keys() else 'labels'
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]['input_ids'])
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        flattened_features = sum(flattened_features, [])
        
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors='pt',
        )
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        batch['labels'] = torch.tensor(labels, dtype=torch.int64)
        return batch

In [10]:
# Now we'll instatiate the model that we'll finetune on our public dataset, then use to
# make prediction on the private dataset.

from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer
model = AutoModelForMultipleChoice.from_pretrained("bert-base-uncased")




Some weights of BertForMultipleChoice were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [11]:
# The arguments here are selected to run quickly; feel free to play with them.
model_dir = 'finetuned_bert'
training_args = TrainingArguments(
    output_dir=model_dir,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to='none'
)

In [12]:
# Generally it's a bad idea to validate on your training set, but because our training set
# for this problem is so small we're going to train on all our data.
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_ds,
    eval_dataset=tokenized_train_ds,
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
)

  trainer = Trainer(


In [13]:
# Training should take about a minute
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,No log,1.603954
2,No log,1.584196
3,No log,1.431856




TrainOutput(global_step=150, training_loss=1.5702899169921876, metrics={'train_runtime': 2321.7422, 'train_samples_per_second': 0.258, 'train_steps_per_second': 0.065, 'total_flos': 129426866048760.0, 'train_loss': 1.5702899169921876, 'epoch': 3.0})

In [14]:
# Now we can actually make predictions on our questions
predictions = trainer.predict(tokenized_train_ds)



In [15]:
# The following function gets the indices of the highest scoring answers for each row
# and converts them back to our answer format (A, B, C, D, E)
import numpy as np
def predictions_to_map_output(predictions):
    sorted_answer_indices = np.argsort(-predictions)
    top_answer_indices = sorted_answer_indices[:,:3] # Get the first three answers in each row
    top_answers = np.vectorize(index_to_option.get)(top_answer_indices)
    return np.apply_along_axis(lambda row: ' '.join(row), 1, top_answers)

In [16]:
# Let's double check our output looks correct:
predictions_to_map_output(predictions.predictions)

array(['D B C', 'A C D', 'A C B', 'C E A', 'D B E', 'C B E', 'A C D',
       'A D B', 'C E D', 'A C B', 'B E C', 'A D B', 'C B A', 'D C E',
       'B A E', 'C A D', 'E B C', 'A D B', 'A B D', 'E D A', 'B D C',
       'B C D', 'C A B', 'C A B', 'E A B', 'E D A', 'A B C', 'D C B',
       'E B A', 'C B D', 'B C A', 'E C D', 'D E B', 'A C E', 'E D B',
       'A B D', 'E A D', 'A C E', 'E D A', 'A E C', 'E B D', 'B E A',
       'B C D', 'D C E', 'A E B', 'A B C', 'B E C', 'C B A', 'B E A',
       'C E D', 'B D E', 'E D A', 'C A B', 'A C D', 'B A C', 'C E A',
       'C B D', 'C B A', 'D A C', 'B C A', 'B E D', 'B E C', 'B C E',
       'C A E', 'A E C', 'E A D', 'C D A', 'B E C', 'A C E', 'D B A',
       'D E B', 'A E D', 'A D B', 'B D A', 'D A E', 'D B E', 'B D C',
       'B E C', 'C E B', 'E A C', 'C D E', 'A C E', 'C B D', 'B D A',
       'A B C', 'B A D', 'A D B', 'B D E', 'E A C', 'D B A', 'D B A',
       'B E C', 'B E C', 'E B A', 'E B D', 'C D B', 'C D B', 'E C A',
       'C B A', 'D C

In [25]:
# Now we can load up our test set to use our model on!
# The public test.csv isn't the real dataset (it's actually just a copy of train.csv without the answer column)
# but it has the same format as the real test set, so using it is a good way to ensure our code will work when we submit.


test_df = pd.read_csv('kaggle-llm-science-exam/test.csv')
test_df.head()

Unnamed: 0,id,prompt,A,B,C,D,E
0,0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...
1,1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...
2,2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...
3,3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...
4,4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...


In [26]:
# There are more verbose/elegant ways of doing this, but if we give our test set a random `answer` column
# we can make predictions directly with our trainer.

test_df['answer'] = 'A'

# Other than that we'll preprocess it in the same way we preprocessed test.csv
test_ds = Dataset.from_pandas(test_df)
tokenized_test_ds = test_ds.map(preprocess, batched=False, remove_columns=['prompt', 'A', 'B', 'C', 'D', 'E', 'answer'])

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [None]:
# Here we'll generate our "real" predictions on the test set
test_predictions = trainer.predict(tokenized_test_ds)



In [None]:
# Now we can create our submission using the id column from test.csv
submission_df = test_df[['id']]
submission_df['prediction'] = predictions_to_map_output(test_predictions.predictions)

submission_df.head()