**Dataset Description**

Your challenge in this competition is to answer multiple-choice questions written by an LLM. While the specifics of the process used to generate these questions aren't public, we've included 200 sample questions with answers to show the format, and to give a general sense of the kind of questions in the test set. However, there may be a distributional shift between the sample questions and the test set, so solutions that generalize to a broad set of questions are likely to perform better. Each question consists of a prompt (the question), 5 options labeled A, B, C, D, and E, and the correct answer labeled answer (this holds the label of the most correct answer, as defined by the generating LLM).

This competition uses a hidden test. When your submitted notebook is scored, the actual test data (including a sample submission) will be made available to your notebook. The test set has the same format as the provided test.csv but has ~4000 questions that may be different is subject matter.

**Files**

train.csv - a set of 200 questions with the answer column
test.csv - the test set; your task it to predict the top three most probable answers given the prompt. NOTE: the test data you see here just a copy of the training data without the answers. The unseen re-run test set is comprised of ~4,000 different prompts.
sample_submission.csv - a sample submission file in the correct format

**Columns**

prompt - the text of the question being asked
A - option A; if this option is correct, then answer will be A
B - option B; if this option is correct, then answer will be B
C - option C; if this option is correct, then answer will be C
D - option D; if this option is correct, then answer will be D
E - option E; if this option is correct, then answer will be E
answer - the most correct answer, as defined by the generating LLM (one of A, B, C, D, or E).

In [1]:
# When using Google Collab, after installing transformers, restart runtime

In [2]:
# Library to suppress warnings
import warnings
warnings.filterwarnings('ignore')

In [3]:
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Libraries to help with data visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
# Path for  data train
train_data = "/content/drive/MyDrive/LLM Science Exam/train.csv"

In [6]:
# Path for  data test
test_data = "/content/drive/MyDrive/LLM Science Exam/test.csv"

In [7]:
df_test = pd.read_csv(test_data)

In [8]:
df_train = pd.read_csv(train_data)

In [9]:
df_train.head()

Unnamed: 0,id,prompt,A,B,C,D,E,answer
0,0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...,D
1,1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,A
2,2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...,A
3,3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,C
4,4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,D


In [10]:
df_train.drop(columns=['id'], inplace=True)

In [11]:
df_test.head()

Unnamed: 0,id,prompt,A,B,C,D,E
0,0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...
1,1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...
2,2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...
3,3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...
4,4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...


In [12]:
df_test.drop(columns=['id'], inplace=True)

In [13]:
# Source: https://huggingface.co/docs/transformers/training#train-with-pytorch-trainer
# Transformers give access to 1K+ pre-trained model
# Fine-tuning = Train pre-trained model on a dataset specific task

# Tokenizer to process test
# Padding and truncation strategy needed to handle any variable sequence lengths
# datasets map method applies a preprocessing function over the entire dataset

from transformers import AutoTokenizer
# Transformers is a library specific to Hugging Face, but the term "transformers" is a
# type of deep learning model architecture
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
# bert based cased model is a 12 layer BERT model with 110 million paramters trained on
# masked language modeling objectivs for Englisth language understanding and processing

In [14]:
pip install --upgrade transformers



In [15]:
# This function takes a dictionary of examples as input
# Applies tokenization using Hugging Face transformers library's tokenizer
# Tokenizes text content stored in the "text" key
# Padding = max_length - ensures tokenized sequences are padded to the max lenght
# Truncation = true - truncates sequencs to fit with max length

# def tokenize_function(examples):
#   return tokenizer(examples['text'], padding = "max_length", truncation=True)

In [16]:
# Function to apply the logic for each row
def translate_answer(row):
  if row['answer'] == 'D':
    return row['D']
  elif row['answer'] == 'A':
    return row['A']
  elif row['answer'] == 'B':
    return row['B']
  elif row['answer'] == 'C':
    return row['C']
  elif row['answer'] == 'E':
    return row['E']
  else:
    # Add more conditions for other values if needed
    return np.nan  # Default value for other cases

In [17]:
# Apply the function to create the 'answer_translated' column
df_train['answer'] = df_train.apply(translate_answer, axis=1)

In [18]:
df_train.head()

Unnamed: 0,prompt,A,B,C,D,E,answer
0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...,MOND is a theory that reduces the discrepancy ...
1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the evolution of sel...
2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...,The triskeles symbol was reconstructed as a fe...
3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...
4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...


In [19]:
# Format dataset to feed it into tokenize_function
# Separate context, which where the answer is, from the question, and the answer
# Merge columns that contain possible answer into one
context = df_train['A']+ ' ' + df_train['B']+ ' ' + df_train['C']+ ' ' + df_train['D']+ ' ' + df_train['E']

In [20]:
new_df_train = pd.DataFrame({'question': df_train['prompt'], 'context': context, 'answers': df_train['answer']})

In [21]:
new_df_train.head()

Unnamed: 0,question,context,answers
0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that reduces the discrepancy ...
1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the evolution of sel...
2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol was reconstructed as a fe...
3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...
4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...


In [22]:
# datasets is a library by Hugging Face
# streamlines process of working with diverse datasets in ML
!pip install datasets



In [23]:
# importing Dataset allows access to functionalities within the library
from datasets import Dataset

In [24]:
# common approach when working with Hugging Face's datasets library
# converts pandas DataFrame into a dataset object
dataset = Dataset.from_pandas(new_df_train)

In [25]:
def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [29]:
tokenized_dataset = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

TypeError: string indices must be integers

In [None]:
# applies function tokenize_function in batches to new_df_train
# applying in batches speeds up process
# tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=dataset.column_names)

In [None]:
# tokenized_datasets_test = dataset_test.map(tokenize_function, batched=True)

In [None]:
# Data collator is used for NLP tasks
# Data collator is responsible for processing batches of input data and combining them before feeding them into a machine learning model
# Data collator ensures uniformity if the elements/rows/data points are no the same length
# Data collator are even more necessary with larger datasets
# from transformers import DataCollatorWithPadding

In [None]:
# initialize data collator object
# data_collator = DataCollatorWithPadding(tokenizer)

In [None]:
# AutoModelForQuestionAnswering is a class that represents a pre-trained model for question answering tasks
# TrainingArguments stores and hadles the training arguments and hyperparameters for model training
# TrainingArguments allows configuration of various settings
# Trainer is a high level interface for training and evaluating models
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

In [None]:
# load pre-trained DistilBERT model for question answering
# DistilBERT is smaller and lighter than predecessor BERT with 40% fewer parameters = increased computational efficiency
# DistilBERT trained using process named "knowledge distillation", where a larger model like BERT acts a teacher and transfers knowledge to DistilBERT - maintatins performance while reducing size
# DistilBERT employs self-attention mechanism for understanding context in text - capturing relationships between words
# DistilBERT is cost effective due to reduced sized
model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")

In [None]:
# Using the `Trainer` with `PyTorch` requires `accelerate>=0.21.0`
!pip install transformers[torch]

In [None]:
# Define training hyperparameters in Trainingi Arguments (output_dir is required - specifies where to save model)
training_args = TrainingArguments(
    output_dir="qa_model"
    # evaluation_strategy="epoch"
    # learning_rate=2e-5,
    # per_device_train_batch_size=16,
    # per_device_eval_batch_size=16,
    # num_train_epochs=3,
    # weight_decay=0.01,
    # push_to_hub=True,
)


In [None]:
dataset.shape

In [None]:
dataset_test.shape

In [None]:
# To train model, you need to split training dataset, to have a subset for testing
dataset = dataset.train_test_split(test_size = 0.2)

In [None]:
# identify type of object
type(dataset)

In [None]:
# identify attributes of object
# dir(dataset)

In [None]:
dataset['train'][0]

In [None]:
dataset['test'][0]

In [None]:
# if you are using a datasets.dataset_dict.DatasetDict, you might need to convert it into PyTorch datasets.
from torch.utils.data import DataLoader

In [None]:
# train_dataset = dataset['train']
# val_dataset = dataset['test']

In [None]:
# train_dataloader = DataLoader(train_dataset, collate_fn=data_collator)

In [None]:
# val_dataloader = DataLoader(val_dataset, collate_fn=data_collator)

In [None]:
# Pass training args, model, dataset, tokenizer, and data collator to Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    tokenizer=tokenizer
    # data_collator=data_collator,
)

In [None]:
# Call train() to finetune the model
trainer.train()

In [None]:
# Initializes question-answering pipeline using the Huggin Face Transformers library
# Pipeline is designed for natural language processing tasks related to question answering
# Uses pre-trained models to extract answers from a given context
# qa model can be used to answer questions based on a provided context
# you would pass a question and its corresponding context to the pipeline
# the model would generate an answer based on its understanding o the language and context
qa_model = pipeline("question-answering")

In [None]:
# Create a new DataFrame to store the predicted answers
df_pred = pd.DataFrame(columns=['prompt','top_answer_1','top_answer_2','top_answer_3'])

In [None]:
for _, row in df_train.iterrows():
  # Define the prompt and answer choices
  prompt = row['prompt']
  choices = [str(row['A']), str(row['B']), str(row['C']), str(row['D']), str(row['E'])]

  # Combine prompt and choices
  question = {'question':prompt, 'context':' '.join(choices)}

  # Get the answer
  answers = qa_model(question, top_k=3)

  # Print the answer
  # print(f'The answer is: {answer}')
  # Add the answers to the dataframe
  df_pred = df_pred.append({
      'prompt': prompt,
      'top_answer_1': row[df_train.iloc[0]['answer']],
      'top_answer_2': row[df_train.iloc[1]['answer']],
      'top_answer_3': row[df_train.iloc[2]['answer']]
  }, ignore_index=True)

In [None]:
df_pred.head()

In [None]:
df_combined = pd.merge(df_train, df_pred, how='inner',on='prompt')

In [None]:
df_combined.head()

In [None]:
# Function to apply the logic for each row
def translate_answer(row):
  if row['answer'] == 'D':
    return row['D']
  elif row['answer'] == 'A':
    return row['A']
  elif row['answer'] == 'B':
    return row['B']
  elif row['answer'] == 'C':
    return row['C']
  elif row['answer'] == 'E':
    return row['E']
  else:
    # Add more conditions for other values if needed
    return np.nan  # Default value for other cases

In [None]:
# Apply the function to create the 'answer_translated' column
df_combined['answer_translated'] = df_combined.apply(translate_answer, axis=1)

In [None]:
df_combined.head()

In [None]:
# Evaluate performance
# df_combined['accuracy'] = df_combined['answer_translated'].equals(df_combined['top_answer_1'])
df_combined['accuracy'] = np.where(df_combined['answer_translated'] == df_combined['top_answer_1'], 1, 0)

In [None]:
# Count the occurrences of each category
category_counts = df_combined['accuracy'].value_counts()

# Create a bar plot
plt.figure(figsize=(10, 6))  # Adjust the figure size as needed
sns.barplot(x=category_counts.index, y=category_counts.values, palette='viridis')

# Set plot labels and title
plt.xlabel('Accuracy')
plt.ylabel('Count')

# Rotate x-axis labels for better readability (optional)
plt.xticks(rotation=45, ha='right')

# Show the plot
plt.show()

In [None]:
# calculate the percentage of accuracy
accuracy = df_combined['accuracy'].mean() * 100

print(f"The percentage of accuracy is {accuracy:.2f}%")