<a href="https://colab.research.google.com/github/csabi0312/DeepLProject/blob/main/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Packages

In [1]:
!pip install datasets transformers evaluate
!pip install transformers[torch]



In [2]:
# Import packages
import random
import numpy as np
import pandas as pd
from tqdm import tqdm   # To make your loops show a smart progress meter
from sklearn.model_selection import train_test_split
from keras.callbacks import EarlyStopping, ModelCheckpoint, TensorBoard
from datasets import Dataset, DatasetDict, load_metric
from transformers import BertForSequenceClassification, BertTokenizer, TrainingArguments, Trainer
import evaluate


# Setting the random seed
seed_value = 42
random.seed(seed_value)

#Data manipulations

In [3]:
# Loading the questions
data = pd.read_csv("https://raw.githubusercontent.com/csabi0312/DeepLProject/main/train.csv",index_col=0)

In [4]:
data.head()

Unnamed: 0_level_0,prompt,A,B,C,D,E,answer
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...,D
1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,A
2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...,A
3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,C
4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,D


In [5]:
# Creating a dictionary to map the values to numbers
mapping = {'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4}

# Replacing the values in the 'answer' column
data['answer'] = data['answer'].replace(mapping)
data.head()

Unnamed: 0_level_0,prompt,A,B,C,D,E,answer
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...,3
1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,0
2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...,0
3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,2
4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,3


In [6]:
# Train-Val-Test split:
# Splitting the DataFrame into training, validation, and test datasets with a 2:1:1 ratio
train, temp = train_test_split(data, test_size=0.5, random_state=42)
val, test = train_test_split(temp, test_size=0.5, random_state=42)

print(len(train))
print(len(val))
print(len(test))

100
50
50


We want to merge the questions with the possible answers and create a dataset with two columns: merged text and answer as an integer label.<br> <br>
To reduce overfitting and enrich the data we will consider a random permuation of the answers and merge them with the question.<br>
This means that for each question there will be 2 datapoints in the new dataset, one with the original order of the answers and a permuted one.

In [7]:
def shuffle_list_with_index(list_, original_index):
    # Shuffling the list without changing the original list
    shuffled_list = random.sample(list_, len(list_))

    # Finding the new index of the original right element
    right_word = list_[original_index]
    new_index = shuffled_list.index(right_word)

    return shuffled_list, new_index

In [8]:
# Testing the shuffling and the merging for the first row of the dataset
row = train.iloc[0]
question = row[0]
answers = [row[i] for i in range(1,6)]
label = row[6]

new_answers, new_label = shuffle_list_with_index(answers,label)

result1 = ['\n'.join([question] + answers),label]
result2 = ['\n'.join([question] + new_answers), new_label]

for i in result1:
  print(i)
print('\n')
for j in result2:
  print(j)

Which of the following statements accurately describes the relationship between the dimensions of a diffracting object and the angular spacing of features in the diffraction pattern?
The angular spacing of features in the diffraction pattern is indirectly proportional to the dimensions of the object causing the diffraction. Therefore, if the diffracting object is smaller, the resulting diffraction pattern will be narrower.
The angular spacing of features in the diffraction pattern is directly proportional to the dimensions of the object causing the diffraction. Therefore, if the diffracting object is smaller, the resulting diffraction pattern will be narrower.
The angular spacing of features in the diffraction pattern is independent of the dimensions of the object causing the diffraction. Therefore, if the diffracting object is smaller, the resulting diffraction pattern will be the same as if it were big.
The angular spacing of features in the diffraction pattern is inversely proportio

In [9]:
# Performing the data augmentation process for the training set
# Initializing an empty list
X_t=[]

# Looping through the 'train' DataFrame using tqdm for progress visualization
for index in tqdm(range(len(train))):
    # Extracting the current row from the DataFrame
    row = train.iloc[index]
    # Extracting the question from the current row
    question = row[0]
    # Extracting the answers from the current row
    answers = [row[i] for i in range(1,6)]
    # Extracting the label from the current row
    label = row[6]

    # Shuffling the answers and updating the label using the 'shuffle_list_with_index' function
    new_answers, new_label = shuffle_list_with_index(answers,label)

    # Combining the question and answers with a new line and adding the original label to the list
    result1 = ['\n '.join([question] + answers),label]
    # Combining the question and shuffled answers with a new line and adding the updated label to the list
    result2 = ['\n '.join([question] + new_answers), new_label]

    # Appending both results to the list 'X_t'
    X_t.append(result1)
    X_t.append(result2)

# Creating a new DataFrame 'df_train' from the list 'X_t' with the specified column names
df_train = pd.DataFrame(X_t, columns=["text","label"])

# Displaying the first few rows of the newly created DataFrame
df_train.head()

100%|██████████| 100/100 [00:00<00:00, 1619.81it/s]


Unnamed: 0,text,label
0,Which of the following statements accurately d...,3
1,Which of the following statements accurately d...,4
2,What is the second law of thermodynamics?\n Th...,4
3,What is the second law of thermodynamics?\n Th...,0
4,What is radiometric dating?\n Radiometric dati...,1


In [10]:
# Making the validation and the test set into the same form
X_v=[]

for index in tqdm(range(len(val))):
  row = val.iloc[index]
  question = row[0]
  answers = [row[i] for i in range(1,6)]
  label = row[6]

  result1 = ['\n '.join([question] + answers),label]
  X_v.append(result1)

df_val = pd.DataFrame(X_v, columns=["text","label"])
df_val.shape

100%|██████████| 50/50 [00:00<00:00, 3764.34it/s]


(50, 2)

In [11]:
X_e=[]

for index in tqdm(range(len(test))):
  row = test.iloc[index]
  question = row[0]
  answers = [row[i] for i in range(1,6)]
  label = row[6]

  result1 = ['\n '.join([question] + answers),label]
  X_e.append(result1)

df_test = pd.DataFrame(X_e, columns=["text","label"])
df_test.shape

100%|██████████| 50/50 [00:00<00:00, 1592.12it/s]


(50, 2)

Now we have the 3 neccesary datasets for training and testing:


*   df_train: enriched train dataset
*   df_val : validation dataset
*   df_test : test dataset



In [12]:
# Create Hugging Face DatasetDict from the separate DataFrames
raw_datasets = DatasetDict({
    "train": Dataset.from_pandas(df_train),
    "validation": Dataset.from_pandas(df_val),
    "test": Dataset.from_pandas(df_test),
})

In [14]:
# Specifying hyperparameters
DATA_COLUMN = 'text'
LABEL_COLUMN = 'label'
LEARNING_RATE = 5e-5
BATCH_SIZE = 8
NUM_EPOCHS = 10
NUM_LABELS = 5    # There are 5 labels corresponding to the answers: A B C D E

# Load pre-trained model and tokenizer
model_name = "reyhanemyr/bert-base-uncased-finetuned-scientific-eval"
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=NUM_LABELS, ignore_mismatched_sizes=True)
tokenizer = BertTokenizer.from_pretrained(model_name)

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['text'], padding=True, truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

# Define accuracy metric
metric = load_metric("accuracy")

# Define training arguments
training_args = TrainingArguments(
    output_dir="./bert_sequence_classification",
    per_device_train_batch_size=BATCH_SIZE,
    evaluation_strategy="epoch",
    save_total_limit=2,
    num_train_epochs=NUM_EPOCHS,
)

# Define the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    compute_metrics=lambda p: metric.compute(predictions=p.predictions.argmax(axis=1), references=p.label_ids),
)

# Train the model
trainer.train()

# Evaluate the model
results = trainer.evaluate()

print(results)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at reyhanemyr/bert-base-uncased-finetuned-scientific-eval and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at reyhanemyr/bert-base-uncased-finetuned-scientific-eval and are newly initialized because the shapes did not match:
- classifier.weight: found shape torch.Size([15, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
- classifier.bias: found shape torch.Size([15]) in the checkpoint and torch.Size([5]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/337 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

  metric = load_metric("accuracy")


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.62795,0.2
2,No log,1.624234,0.2
3,No log,1.71886,0.16
4,No log,1.664168,0.26
5,No log,1.774031,0.24
6,No log,1.880902,0.2
7,No log,1.917006,0.16
8,No log,1.954061,0.24
9,No log,1.942363,0.3
10,No log,1.973765,0.24


{'eval_loss': 1.9737648963928223, 'eval_accuracy': 0.24, 'eval_runtime': 1.6714, 'eval_samples_per_second': 29.915, 'eval_steps_per_second': 4.188, 'epoch': 10.0}


In [15]:
# Evaluate the model on the test set
test_results = trainer.evaluate(tokenized_datasets["test"])

print(test_results)

{'eval_loss': 1.88422429561615, 'eval_accuracy': 0.22, 'eval_runtime': 1.5265, 'eval_samples_per_second': 32.756, 'eval_steps_per_second': 4.586, 'epoch': 10.0}


It's only 2% better that the original bert.

Maybe we need to give the inputs to the model separately sentence-by-sentence somehow?