## Evaluate the model

For the evaluation phase, I selected 300 samples from the dataset then rephrased them. I re-embedded these samples and asked the model to find answers again. My model and the FAISS method achieved an accuracy of 0.86.

In [8]:
import pickle
import faiss
import numpy as np
import pandas as pd
from zipfile import ZipFile
from sentence_transformers import SentenceTransformer

In [9]:
my_file = open("changed_samples2.txt", "r", encoding="utf-8")

# reading the file
data = my_file.read()

question_list = data.replace('\n', ' ').split("?")
question_list.pop()
# printing the data
print(question_list)
my_file.close()

['what is justin bieber’s brother called', '  which role did natalie portman portray in star wars', '  selena gomez is from which state', '  grand bahama island is a part of which country', '  what currency should be brought to the bahamas', '  which character did john noble act as in the lord of the rings', '  which team is joakim noah associated with', '  the nfl team redskins originates from where', '  in which location did saki reside', '  what is sacha baron cohen’s current age', '  at the start of ww2, which two nations attacked poland', '  which time zone applies to cleveland, ohio', '  who did draco malfoy marry in the end', '  what nations share a border with the usa', '  where on a map can rome, italy be found', '  what is nina dobrev’s nationality', '  iceland is a territory of which nation', '  who was the first kennedy to pass away', '  which books were authored by beverly cleary', '  from which country did the philippines achieve independence', '  what is the main airport

In [12]:
d = pd.read_csv('../Data/df_merged.csv')

with ZipFile('../my-embeddings.zip', 'r') as zip:
    # printing all the contents of the zip file
    zip.printdir()
    zip.extractall()

with open("my-embeddings.pkl", "rb") as fIn:
    cache_data = pickle.load(fIn)
    embeddings = cache_data['embeddings']
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

embeddingsQ = model.encode(question_list)
indexes = []

for q in embeddingsQ:
    q = np.array(q).reshape(1, -1)
    index = faiss.IndexFlatL2(embeddings.shape[1])

    index = faiss.IndexIDMap(index)

    index.add_with_ids(embeddings, d.index.values)

    with open("faiss_index.pickle", "wb") as f:
        pickle.dump(faiss.serialize_index(index), f)

    L2, ID = index.search(np.array(q).astype("float32"), k=1)
    indexes.append(ID.flatten().tolist()[0])

File Name                                             Modified             Size
my-embeddings.pkl                              2025-05-10 13:20:50    139155638


In [14]:
print(indexes)
len(indexes)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 515, 22, 23, 871, 2388, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 3467, 42, 43, 44, 45, 14754, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 2496, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 3434, 78, 79, 80, 81, 82, 1406, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 387, 102, 103, 104, 105, 180, 107, 108, 109, 110, 135, 2191, 113, 114, 115, 197, 117, 118, 119, 3433, 121, 122, 123, 124, 3164, 126, 127, 128, 129, 264, 131, 132, 133, 134, 135, 136, 3118, 138, 3702, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 842, 156, 157, 158, 159, 160, 161, 162, 3757, 164, 165, 166, 1403, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 2063, 183, 3021, 185, 186, 187, 188, 189, 190, 3283, 192, 193, 194, 195, 196, 116, 198, 199, 200, 201, 202, 203, 204, 205, 78253, 207, 208, 209, 210, 211, 212, 29236, 214, 215, 

300

In [15]:
from sklearn.metrics import accuracy_score

labels = list(range(0, 301))
labels.remove(247)  # Unnoticed value which doesn't exit in rephrased questions
accuracy = accuracy_score(indexes, labels)
accuracy


0.8633333333333333

In [16]:
different = []
for i in indexes:
    if i not in labels:
        different.append(i)
    else:
        continue
different

[515,
 871,
 2388,
 3467,
 14754,
 2496,
 3434,
 1406,
 387,
 2191,
 3433,
 3164,
 3118,
 3702,
 842,
 3757,
 1403,
 2063,
 3021,
 3283,
 78253,
 29236,
 2363,
 2419,
 3069,
 42294,
 3577,
 1590,
 2064,
 2672,
 3328,
 3576,
 3673,
 2188,
 1475,
 1639]

When I reviewed some of the predictions, I noticed that some of them belonged to the same questions. These entries might have been added to the dataset to improve context during training. Still, it's not right to remove them from the dataset, because matching answers can belong to different questions.

In [17]:
df2 = d[d.index.isin(different)]
df2

Unnamed: 0.1,Unnamed: 0,question,answers
387,387,when was the last dallas cowboys super bowl win?,Super Bowl XXX
515,515,who is the prime minister of ethiopia now?,Meles Zenawi
842,842,what is the currency of spain called?,Euro
871,871,what type of political system does russia have?,Constitutional republic
1403,1403,what university did bill clinton graduated from?,Yale Law School
1406,1406,when did the ny knicks last win a championship?,1973 NBA Finals
1475,1475,who does david beckham play for in 2013?,Paris Saint-Germain F.C.
1590,1590,when and where did the battle of antietam take...,Sharpsburg
1639,1639,what was the name of the book hitler wrote whi...,Mein Kampf
2063,2063,who killed vincent chin film?,"Ronald Ebens,Michael Nitz"
