# Reduction of set of questions for evaluation
After reaching a high top 1, we can reduce the set of questions for quicker further experimentation, we will try to do this in a way to keep the questions stratified.<br>
To attempt this, we will just run the code for evaluatin and keep track of which questions were not abled to be anwsered, then will create a mini set of questions with the similar ratio of anwsered and unanwsered questions.  

Next cell just loads the data, vector store and the model from 04 notebook.

In [12]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma
from langchain_classic.retrievers import ContextualCompressionRetriever
from langchain_classic.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from utils import load_processed_data

_, questions_ground_truth = load_processed_data("../data/processed/squad_processed.pkl")

embed = HuggingFaceEmbeddings(model_name="google/embeddinggemma-300m", encode_kwargs={'normalize_embeddings': True})
persist_directory = "./chroma/04_google_gemma"
vectorstore = Chroma(persist_directory=persist_directory, embedding_function=embed)

rerank_model = HuggingFaceCrossEncoder(model_name="cross-encoder/ms-marco-MiniLM-L-6-v2")

compressor = CrossEncoderReranker(model=rerank_model, top_n=10)
retriever = ContextualCompressionRetriever(
    base_compressor=compressor, 
    base_retriever=vectorstore.as_retriever(search_kwargs={'k': 10})
)

Now we can run the evaluation and keep questions in separate lists.

In [5]:
import random
from tqdm import tqdm

k_values = [1, 3, 5, 7, 10]
hits = {k: 0 for k in k_values}
total_mrr = 0.0
total_questions = len(questions_ground_truth)
max_k = max(k_values)

#categories lists
found_questions = []
not_found_questions = []

print(f"Starting evaluation on {total_questions} questions...")

for q_data in tqdm(questions_ground_truth):
    question = q_data['question']
    target_id = q_data['ground_truth_para_id']
    
    retrieved_docs = retriever.invoke(question, k=max_k) 
    retrieved_ids = [doc.metadata.get('para_id') for doc in retrieved_docs]

    if target_id in retrieved_ids:
        # Mark as found
        found_questions.append(q_data)
        
        # Metrics
        rank = retrieved_ids.index(target_id) + 1
        total_mrr += 1.0 / rank
        for k in k_values:
            if rank <= k:
                hits[k] += 1
    else:
        # mark as not found
        not_found_questions.append(q_data)

print("\n--- Evaluation Results ---")
print(f"MRR: {total_mrr / total_questions:.4f}")
for k in k_values:
    print(f"Hit Rate@{k}: {(hits[k] / total_questions)*100:.2f}%")

print(f"\nCategorization Summary:")
print(f"Total Found: {len(found_questions)}")
print(f"Total Not Found: {len(not_found_questions)}")

Starting evaluation on 2265 questions...


100%|██████████| 2265/2265 [15:13<00:00,  2.48it/s]


--- Evaluation Results ---
MRR: 0.9226
Hit Rate@1: 89.05%
Hit Rate@3: 95.32%
Hit Rate@5: 96.42%
Hit Rate@7: 96.64%
Hit Rate@10: 96.78%

Categorization Summary:
Total Found: 2192
Total Not Found: 73





With that done, we will create the mini set of questions and run the evaluation on them to see if the metrics are up there with the larger set. 

In [14]:
sample_size = 350


found_ratio = len(found_questions) / total_questions
n_found = int(sample_size * found_ratio)
n_not_found = sample_size - n_found

questions_mini_set = random.sample(found_questions, n_found) + random.sample(not_found_questions, n_not_found)
random.shuffle(questions_mini_set) # just in case :D

print(f"Stratified mini set created with {len(questions_mini_set)} questions.")

Stratified mini set created with 350 questions.


In [15]:
from utils import evaluate_retrieval
results_mini_set = evaluate_retrieval(questions_mini_set, retriever)

Starting evaluation on 350 questions...


100%|██████████| 350/350 [02:09<00:00,  2.70it/s]


--- Evaluation Results ---
MRR: 0.9128
Hit Rate@1: 87.14%
Hit Rate@3: 95.71%
Hit Rate@5: 96.57%
Hit Rate@7: 96.57%
Hit Rate@10: 96.57%





MRR and hit rates at 1, 3 and 5 stayed in the decent range, while we see hit rates at 7 and 9 plateau. After testing multiple times to check that i dont always run with the same seed, these three rates always stayed at 96.57%. Thinking this through, I've realized that the span between 96.42% (Hit Rate@5 with full question set) and 96.78% (Hit Rate@10 with full question set) consists of around 8 questions that get their answer after 5th document. One (or all for that matter) of these questions being picked in ~338 from 2192 total answered questions is unlikely. <br>
I will not stratify the mini set to account for these 8 as I will accept the 96.57% as plateau for this system.<br><br>
Lastly, I will save the data for later use.

In [17]:
import pickle

output_path = "../data/processed/squad_processed_mini.pkl"
output_data = {
    "questions_ground_truth": questions_mini_set
}

with open(output_path, "xb") as f:
    pickle.dump(output_data, f)

print(f"Data successfully exported to {output_path}")

Data successfully exported to ../data/processed/squad_processed_mini.pkl
