# Gold Set Evaluation

Evaluating an LLM's accuracy is not straightforward. Traditional classification
metrics like accuracy, precision, recall, and f_beta are well suited for when 
there are discrete classes so that Type I and Type II errors can be measured.
However when there are many different answers, all with semantic similarity, 
a more robust and understanding evaluation is required.

*Note:* Of course, you can make your LLM answer yes/no or even multiple choice 
questions but the LLM is capable of so much more that won't be assessed.

For these reasons, to capture semantic similarity between generated answers from
the LLM and expected answers, we go back to the embedding space. Here, the 
context and semantic meaning is captured in training, so we can use the vector 
store to embed the generated and expected answers as vectors and measure the 
cosine-similarity in the embedding space. Similarities of these vectors 
correspond to similarities between answers.

## Load
Let's start by loading in the vector store

In [None]:
import os
from pathlib import Path
from langchain_ollama import OllamaEmbeddings
from langchain_community.vectorstores import FAISS
import numpy as np
from dotenv import load_dotenv
import ollama

load_dotenv()


OLLAMA_SERVER_URL = os.getenv("OLLAMA_SERVER_URL")
DPATH_VECTORSTORE = Path.cwd() / "data" / "vectorstore"

client = ollama.Client(host=OLLAMA_SERVER_URL)
models_server = client.list()["models"]
model_names = [model.model for model in models_server]

In [None]:
# Get the only model in the server
model_name = model_names[0]

EMBEDDINGS = OllamaEmbeddings(model=model_name, base_url=OLLAMA_SERVER_URL)
vector_store_path = DPATH_VECTORSTORE / "vs_deepseek-r1"
VECTOR_STORE = FAISS.load_local(
    vector_store_path, embeddings=EMBEDDINGS, allow_dangerous_deserialization=True
)

# Load in Gold Set Session

Load in a session that was intended to ask questions where we have an expected 
answer. That way we can compare the answer we expect with the answer the LLM
generated. 

Here, there should be many more questions, provided with subject matter experts
who know the answers and can evaluate responses to provide feedback. 

In [22]:
import pandas as pd


dpath = Path.cwd() / "data" / "sessions"
fname = "chat_history_session_2945694660.csv"
fpath = dpath / fname

raw = pd.read_csv(fpath, index_col = 0)
raw

Unnamed: 0,session_id,user_input,chatbot_output,response_time_sec,reference_page,context_pdf,pdf_pages,user_feedback,created_at,updated_at
3,2945694660,why is the sky blue?,"<think>\nOkay, let's start by understanding th...",11.44895,7,SlamonetalSCIENCE1987.pdf,8,,2025-06-10 00:30:16.946532,
4,2945694660,what is HER-2/neu,"<think>\nHmm, the user wants me to explain wha...",18.639963,6,SlamonetalSCIENCE1987.pdf,8,,2025-06-10 00:40:51.758936,
5,2945694660,"In the initial survey, tissue from how many br...","<think>\nOkay, let's tackle this query step by...",14.046112,6,SlamonetalSCIENCE1987.pdf,8,,2025-06-10 00:47:28.563297,


Here, an irrelevant question is asked to make sure the model doesn't make up an
answer.

# Prepare
Let's look at the relevant columns and add answers that are expected with an answer key.

In [25]:
# Answers from article and web search lookup
answer_key = {
    "why is the sky blue?": "I don't know",
    "what is HER-2/neu": "HER-2/neu is a receptor protein that, in healthy cells, helps control cell growth and division.",
    "In the initial survey, tissue from how many breast cancers was evaluated for alterations in the HER-2/neu gene?": "103",
}

prepared = (
    raw
    .filter(["session_id", "user_input", "chatbot_output"])
    .assign(expected_answer=lambda x: x["user_input"].map(answer_key))
)
prepared

Unnamed: 0,session_id,user_input,chatbot_output,expected_answer
3,2945694660,why is the sky blue?,"<think>\nOkay, let's start by understanding th...",I don't know
4,2945694660,what is HER-2/neu,"<think>\nHmm, the user wants me to explain wha...","HER-2/neu is a receptor protein that, in healt..."
5,2945694660,"In the initial survey, tissue from how many br...","<think>\nOkay, let's tackle this query step by...",103


## Embed Answers
Let's embed answers so that we can compare them.

In [None]:
def embed_text(vector_store, text: str) -> np.ndarray:
    return np.array(vector_store.embedding_function.embed_query(text))

processed = (
    prepared
    .assign(
        generated_embeded=lambda x: x.apply(lambda row: embed_text(VECTOR_STORE, row["chatbot_output"]), axis="columns"),
        expected_embeded=lambda x: x.apply(lambda row: embed_text(VECTOR_STORE, row["expected_answer"]), axis="columns"),

    )
)
processed

Unnamed: 0,session_id,user_input,chatbot_output,expected_answer,generated_embeded,expected_embeded
3,2945694660,why is the sky blue?,"<think>\nOkay, let's start by understanding th...",I don't know,"[0.0013492916, 0.004949752, 0.0018895468, 0.00...","[0.00016478242, -0.0017969977, -0.0024264022, ..."
4,2945694660,what is HER-2/neu,"<think>\nHmm, the user wants me to explain wha...","HER-2/neu is a receptor protein that, in healt...","[0.007146416, -0.0031549868, -0.008919642, -5....","[0.0116439555, 0.000839068, 0.005928954, 0.006..."
5,2945694660,"In the initial survey, tissue from how many br...","<think>\nOkay, let's tackle this query step by...",103,"[0.0031050346, 0.016526, 0.0020599884, -0.0129...","[0.0002817883, 0.0021807922, 0.00036940924, 0...."


In [None]:
# from torch import cosine_similarity
from langchain_community.utils.math import cosine_similarity


similarities = cosine_similarity(
    processed["generated_embeded"].tolist(), 
    processed["expected_embeded"].tolist()
)
similarities

array([[0.9282745 , 0.81796825, 0.85264665],
       [0.46150412, 0.58925928, 0.40713658],
       [0.34591719, 0.47816363, 0.30056028]])

`similarities` above are pairwise similarities of all generated and expected 
values so the only ones that make sense to evaluate are the diagonal (first 
generated vs first expected, etc.).

In [38]:
eval_similarities = np.diag(similarities)
eval_similarities

array([0.9282745 , 0.58925928, 0.30056028])

The similarities to evaluate, `eval_similarities`, measure the relevance of the 
answers for each question. To evalute the model's performance in aggregate, we 
can take the average across the gold set to get a sense on how well the model 
does overall.

In [39]:
gs_avg_similiarity = eval_similarities.mean()
gs_avg_similiarity

0.6060313542407915