# Part 2: Evaluating our LLM application

So far, we've chosen typical/arbitrary values for the various parts of our RAG application. But if we were to change something, such as our chunking logic, embedding model, LLM, etc. how can we know that we have a better configuration than before. A generative task like this is very difficult to quantitatively assess and so we need to develop creative ways to do so. 

Because we have many moving parts in our application, we need to perform unit/component and end-to-end evaluation. Component-wise evaluation can involve evaluating our retrieval in isolation (is the best source in our set of retrieved chunks) and evaluating our LLMs response (given the best source, is the LLM able to produce a quality answer). As for end-to-end evaluation, we can assess the quality of the entire system (given all data, what is the quality of the response).

<span style="background: yellow; color: red; font-size: 1rem;"><b>DIAGRAM:</b></span> Component and end-to-end evaluation.

## Setup

In [None]:
import os

os.environ["ANYSCALE_API_BASE"] = "https://ray-summit-training-jrvwy.cld-kvedzwag2qa8i5bj.s.anyscaleuserdata-staging.com/v1/chat/completions"
os.environ["ANYSCALE_API_KEY"] = "tZLmCV1WtQAtnDx93MM5w3xaNhYV3whhcFRTKoH1GYQ"

os.environ["OPENAI_API_BASE"] = "https://api.openai.com/v1"
# os.environ["OPENAI_API_KEY"] = ...

## Golden Context Dataset

In an ideal world, we would have a golden validation dataset: given a set of queries, we would have the correct sources that answer those queries, and optionally the correct answer that should be returned by the LLM.

For this example, we have manually collected 177 representative user queries and identified the correct source in the documentation that answer those user queries.

In [None]:
from pathlib import Path
import json

golden_dataset_path = Path("../datasets/eval-dataset-v1.jsonl")

with open(golden_dataset_path, "r") as f:
    data = [json.loads(item) for item in list(f)]
    
len(data)

Our dataset contains 'question' and 'source' pairs. If we have a golden context dataset, it is the best option for evaluation.

In [None]:
data[:5]

## Cold Start

We may not always have a prepared dataset of questions and the best source to answer that question readily available. To address this cold start problem, we could use an LLM to look at our documents and generate questions that the specific chunk would answer. This provides us with quality questions and the exact source the answer is in. However, this dataset generation method could be a bit noisy. The generate questions may not always be resembling of what your users may ask and the specific chunk we say is the best source may also have that exact information in other chunks. Nonetheless, this is a great way to start our development process while we collect + manually label a high quality dataset.

<span style="background: yellow; color: red; font-size: 1rem;"><b>DIAGRAM:</b></span> Show the synthetic data generation process.

In [None]:
from llama_index.evaluation import DatasetGenerator

In [None]:
# TODO

Since we already have a dataset with representative user queries and ground truth labels, we will use that for evaluation instead of a synthetically generated dataset.

## Evaluating Retrieval

The first component to evaluate in our RAG application is retrieval. Given a query, is our retriever pulling in the correct context to answer that query? Regardless of how good our LLM is, if it does not have the right context to answer the question, it cannot provide the right answer.

We can use our golden context dataset to evaluate retrieval. The simplest approach is that for each query in our dataset, we can test to see if the correct source is included in any of the chunks that are retrieved by our retriever. This measures "hit rate".

However, simply checking for existence can be misleading if we increase the number of chunks that we retrieve. Therefore, we also want to check the score that our retriever gives for the correct source. A higher score means our retriever is accurately determining the correct context. 

To summarize, for each query in our evaluation dataset, we will measure the following:
1. Is the correct source included in any of the retrived chunks?
2. What is the score our retriever gives to the correct source?

<span style="background: yellow; color: red; font-size: 1rem;"><b>DIAGRAM:</b></span> Show retrieval evaluation

First, let's a get a retriever over the vector database. We have packaged this as a utility. It is the same as we did in notebook 1.

In [None]:
from utils import get_retriever

In [None]:
retriever = get_retriever(similarity_top_k=5)

Now let's evaluate our retriever. 

In [None]:
results = []


for entry in data:
    query = entry["question"]
    expected_source = entry['source']
    
    retrieved_nodes = retriever.retrieve(query)
    retrieved_sources = [node.metadata['source'] for node in retrieved_nodes]
    
    # If our label does not include a section, then any sections on the page should be considered a hit.
    if "#" not in expected_source:
        retrieved_sources = [source.split("#")[0] for source in retrieved_sources]
    
    if expected_source in retrieved_sources:
        is_hit = True
        score = retrieved_nodes[retrieved_sources.index(expected_source)].score
    else:
        is_hit = False
        score = 0.0
    
    result = {
        "is_hit": is_hit,
        "score": score,
        "retrieved": retrieved_sources,
        "expected": expected_source,
        "query": query,
    }
    results.append(result)

In [None]:
results[:2]

Let's see how well our retriever does. It's not great right now, but we now have a solid metric to evaluate our retriever for future optimizations.

In [None]:
total_hits = sum(result["is_hit"] for result in results)
hit_percentage = total_hits / len(results)
hit_percentage

In [None]:
average_score = sum(result["score"] for result in results) / len(results)
average_score

## End-to-end evaluation

While we can evaluate our retriever in isolation, ultimately we want to evaluate our RAG application end-to-end, which includes the final response generated from our LLM.

To effectively evaluate our generated responses, we need "ground truth" responses. These ground truth responses can be generated by feeding the correct context to a "golden" LLM. Then, we can use an LLM to evaluate our generated responses compared to the ground truth responses.

<span style="background: yellow; color: red; font-size: 1rem;"><b>DIAGRAM:</b></span> Show e2e evaluation

### Choosing a Golden LLM

To generate ground truth responses, and then to evaluate the generated responses vs. the ground truth, we need a "golden" LLM. But which LLM should we use? We now run into a problem: we need to determine the quality of different LLMs to choose as a "golden" LLM, but doing so requires a "golden" LLM. Leaderboards on general benchmarks provide a rough indication on which LLMs perform better, but in this case, we will go with the eye-test.

Let's get responses from both GPT-4 and Llama2-70B and see for ourselves which one is better.

In [None]:
from bs4 import BeautifulSoup

def fetch_text_from_source(source: str):
    url, anchor = source.split("#") if "#" in source else (source, None)
    file_path = Path("/efs/shared_storage/amog/", url.split("https://")[-1])
    with open(file_path, "r", encoding="utf-8") as file:
        html_content = file.read()
    soup = BeautifulSoup(html_content, "html.parser")
    if anchor:
        target_element = soup.find(id=anchor)
        if target_element:
            text = target_element.get_text()
        else:
            return fetch_text_from_source(source=url)
    else:
        text = soup.get_text()
    return text

In [None]:
example_source = data[0]["source"]
print(example_source)

text = fetch_text_from_source(example_source)
text

In [None]:
# Content for inference
system_content = "Answer the query using the context provided."

In [None]:
import openai

def generate_response(llm_name, 
                      temperature, 
                      system_content, 
                      user_content):
    
    response = openai.ChatCompletion.create(
        model=llm_name,
        temperature=temperature,
        messages=[
                {"role": "system", "content": system_content},
                {"role": "assistant", "content": ""},
                {"role": "user", "content": user_content},
            ],)
    return response["choices"][-1]["message"]["content"]
        
            

Let's get responses from gpt-4

In [None]:
gpt4_responses = []
llm_name = "gpt-4"
max_context_length = 8192

openai.api_base = os.environ["OPENAI_API_BASE"]
openai.api_key = os.environ["OPENAI_API_KEY"]

for entry in data[:5]:
    query = entry["question"]
    source = entry["source"]
    context = fetch_text_from_source(source)
    
    context_length = max_context_length - len(system_content)
    user_content = f"The query is {query} and the additional context is {context}"[:context_length]
    
    response = generate_response(llm_name, temperature=0.0, system_content=system_content, user_content=user_content)
    gpt4_responses.append(response)


In [None]:
gpt4_responses

Now let's get responses from LLama2-70b

In [None]:
llama_responses = []
llm_name = "meta-llama/Llama-2-70b-chat-hf"
max_context_length = 4096

openai.api_base = os.environ["ANYSCALE_API_BASE"]
openai.api_key = os.environ["ANYSCALE_API_KEY"]

for entry in data[:5]:
    query = entry["question"]
    source = entry["source"]
    context = fetch_text_from_source(source)
    
    context_length = max_context_length - len(system_content)
    user_content = f"The query is {query} and the additional context is {context}"[:context_length]
    
    response = generate_response(llm_name, temperature=0.0, system_content=system_content, user_content=user_content)
    llama_responses.append(response)

In [None]:
llama_responses

Now let's compare the two

In [None]:
BOLD = '\033[1m'
END = '\033[0m'
    
for query, gpt_response, llama_response in zip(data[:5], gpt4_responses, llama_responses):
    print(f"{BOLD}Query:{END} {query['question']}")
    print(f"{BOLD}GPT4 answer:{END} {gpt_response}")
    print(f"{BOLD}Llama2-70B answer:{END} {llama_response}")
    print("\n")

Based on these answers, we go with GPT-4 as our "golden" LLM.

### Generating our Golden Responses

Now that we have chosen which LLM to use, we can generate our reference responses. Let's generate 10 reference responses and save them to a file.

In [None]:
golden_responses = []
llm_name = "gpt-4"
max_context_length = 8192

openai.api_base = os.environ["OPENAI_API_BASE"]
openai.api_key = os.environ["OPENAI_API_KEY"]

ten_samples = data[:10]

for entry in ten_samples:
    query = entry["question"]
    source = entry["source"]
    context = fetch_text_from_source(source)
    
    context_length = max_context_length - len(system_content)
    user_content = f"The query is {query} and the additional context is {context}"[:context_length]
    
    response = generate_response(llm_name, temperature=0.0, system_content=system_content, user_content=user_content)
    golden_responses.append(response)

In [None]:
reference_dataset = [{"question": entry["question"], "source": entry["source"], "response": response} for entry, response in zip(ten_samples, golden_responses)]

In [None]:
with open("golden-responses.json", "w") as file:
    json.dump(reference_dataset, file, indent=4)

## Evaluating our Query Engine

Once we have reference responses, we can get our generated responses from our query engine. Then pass both responses to our golden LLM to evaluate the responses from our application.

In [None]:
with open("golden-responses.json", "r") as file:
    golden_responses = json.load(file)
golden_responses

In [None]:
from utils import get_query_engine

In [None]:
openai.api_base = os.environ["ANYSCALE_API_BASE"]
openai.api_key = os.environ["ANYSCALE_API_KEY"]
query_engine = get_query_engine(similarity_top_k=5)

# Store both the original response object and the response string.
rag_responses = []
rag_response_str = []

for entry in golden_responses:
    query = entry["question"]
    response = query_engine.query(query)
    rag_responses.append(response)
    rag_response_str.append(response.response)

In [None]:
rag_response_str

In [None]:
# Evaluation prompt
system_content = """
    "You are given a query, a reference answer, and a candidate answer.
    You must {score} the candidate answer between 1 and 5 on how well it answers the query, 
    using the reference answer as the golden truth.
    You must return your response in a line with only the score.
    Do not add any more details.
    On a separate line provide your {reasoning} for the score as well.
    Return your response following the exact format outlined below.
    All of this must be in a valid JSON format.
    
    {"score": score,
     "reasoning": reasoning}
    """

In [None]:
evaluation_scores = []
llm_name = "gpt-4"
max_context_length = 8192

openai.api_base = os.environ["OPENAI_API_BASE"]
openai.api_key = os.environ["OPENAI_API_KEY"]

for rag_response, golden_response in zip(rag_response_str, golden_responses):
    query = golden_response["question"]
    golden_answer = golden_response["response"]
    generated_answer = rag_response
    
    context_length = max_context_length - len(system_content)
    user_content = f"The query is {query}, the reference answer is {golden_answer}, and the candidate answer is {generated_answer}"[:context_length]
    
    response = generate_response(llm_name, temperature=0.0, system_content=system_content, user_content=user_content)
    evaluation_scores.append(response)

In [None]:
evaluation_scores

Now let's parse the score and the response from the LLM output

In [None]:
import re

def parse_evaluation_score(evaluation_result):
    # Define regular expressions for extracting values
    score_pattern = r'"score"\s*:\s*([0-9]+)'
    reasoning_pattern = r'"reasoning"\s*:\s*"([^"]*)"'

    # Extract values using regular expressions
    score_match = re.search(score_pattern, evaluation_result)
    reasoning_match = re.search(reasoning_pattern, evaluation_result)

    # Convert
    if score_match and reasoning_match:
        score = float(score_match.group(1))
        reasoning = reasoning_match.group(1)
        return {"score": score, "reasoning": reasoning}

    return {"score": "", "reasoning": ""}
    

In [None]:
parsed_evaluation_scores = [parse_evaluation_score(result) for result in evaluation_scores]

In [None]:
parsed_evaluation_scores

Let's save the query, both responses, and the score to a JSON file

In [None]:
scores = [
    {"question": golden_response["question"],
     "golden_response": golden_response["response"],
     "generated_response": rag_response,
     "score": parsed_score["score"],
     "reasoning": parsed_score["reasoning"],
    }
    for parsed_score, rag_response, golden_response in zip(parsed_evaluation_scores, rag_response_str, golden_responses)
]

In [None]:
scores

In [None]:
with open("eval-scores.json", "w") as file:
    json.dump(scores, file, indent=4)

We can also calculate the average scores

In [None]:
average_scores = sum(score["score"] for score in scores) / len(scores)
average_scores

## Alternative forms of evaluation

Generating reference responses and then using them for evaluation can give us a more accurate assesment on how our query engine is performing. However, this approach can be expensive. We have to make an initial pass through GPT4 to generate the reference response, and then we have to make another pass through GPT4 to evaluate our application's responses against the reference response.

We can explore other evaluation metrics to get a better sense on how our query engine is performing, without needing to make multiple passes to GPT4.

### Evaluating for hallucination/relevancy

One metric we can test is relevancy, which does not require generating reference responses. With this approach, we check to see if the generated response is relevant to at least one of the retrieved sources and to the query. This ensures that our LLM is not making up a response, but rather that it is relevant to the question that is being asked, and also that is relevant to at least one of the retrieved context.

This does NOT check whether the response is a correct response.

This capability is built into LlamaIndex, via the QueryResponseEvaluator. We use gpt-4 as the evaluator.

In [None]:
from llama_index.evaluation import QueryResponseEvaluator
from llama_index.llms import OpenAI
from llama_index import ServiceContext

def evaluate_response_relevance(queries: list, responses: list):
    openai.api_base = os.environ["OPENAI_API_BASE"]
    openai.api_key = os.environ["OPENAI_API_KEY"]

    llm = OpenAI(model="gpt-4", temperature=0.0)
    service_context = ServiceContext.from_defaults(llm=llm)
    evaluator = QueryResponseEvaluator(service_context=service_context)

    evals = []
    for query, response in zip(queries, responses):
        evals.append(str(evaluator.evaluate(query, response)))
    
    return len([val == "YES" for val in evals]) / len(evals)

In [None]:
relevance_results = evaluate_response_relevance(queries=[sample["question"] for sample in ten_samples], responses=rag_responses)

In [None]:
relevance_results