# Automated Evaluation of LLM Based QA Systems

The following notebook was part of workshop **Automated Evaluation of LLM Based QA systems** presented at Machine Learning Conference 2024. It aims to show you basics of designing RAG application and how to optimize and monitor its performance with evaluator model and curated dataset of questions. Brief description of your goals for the practical session is presented at the start of every session. Notebook is supplemented by a power point presentation also presented in this repository.

### Notebook requirements

To run the notebook, you need an access to OpenAI API through Azure for a LLM (notebook was designed weith GPT3.5 in mind) and embedding model _text-ada-embedding-002_ which is unfortunately required due to the prepared RAG database already having these embeddings calculated. We also provide notebook `prepare_data.ipynb` for creating the rag database from scratch, so you can implement different embedding model.

If you want to implement different models, you have to change langchain functions `AzureChatOpenAI` and `AzureOpenAIEmbeddings` with equivalent counter parts. But don't forget that changing embedding function requires changing of the RAG database!

### Authors
MlPrague 2024, 22.04.2024 9:00 - 12:30, Workshop lecturers:
  - Ondřej Finke, O2 [Dataclair.ai](https://dataclair.ai/), ondrej.finke@o2.cz
  - Alexandr Vendl, O2 [Dataclair.ai](https://dataclair.ai/), alexandr.vendl@o2.cz
  - Marek Matiáš, O2 [Dataclair.ai](https://dataclair.ai/), marek.matias@o2.cz


Practical sessions:
  - Session 1: Getting to know the RAG application
  - Session 2: Manually evaluating the RAG application
  - Session 2: Training the evaluator model
  - Session 3: Optimizing the RAG application

In [None]:
# CELL 1 - IMPORTS AND SETTINGS
# imports
import random
import time
from functools import wraps

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_core.output_parsers import JsonOutputParser, StrOutputParser
from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_core.runnables import (
    ConfigurableField,
    Runnable,
    RunnableLambda,
    RunnablePassthrough,
)
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings
from sklearn.metrics import classification_report, confusion_matrix


# supporting functions
def make_autopct(values):
    """format of numbers in the pie chart"""

    def my_autopct(pct):
        total = sum(values)
        val = int(round(pct * total / 100.0))
        return f"{pct:.1f}% ({val:d})"

    return my_autopct


def retry_on_exception(
    func,
    max_retries: int = 3,
    initial_delay: float = 1.0,
    exponential_base: float = 2.0,
    jitter: bool = True,
    exceptions: tuple = (Exception,),
    on_timeout: Exception = Exception("Connection failed, please try again."),
):
    """
    General decorator function to retry the decorated function on exception.
    It retries the function up to `max_retries` times.
    """

    @wraps(func)
    def wrapper(*args, **kwargs):
        retries = 0
        delay = initial_delay
        while retries < max_retries:
            try:
                result = func(*args, **kwargs)
                return result
            except Exception as e:  # modify this line
                print(f"Exception occurred: {e}")  # add this line
                # Increment the delay
                delay *= exponential_base * (1 + jitter * random.random())
                # Sleep for the delay
                time.sleep(delay)
                retries += 1
        return []

    return wrapper


# SETTINGS FOR THE LLM MODELS
# Add your api keys and endpoints
api_key = ""
api_endpoint = ""
api_version = ""
ans_model_type = "gpt-35-turbo"             # LLM for answering
emb_model_type = "text-embedding-ada-002"   # embedding model. ADA-2 was used in db

---

# Session 1: Getting to know the RAG application

During the first practical session, your goal is to get familiar with the presented implementation of RAG application on top GPT3.5 model. RAG chain is build as a langchain Runnable which, when called, does the following:
1) Retrieve documents from vector database based on user question
2) Select documents according to the search type and hyperparameters
3) Generate response based on the found documents

**Information about the data**

- 567 documents present in the database from 22 Wikipedia articles
- Articles are scraped in January 2024, please, follow these links for the january revision of the article
  - [2024 United States presidential election](https://en.wikipedia.org/w/index.php?title=2024_United_States_presidential_election&oldid=1200006968)
  - [2024 Taiwanese presidential election](https://en.wikipedia.org/w/index.php?title=2024_Taiwanese_presidential_election&oldid=1199911440)
  - [2023 Turkish presidential election](https://en.wikipedia.org/w/index.php?title=2023_Turkish_presidential_election&oldid=1198592810)
  - [2023 Turkish parliamentary election](https://en.wikipedia.org/w/index.php?title=2023_Turkish_parliamentary_election&oldid=1199995550)
  - [2023 Slovak parliamentary election](https://en.wikipedia.org/w/index.php?title=2023_Slovak_parliamentary_election&oldid=1199500182)
  - [2023 Singaporean presidential election](https://en.wikipedia.org/w/index.php?title=2023_Singaporean_presidential_election&oldid=1196788263)
  - [2023 Serbian parliamentary election](https://en.wikipedia.org/w/index.php?title=2023_Serbian_parliamentary_election&oldid=1195591108)
  - [2023 Polish parliamentary election](https://en.wikipedia.org/w/index.php?title=2023_Polish_parliamentary_election&oldid=1196759103)
  - [2023 Finnish parliamentary election](https://en.wikipedia.org/w/index.php?title=2023_Finnish_parliamentary_election&oldid=1200416522)
  - [2023 Estonian parliamentary election](https://en.wikipedia.org/w/index.php?title=2023_Estonian_parliamentary_election&oldid=1199658758)
  - [2023 Egyptian presidential election](https://en.wikipedia.org/w/index.php?title=2023_Egyptian_presidential_election&oldid=1199675829)
  - [2023 Czech presidential election](https://en.wikipedia.org/w/index.php?title=2023_Czech_presidential_election&oldid=1200526766)
  - [2023 Bulgarian parliamentary election](https://en.wikipedia.org/w/index.php?title=2023_Bulgarian_parliamentary_election&oldid=1194904336)
  - [2022 Turkmenistan presidential election](https://en.wikipedia.org/wiki/2022_Turkmenistan_presidential_election)
  - [2022 South Korean presidential election](https://en.wikipedia.org/w/index.php?title=2022_South_Korean_presidential_election&oldid=1189688011)
  - [2022 Slovenian parliamentary election](https://en.wikipedia.org/wiki/2022_Slovenian_parliamentary_election)
  - [2022 Maltese general election](https://en.wikipedia.org/wiki/2022_Maltese_general_election)
  - [2022 Latvian parliamentary election](https://en.wikipedia.org/w/index.php?title=2022_Latvian_parliamentary_election&oldid=1197667621)
  - [2022 Hungarian parliamentary election](https://en.wikipedia.org/w/index.php?title=2022_Hungarian_parliamentary_election&oldid=1196216171)
  - [2022 French presidential election](https://en.wikipedia.org/w/index.php?title=2022_French_presidential_election&oldid=1200325554)
  - [2022 Bulgarian parliamentary election](https://en.wikipedia.org/w/index.php?title=2022_Bulgarian_parliamentary_election&oldid=1193509245)
  - [2022 Austrian presidential election](https://en.wikipedia.org/w/index.php?title=2022_Austrian_presidential_election&oldid=1184476052)

In [None]:
# CELL 2
# PROMPT TEMPLATE FOR THE RAG APPLICATION
# template needs to include {context} and {question} where the context from loaded documents
# and question is loaded when

rag_prompt_template = """You are an AI assistant trained for political sciences. You provide an answer to a question solely based on the provided context.
Answer in full sentences. At the end of your answer, write name of the article you used as a source in square brackets, like so: [name of the article].
If you don't have a context for asked question, reply with the following message: "I'm sorry, but I cannot find answer to this question in my data.".
Provided articles:
{context}

Users question: {question}
"""


def get_rag_chain() -> Runnable:
    """RAG langchain chain with following steps

    1) Retrieve documents based on user question
    2) Select top n documents
    3) Generate response based on the found documents

    When invoking the chain, following rules has to be followed:
    Keys for input dictionary:
        - question: str - question you want to ask about the data
        - search: dict - dictionary defining the search type

    Output dictionary:
        - answer: str - answer from the LLM
        - documents: list[Document] - list of documents retrieved from the database
        - context: str - Documents passed into the prompt in the form of merged str
    """

    # define the retriever
    # embedding model
    embeddings = AzureOpenAIEmbeddings(
        api_key=api_key,
        azure_endpoint=api_endpoint,
        api_version=api_version,
        model=emb_model_type,
        tiktoken_model_name="cl100k_base",
    )
    # load chroma db
    db = Chroma(persist_directory="./data", embedding_function=embeddings)

    # supporting functions
    def retrieve_docs(dictionary, db=db) -> list[Document]:
        # retrieves documents based on search type
        # maximum marginal difference
        if dictionary["search"]["type"] == "mmr":
            docs = db.max_marginal_relevance_search(
                dictionary["question"],
                k=dictionary["search"]["num_docs"],
                fetch_k=dictionary["search"]["num_docs"] * 4,
                lambda_mult=dictionary["search"]["lambda_mult"],
            )
        # vector similarity
        elif dictionary["search"]["type"] == "vector":
            docs = db.similarity_search_with_relevance_scores(
                dictionary["question"], k=dictionary["search"]["num_docs"]
            )
            docs = [
                document[0]
                for document in docs
                if document[1] > dictionary["search"]["similarity_treshold"]
            ]
        return docs

    def select_docs(dictionary):
        # merge retrieved documents into string used for the prompt based in filters
        # context length is = num_docs or less depending on treshold in vector search
        context = ""
        for i in range(len(dictionary["documents"])):
            context += f"Context {i+1}:\n{dictionary['documents'][i].page_content}\n\n"

        if context == "":
            print("WARNING: LLM didn't receive any context")
        return context

    # prompt used for the RAG application

    # define the model
    prompt = ChatPromptTemplate.from_template(rag_prompt_template)
    model = AzureChatOpenAI(
        openai_api_key=api_key,
        azure_endpoint=api_endpoint,
        azure_deployment=ans_model_type,
        openai_api_version=api_version,
        temperature=0,
        tiktoken_model_name="cl100k_base",
    ).configurable_fields(
        temperature=ConfigurableField(
            id="temp",
            name="LLM Temperature",
        )
    )

    # define the full chain
    chain = (
        RunnablePassthrough.assign(documents=RunnableLambda(retrieve_docs))
        | RunnablePassthrough.assign(context=RunnableLambda(select_docs))
        | RunnablePassthrough.assign(answer=prompt | model | StrOutputParser())
    )

    return chain


# prepare the chain
rag_chain = get_rag_chain()


# function for calling the chain with retries on exception
@retry_on_exception
def rag_chain_with_retry(input_dict, temperature=0):
    return rag_chain.with_config(configurable={"temp": temperature}).invoke(input_dict)

### Running the RAG

RAG chain is called using `rag_chain_with_retry` function. Input to this function is `temperature` (default = 0) of the model and `input_dict` which is a dictionary with keys `question` and `search`, where key `search` holds another dictionary defining which type of search is used. Dictionary for the two search types are defined as variables _search_mmr_ and _search_vec_.

Output of the chain is a dictionary with keys:
- `documents` - list of documents found by the search (langchain Document class)
- `context` - documents converted to strings for the llm prompt.
- `answer` - string with answer

In [None]:
# RAG RUNTIME
# question you want to ask the RAG app
question = "Who is the president of Czech Republic?"

# search settings
# maximum marginal relevance
# optimizes for similarity to query and diversity among selected documents.
# lambda_mult <0,1> - degree of diversity between the documents, 1 = minimum diversity
search_mmr = {"type": "mmr", "num_docs": 4, "lambda_mult": 1}

# vector similarity
# returns num_docs closest documents based on vector similarity
# similarity treshold <0,1> - documents with lower than this similarity will be ignored
search_vec = {"type": "vector", "num_docs": 5, "similarity_treshold": 0}

# input dictionary for the RAG chain
rag_input_dict = {"question": question, "search": search_mmr}

# Call the chain
rag_answer = rag_chain_with_retry(temperature=0, input_dict=rag_input_dict)

# print the answer
print("")
print(f"Question: {question}")
print(f"Answer: {rag_answer['answer']}")
print("-------")
print(f"Number of documents used: {len(rag_answer['documents'])}")

---

# Session 2: Manual RAG optimization

In the second session, try to optimize the RAG performance manually. Look at the subset of provided questions and ground truth answers, generate new answers using your RAG app and try to change the RAG settings to achieve better results. You can change the prompt in variable `rag_prompt_template` above, change search type and it's hyperparameters, number of documents provided in the prompt, and the model temperature. All these will have some effect on the answer quality.

In [None]:
# load 8 questions and nice print the answers
mdf = pd.read_csv("./data/validation_rag_questions.csv", sep=",")
mdf = mdf.sample(8, random_state=42)

# answer these questions with rag
search_mmr = {"type": "mmr", "num_docs": 4, "lambda_mult": 0}
search_vec = {"type": "vector", "num_docs": 5, "similarity_treshold": 0}

mdf["rag_answer"] = "empty"
for index, row in mdf.iterrows():
    try:
        # run the chain
        rag_input_dict = {"question": row["question"], "search": search_mmr}
        rag_answer = rag_chain_with_retry(temperature=0, input_dict=rag_input_dict)
        # save the results
        mdf.loc[index, "rag_answer"] = rag_answer["answer"]
        print(f"Row {index} answered")
    except Exception as err:
        print(f"WARNING: Row {index}, Exception raised as {err}")

In [None]:
# update pandas settings to show full answers
pd.set_option("display.max_colwidth", None)
# show the answers
mdf[["question", "correct_answer", "rag_answer"]]

In [None]:
# reset the pandas settings
pd.reset_option("display.max_colwidth")

---

# Session 3: Evaluator model

During the third session, your goal is to train evaluator model which is predicting if the provided `answer` is same as `ground_truth`. This model can be LLM based (as in next cells) or done using classic machine learning methods (for example logistic regression on cosine distance between the ground truth, generated answer and question). For training the evaluator, you will be provided with dataset which includes:
  - question
  - human_answer - this is the answer created by the human and is considered correct
  - rag_answer - answer generated to the same question by LLM (already generated)
  - same_answer - bool marked by human stating if the human_answer and rag_answer are the same

Your goal is to optimize the **prompt** of the evaluator model in a way that when provided with question, human_answer, and rag_answer returns the same cathegory as human did in `same_answer`.

**Let's first get acquainted with the chain for the Evaluator model**. Model expects question, ground_truth and generated_answer as an input into the prompt. The format_instructions variable is handled automatically by langchain. Output of the model is a dictionary with key correct_answer holding bool value.

In [None]:
# PROMPT TEMPLATE FOR THE EVALUATOR
# template needs to include {question}, {ground_truth}, {generated_answer}, and {format_instructions} where the context
# required by
compare_prompt_template = """You are given a question, perfect answer and a candidate answer.
Your task is to determine if the candidate answer is almost as good as the perfect answer.

question:
{question}

perfect answer (ground truth):
{ground_truth}

candidate answer:
{generated_answer}

Format instructions:
{format_instructions}"""


def get_compare_chain() -> Runnable:
    """Langchain Runnable for comparing two answers

    Requirements for input dictionary keys:
        - question: str - question user asked
        - ground_truth: str - Baseline correct answer
        - generated_answer: str - Answer which we want to compare to the correct answer

    """

    # define the output parses
    class Candidate(BaseModel):
        correct_answer: bool = Field(
            description="is the candidate answer same as perfect answer?"
        )

    parser = JsonOutputParser(pydantic_object=Candidate)

    # prompt for the comparison model

    # define prompt and model
    prompt = PromptTemplate(
        template=compare_prompt_template,
        input_variables=["question", "ground_truth", "generated_answer"],
        partial_variables={"format_instructions": parser.get_format_instructions()},
    )
    model = AzureChatOpenAI(
        openai_api_key=api_key,
        azure_endpoint=api_endpoint,
        azure_deployment=ans_model_type,
        openai_api_version=api_version,
        temperature=0,
        tiktoken_model_name="cl100k_base",
    )

    # chain
    chain = prompt | model | parser
    return chain


# prepare the chain
compare_chain = get_compare_chain()


@retry_on_exception
def compare_chain_with_retry(input_dict):
    return compare_chain.invoke(input_dict)

In [None]:
# Example of compare runtime
# compare chain dictionary
compare_input_dict = {
    "question": "Who is the president of Czech Republic?",
    "ground_truth": "Petr Pavel is prezident of Czech Republic.",
    "generated_answer": "Miloš Zeman is the current president of Czechia.",
}
# run the chain
compare_answer = compare_chain_with_retry(compare_input_dict)

# print answer
print(f"Question: {question}")
print(f"Ground trurh: {compare_input_dict['ground_truth']}")
print(f"Generated answer: {compare_input_dict['generated_answer']}")
print("-------")
print(
    f"Is the generated answer same as ground truth? {compare_answer['correct_answer']}"
)

### Training the evaluator model

Training the evaluator model in this case is simply optimizing the prompt in a variable `compare_prompt_template` to achieve ideally same performance as the human evaluator from the provided dataset.

In [None]:
# loading data
edf = pd.read_csv("./data/validation_evaluator_model.csv", sep=",")
edf.head(2)

In [None]:
# running the LLM evaluator
edf["ai_same_answer"] = None
for index, row in edf.iterrows():
    try:
        # get answer
        compare_input_dict = {
            "question": row["question"],
            "ground_truth": row["human_answer"],
            "generated_answer": row["rag_answer"],
        }
        # run the chain
        compare_answer = compare_chain_with_retry(compare_input_dict)
        # save the results
        edf.loc[index, "ai_same_answer"] = compare_answer["correct_answer"]
    except Exception as err:
        print(f"Row {index}, Exception raised as {err}")

In [None]:
# comparing the evaluator results to baseline
def rate_evaluator(edf: pd.DataFrame):
    """Function for showing report about the evaluator"""
    edf["ai_same_answer"] = edf["ai_same_answer"].astype(bool)
    used_labels = edf["ai_same_answer"].unique()
    cm = confusion_matrix(
        edf["same_answer"],
        edf["ai_same_answer"],
        labels=used_labels,
    )
    # define figures
    fig, ax = plt.subplots(figsize=(9, 4), nrows=1, ncols=2)
    fig.suptitle("Evaluator score\n(Comparison evaluator generated labels to human)")
    values = [
        len(edf[edf["same_answer"] == edf["ai_same_answer"]]),
        len(edf[edf["same_answer"] != edf["ai_same_answer"]]),
    ]
    # piechart w
    ax[0].pie(
        values,
        labels=["Match", "Miss"],
        autopct=make_autopct(values),
        startangle=45,
        colors=["forestgreen", "tab:red"],
    )
    ax[0].axis("equal")

    ax[1].imshow(cm, cmap="summer")
    ax[1].set_xticks(np.arange(len(used_labels)))
    ax[1].set_yticks(np.arange(len(used_labels)))
    ax[1].set_xticklabels(used_labels, rotation=90)
    ax[1].set_yticklabels(used_labels)
    ax[1].set_xlabel("Evaluator predicted label")
    ax[1].set_ylabel("Human label")

    for i in range(len(used_labels)):
        for j in range(len(used_labels)):
            ax[1].text(j, i, cm[i, j], ha="center", va="center", color="k")

    plt.tight_layout()
    plt.show()

    print(classification_report(edf["same_answer"], edf["ai_same_answer"]))
    pass


human_acc = len(edf[edf["same_answer"] == True]) / len(edf)
ai_acc = len(edf[edf["ai_same_answer"] == True]) / len(edf)
print(f"Human expert evaluated accuracy as: {human_acc:0,.2f}")
print(f"Automated evaluator evaluated accuracy as: {ai_acc:0,.2f}")
print(f"Absolute error of evaluator: {abs(ai_acc-human_acc):0,.2f}")
print("\nMore detailed evaluator accuracy: ")
rate_evaluator(edf)

---

# Session 4: Optimizing the RAG application


In the last session, your goal is to optimize the RAG application using your trained evaluator model.

### Running the evaluation

In [None]:
df = pd.read_csv("./data/validation_rag_questions.csv", sep=",")
# search types
# maximum marginal relevance
search_mmr = {"type": "mmr", "num_docs": 1, "lambda_mult": 1}
# vector similarity
search_vec = {"type": "vector", "num_docs": 5, "similarity_treshold": 0}

df["rag_answer"] = "empty"
df["is_rag_correct"] = None

for index, row in df.iterrows():
    try:
        # get answer
        rag_input_dict = {"question": row["question"], "search": search_vec}
        rag_answer = rag_chain_with_retry(temperature=0, input_dict=rag_input_dict)

        # compare chain
        compare_input_dict = {
            "question": row["question"],
            "ground_truth": row["correct_answer"],
            "generated_answer": rag_answer["answer"],
        }
        # run the chain
        compare_answer = compare_chain_with_retry(compare_input_dict)

        # save the results
        df.loc[index, "rag_answer"] = rag_answer["answer"]
        df.loc[index, "is_rag_correct"] = compare_answer["correct_answer"]

        print(f"Row {index} compared")
    except Exception as err:
        print(
            f"WARNING: Row {index}, Exception raised as {err}, Evaluator answer: {compare_answer}"
        )

In [None]:
df.head(2)

In [None]:
# General RAG performance
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(8, 5))
fig.suptitle("Automated evaluation")
ax[0].set_title("Was RAG correct?")
ax[0].pie(
    df["is_rag_correct"].value_counts(),
    labels=df["is_rag_correct"].value_counts().index,
    autopct=make_autopct(df["is_rag_correct"].value_counts()),
    startangle=45,
    colors=["forestgreen", "tab:red"],
)
cdf = df.dropna(subset="article")
matched = cdf.apply(
    lambda row: row["rag_answer"].lower().find(row["article"].lower()) != -1, axis=1
).sum()
values = [matched, len(cdf) - matched]
ax[1].set_title("Did RAG cited correct?")
ax[1].pie(
    values,
    labels=["True", "False"],
    autopct=make_autopct(values),
    startangle=45,
    colors=["forestgreen", "tab:red"],
)
plt.tight_layout()
plt.show()

In [None]:
# RAG performance per question type
ncols = 3
nrows = 2
fig, ax = plt.subplots(ncols=ncols, nrows=nrows, figsize=(12, 10))
fig.suptitle("Performance per question type")
# go through question types
for col in range(ncols):
    for row in range(nrows):
        type = df["Type"].value_counts().index[col * nrows + row]
        sdf = df[df["Type"] == type]
        ax[row, col].pie(
            sdf["is_rag_correct"].value_counts(),
            labels=sdf["is_rag_correct"].value_counts().index,
            autopct=make_autopct(sdf["is_rag_correct"].value_counts()),
            startangle=45,
            colors=["forestgreen", "tab:red"],
        )
        ax[row, col].set_title(f"Questions: {type}")

plt.tight_layout()
plt.show()

---