# Step 4: Evaluate the application

## 1. Run the app on our evaluation dataset

In [1]:
%run 01-llm-app-setup.ipynb

In [2]:
import pandas as pd

gen_dataset = pd.read_csv('generated_qa.csv')

In [3]:
gen_dataset["answer"] = None
gen_dataset["contexts"] = None

for idx, item in gen_dataset.iloc[:2].iterrows():
    result = rag_chain.invoke(item.question)
    gen_dataset.at[idx, "answer"] = result["answer"]
    gen_dataset.at[idx, "contexts"] = result["context"]


In [4]:
gen_dataset.iloc[:2]

Unnamed: 0,question,ground_truth,ground_truth_context,answer,contexts
0,What is the core controller of the autonomous ...,LLM (large language model),LLM Powered Autonomous Agents\n \nDate: Jun...,The core controller of the autonomous agents d...,[LLM Powered Autonomous Agents\n \nDate: Ju...
1,What is considered as utilizing the short-term...,"In-context learning, as seen in Prompt Enginee...",Memory\n\nShort-term memory: I would consider ...,Utilizing the short-term memory of the model i...,[Memory\n\nShort-term memory: I would consider...


## 2. Run evaluation

This might take a few minutes. We call our pre-defined evaluation functions and also run ragas on our dataset.

In [5]:
%run 03-metrics-definition.ipynb

In [6]:
results_lst = []
 
for idx, row in gen_dataset.iloc[:2].iterrows(): # Subsetting to make it go faster
    custom_eval_results = {
        "context_correctness": context_correctness(row["ground_truth_context"], row["contexts"]),
        "ground_truth_context_rank": ground_truth_context_rank(row["ground_truth_context"], row["contexts"]),
        "context_rougel_score": context_rougel_score(row["ground_truth_context"], row["contexts"]),
    }

    ragas_eval_results = evaluate_w_ragas(row)
    results_lst.append(custom_eval_results | ragas_eval_results)


results_df = pd.DataFrame(results_lst)



Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Evaluating:   0%|          | 0/3 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/3 [00:00<?, ?it/s]

In [7]:
results_df

Unnamed: 0,context_correctness,ground_truth_context_rank,context_rougel_score,context_precision,faithfulness,answer_correctness
0,True,0,1.0,1.0,1.0,0.589765
1,True,0,1.0,1.0,1.0,0.469287


And that's it! We can aggregate these metrics to get a single number for each of them and we have a good evaluation of our model.