### The purpose of this notebook is to evaluate different RAG approaches on the tacoma manual dataset.

A set of manually curated questions, answers, and relevant page numbers has been created.
RAG answers will be compared to the curated answers for each question to evaluate each method's effectivenes.

In [1]:
from dotenv import load_dotenv
import json
import numpy as np
import pandas as pd
from pathlib import Path
import os
import sys
from tqdm.auto import tqdm
import warnings

warnings.filterwarnings('ignore')

CD = globals()['_dh'][0]
sys.path.append(str(Path(CD).parent / 'rag_flask'))

from data_pipeline import DataDownloadPreprocess
import rag_query

load_dotenv(dotenv_path=CD.parent / '.env')

  from .autonotebook import tqdm as notebook_tqdm


True

In [2]:
prompt_options=[
    'Answer the QUESTION based on the CONTEXT from the user manual. Respond with the page numbers from the CONTEXT.',
    '''Answer the QUESTION based on the CONTEXT from the user manual. \
If the provided CONTEXT does not provide the information needed to answer the QUESTION, then just respond with the page numbers from the CONTEXT \
and tell the user to look at those page numbers.''',
    'Answer the QUESTION soley based on the CONTEXT from the user manual. Your response must include the relevant page numbers from the CONTEXT.'
]

In [3]:
q = pd.read_csv('tacoma_manual_rag - question_answer.csv', dtype=str)

In [15]:
model_answers = list()

In [4]:
models = ['gemma2', 'gpt-4o-mini']

In [5]:
prompt_options

['Answer the QUESTION based on the CONTEXT from the user manual. Respond with the page numbers from the CONTEXT.',
 'Answer the QUESTION based on the CONTEXT from the user manual. If the provided CONTEXT does not provide the information needed to answer the QUESTION, then just respond with the page numbers from the CONTEXT and tell the user to look at those page numbers.',
 'Answer the QUESTION soley based on the CONTEXT from the user manual. Your response must include the relevant page numbers from the CONTEXT.']

In [28]:
for model in models:
    print('Model:', model)
    rq = RagQuery(
        host='localhost',
        llm_model=model,
        eval_model='gemma2',
        index_name=config['index_name'],
        embedding_model_name=config['sentence_transformer_model_name'],
        search_type=config['search_type'],
        num_es_results=config['num_es_results'],
        num_es_candidates=config['num_es_candidates'],
        vehicle_name='Toyota Tacoma 2020',
        similarity_threshold=config['es_knn_similarity_threshold']
        )
    
    for prompt in prompt_options:
        print('prompt:\n', prompt, end='\n\n')
        
        for question in tqdm(q['Question'].values.tolist()):
    
            answer = rq.rag(query=question, 
                        evaluate=False,
                           prompt_str=prompt)
            model_answers.append([model, prompt, question, answer])


Model: gemma2
prompt:
 Answer the QUESTION based on the CONTEXT from the user manual. Respond with the page numbers from the CONTEXT.



100%|███████████████████████████████████████████████████████████████████████████████| 26/26 [1:35:14<00:00, 219.77s/it]


prompt:
 Answer the QUESTION based on the CONTEXT from the user manual. If the provided CONTEXT does not provide the information needed to answer the QUESTION, then just respond with the page numbers from the CONTEXT and tell the user to look at those page numbers.



100%|███████████████████████████████████████████████████████████████████████████████| 26/26 [1:36:59<00:00, 223.83s/it]


prompt:
 Answer the QUESTION soley based on the CONTEXT from the user manual. Your response must include the relevant page numbers from the CONTEXT.



100%|███████████████████████████████████████████████████████████████████████████████| 26/26 [1:26:19<00:00, 199.23s/it]


In [29]:
model_answer_df = pd.DataFrame(model_answers, columns=['model', 'prompt', 'question', 'answer_dict'])

for k in answer.keys():
    model_answer_df[k] = model_answer_df['answer_dict'].map(lambda x: x.get(k, None))

model_answer_df.to_csv('model_answers.csv', index=False)

In [19]:
model_answer_df = pd.read_csv('model_answers.csv')
model_answer_df = model_answer_df.merge(
    right=q.rename(columns={'Question': 'question', 'Answer': 'ground_truth_answer'}),
    on='question')

In [21]:
# From DataTalksClub LLM Zoomcamp Module 4:
# https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/offline-rag-evaluation.ipynb
prompt_template = """
You are an expert evaluator for a Retrieval-Augmented Generation (RAG) system.
Your task is to analyze the relevance of the generated answer compared to the original answer provided.
Based on the relevance and similarity of the generated answer to the original answer, you will classify
it as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".

Here is the data for evaluation:

Original Answer: {answer_orig}
Generated Question: {question}
Generated Answer: {answer_llm}

Please analyze the content and context of the generated answer in relation to the original
answer and provide your evaluation in parsable JSON without using code blocks:

{{
  "Relevance": "NON_RELEVANT" | "PARTLY_RELEVANT" | "RELEVANT",
  "Explanation": "[Provide a brief explanation for your evaluation]"
}}
""".strip()

rq = rag_query.RagQuery(
        llm_model='gpt-4o-mini',
        eval_model='gpt-4o-mini',
        index_name=os.getenv('INDEX_NAME'),
        embedding_model_name=os.getenv('SENTENCE_TRANSFORMER_MODEL_NAME'),
        search_type='knn',
        num_neighbors=os.getenv('NUM_NEIGHBORS'),
        num_candidates=os.getenv('NUM_CANDIDATES'),
        host='localhost'
    )

2024-10-01 18:22:22,316 - INFO - Use pytorch device_name: cpu
2024-10-01 18:22:22,317 - INFO - Load pretrained SentenceTransformer: multi-qa-mpnet-base-dot-v1


In [22]:
model_answer_df['eval_prompt'] = model_answer_df.apply(
    lambda x: prompt_template.format(
        answer_orig=x['ground_truth_answer'], question= x['question'], answer_llm=  x['answer']), axis=1
)

In [None]:
tqdm.pandas()
model_answer_df['eval_result'] = model_answer_df['eval_prompt'].progress_apply(lambda x: rq.llm(prompt=x, model='gpt-4o-mini'))

In [41]:
eval_results = model_answer_df['eval_result'].values.tolist()
eval_results = [i[0] for i in eval_results]
eval_results = [json.loads(i.replace('”', '"')) for i in eval_results]

model_answer_df = model_answer_df.reset_index(drop=True).merge(
    right=pd.DataFrame(eval_results),
    left_index=True,
    right_index=True)

In [112]:
print(f'Results for model {models[0]}')

model_answer_df.loc[model_answer_df['model'] == models[0]].groupby(
    ['prompt', 'Relevance'], as_index=False).size().pivot(columns='Relevance', index='prompt', values='size')

Results for model gemma2


Relevance,NON_RELEVANT,PARTLY_RELEVANT,RELEVANT
prompt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Answer the QUESTION based on the CONTEXT from the user manual. If the provided CONTEXT does not provide the information needed to answer the QUESTION, then just respond with the page numbers from the CONTEXT and tell the user to look at those page numbers.",20,5,1
Answer the QUESTION based on the CONTEXT from the user manual. Respond with the page numbers from the CONTEXT.,17,8,1
Answer the QUESTION soley based on the CONTEXT from the user manual. Your response must include the relevant page numbers from the CONTEXT.,15,10,1


In [111]:
print(f'Results for model {models[1]}')

model_answer_df.loc[model_answer_df['model'] == models[1]].groupby(
    ['prompt', 'Relevance'], as_index=False).size().pivot(columns='Relevance', index='prompt', values='size')

Results for model gpt-4o-mini


Relevance,NON_RELEVANT,PARTLY_RELEVANT,RELEVANT
prompt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Answer the QUESTION based on the CONTEXT from the user manual. If the provided CONTEXT does not provide the information needed to answer the QUESTION, then just respond with the page numbers from the CONTEXT and tell the user to look at those page numbers.",2.0,9.0,15.0
Answer the QUESTION based on the CONTEXT from the user manual. Respond with the page numbers from the CONTEXT.,,7.0,19.0
Answer the QUESTION soley based on the CONTEXT from the user manual. Your response must include the relevant page numbers from the CONTEXT.,1.0,4.0,21.0


In [114]:
# Summary statistics for model response times.
model_answer_df.groupby('model', as_index=False)['response_time'].describe()

Unnamed: 0,model,count,mean,std,min,25%,50%,75%,max
0,gemma2,78.0,213.839553,30.574609,149.295184,192.681688,210.900636,231.422424,296.395015
1,gpt-4o-mini,78.0,1.977341,1.14265,0.584887,1.232641,1.597632,2.296391,6.319951


Both models performed best with the third query option. It is disappointing results for the Gemma2 model, as I would prefer to run this application without an external paid service, but there is a significant difference in response time and results.