### The purpose of this notebook is to evaluate different retrieval methods for elasticsearch on the tacoma manual dataset.

A set of manually curated questions, answers, and relevant page numbers has been created.
Page numbers returned by elasticsearch will be compared to known page numbers for each question to evaluate each method's effectivenes.

In [10]:
import warnings
warnings.filterwarnings('ignore')

from dotenv import load_dotenv
import numpy as np
import pandas as pd
from pathlib import Path
import os
import sys
from tqdm.auto import tqdm

CD = globals()['_dh'][0]
sys.path.append(str(Path(CD).parent / 'rag_flask'))

from data_pipeline import DataDownloadPreprocess
from rag_query import RagQuery

load_dotenv(dotenv_path=CD.parent / '.env')

True

In [11]:
models = {'multi-qa-distilbert-cos-v1': '',
          'multi-qa-mpnet-base-dot-v1': 'mpnet_'}

In [12]:
knn_similarity_options = [
    'cosine',
    'l2_norm'
]

In [13]:
# run datapipeline for both model options, creating different elasticsearch indices.
# I already have data downloaded and preprocessed for inserting into Elastic,
# so this just creates and fills a few different indexes - one for each combination
# of model type and similarity option.
dp = DataDownloadPreprocess(host='localhost', output_path=CD.parent / 'rag_flask' / 'data')

index_names = list()

for sim_opt in knn_similarity_options:
    for model_name, model_prefix in models.items():
        index_name = f'{model_prefix}{sim_opt}_manual_index'
        print(index_name)
        dp.config['es_knn_similarity_measure'] = sim_opt
        index_names.append(index_name)
        for v_ind, v in enumerate(dp.vehicles):
            if v_ind == 0:
                elastic_delete_create_index=True
            else:
                elastic_delete_create_index=False
    
            print('Moving to elastic')
            dp.elasticsearch_embed(
                input_file_name=v['prefix'] + model_prefix + 'output_embedding.parquet',
                index_name=index_name,
                delete_create_index=elastic_delete_create_index
            )

cosine_manual_index
Moving to elastic
using index name: cosine_manual_index
737 records added
Moving to elastic
using index name: cosine_manual_index
609 records added
mpnet_cosine_manual_index
Moving to elastic
using index name: mpnet_cosine_manual_index
737 records added
Moving to elastic
using index name: mpnet_cosine_manual_index
609 records added
l2_norm_manual_index
Moving to elastic
using index name: l2_norm_manual_index
737 records added
Moving to elastic
using index name: l2_norm_manual_index
609 records added
mpnet_l2_norm_manual_index
Moving to elastic
using index name: mpnet_l2_norm_manual_index
737 records added
Moving to elastic
using index name: mpnet_l2_norm_manual_index
609 records added


In [47]:
# okay, now there are four different indexes. For each question, evaluate the performance with each index, along with text search,
#and a few different knn neighbor settings.
q = pd.read_csv('tacoma_manual_rag - question_page_num.csv', dtype={'Question': str, 'Page': int})
q = q.groupby('Question', as_index=False)['Page'].apply(list)

In [44]:
eval_params_df = pd.read_csv('tacoma_manual_rag - retrieval_evaluation_params.csv',
                          dtype={'search_type': str, 'index_name': str, 'model_name': str,
                                 'num_es_results': 'Int64', 'num_es_candidates': 'Int64',
                                'similarity': 'Int64', 'sim_opt': str, 'model_prefix': str})

eval_params_df = eval_params_df.replace({np.nan: None})

cosine = eval_params_df['sim_opt'] == 'cosine'
# Cosine similarity must be as fraction.
eval_params_df.loc[cosine, 'similarity'] = eval_params_df.loc[cosine, 'similarity'] / 100

eval_params = eval_params_df.to_dict('records')

In [64]:
eval_params[4]

{'search_type': 'knn',
 'index_name': 'cosine_manual_index',
 'model_name': 'multi-qa-distilbert-cos-v1',
 'num_es_results': 5,
 'num_es_candidates': 25,
 'similarity': 0.2,
 'sim_opt': 'cosine',
 'model_prefix': None}

In [71]:
ep = eval_params[4]
q = 'Does the tuck need to be washed after driving on salty roads?'
index_name = ep.get('index_name')
    
rq = RagQuery(
host='localhost',
#llm_model=ep.get('model_name'),
#eval_model='gemma2',
# Text search uses the default name 'manual_index'.
# If you do not have this index set, then any of the ones created above will work.
index_name=index_name,
embedding_model_name=ep.get('model_name'),
search_type=ep.get('search_type'),
num_es_results=ep.get('num_es_results'),
num_es_candidates=ep.get('num_es_candidates'),
vehicle_name='Toyota Tacoma 2020',
similarity_threshold=None#ep.get('similarity')
)
result = rq.user_query_to_es(input_text=question)
#q_results.append([ep_ind, q_ind, [i['page_ind'] for i in result]])
len(result)

BadRequestError: BadRequestError(400, 'x_content_parse_exception', "[1:16244] [knn] similarity doesn't support values of type: VALUE_NULL")

In [16]:
import logging
logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)

In [48]:
q_results = list()
for ep_ind, ep in enumerate(tqdm(eval_params)):
    for q_ind, question in enumerate(q['Question'].values):
        
        index_name = ep.get('index_name')
        if index_name == 'manual_index':
            index_name = 'cosine_manual_index'
            
        rq = RagQuery(
        host='localhost',
        #llm_model=ep.get('model_name'),
        #eval_model='gemma2',
        # Text search uses the default name 'manual_index'.
        # If you do not have this index set, then any of the ones created above will work.
        index_name=index_name,
        embedding_model_name=ep.get('model_name'),
        search_type=ep.get('search_type'),
        num_es_results=ep.get('num_es_results'),
        num_es_candidates=ep.get('num_es_candidates'),
        vehicle_name='Toyota Tacoma 2020',
        similarity_threshold=ep.get('similarity')
        )
        result = rq.user_query_to_es(input_text=question)
        q_results.append([ep_ind, q_ind, [i['page_ind'] for i in result]])

100%|██████████████████████████████████████████████████████████████████████████████| 484/484 [1:58:46<00:00, 14.72s/it]


In [49]:
q_results_df = pd.DataFrame(q_results, columns=['params_ind', 'question_ind', 'result_pages'])

In [50]:
q_results_df.to_csv('es_evaluation_question_answers.csv', index=False)

In [52]:
q_results_df = q.merge(
    right=q_results_df,
    left_index=True,
    right_on='question_ind'
)

In [53]:
def hit_rate(ground_truth, result_pages):
    """Calculate hit rate. Any ground truth page found to be matching is considered a hit."""
    
    total = len(ground_truth)
    hits = 0
    
    for i in range(total):
        for gt in ground_truth[i]:
            if gt in result_pages[i]:
                hits += 1
                break
    return hits / total

def hit_rate_multi_ground_truth(ground_truth, result_pages):
    """Calculate hit rate. The number of ground truth pages is taken into account."""
    
    total = len(ground_truth)
    hits = list()
    
    for i in range(total):
        hit_count = 0
        gt_count = len(ground_truth[i])
        for gt in ground_truth[i]:
            if gt in result_pages[i]:
                hit_count += 1
        hits.append(hit_count / gt_count)
    return sum(hits) / total

In [54]:
def mean_reciprocal_rank(ground_truth, result_pages):
    """Calculate mean reciprocal rank. Any ground truth page found to be matching is considered a hit."""
    
    total = len(ground_truth)
    hit_ranks = list()
    
    for i in range(total):
        for gt in ground_truth[i]:
            for rp_count, rp in enumerate(result_pages[i]):
                if gt == rp:
                    hit_ranks.append(1 / (rp_count + 1))
                    break
    return sum(hit_ranks) / total

def mean_reciprocal_rank_multi_ground_truth(ground_truth, result_pages):
    """Calculate mean reciprocal rank. The number of ground truth pages is taken into account."""
    
    total = len(ground_truth)
    total_hit_ranks = list()
    
    for i in range(total):
        gt_count = len(ground_truth[i])
        hit_ranks = list()
        for gt in ground_truth[i]:
            for rp_count, rp in enumerate(result_pages[i]):
                if gt == rp:
                    hit_ranks.append((1 / (rp_count + 1)) / gt_count)

        total_hit_ranks.append(sum(hit_ranks))
                    
    return sum(total_hit_ranks) / total

In [55]:
q_results_df.to_clipboard(index=False)

In [72]:
q_results_hit_rate = q_results_df.groupby('params_ind')[['Page', 'result_pages']].apply(
    lambda x: hit_rate(ground_truth=x['Page'].values.tolist(), result_pages=x['result_pages'].values.tolist())
)

q_results_hit_rate_multi = q_results_df.groupby('params_ind')[['Page', 'result_pages']].apply(
    lambda x: hit_rate_multi_ground_truth(ground_truth=x['Page'].values.tolist(), result_pages=x['result_pages'].values.tolist())
)

q_results_mrr = q_results_df.groupby('params_ind')[['Page', 'result_pages']].apply(
    lambda x: mean_reciprocal_rank(ground_truth=x['Page'].values.tolist(), result_pages=x['result_pages'].values.tolist())
)

q_results_mrr = q_results_df.groupby('params_ind')[['Page', 'result_pages']].apply(
    lambda x: mean_reciprocal_rank_multi_ground_truth(ground_truth=x['Page'].values.tolist(), result_pages=x['result_pages'].values.tolist())
)

In [73]:
eval_params_df['hit_rate'] = q_results_hit_rate
eval_params_df['hit_rate_multi'] = q_results_hit_rate_multi
eval_params_df['mean_reciprocal_rank'] = q_results_mrr
eval_params_df['mean_reciprocal_rank_multi_ground_truth'] = q_results_mrr

In [74]:
eval_params_df.sort_values(['mean_reciprocal_rank', 'hit_rate'], ascending=False).head(10)

Unnamed: 0,search_type,index_name,model_name,num_es_results,num_es_candidates,similarity,sim_opt,model_prefix,hit_rate,hit_rate_multi,mean_reciprocal_rank,mean_reciprocal_rank_multi_ground_truth
3,string,,,20,,,,,0.692308,0.628205,0.667409,0.667409
2,string,,,15,,,,,0.615385,0.551282,0.650115,0.650115
1,string,,,10,,,,,0.615385,0.541667,0.617332,0.617332
34,knn,mpnet_cosine_manual_index,multi-qa-mpnet-base-dot-v1,10,25.0,0.2,cosine,mpnet_,0.961538,0.951923,0.587447,0.587447
36,knn,mpnet_cosine_manual_index,multi-qa-mpnet-base-dot-v1,10,25.0,0.3,cosine,mpnet_,0.961538,0.951923,0.587447,0.587447
74,knn,mpnet_cosine_manual_index,multi-qa-mpnet-base-dot-v1,10,50.0,0.2,cosine,mpnet_,0.961538,0.951923,0.587447,0.587447
76,knn,mpnet_cosine_manual_index,multi-qa-mpnet-base-dot-v1,10,50.0,0.3,cosine,mpnet_,0.961538,0.951923,0.587447,0.587447
114,knn,mpnet_cosine_manual_index,multi-qa-mpnet-base-dot-v1,10,100.0,0.2,cosine,mpnet_,0.961538,0.951923,0.587447,0.587447
116,knn,mpnet_cosine_manual_index,multi-qa-mpnet-base-dot-v1,10,100.0,0.3,cosine,mpnet_,0.961538,0.951923,0.587447,0.587447
134,knn,mpnet_cosine_manual_index,multi-qa-mpnet-base-dot-v1,15,25.0,0.2,cosine,mpnet_,0.961538,0.951923,0.587447,0.587447


In [75]:
eval_params_df.sort_values(['hit_rate', 'mean_reciprocal_rank'], ascending=False).head(10)

Unnamed: 0,search_type,index_name,model_name,num_es_results,num_es_candidates,similarity,sim_opt,model_prefix,hit_rate,hit_rate_multi,mean_reciprocal_rank,mean_reciprocal_rank_multi_ground_truth
34,knn,mpnet_cosine_manual_index,multi-qa-mpnet-base-dot-v1,10,25,0.2,cosine,mpnet_,0.961538,0.951923,0.587447,0.587447
36,knn,mpnet_cosine_manual_index,multi-qa-mpnet-base-dot-v1,10,25,0.3,cosine,mpnet_,0.961538,0.951923,0.587447,0.587447
74,knn,mpnet_cosine_manual_index,multi-qa-mpnet-base-dot-v1,10,50,0.2,cosine,mpnet_,0.961538,0.951923,0.587447,0.587447
76,knn,mpnet_cosine_manual_index,multi-qa-mpnet-base-dot-v1,10,50,0.3,cosine,mpnet_,0.961538,0.951923,0.587447,0.587447
114,knn,mpnet_cosine_manual_index,multi-qa-mpnet-base-dot-v1,10,100,0.2,cosine,mpnet_,0.961538,0.951923,0.587447,0.587447
116,knn,mpnet_cosine_manual_index,multi-qa-mpnet-base-dot-v1,10,100,0.3,cosine,mpnet_,0.961538,0.951923,0.587447,0.587447
134,knn,mpnet_cosine_manual_index,multi-qa-mpnet-base-dot-v1,15,25,0.2,cosine,mpnet_,0.961538,0.951923,0.587447,0.587447
136,knn,mpnet_cosine_manual_index,multi-qa-mpnet-base-dot-v1,15,25,0.3,cosine,mpnet_,0.961538,0.951923,0.587447,0.587447
154,knn,mpnet_cosine_manual_index,multi-qa-mpnet-base-dot-v1,20,25,0.2,cosine,mpnet_,0.961538,0.951923,0.587447,0.587447
156,knn,mpnet_cosine_manual_index,multi-qa-mpnet-base-dot-v1,20,25,0.3,cosine,mpnet_,0.961538,0.951923,0.587447,0.587447


In [76]:
eval_params_df.sort_values(['num_es_results', 'hit_rate', 'mean_reciprocal_rank'], ascending=[True, False, False]).head(5)

Unnamed: 0,search_type,index_name,model_name,num_es_results,num_es_candidates,similarity,sim_opt,model_prefix,hit_rate,hit_rate_multi,mean_reciprocal_rank,mean_reciprocal_rank_multi_ground_truth
14,knn,mpnet_cosine_manual_index,multi-qa-mpnet-base-dot-v1,5,25,0.2,cosine,mpnet_,0.923077,0.88141,0.578793,0.578793
16,knn,mpnet_cosine_manual_index,multi-qa-mpnet-base-dot-v1,5,25,0.3,cosine,mpnet_,0.923077,0.88141,0.578793,0.578793
18,knn,mpnet_cosine_manual_index,multi-qa-mpnet-base-dot-v1,5,25,0.4,cosine,mpnet_,0.923077,0.88141,0.578793,0.578793
54,knn,mpnet_cosine_manual_index,multi-qa-mpnet-base-dot-v1,5,50,0.2,cosine,mpnet_,0.923077,0.88141,0.578793,0.578793
56,knn,mpnet_cosine_manual_index,multi-qa-mpnet-base-dot-v1,5,50,0.3,cosine,mpnet_,0.923077,0.88141,0.578793,0.578793


In [77]:
eval_params_df.loc[eval_params_df['search_type'] == 'hybrid'].sort_values(
    ['num_es_results', 'hit_rate', 'mean_reciprocal_rank'], ascending=[True, False, False]).head(5)

Unnamed: 0,search_type,index_name,model_name,num_es_results,num_es_candidates,similarity,sim_opt,model_prefix,hit_rate,hit_rate_multi,mean_reciprocal_rank,mean_reciprocal_rank_multi_ground_truth
254,hybrid,mpnet_cosine_manual_index,multi-qa-mpnet-base-dot-v1,5,25,0.2,cosine,mpnet_,0.730769,0.666667,0.420032,0.420032
256,hybrid,mpnet_cosine_manual_index,multi-qa-mpnet-base-dot-v1,5,25,0.3,cosine,mpnet_,0.730769,0.666667,0.420032,0.420032
258,hybrid,mpnet_cosine_manual_index,multi-qa-mpnet-base-dot-v1,5,25,0.4,cosine,mpnet_,0.730769,0.666667,0.420032,0.420032
294,hybrid,mpnet_cosine_manual_index,multi-qa-mpnet-base-dot-v1,5,50,0.2,cosine,mpnet_,0.730769,0.666667,0.420032,0.420032
296,hybrid,mpnet_cosine_manual_index,multi-qa-mpnet-base-dot-v1,5,50,0.3,cosine,mpnet_,0.730769,0.666667,0.420032,0.420032


Because it is more important to include relevant context than to be concerned about order when results are used as RAG context, hit rate will be considered the more important measure than mean reciprocal rank. Including irrelevent context only adds to cost and processing time, so the small gains in hit rate from increasing the number of Elasticsearch results are not worth returning more than 5 results. The tested similarity thresholds did not have an impact on results.

When creating the ground truth dataset, I found that multiple pages were needed for full answer context, but it was not easily doable to chunk text by a method besides page number. Given that pages already have a degree of logical chunking, I decided to review results for multiple pages, and include these as separate measures- Hit Rate Multi-Ground Truth, and Mean Reciprocal Rank Multi-Ground Truth. These measures test for the presence and rank, respectively, and weight by the number of ground truth pages for each question.

The top result for Hit Rate with the lowest number of results and candidates will be selected.
In this case, the chosen parameters are:
- multi-qa-mpnet-base-dot-v1 as the embedding model
- 5 results (size or k parameter)
- 25 candidates
- KNN search type
- Cosine similarity
- 0.2 Similarity threshold