# Map Evaluation Data to Original Dataset

As part of the evaluation of our RAG system we need to map the relevant chunks of the evaluation set to the original dataset. This is necessary to be able to evaluate the performance of the RAG system with metrics such as Mean Reciprocal Rank (MRR) and Mean Average Precision (MAP). In this notebook we will assess the matching strategies of relevant chunk in the evaluation set to the original dataset.

## Load Data

First the data is loaded.

In [None]:
import pandas as pd

df = pd.read_csv('data/Cleantech Media Dataset/cleantech_media_dataset_v2_2024-02-23.csv')
df.head()

In [None]:
df_eval_subset = pd.read_csv('data/Cleantech Media Dataset/cleantech_rag_evaluation_data_2024-02-23.csv')
df_eval_subset = df_eval_subset.drop_duplicates().sample(10)
df_eval_subset.head()

## Preprocess Data

In order to be able to associate each chunk with a specific document each document needs to have an unique identifier. The creation of an unique identifier is done in the preprocessing class as we take the content of the document and hash it. This hash is then used as the unique identifier for the document. The preprocessing class also takes care of other preprocessing steps such as removing duplicates and concatenating the content of the document, more information can be found in the [preprocessing notebook](preprocessing.ipynb).

In [None]:
from src.preprocessing.preprocessor import Preprocessor

default_df = Preprocessor(df, verbose=True, explode=False, concatenate_contents=True).preprocess()

After preprocessing the dataframe has the new column `id` which is the unique identifier for each document.

In [None]:
default_df['id'].duplicated().sum()

## Mapping Evaluation Data to Original Dataset

The mapping between relevant chunks from the evaluation set and documents in the dataset is done in the EvaluationSetPreprocessor class. this class uses a fuzzy matching strategy to find the best match for each relevant chunk in the dataset. The best match is then stored in the `best_match_id` column along with the `best_match_score` which is the similarity score between the relevant chunk and the best match.

In [None]:
from src.preprocessing.eval_preprocessor import EvaluationSetPreprocessor

eval_processor = EvaluationSetPreprocessor(default_df, df_eval_subset, verbose=True)
eval_df = eval_processor.preprocess()
eval_df.head()

## Assess Matching Strategy

Now we humanly assess the matching strategy by highlighting the relevant chunk and the best match in the original dataset. The `highlight_matches` function takes the evaluation dataframe and the original dataframe as input and highlights the relevant chunk and the best match in the original dataset. The `min_words` parameter can be used to specify the minimum number of words that should be highlighted. It does not work perfectly but it gives a good idea of how well the matching strategy is working helping us to find the relevant chunks in the original document content.

In [None]:
import re
from IPython.display import display, HTML


def highlight_matches(eval_df: pd.DataFrame, default_df: pd.DataFrame, min_words=2):
    def highlight_text(text, match, min_words):
        words_pattern = r'\b(' + '|'.join(re.escape(word) for word in match.split()) + r')\b'
        regex = rf"({words_pattern}(?:\s+{words_pattern})*)"

        def highlighter(match):
            word_count = len(match.group(0).split())
            if word_count >= min_words:
                return f"<mark>{match.group(0)}</mark>"
            else:
                return match.group(0)

        highlighted = re.sub(regex, highlighter, text, flags=re.IGNORECASE)
        return highlighted

    for index, row in eval_df[:2].iterrows():
        best_match = default_df[default_df['id'] == row['best_match_id']]

        if not best_match.empty:
            best_match_content = best_match['content'].values[0]
            highlighted_content = highlight_text(best_match_content, row['relevant_chunk'], min_words)
        else:
            highlighted_content = "No match found"

        display(HTML(f"<b>Relevant chunk:</b> {row['relevant_chunk']}"))
        print("\n")
        display(HTML(f"<b>Best Match:</b> {highlighted_content}"))
        print("\n\n" * 2)


highlight_matches(eval_df, default_df)