# Overview
The purpose of this notebook is to demonstrate how to evaluate a rerank model. In the previous notebook, we saw that as we increased k where k is the number of chunks returned, we got better recall@k scores. This is fairly intuitive, but also poses another problem. If we increase K too much, won't that just mean higher input token costs and a greater chance that the LLM glosses the answer since it's receiving too much info?  

This is one of the key benefits of adding a rerank model into your IR system. If we can narrow down the list of possible context chunks to < 10, we can add a separate model to output a relevance score. We can remove inputs that are below a threshold and use these scores to "rerank" the outputs so that the model gets only the most relevant outputs from our vector search. 

That's exactly what we'll do in this notebook. We'll incorporate a ReRank model using Bedrock and validate whether it's improving core metrics.

# Background
When evaluating a rerank model, it's crucial to focus on metrics that reflect both the quality and relevance of the reranked results, as well as the model's ability to improve upon the initial ranking. Normalized Discounted Cumulative Gain (NDCG) is often considered one of the most important metrics, as it accounts for the position of relevant items in the ranked list and can handle graded relevance judgments. Mean Average Precision (MAP) is another valuable metric that provides a single figure of merit for the overall ranking quality across multiple queries. For scenarios where the top results are particularly important, metrics like Precision@k and Mean Reciprocal Rank (MRR) can offer insights into the model's performance at specific cut-off points.
 
In addition to these standard information retrieval metrics, it's beneficial to consider comparative metrics that directly measure the improvement over the base ranking. This can include the percentage of queries improved, the average change in relevant document positions, or a paired statistical test (such as a Wilcoxon signed-rank test) comparing the reranked results to the original ranking. It's also important to evaluate the model's efficiency, considering factors like inference time and computational resources required, especially for applications with strict latency requirements. Ultimately, the choice of metrics should align with the specific goals of your reranking task and the priorities of your system, balancing between relevance, user satisfaction, and operational constraints.

How to Evaluate
To evaluate a rerank model using these metrics, start by preparing a test set consisting of queries, their corresponding initial rankings, and human-annotated relevance judgments for each query-document pair. Run your rerank model on the initial rankings to produce a new set of reranked results. Then, calculate the chosen metrics for both the initial and reranked results. For example, to compute NDCG@k, sort the documents for each query by their relevance scores, calculate the Discounted Cumulative Gain (DCG) for the top k results, and normalize it by the Ideal DCG. For MAP, calculate the average precision for each query at every position where a relevant document is retrieved, then take the mean across all queries. To assess improvement, compare the metric scores between the initial and reranked results. You can use statistical tests like paired t-tests or Wilcoxon signed-rank tests to determine if the improvements are significant. It's also valuable to analyze per-query performance to identify where the rerank model excels or struggles. Finally, consider evaluating on different subsets of your data to ensure consistent performance across various query types or document categories.



# What Will We Do? 
* We'll the best run from our previous model, run it through a ReRank model, and then recalculate the results.

**Lets get started!**

# Get Validation Dataset

In [None]:
import pandas as pd

def get_clean_eval_dataset():
    EVAL_PATH = '../data/eval-datasets/2_rerank_validation.csv'
    eval_df = pd.read_csv(EVAL_PATH)

    # Clean up the DataFrame
    eval_df = eval_df.rename(columns=lambda x: x.strip())  # Remove any leading/trailing whitespace from column names
    eval_df = eval_df.drop(columns=[col for col in eval_df.columns if col.startswith('Unnamed')])  # Remove unnamed columns
    eval_df = eval_df.dropna(how='all')  # Remove rows that are all NaN
    
    # Strip whitespace from string columns
    for col in eval_df.select_dtypes(['object']):
        eval_df[col] = eval_df[col].str.strip()
    
    # Ensure 'relevant_doc_ids' is a string column
    eval_df['relevant_doc_ids'] = eval_df['relevant_doc_ids'].astype(str)

    return eval_df

previous_run_df = get_clean_eval_dataset()
eval_df = get_clean_eval_dataset()

## Cross-Encoders in Natural Language Processing

Cross-encoders are powerful neural network models used for comparing two pieces of text, such as a query and a document. Unlike bi-encoders which encode texts separately, cross-encoders process both texts simultaneously, allowing for rich interactions between them at every layer of the network. This approach often yields higher accuracy in tasks like passage reranking, as it can capture complex relationships between the query and potential answers. However, cross-encoders are computationally intensive and less efficient for large-scale retrieval tasks. They're typically used to rerank a small set of candidates initially retrieved by faster methods, striking a balance between accuracy and efficiency in information retrieval systems.

Because we only have 5 candidates, we can get through the ReRank pretty fast!

## CrossEncoder Reranking with Long Passages
This implementation addresses the challenge of reranking long text passages (2056 tokens) using a CrossEncoder model with a 512 token limit. The `CrossEncoderReRankTask` class includes a `chunk_text` method that splits long passages into smaller, overlapping chunks. During reranking, each chunk is scored separately against the query. The final score for a passage is determined by taking the maximum score across all its chunks. This approach allows the reranker to consider the entire content of long passages while respecting the model's token limit, potentially improving the accuracy of the reranking process for documents that exceed the standard BERT (what this cross encoder is based off) model's sequence length.

In [None]:
from sentence_transformers import CrossEncoder as SentenceTransformerCrossEncoder
from pydantic import BaseModel
from typing import List, Tuple
import numpy as np
from abc import ABC, abstractmethod

class Passage(BaseModel):
    chunk: str
    file_name: str
    score: float = 0.0

class BaseReRankTask(ABC):
    @abstractmethod
    def rerank(self, query_text: str, passages: List[Passage]) -> List[Passage]:
        pass

class CrossEncoderReRankTask(BaseReRankTask):
    def __init__(self, model_name: str = 'cross-encoder/ms-marco-MiniLM-L-12-v2', score_threshold: float = -0.999, max_length: int = 512):
        self.cross_encoder = SentenceTransformerCrossEncoder(model_name)
        self.score_threshold = score_threshold
        self.max_length = max_length

    def chunk_text(self, text: str, max_length: int) -> List[str]:
        words = text.split()
        chunks = []
        current_chunk = []
        current_length = 0

        for word in words:
            if current_length + len(word) + 1 > max_length:
                chunks.append(" ".join(current_chunk))
                current_chunk = [word]
                current_length = len(word)
            else:
                current_chunk.append(word)
                current_length += len(word) + 1

        if current_chunk:
            chunks.append(" ".join(current_chunk))

        return chunks

    def rerank(self, query: str, passages: List[Passage]) -> List[Passage]:
        all_input_pairs = []
        chunk_map = {}

        for i, passage in enumerate(passages):
            chunks = self.chunk_text(passage.chunk, self.max_length)
            for j, chunk in enumerate(chunks):
                all_input_pairs.append([query, chunk])
                chunk_map[(i, j)] = chunk

        # Get scores from the cross-encoder
        scores = self.cross_encoder.predict(all_input_pairs)

        # Aggregate scores for each original passage
        passage_scores = {}
        for (i, j), score in zip(chunk_map.keys(), scores):
            if i not in passage_scores:
                passage_scores[i] = []
            passage_scores[i].append(score)

        # Calculate final score for each passage (e.g., using max score)
        final_scores = {i: max(scores) for i, scores in passage_scores.items()}

        # Sort passages based on their scores in descending order
        sorted_passages = sorted([(score, passages[i]) for i, score in final_scores.items()], key=lambda x: x[0], reverse=True)

        # Update passage scores and return
        result = []
        for score, passage in sorted_passages:
            passage.score = float(score)
            result.append(passage)

        return result

# Define the ReRank task
reranker: BaseReRankTask = CrossEncoderReRankTask()

# Copy IRMetrics Calculator
This class is copied from the previous notebook. We included it here vs. pushing this to a utility class for ease of use when modifying the code

In [None]:
import json
import numpy as np

class IRMetricsCalculator:
    def __init__(self, df):
        self.df = df

    @staticmethod
    def precision_at_k(relevant, retrieved, k):
        retrieved_k = retrieved[:k]
        return len(set(relevant) & set(retrieved_k)) / k if k > 0 else 0

    @staticmethod
    def recall_at_k(relevant, retrieved, k):
        retrieved_k = retrieved[:k]
        return len(set(relevant) & set(retrieved_k)) / len(relevant) if len(relevant) > 0 else 0

    @staticmethod
    def dcg_at_k(relevant, retrieved, k):
        retrieved_k = retrieved[:k]
        dcg = 0
        for i, item in enumerate(retrieved_k):
            if item in relevant:
                dcg += 1 / np.log2(i + 2)
        return dcg

    @staticmethod
    def ndcg_at_k(relevant, retrieved, k):
        dcg = IRMetricsCalculator.dcg_at_k(relevant, retrieved, k)
        idcg = IRMetricsCalculator.dcg_at_k(relevant, relevant, k)
        return dcg / idcg if idcg > 0 else 0

    @staticmethod
    def parse_json_list(json_string):
        try:
            return json.loads(json_string)
        except json.JSONDecodeError as e:
            print(f"Error parsing JSON: {json_string} with error {e}")
            return []

    def calculate_metrics(self, k_values=[1, 3, 5]):
        for k in k_values:
            self.df[f'precision@{k}'] = self.df.apply(lambda row: self.precision_at_k(
                self.parse_json_list(row['relevant_doc_ids']),
                self.parse_json_list(row['retrieved_doc_ids']), k), axis=1)
            self.df[f'recall@{k}'] = self.df.apply(lambda row: self.recall_at_k(
                self.parse_json_list(row['relevant_doc_ids']),
                self.parse_json_list(row['retrieved_doc_ids']), k), axis=1)
            self.df[f'ndcg@{k}'] = self.df.apply(lambda row: self.ndcg_at_k(
                self.parse_json_list(row['relevant_doc_ids']),
                self.parse_json_list(row['retrieved_doc_ids']), k), axis=1)
        return self.df

# Setup Task Runner
Similiar to what we did with the VectorDB search, we'll setup a task runner to iterate through our validation dataset to recalculate ranks

In [None]:
from typing import List
import pandas as pd
import json


class ReRankTaskRunner:
    def __init__(self, eval_df: pd.DataFrame, reranker: BaseReRankTask):
        self.eval_df = eval_df
        self.reranker = reranker

    def _get_unique_file_paths(self, results: List[Passage]) -> List[str]:
        # Since Python 3.7, dicts retain insertion order.
        return list(dict.fromkeys(r.file_name for r in results))


    def run(self) -> pd.DataFrame:
        # Make a copy of the dataframe so we don't modify the original.
        df = pd.DataFrame(self.eval_df)
        
        results = []
        for index, row in df.iterrows():
            query: str = row['query_text']
            
            # Run retrieval task
            chunks: dict = json.loads(row['retrieved_chunks'])
            passages: List[Passage] = [Passage(chunk=chunk['chunk'], file_name=chunk['relative_path']) for chunk in chunks]

            reranked_passages = self.reranker.rerank(query, passages)
            
            # Extract unique page numbers for comparison with validation dataset.
            ordered_filepaths: List[str] = self._get_unique_file_paths(reranked_passages)

            # retrieved_chunks = [ {'relative_path': r.metadata['relative_path'], 'chunk': r.document} for r in retrieval_results ]

            # Create new record
            result = {
                'query_text': query,
                'relevant_doc_ids': row['relevant_doc_ids'],
                'retrieved_doc_ids': json.dumps(ordered_filepaths),
            }
            results.append(result)

        new_dataframe = pd.DataFrame(results)

        ir_calc: IRMetricsCalculator = IRMetricsCalculator(new_dataframe)
        return ir_calc.calculate_metrics()


In [None]:
# Run the validation
reranked_results_df = ReRankTaskRunner(eval_df, reranker).run()

# Compare Original Ranking with New Ranking
We'll copy the Experiment summarizer from the previous notebook as well and compare the results of the ReRanked results to the previous results

In [None]:
import pandas as pd
import numpy as np
from typing import List

class ExperimentSummarizer:
    def __init__(self, df):
        self.df = pd.DataFrame(df)
        self.summary_df = None

    @staticmethod
    def calculate_ap(relevant_docs, retrieved_docs):
        relevant_set = set(relevant_docs.split(','))
        retrieved_list = retrieved_docs.split(',')
        relevant_count = 0
        total_precision = 0
        
        for i, doc in enumerate(retrieved_list, 1):
            if doc in relevant_set:
                relevant_count += 1
                total_precision += relevant_count / i
        
        return total_precision / len(relevant_set) if relevant_set else 0

    @staticmethod
    def calculate_reciprocal_rank(relevant_docs, retrieved_docs):
        relevant_set = set(relevant_docs.split(','))
        retrieved_list = retrieved_docs.split(',')
        
        for i, doc in enumerate(retrieved_list, 1):
            if doc in relevant_set:
                return 1 / i
        
        return 0

    def calculate_map(self):
        self.df['AP'] = self.df.apply(lambda row: self.calculate_ap(row['relevant_doc_ids'], row['retrieved_doc_ids']), axis=1)
        return self.df['AP'].mean()

    def calculate_mrr(self):
        self.df['RR'] = self.df.apply(lambda row: self.calculate_reciprocal_rank(row['relevant_doc_ids'], row['retrieved_doc_ids']), axis=1)
        return self.df['RR'].mean()

    def calculate_mean_metrics(self):
        return self.df[[
            'precision@1', 'recall@1', 'ndcg@1',
            'precision@3', 'recall@3', 'ndcg@3',
            'precision@5', 'recall@5', 'ndcg@5'
        ]].mean()

    def calculate_top_k_percentages(self):
        top_1 = (self.df['precision@1'] > 0).mean() * 100
        top_3 = (self.df['precision@3'] > 0).mean() * 100
        top_5 = (self.df['precision@5'] > 0).mean() * 100
        return top_1, top_3, top_5

    def analyze(self):
        map_score = self.calculate_map()
        mrr_score = self.calculate_mrr()
        mean_metrics = self.calculate_mean_metrics()
        top_1, top_3, top_5 = self.calculate_top_k_percentages()

        self.summary_df = pd.DataFrame({
            'Metric': [
                'MAP (Mean Average Precision)',
                'MRR (Mean Reciprocal Rank)',
                'Mean Precision@1', 'Mean Recall@1', 'Mean NDCG@1',
                'Mean Precision@3', 'Mean Recall@3', 'Mean NDCG@3',
                'Mean Precision@5', 'Mean Recall@5', 'Mean NDCG@5',
                '% Queries with Relevant Doc in Top 1',
                '% Queries with Relevant Doc in Top 3',
                '% Queries with Relevant Doc in Top 5'
            ],
            'Value': [
                map_score,
                mrr_score,
                mean_metrics['precision@1'], mean_metrics['recall@1'], mean_metrics['ndcg@1'],
                mean_metrics['precision@3'], mean_metrics['recall@3'], mean_metrics['ndcg@3'],
                mean_metrics['precision@5'], mean_metrics['recall@5'], mean_metrics['ndcg@5'],
                top_1, top_3, top_5
            ]
        })
        return self.summary_df

    def get_summary(self):
        if self.summary_df is None:
            self.analyze()
        return self.summary_df

In [None]:
# Generate a summary of the reranked results
# Lets use the class above to create aggregate metrics to see how well the system performs.
original_summary = ExperimentSummarizer(eval_df).analyze()
rerank_summary = ExperimentSummarizer(reranked_results_df).analyze()

In [None]:
class ExperimentComparator:
    def __init__(self, *experiment_data):
        self.experiments = experiment_data

    def compare_metrics(self):
        merged_df = pd.DataFrame({'Metric': self.experiments[0][0]['Metric']})
        for df, name in self.experiments:
            merged_df = pd.merge(merged_df, df, on='Metric', how='left')
            merged_df = merged_df.rename(columns={'Value': name})
        
        base_exp = self.experiments[0][1]
        for df, name in self.experiments[1:]:
            merged_df[f'Change_{name}_vs_{base_exp}'] = merged_df[name] - merged_df[base_exp]
            merged_df[f'PercentChange_{name}_vs_{base_exp}'] = ((merged_df[name] - merged_df[base_exp]) / merged_df[base_exp]) * 100
        
        return merged_df

    def print_comparison(self):
        comparison = self.compare_metrics()
        
        def color_change(val):
            if pd.isna(val):
                return ''
            return 'color: red' if val < 0 else 'color: green' if val > 0 else ''
        
        def background_color_change(val):
            if pd.isna(val):
                return ''
            return 'background-color: #ffcccb' if val < 0 else 'background-color: #90ee90' if val > 0 else ''
        
        change_columns = [col for col in comparison.columns if col.startswith('Change_') or col.startswith('PercentChange_')]
        styled = comparison.style
        
        for col in change_columns:
            styled = styled.map(color_change, subset=[col])
            styled = styled.map(background_color_change, subset=[col])
        
        numeric_columns = comparison.select_dtypes(include=[np.number]).columns
        format_dict = {col: '{:.6f}' for col in numeric_columns}
        
        for col in change_columns:
            if col.startswith('PercentChange_'):
                format_dict[col] = '{:.2f}%'
        
        styled = styled.format(format_dict)
        return styled

    def analyze(self):
        return self.print_comparison()

In [None]:
experiment_comparator = ExperimentComparator(
    (original_summary, "Original"),
    (rerank_summary, "ReRanked")
)
experiment_comparator.analyze()

# Conclusion
As you can see, the ReRanker improved our MAP by 80%, MRR by 50%, and improved our precision metrics! It was also able to successfully improve our Precision@3. This is important. Because we don't want to put 5 documents into our RAG solution, precision@k, MAP, and MRR are more important in this step than Recall.

# Next Steps
For now we'll skip validating the entire IR system. The validation dataset for this ReRanker came straight out of the first notebook's outputs so we effectively evaluated the IR system already. 

Move to the next notebook to start getting into LLM Validation