# Overview
This notebook builds of all the previous notebooks. We validated our chunking strategy and embeddings, rerank, LLM-As-A-Judge-Prompt and finally our prompt itself. The last step is to put it all together to validate the entire system. 

Remember back to our first two notebooks, we don't expect to get the top result all the time. However we want to see how our overall system is performing on our validation dataset. To do this we'll build a local RAG system based. 

#### What Metrics Should I Care About.
For an E2E system, we care about a couple metrics. These metrics have been given a new name with the introduction of RAG but they build off a lot of existing metrics used in the data science space. 


# What Will We Do?
* Curate a dataset of questions and ground truth answers (we've created one already)
* Modify our grading rubric to include the ground truth answer.
* Setup a retrieval task
* Inject the context from our retrieval task into our LLM and validate the results with our updated rubric.
* Compare our custom setup to RAGAS (an open source evaluation tool) to see how it compares.

At this end of this notebook, we should have a working example that demonstrates E2E how well our retrieval system is working

**Note** For E2E evaluation, RAGAS is a great tool. It however doesn't work very well in a production system where we don't know the answers ahead of time. Because of this, the LLM-As-A-Judge prompt we used previously is better suited to validate your production system. It's not uncommon to either use this grading rubric as a safeguard before returning answers to users, or running it asyncronously on a subset of responses to gather metrics for more advanced observability. 

# Import Validation Dataset


In [None]:
import pandas as pd

eval_df = pd.read_csv('../data/eval-datasets/5_e2e_validation.csv')

In [None]:
import chromadb
import boto3

# Initialize Chroma client from persistant disk. We'll use the same collection from our first notebook
chroma_client = chromadb.PersistentClient(path="../data/chroma")

# Also initialize the bedrock client so we can call some embedding models!
session = boto3.Session(profile_name='default')
bedrock = boto3.client('bedrock-runtime')

# Setup Embeddings / Chunk Retrieval Task
Included in this repo is a data/chroma section that contains a prebuilt chroma collection using the chunking strategy and embedding model we selected as the best in our first notebook. Instead of recreating that, we can modify the ChromaHelper from the first notebook and load it from memory

In [None]:
from pydantic import BaseModel
from typing import List, Dict
from chromadb.utils.embedding_functions import AmazonBedrockEmbeddingFunction

class RetrievalResult(BaseModel):
    id: str
    document: str
    embedding: List[float]
    distance: float
    metadata: Dict = {}

class ChromaDBRetrievalTask:
    def __init__(self, chroma_client, collection_name: str, embedding_function):
        self.client = chroma_client
        self.collection_name = collection_name
        self.embedding_function = embedding_function
        # Get the existing collection
        self.collection = self._get_collection()

    def _get_collection(self):
        return self.client.get_collection(
            name=self.collection_name,
            embedding_function=self.embedding_function
        )

    def retrieve(self, query_text: str, n_results: int = 5) -> List[RetrievalResult]:
        # Query the collection
        results = self.collection.query(
            query_texts=[query_text],
            n_results=n_results,
            include=['embeddings', 'documents', 'metadatas', 'distances']
        )
        # Transform the results into RetrievalResult objects
        retrieval_results = []
        for i in range(len(results['ids'][0])):
            retrieval_results.append(RetrievalResult(
                id=results['ids'][0][i],
                document=results['documents'][0][i],
                embedding=results['embeddings'][0][i],
                distance=results['distances'][0][i],
                metadata=results['metadatas'][0][i] if results['metadatas'][0] else {}
            ))
        return retrieval_results

# Define some experiment variables
EMBEDDING_MODEL: str = "amazon.titan-embed-text-v2:0"
COLLECTION_NAME: str = 'experiment_3_collection'

# This is a handy function Chroma implemented for calling bedrock. Lets use it!
embedding_function = AmazonBedrockEmbeddingFunction(
    session=session,
    model_name=EMBEDDING_MODEL
)

# Create our retrieval task for Chroma.
chroma_retrieval_task: ChromaDBRetrievalTask = ChromaDBRetrievalTask(
    chroma_client = chroma_client, 
    collection_name = COLLECTION_NAME,
    embedding_function = embedding_function
)    

# Setup ReRank

In [None]:
from sentence_transformers import CrossEncoder as SentenceTransformerCrossEncoder
from pydantic import BaseModel
from typing import List, Tuple
import numpy as np
from abc import ABC, abstractmethod

class Passage(BaseModel):
    chunk: str
    file_name: str
    score: float = 0.0

class CrossEncoderReRankTask:
    def __init__(self, model_name: str = 'cross-encoder/ms-marco-MiniLM-L-12-v2', score_threshold: float = -0.999, max_length: int = 512):
        self.cross_encoder = SentenceTransformerCrossEncoder(model_name)
        self.score_threshold = score_threshold
        self.max_length = max_length

    def chunk_text(self, text: str, max_length: int) -> List[str]:
        words = text.split()
        chunks = []
        current_chunk = []
        current_length = 0

        for word in words:
            if current_length + len(word) + 1 > max_length:
                chunks.append(" ".join(current_chunk))
                current_chunk = [word]
                current_length = len(word)
            else:
                current_chunk.append(word)
                current_length += len(word) + 1

        if current_chunk:
            chunks.append(" ".join(current_chunk))

        return chunks

    def rerank(self, query: str, passages: List[Passage]) -> List[Passage]:
        all_input_pairs = []
        chunk_map = {}

        for i, passage in enumerate(passages):
            chunks = self.chunk_text(passage.chunk, self.max_length)
            for j, chunk in enumerate(chunks):
                all_input_pairs.append([query, chunk])
                chunk_map[(i, j)] = chunk

        # Get scores from the cross-encoder
        scores = self.cross_encoder.predict(all_input_pairs)

        # Aggregate scores for each original passage
        passage_scores = {}
        for (i, j), score in zip(chunk_map.keys(), scores):
            if i not in passage_scores:
                passage_scores[i] = []
            passage_scores[i].append(score)

        # Calculate final score for each passage (e.g., using max score)
        final_scores = {i: max(scores) for i, scores in passage_scores.items()}

        # Sort passages based on their scores in descending order
        sorted_passages = sorted([(score, passages[i]) for i, score in final_scores.items()], key=lambda x: x[0], reverse=True)

        # Update passage scores and return
        result = []
        for score, passage in sorted_passages:
            passage.score = float(score)
            result.append(passage)

        # Lets only return the top 2.
        return result[:2]

# Define the ReRank task. By default we use ms-mini-marco from the the HuggingFace Sentence Transformer Library
reranker: CrossEncoderReRankTask =  CrossEncoderReRankTask()

# Setup Full Retrieval Task
In this step we'll create a class that combines our chromaDB collection & ReRank to return the most relevant passages we can find

In [None]:
class RetrievalTask:
    def __init__(self, chroma_retriever: ChromaDBRetrievalTask, reranker: CrossEncoderReRankTask):
        self.chroma_retriever = chroma_retriever
        self.reranker = reranker

    # Retrieve our results, Rerank, and return the top 2 results
    def retrieve(self, query, n_results=5) -> List[Passage]:
        initial_results: RetrievalResult  = self.chroma_retriever.retrieve(query, n_results)
        passages = [Passage(chunk=r.document, file_name = r.metadata['relative_path']) for r in initial_results]
        return self.reranker.rerank(query, passages)

retrieval_task: RetrievalTask = RetrievalTask(
    chroma_retriever=chroma_retrieval_task,
    reranker=reranker
)
        

# Setup RAG With Bedrock
This is mostly reused from the previous task with one exception. In the RAGClient, we're making a Retrieval call to populate the context. We store it in context and context_chunks. Context is useful for the LLM-As-A-Judge Evaluation while Context Chunks is a specific requirement for a tool we'll use to compare our custom evaluator to (RAGAS)

In [None]:
import boto3
import pandas as pd
import re
from concurrent.futures import ThreadPoolExecutor, as_completed
from typing import List, Dict, Any
import json

class BaseBedrockClient:
    def __init__(self, user_prompt: str, system_prompt: str, model_id: str, hyper_params: dict):
        self.client = boto3.client('bedrock-runtime')
        self.user_prompt = user_prompt
        self.system_prompt = system_prompt
        self.model_id = model_id
        self.hyper_params = hyper_params

    def create_chat_payload(self, inputs: dict) -> list[dict]:
        prompt = self.user_prompt.format(**inputs)
        return [{"role": "user", "content": [{"text": prompt}]}]

    def call(self, messages: list[dict]) -> str:
        response = self.client.converse(
            modelId=self.model_id,
            messages=messages,
            inferenceConfig=self.hyper_params,
            system=[{"text": self.system_prompt}]
        )
        return response['output']['message']['content'][0]['text']

    def call_threaded(self, message_lists: List[List[Dict[str, Any]]]) -> List[str]:
        future_to_position = {}
        with ThreadPoolExecutor(max_workers=5) as executor:
            for i, request in enumerate(message_lists):
                future = executor.submit(self.call, request)
                future_to_position[future] = i
            
            responses = [None] * len(message_lists)
            for future in as_completed(future_to_position):
                position = future_to_position[future]
                try:
                    response: str = future.result()
                    responses[position] = response
                except Exception as exc:
                    print(f"Request at position {position} generated an exception: {exc}")
                    responses[position] = None
        return responses

class RAGClient(BaseBedrockClient):
    def __init__(self, user_prompt: str, system_prompt: str, model_id: str, hyper_params: dict, retrieval_task: RetrievalTask):
        super().__init__(user_prompt, system_prompt, model_id, hyper_params)
        self.retrieval_task = retrieval_task

    def extract_response(self, llm_output: str) -> str:
        response_match = re.search(r'<response>(.*?)</response>', llm_output, re.DOTALL)
        return response_match.group(1).strip() if response_match else "No response found"

    def process(self, df: pd.DataFrame) -> pd.DataFrame:
        df = df.copy()

        message_lists = []
        contexts = []  # Store context as it's passed into the prompt
        context_lists = [] # Store context for RAGAS evaluation
        for _, row in df.iterrows():
            # Get passages for context
            passages: List[Passages] = self.retrieval_task.retrieve(row["query_text"])
            # Combine into single context
            context = "\n\n".join(f"###File name:\n{p.file_name}\n###Passage:\n{p.chunk}" for p in passages)

            # Store contexts for downstream dependencies
            contexts.append(context)
            context_lists.append(json.dumps([p.chunk for p in passages]))
            
            # Construct message list using the query text and relevant passages retrieved.
            message_lists.append(self.create_chat_payload({
                "query_text": row["query_text"],
                "context": context
            }))
        
        responses = self.call_threaded(message_lists)

        df['context'] = contexts
        df['context_chunks'] = context_lists
        df['llm_response'] = [self.extract_response(r) for r in responses]
        return df

class EvaluationClient(BaseBedrockClient):
    def __init__(self, user_prompt: str, system_prompt: str, model_id: str, hyper_params: dict):
        super().__init__(user_prompt, system_prompt, model_id, hyper_params)

    def extract_score_and_thinking(self, llm_output: str) -> tuple:
        thinking_match = re.search(r'<thinking>(.*?)</thinking>', llm_output, re.DOTALL)
        score_match = re.search(r'<score>(.*?)</score>', llm_output, re.DOTALL)

        thinking = thinking_match.group(1).strip() if thinking_match else "No thinking found"
        score = float(score_match.group(1)) if score_match else None
        
        return score, thinking

    def evaluate(self, df: pd.DataFrame) -> pd.DataFrame:
        df = df.copy()
        message_lists = [self.create_chat_payload({
            "query_text": row["query_text"],
            "context": row["context"],
            "llm_response": row["llm_response"],
            "ground_truth": row["ground_truth"]
        }) for _, row in df.iterrows()]
        
        responses = self.call_threaded(message_lists)

        llm_scores = []
        llm_thinking = []

        for response in responses:
            if response is not None:
                score, thinking = self.extract_score_and_thinking(response)
                llm_scores.append(score)
                llm_thinking.append(thinking)
            else:
                llm_scores.append(None)
                llm_thinking.append("Error occurred during processing")

        df['grade'] = llm_scores
        df['reasoning'] = llm_thinking
        
        return df

# Define RAG Prompts
This is the same prompt we used in the previous task. 

In [None]:
# System Prompt
RAG_SYSTEM_PROMPT = """You are an advanced AI assistant specialized in Retrieval Augmented Generation (RAG).
Your primary function is to provide accurate, concise, and relevant answers based solely on the given context.
Follow these guidelines strictly:

1. Use only information from the provided context. Do not introduce external knowledge or make assumptions.
2. Ensure your answers are complete, addressing all aspects of the question using available information.
3. Be extremely concise. Use as few words as possible while maintaining clarity and completeness.
4. Maintain 100% accuracy based on the given context. If the context doesn't contain enough information to answer fully, state this clearly.
5. Structure your responses for maximum clarity. Use bullet points or numbered lists when appropriate.
6. If the context contains technical information, explain it in simple terms as if speaking to a non-technical person.
7. Do not apologize or use phrases like "Based on the context provided" or "According to the information given".
8. If asked about something not in the context, simply state "The provided context does not contain information about [topic]."

Your goal is to achieve the highest possible score on context utilization, completeness, conciseness, accuracy, and clarity."""

# User Prompt
RAG_USER_PROMPT = """Answer the following question using only the provided context:

<query>
{query_text}
</query>

<context>
{context}
</context>

Instructions:
1. Read the question and context carefully.
2. Formulate a concise and accurate answer based solely on the given context.
3. Ensure your response is clear and easily understandable to a non-technical person.
4. Do not include any information not present in the context.
5. If the context doesn't contain relevant information, state this clearly and concisely.
6. Place your response in <response></response> tags."""

# Reuse Rubric 
We'll reuse our Rubric from the previous task but we'll add the "ground truth" section to the results and slightly tweak our evaluation criteria to account for context relevancy. Because we want the total score to be 5 (which is arbitrary), we'll remove one of the previous evaluation criteria

In [None]:
# System Prompt
RUBRIC_SYSTEM_PROMPT = """You are an expert judge evaluating Retrieval Augmented Generation (RAG) applications.
Your task is to evaluate given answers based on context and questions using the criteria provided.
Evaluation Criteria (Score either 0 or 1 for each, total score is the sum):
1. Context Utilization: Does the answer use only information provided in the context, without introducing external or fabricated details?
2. Completeness: Does the answer thoroughly address all key elements of the question based on the available context, without significant omissions?
3. Conciseness: Does the answer efficiently use words to address the question and avoid unnecessary redundancy?
4. Context Relevancy: Is the context returned sufficient to provide an answer like the gold standard answer.
5. Clarity: Is the answer easy to understand and follow?
Your role is to provide a fair and thorough evaluation for each criterion, explaining your reasoning clearly."""

# User Prompt
RUBRIC_USER_PROMPT = """Please evaluate the following RAG response:

Question:
<query_text>
{query_text}
</query_text>

Ground Truth Answer
<llm_response>
{ground_truth}
</llm_response>


Generated answer:
<llm_response>
{llm_response}
</llm_response>

Context:
<context>
{context}
</context>

Evaluation Steps:
1. Carefully read the provided context, question, and answer.
2. For each evaluation criterion, assign a score of either 0 or 1:
   - Context Utilization
   - Completeness
   - Conciseness
   - Accuracy
   - Clarity
3. Provide a clear explanation for each score, referencing specific aspects of the response.
4. Calculate the total score by adding up the points awarded (minimum 0, maximum 5).
5. Present your evaluation inside <thinking></thinking> tags.
6. Include individual criterion scores (0 or 1) in the thinking tags and the total score inside <score></score> tags.
7. Ensure your response is valid XML and provides a comprehensive evaluation.
8. Use the ground truth to evaluate whether the information returned was not relevant to answer the question fully. If not, 

Example Output Format:
<thinking>
Context Utilization: 1 - The answer strictly uses information from the context without introducing external details.
Completeness: 1 - The response covers all key elements of the question based on the available context.
Conciseness: 1 - The answer is helpful and doesn't repeat the same information more than once.
Context Relevancy: 0 - The context was not relevant to the question.
Clarity: 1 - The response is clear and easy to follow.
</thinking>
<score>4</score>

Please provide your detailed evaluation."""

In [None]:
# Define different models we can use. 
SONNET_ID = "anthropic.claude-3-sonnet-20240229-v1:0"
HAIKU_ID = "anthropic.claude-3-haiku-20240307-v1:0"

# Initialize RAG Client
rag_client: RAGClient = RAGClient(
    RAG_USER_PROMPT, 
    RAG_SYSTEM_PROMPT, 
    HAIKU_ID,
    {"temperature": 0.5, "maxTokens": 4096},
    retrieval_task
)

# Initialize Eval Client
eval_client = EvaluationClient(
    RUBRIC_USER_PROMPT, 
    RUBRIC_SYSTEM_PROMPT, 
    HAIKU_ID, 
    {"temperature": 0.7, "maxTokens": 4096}
)

In [None]:
# Generate RAG responses
rag_df = rag_client.process(eval_df)

In [None]:
# Evaluate RAG Responses
llm_as_a_judge_results_df = eval_client.evaluate(rag_df)

# Create Summary View of Results

In [None]:
import pandas as pd
import numpy as np
from textwrap import fill

class E2EEvaluator:
    def __init__(self, df):
        self.df = df
        self.grades = df['grade'].astype(float)
    
    def calculate_metrics(self):
        return {
            'Mean': np.mean(self.grades),
            'Median': np.median(self.grades),
            'Standard Deviation': np.std(self.grades),
            'Minimum Grade': np.min(self.grades),
            'Maximum Grade': np.max(self.grades)
        }
    
    def generate_report(self):
        metrics = self.calculate_metrics()
        report = "E2E Validation Result\n"
        report += "========================\n\n"
        
        for metric, value in metrics.items():
            report += f"{metric}: {value:.2f}\n"
        
        return report
    
    def analyze_grade_distribution(self):
        return self.df['grade'].value_counts().sort_index()

    def pretty_print_lowest_results(self, n=3, width=80):
        lowest_results = self.df.nsmallest(n, 'grade')
        for index, row in lowest_results.iterrows():
            print(f"{'='*width}\n")
            print(f"Grade: {row['grade']}\n")
            print("Query Text:")
            print(fill(row['query_text'], width=width))
            print("\nLLM Response:")
            print(fill(row['llm_response'], width=width))
            print("\nReasoning:")
            print(fill(row['reasoning'], width=width))
            print(f"\n{'='*width}\n")

In [None]:
# Assuming your dataframe is named 'df'
evaluator = E2EEvaluator(llm_as_a_judge_results_df)

# Generate and print the report
print(evaluator.generate_report())

# Analyze grade distribution
print(evaluator.analyze_grade_distribution())

In [None]:
# Look at the results and spot check them
llm_as_a_judge_results_df

# E2E Test Results
The E2E test results are pretty good! However, it doesn't account for scenarios where you simply don't have the correct context. This is where RAGAS is really useful. It will provide some more granular metrics for us. However, a word of caution, it is very slow.

# Bonus - Use RAGAS
In this last section, we'll use a library called Ragas as an alternative to building this all manually. 

In [None]:
import json
import pandas as pd
from datasets import Dataset
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
    answer_correctness
)
from langchain_community.embeddings import BedrockEmbeddings
from langchain_aws import ChatBedrock
from ragas import evaluate, RunConfig

class RAGASEvaluator:
    def __init__(self, embedding_model, model_id, bedrock_client):
        self.embedding_model_id = embedding_model
        self.model_id = model_id
        self.bedrock_client = bedrock_client
        
        # Define embedding model for RAGAS
        self.ragas_embedding_model = BedrockEmbeddings(model_id=self.embedding_model_id, client=self.bedrock_client)
        
        # Define the llm to use for RAGAS evaluation
        self.ragas_llm = ChatBedrock(
            model_id=self.model_id,
            model_kwargs = {"temperature": 0.5, "max_tokens": 2048}
        )
        
        # Define metrics we care about
        self.METRICS = [
            context_precision,
            faithfulness,
            answer_relevancy,
            context_recall,
            answer_correctness
        ]

    def notebook_to_hf_dataset(self, df: pd.DataFrame) -> Dataset:
        # Initialize the new data structure
        data_samples = {
            'question': [],
            'answer': [],
            'contexts': [],
            'ground_truth': []
        }
        
        # Iterate through each row in the dataframe
        for _, row in df.iterrows():
            # Add the question (query_text)
            data_samples['question'].append(row['query_text'])
            
            # Add the answer (llm_response)
            data_samples['answer'].append(row['llm_response'])
            
            # Parse the context_chunks JSON string and add to contexts
            context_chunks = json.loads(row['context_chunks'])
            data_samples['contexts'].append(context_chunks)
            
            # Add the ground truth
            data_samples['ground_truth'].append(row['ground_truth'])
        
        # Create and return the Dataset object
        return Dataset.from_dict(data_samples)

    def evaluate_rag(self, rag_df: pd.DataFrame):
        # Convert dataframe to HuggingFace dataset
        ragas_dataset = self.notebook_to_hf_dataset(rag_df)
        
        # Perform evaluation
        result = evaluate(
            ragas_dataset,
            metrics=self.METRICS,
            llm=self.ragas_llm,
            embeddings=self.ragas_embedding_model,
            run_config=RunConfig(max_workers=1)
        )
        
        return result

In [None]:
# Run Ragas Evaluation
HAIKU_ID = "anthropic.claude-3-haiku-20240307-v1:0"

ragas_evaluator = RAGASEvaluator(EMBEDDING_MODEL, HAIKU_ID, bedrock)
ragas_results = ragas_evaluator.evaluate_rag(rag_df)
print(json.dumps(ragas_results, indent=4))

# Conclusion
In this notebook we combined our embeddings, ReRank, and prompt together to run E2E tests on our entire system. We explored doing this all manually and also explored using a tool like RAGAS. Based on our findings, document level chunking worked very well for this use case. 

```json
{
    "context_precision": 0.9583333332833334,
    "faithfulness": 0.8954968944099378,
    "answer_relevancy": 0.7086921115930934,
    "context_recall": 1.0,
    "answer_correctness": 0.6089388020774851
}
```

## Takeaways
While tools like RAGAS are great for quick experiments and evaluating an end to end systems, it often times isn't as flexible as building something custom. It also doesn't provide the level of granularity you want when building an IR system in general. By adding validation datasets at each touchpoint in your LLM augmented system, we can get a much more comprehensive view of what's happening and where the bottlenecks to better performance are. 

Another important takeaway is that how you chunk your data matters (aguably) more than what model you choose to vend the RAG results. We explored only basic chunking strategies. The pros to them are that they're easy, fast, and don't cost a lot of credits to index your data. The downside to them is that they're basic and more advanced chunking strategies could unlock greater performance. 

No dataset or IR problem is exactly the same. It's important to evaluate it to understand how your system is performing and update your chunking and validation sets over time