## Build a Baseline Retrieval Augment Generation Solution
In this notebook we will build an initial solution that will utilize a pre-trained model augmented with a contextual data from a vector store retriever. At a high level, the solution will work as follows:
- Based on a user's query, we will retrieve the top-k most similar documents from the vector store.
- Provide the relevant documents as part of the prompt to the model along with the user's question
- Generate the answer using the model

![Basic RAG](images/chatbot_lang.png)

We'll evaluate several aspects of the solution including:
- The accuracy of the retrieved context
- The quality of the generated answer

These metrics will help determine whether a solution using purely pre-trained models is viable or whether we need to consider more complex strategies or fine-tuning

In [1]:
import sys
import os
module_path = "../.."
sys.path.append(os.path.abspath(module_path))
from utils.environment_validation import validate_environment, validate_model_access
validate_environment()

Validating base environment
Base environment validated successfully


langchain==0.3.5 has been installed successfully.


In [2]:
required_models = [
    "amazon.titan-embed-text-v2:0",
    "mistral.mixtral-8x7b-instruct-v0:1",
    "mistral.mistral-7b-instruct-v0:2",
    "anthropic.claude-3-haiku-20240307-v1:0"
]
validate_model_access(required_models)

### Data Ingestion
The prepared datasets have been split into training and validation sets. We will load documents associated with both sets into a vector store for retrieval.

In [1]:
from pathlib import Path
from itertools import chain
from rich import print as rprint
from IPython.display import display, Markdown
import json
import langchain
from langchain_core.documents import Document
from langchain_aws.chat_models import ChatBedrockConverse
from langchain_aws.embeddings import BedrockEmbeddings
from langchain_community.vectorstores import FAISS
import boto3

import pickle
from io import BytesIO
from pathlib import Path

import warnings
warnings.filterwarnings("ignore")

import asyncio
import nest_asyncio
nest_asyncio.apply()
warnings.filterwarnings("ignore")

data_path = Path("data/prepared_data")
train_data = (data_path / "prepared_data_train.jsonl").read_text().splitlines()
test_data = (data_path / "prepared_data_test.jsonl").read_text().splitlines()

doc_ids = []
documents = []

# Create a list of LangChain documents that can then be used to ingest into a vector store

for record in chain(train_data, test_data):
    json_record = json.loads(record)
    if json_record["ref_doc_id"] not in doc_ids:
        doc_ids.append(json_record["ref_doc_id"])
        doc = Document(page_content=json_record["context"], metadata=json_record["section_metadata"])
        documents.append(doc)

print(f"Loaded {len(documents)} sections")

Loaded 1340 sections


In [2]:
import mlflow_utils
import mlflow

mlflow_config_path = Path("mlflow_config.json")
if not mlflow_config_path.exists():
    rprint(
        "No MLFlow configuration found. Please run the first notebook to set up MLFlow."
    )
else:
    mlflow_config = json.loads(mlflow_config_path.read_text())
    server_status = mlflow_utils.check_server_status(
        mlflow_config["tracking_server_name"]
    )
    if server_status["IsActive"]:
        rprint(
            f'MLFlow server is available. The current status is: {server_status["TrackingServerStatus"]}'
        )
        mlflow_available = True
        mlflow.set_tracking_uri(mlflow_config["tracking_server_arn"])
    else:
        mlflow_available = False
        rprint(
            f'MLFlow server is not available. The current status is: {server_status["TrackingServerStatus"]}'
        )

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


Next we will initialize the embedding model that will be used to vectorize the documents and queries. We will use the `amazon.titan-embed-text-v2:0` model for this purpose.

In [3]:
boto3_session=boto3.session.Session()

bedrock_runtime = boto3_session.client("bedrock-runtime")

embedding_modelId = "amazon.titan-embed-text-v2:0"

embed_model = BedrockEmbeddings(
    model_id=embedding_modelId,
    model_kwargs={"dimensions": 1024, "normalize": True},
    client=bedrock_runtime,
)

query = "Do I really need to fine-tune the large language models?"
response = embed_model.embed_query(query)
rprint(f"Generated an embedding with {len(response)} dimensions. Sample of first 10 dimensions:\n", response[:10])

The documents can now be ingested into a vector store. We will utilize a local vector store backed by the `faiss` library for this purpose. In production scenarios, a more scalable solution like OpenSearch or pgvector should be used.

In [4]:
vector_store_file = "baseline_rag_vec_db.pkl"

if not Path(vector_store_file).exists():
    rprint(f"Vector store file {vector_store_file} does not exist. Will create a new vector store.")
    CREATE_NEW = True
else:
    rprint(f"Vector store file {vector_store_file} already exists and will be reused. Delete it or change the file name above to if you wish to create a new vector store.")
    CREATE_NEW = False 

if CREATE_NEW:
    vec_db = FAISS.from_documents(documents, embed_model)
    pickle.dump(vec_db.serialize_to_bytes(), open(vector_store_file, "wb"))
    
else:
    if not Path(vector_store_file).exists():
        raise FileNotFoundError(f"Vector store file {vector_store_file} not found. Set CREATE_NEW to True to create a new vector store.")
    
    vector_db_buff = BytesIO(pickle.load(open(vector_store_file, "rb")))
    vec_db = FAISS.deserialize_from_bytes(serialized=vector_db_buff.read(), embeddings=embed_model, allow_dangerous_deserialization=True)

### Evaluate the retrieval performance
Before moving on to the generation step, we should validate the performance of the retriever. The large language model will not be able to generate accurate answers if the retrieved context is not relevant. We will evaluate the retriever using the validation set. The prepared validation set contains 400 questions along with relevant contexts. For each question, we have the unique document id of the relevant context. So our evaluation is simple: we will retrieve the top-k documents for each question and check if the relevant context is present in the top-k results. We will then calculate the recall or Hit Rate of the retriever. Additionally we'll compute the MRR (Mean Reciprocal Rank) metric. The MRR is the average of the reciprocal ranks of the first relevant document. For example, if we retrieve 5 documents (k=5) and the relevant document is ranked 2nd, the reciprocal rank would be 1/2. We calculate the reciprocal rank for each question and then take the average to get the MRR.

In [5]:
test_data = (data_path / "prepared_data_test.jsonl").read_text().splitlines()
retriever_evaluation_data = []

# we only need the ref_doc_id and question from the test data

for record in test_data:
    json_record = json.loads(record)
    retriever_evaluation_data.append({"ref_doc_id":json_record["ref_doc_id"], "question":json_record["question"]})

In [8]:
if mlflow_available:
    pre_signed_url = mlflow_utils.create_presigned_url(mlflow_config["tracking_server_name"])
    display(Markdown(f"Our experiment results will be logged to MLFlow. You can view them from the [MLFlow UI]({pre_signed_url})") )

Our experiment results will be logged to MLFlow. You can view them from the [MLFlow UI](https://t-uouqtyajo3zu.us-west-2.experiments.sagemaker.aws/auth?authToken=eyJhbGciOiJIUzI1NiJ9.eyJhdXRoVG9rZW5JZCI6ImQ1OGU4MzE2LTYyODItNDRiNC04NWJmLTRmYTY0NGM4M2Q5NyIsImZhc0NyZWRlbnRpYWxzIjoiQWdWNGFVTHkwSVcvTXA4THp1L003aXNvbWpMNTFrcWRKZXliZHovb2lObjdxdklBWHdBQkFCVmhkM010WTNKNWNIUnZMWEIxWW14cFl5MXJaWGtBUkVGM2JXOUVWVWhIVVVwcmJFbzNhbVZDZW1wU1ZEWTNhWEp1T0RCUWFFNVpOVWxrVERnNGVXOHliakE0TW1zelJIYzBUek01Y0VKWGIwcHJha3AxV0hnM1VUMDlBQUVBQjJGM2N5MXJiWE1BUzJGeWJqcGhkM002YTIxek9uVnpMWGRsYzNRdE1qbzFPVEF4T0RNM016azFNRFE2YTJWNUx6ZzNabUUxTVdReUxURTRNRGt0TkdVMFl5MWhObVV6TFRRNFpXWTNNelk1WW1NM1lnQzRBUUlCQUhneXZiSTRoY0UxaXBwaCtZcDhxRkZCOEo0ZzZ3RjFucFd0dks2NlZ0VHN1d0dOb2VzTzdTRzd4SGRyZldYRTc5WUlBQUFBZmpCOEJna3Foa2lHOXcwQkJ3YWdiekJ0QWdFQU1HZ0dDU3FHU0liM0RRRUhBVEFlQmdsZ2hrZ0JaUU1FQVM0d0VRUU1UT25PWmhwaXR0Qk01bndPQWdFUWdEdTZhcUFDZFovMHV2QS8yN0liRVRxbUlHd3hnWHhLU3p1VUdlSFdheE5DWTY2d3F6czhEN25FSTJjK3Z3eTBFWlZkZWc0SmtETmdXRU5WcFFJQUFCQUFsS1FSeXh6U1lBdHRVNWNsVzQ1RktZd0RNSHEwUCtGcG8reXZHVVUwNGhUcE5qSlRIZzRDdkJpZUlTVXhjaU16Ly8vLy93QUFBQUVBQUFBQUFBQUFBQUFBQUFFQUFBUDdlNjVZY0tMUHliTGRPeExWYTFCb2xmdkhtUDBuZGc0ZHR4cGpWeFpPK1dkekFQSDBpdkluWThGdjVkZWI3VXR1Y2RhWUc5SWFETjMwNXFxSnNsdEZORkdDUzV0MDFOR3RMWm1UU3FPL2FjaXBsUEJnb05LZ0ovd0pSYWJvN2JENHdWY24rZVJNNlhKVlBNY0hMUWFOK29HelAyWnZaZnVtMWVVTXZrdVAwRE83MndySEp3ejFqa1N5ZkwzcWhudk9FWXBKcnRkQ3JvMHp3NHR5QXhGbWFPUmxKQkNTeW9Id3g5VFJQNXZSaFQwTjFIZUpvZU1rMmt0a2pNVEg1Mml4ZUZXK3g5dk9GbVFyM3BnRkdNS0tXNHRiNWxxWXlmTi9peXRGSUJXNlBKRDlxcXVuM3dzV2grVVhCTzVUNE5ydFNDSEZHV2RsbEtzSFFDaFRzeS9pWlBFWll5ZXc5NmRwVTdkVDI4c0NYeE9sU1RvK3V3aGhEWVN4MmNrS21HYmJudUp4MGZsYkphMGpTeUwrNTVKT21aeFY5SjBjK3FWU0hXeEo5dlVGNVM4eU5CSkNKcXJGNlZuUmU2YkdzdStNOTBrZkYwd3h6Y0MxMjlJUndWWGRZYmxNeVE4RUZZRTIrSVovOXM4UDBrd3o5V0ZCcWdlUmpWQ0ROVzlJZzU5S2l2YWhCVW5wbHZwYVFKZmRZQzdMQmhtY0dXYSs2NXJtQWdwM2h4cEc2Unl1Q0hSTjk3NUtGM2o1eUZXdmdYOVhmb0Z5dklxcWRnei9XcmNrNS9ORVluVzk4bHFRSlZYVmNOSkpvVTBMM0Zqd29xQzdnR3VyRm11WUJLVmt3WGVLRGdhemFkVXlMNHJnd1NSUVVDUWF3U2p4NTVxc053bVNmZHJjZWxMQkIySVN1UGhCSGhNNG0vaEFJUlExdGFJZlNxU2RHYmtEMkRTMDBrVEFrcEFqWml5SkFHcWd5K29saHFQeWo0VnRlSTZBUGw3K2Z4NG8yWUREK1hINXlZVVd2amRGKzZWZU41ekVzZjZEaEc5MEdhYm9RYVJBdjUzbm9QaTR6NFhrbTBYNXdicC93bzhEOEU4Zk5NSEFWMG91c3pVNTdjTFdqMVJpcFJaTjBnRzVXdnhKclZrQTJ5Q25MNjJPREtIUk9PQUlWT2RNZGxrSjU5c01lZWh6OGo0VzF2bmMxODd2Y1dBUlBhalVTSmR4Uyt4dVVSZVVoZE1IckgwdVNBVXBiNlNTejc0L3h1eCtuNWZaYUwyYkEzM1V0b2I4d21DRlE2SDZjSkFuTGZnQ21BeGVuYUdnbVc4ZzRMM29tdEN3a2FtaTVvcE5BeW9aZXpJcDBKdlB4WGZ1RzFlMkVoeUhGcjFSNWs2KzMvbExqZlRNb0M0QmdBVnpFSURzVTI3OVhONFlLaFhOeDN0MnlnNlZrYmk4d1RXWU5BQ2c4cU5EL0FYYzgyWkdWYVdmN1phN3h5UDJ6emRhRiswa1R0T09XVEkrT3dTT0JrWWRLcTEvWTlXWGdTSVRoZHk3K1ZuRGZuVWVtOEtwaC9tYS93c0lsQ05Ea25QQ1NpeGI3cTduakhBd3dPdmZOMTNPU2hJQU9YYis5cHdheGpDOEwyOGcxZzFBTi9XWVNZQUM3K0p6NlpwNENQR1kzWU13TG9mOHRuRHVsRW1SS3lIMkhVRWJ6OVZTem95aFp0S1NxaFhoWUFkeWJEMFg3bEJkdzFsT3dUc2xiUFZvY2IzUTRoMnF3UkpuVFBZV2NCNEcrTVlsSlVtTWtwT2RYWnM5K0NIUlFuWVpIK0R0a1d6ZEtPUGtzSWtsT0FvRlc5NnJBR2N3WlFJeEFMaUdTTFJGc2JxRmxEeFc2NnN2cDNuOFRQZVl6cTd5bk5tQTNEaXJCT3ZsODB1RGhuaXFFUno1NVNFc0FHRUNpZ0l3SDIwdDJsWXVFcjhKemJwY1hJaHlKZ0NFUUxZU1p2TkdRQzY1WDF6MmpIQUppRUVtSk5tSk5RamhFU1dXbVJ2TCIsImNpcGhlclRleHQiOiJBUUlCQUhneXZiSTRoY0UxaXBwaCtZcDhxRkZCOEo0ZzZ3RjFucFd0dks2NlZ0VHN1d0dtRHU4WGdlN3phcVR2VEpxVFJEMURBQUFBb2pDQm53WUpLb1pJaHZjTkFRY0dvSUdSTUlHT0FnRUFNSUdJQmdrcWhraUc5dzBCQndFd0hnWUpZSVpJQVdVREJBRXVNQkVFREQ2eVB0YUtyTVRuOFprV0d3SUJFSUJib1BUV1RMeDdWaG4wSjJVTVJHUDNGa2xPVXpqTzdvRDdGM25qcDA4bktJUFlsbWJnaEpTd3NFL28vbngrUTRkaWI3M2E0WUpXV3o1L2p5L205eUZPUDkzL3pNRksrK0lyRGg4SjlJTWJEakhEMTd5MGEzekY0TWNpcFE9PSIsInN1YiI6ImFybjphd3M6c2FnZW1ha2VyOnVzLXdlc3QtMjoxNTI4MDQ5MTMzNzE6bWxmbG93LXRyYWNraW5nLXNlcnZlci93b3Jrc2hvcC1tbGZsb3ctMSIsImlhdCI6MTczODYxMjQ2MSwiZXhwIjoxNzM4NjEyNzYxfQ.n6UrZHaeeAQi6ktq6B8_NxDehGKzJ1G3fn7wGAzL8lE)

In [None]:
k = 3 # number of documents to retrieve
faiss_retriever = vec_db.as_retriever(search_kwargs={"k": k})


correct = 0
reciprocal_rank = 0
num_examples = 400 # Number of examples to evaluate
for i, eval_data in enumerate(retriever_evaluation_data[:num_examples]):
    returned_docs = faiss_retriever.invoke(eval_data["question"])
    returned_doc_ids = [doc.metadata["unique_id"] for doc in returned_docs]
    if eval_data["ref_doc_id"] in returned_doc_ids:
        correct += 1
        reciprocal_rank += 1 / (returned_doc_ids.index(eval_data["ref_doc_id"]) + 1)
    else:
        continue

hit_rate = correct / num_examples
mrr = reciprocal_rank / num_examples

print(f"Hit rate @k={k}: {hit_rate}")
print(f"MRR @k={k}: {mrr}")

if mlflow_available:
    mlflow.set_experiment("Retriever Evaluation")
    with mlflow.start_run(run_name="baseline_retriever"):
        mlflow.log_param("retriever", "FAISS")
        mlflow.log_param("k", k)
        mlflow.log_metric("hit_rate", hit_rate)
        mlflow.log_metric("mrr", mrr)

Hit rate @k=3: 0.9225
MRR @k=3: 0.8541666666666665
🏃 View run baseline_retriever at: https://us-west-2.experiments.sagemaker.aws/#/experiments/1/runs/68b15b86e8544ffeaec476d2e8a34db8
🧪 View experiment at: https://us-west-2.experiments.sagemaker.aws/#/experiments/1


The evaluation results above may vary but we should see a hit rate of over 0.92 and an MRR of over 0.85. These results are quite good and indicate that the retriever is able to find the relevant context for most questions. If this was not the case, then using a different embedding model or fine-tuning the retriever would be possible options to consider. A number of libraries exist that can be used to fine-tune or train a custom embedding model for retrieval including:
- [sentence-transformers](https://www.sbert.net/docs/sentence_transformer/training_overview.html)
- [RAGatouille](https://github.com/bclavie/RAGatouille)

There are other ways to improve the retriever performance such as using hybrid search that combines both dense and sparse retrieval methods. 

For example below, we can improve the performance of the above retriever by ensembling it with a sparse retriever like BM25. This tends to work well with domain specific datasets as it combines the strengths of keyword search with semantic search. We'll use langchain's [EnsembleRetriever](https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/ensemble/) to combine the dense retriever with BM25. However many vector dbs offer hybrid search capabilities out of the box such as  [OpenSearch](https://opensearch.org/docs/latest/search-plugins/hybrid-search/).


In [10]:
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
bm_25 = BM25Retriever.from_documents(documents)
bm_25.k = k


ensemble_retriever = EnsembleRetriever(
    retrievers=[faiss_retriever, bm_25], weights=[0.75, 0.25] # you can fine-tune the weights here
)

correct = 0
average_rank = 0
num_examples = 400 # Number of examples to evaluate
for i, eval_data in enumerate(retriever_evaluation_data[:num_examples]):
    returned_docs = ensemble_retriever.invoke(eval_data["question"])
    returned_doc_ids = [doc.metadata["unique_id"] for doc in returned_docs]
    if eval_data["ref_doc_id"] in returned_doc_ids:
        correct += 1
        average_rank += 1 / (returned_doc_ids.index(eval_data["ref_doc_id"]) + 1)
    else:
        continue

hit_rate = correct / num_examples
mrr = average_rank / num_examples

print(f"Hit rate with Hybrid Search @k={k}: {hit_rate}")
print(f"MRR with Hybrid Search @k={k}: {mrr}")

if mlflow_available:
    mlflow.set_experiment("Retriever Evaluation")
    with mlflow.start_run(run_name="hybrid_retriever"):
        mlflow.log_param("k", k)
        mlflow.log_param("retriever", "hybrid")
        mlflow.log_param("weights", ensemble_retriever.weights)
        mlflow.log_metric("hit_rate", hit_rate)
        mlflow.log_metric("mrr", mrr)

Hit rate with Hybrid Search @k=3: 0.9725
MRR with Hybrid Search @k=3: 0.8829166666666665
🏃 View run hybrid_retriever at: https://us-west-2.experiments.sagemaker.aws/#/experiments/1/runs/b6167f5d161f48d193539582813992d4
🧪 View experiment at: https://us-west-2.experiments.sagemaker.aws/#/experiments/1


You should see an improvement in the hit rate and MRR after ensembling with BM25.

### Build the Retrieval Augmented Generation (RAG) pipeline
Now that we are satisfied that the retriever is performing reasonably well, we can move on to the generation step. We'll build a basic Chain that given a question will retrieve the relevant context and invoke a Large Language Model to generate the answer. We will use the smaller `mistral.mistral-7b-instruct-v0:2` to generate the responses, this will also be the model that we will fine-tune in the subsequent notebooks.

In [11]:
from langchain_aws.chat_models import ChatBedrockConverse
from langchain_core.prompts import ChatPromptTemplate

llm_modelId = "mistral.mistral-7b-instruct-v0:2"

llm = ChatBedrockConverse(
    model_id=llm_modelId, max_tokens=1000, temperature=0,
    client=bedrock_runtime,
)


Below is the prompt template that will be used to generate the answer. It's a simple template that will provide basic single-turn functionality and not include any guardrails to constrain the interaction. This is a good starting point but in production scenarios, you would want to add more sophisticated guardrails to ensure the model generates safe and accurate responses.

In [12]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from operator import itemgetter

template = """You are a Banking Regulatory Compliance expert. You have been asked to provide guidance on the following question using the referenced regulations below.
If the referenced regulations do not provide an answer, indicate to the user that you are unable to provide an answer and suggest they consult with a legal expert.

----------------------
{context}
----------------------

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
output_parser = StrOutputParser()

setup_and_retrieval = RunnableParallel(
    {"context": ensemble_retriever, "question": RunnablePassthrough()}
)

# produce an output that contains the answer and the context that was passed to the model
generate_answer = {"answer": prompt | llm | output_parser,
                   "context": itemgetter("context")}

chain = setup_and_retrieval | generate_answer

Let's invoke the chain with a sample test question and examine the results.

In [13]:
sample_record = json.loads(test_data[10])
sample_question = sample_record["question"]
sample_answer = sample_record["answer"]
rprint(f"[bold green]Sample question:[/bold green] {sample_question}")
response = chain.invoke(sample_question)
generated_answer = response["answer"]
rprint(f"[bold green]Generated answer:[/bold green] {generated_answer}")
rprint(f"[bold green]Ground truth answer:[/bold green] {sample_answer}")

### RAG Evaluation
While a manual examination of the generated answers is one of the more reliable ways to evaluate the model, it is not scalable especially as we iterate on the pipeline. In this section we will leverage an automated evaluation framework [RAGAS](https://arxiv.org/abs/2309.15217) (Retrieval Augmented Generation Assessment) along with its implementation in the [ragas](https://docs.ragas.io/en/stable/index.html) python library. RAGAS proposes a number of metrics to evaluate the quality of the generated answers. We will use the following metrics:
- Faithfulness: Measures the factual consistency of the generated answer against the given context
- Answer Relevance: Focuses on assessing how pertinent the generated answer is to the given prompt
- Answer semantic similarity: pertains to the assessment of the semantic resemblance between the generated answer and the ground truth
- Answer Correctness: involves gauging the accuracy of the generated answer when compared to the ground truth

RAGAS uses an LLM as a judge for many of the metrics and as such can be very sensitive to the choice of the LLM and the generation parameters such as temperatures. Metrics may vary significantly from one LLM to another and even with the same LLM you may see differences from run to run even with low temperature settings. The metrics however are still useful as we can compare the performance of different models and pipelines as it gives us a relative measure of performance improvement from one iteration to another.

In [14]:
from ragas.metrics import faithfulness, answer_similarity, answer_relevancy, answer_correctness
from ragas.integrations.langchain import EvaluatorChain
import math

We will use the Amazon Nova Lite model as the judge for the RAGAS metrics. We will also use the default prompts within RAGAS for the evaluation. 

In [15]:
import os 
os.environ["OPENAI_API_KEY"] = "12345" # Ragas raises exception if this is not set

eval_llm = ChatBedrockConverse(
    model_id="us.amazon.nova-lite-v1:0",
    max_tokens=1000,
    temperature=0,
    client=bedrock_runtime,
)

In [16]:
async def generate_answer_async(rag_chain, example):
    """Helper function to generate an answer asynchronously"""
    example = json.loads(example)
    response = await rag_chain.ainvoke(example["question"])
    contexts = [doc.page_content for doc in response["context"]]
    row = {"question": example["question"], "answer": response["answer"], "contexts": contexts, "ground_truth": example["answer"]}
    return row

Evaluation can be time consuming, we will therefore only use the first 50 example from the test dataset

In [26]:
# get the generated responses for the first 100 examples in the test data

NUM_SAMPLE_LLM_EVALUATION = 25
eval_rows = []
for example in test_data[:NUM_SAMPLE_LLM_EVALUATION]:
    eval_rows.append(generate_answer_async(chain, example))
event_loop = asyncio.get_event_loop()
eval_data= event_loop.run_until_complete(asyncio.gather(*eval_rows))

In [27]:
async def evaluate_llm_async(metric, rows):
    sem = asyncio.Semaphore(5)

    async def limited_invoke(row):
        async with sem:
            return await metric.ainvoke(row)

    tasks = [asyncio.create_task(limited_invoke(row)) for row in rows]
    return await asyncio.gather(*tasks)

Next we define the metrics for evaluation

In [28]:
faithfulness_metric = EvaluatorChain(metric=faithfulness, llm=eval_llm, embeddings=embed_model)
answer_relevancy_metric = EvaluatorChain(metric=answer_relevancy, llm=eval_llm, embeddings=embed_model)
answer_similarity_metric = EvaluatorChain(metric=answer_similarity, llm=eval_llm, embeddings=embed_model)
answer_correctness_metric = EvaluatorChain(metric=answer_correctness, llm=eval_llm, embeddings=embed_model)

[**Faithfulness:**](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/faithfulness/) measure the extent to which the claims in the generated answer are supported by the context. It is calculated as the ratio of the number of claims in the generated answer that are supported by the context to the total number of claims in the generated answer. In other words it helps us detect hallucinations as we would expect all claims in the generated answer to be supported by the context.
It does not reflect on the accuracy or correctness of the claims, only that they are supported by the context.

**NOTE:** If you see a message `Failed to parse output. Returning None.` during the evaluation, it simply means that ragas was unable to parse the output from the model. This can happen if the model generates an output that is not in the expected format. These samples will be ignored when calculating the aggregate metric.

In [None]:
faithfulness_evals = event_loop.run_until_complete(evaluate_llm_async(faithfulness_metric, eval_data))
faithfulness_scores = [eval["faithfulness"] for eval in faithfulness_evals if not math.isnan(eval["faithfulness"])]
faithfulness_score = sum(faithfulness_scores) / len(faithfulness_scores)
print("Faithfulness Score: ", faithfulness_score)

In [None]:
# you can filter on the low scoring examples for further analysis
# [e for e in faithfulness_evals if e["faithfulness"] < 0.5]

[**Answer Relevancy:**](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/answer_relevance/) attempts to measure how pertinent the generated answer is to the given prompt. It works by having the evaluator LLM generate synthetic questions based on the generated answer and then calculating the average semantic similarity between the given question and the synthetic questions. The idea is that a more complete and pertinent answer should yield synthetic questions that are more similar to the given question. 

In [23]:
relevancy_evals = event_loop.run_until_complete(evaluate_llm_async(answer_relevancy_metric, eval_data))
relevancy_scores = [eval["answer_relevancy"] for eval in relevancy_evals if not math.isnan(eval["answer_relevancy"])]
relevancy_score = sum(relevancy_scores) / len(relevancy_scores)

print("Answer Relevancy Score: ", relevancy_score)

[**Answer semantic similarity:**](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/semantic_similarity/) measures the cosine similarity between the ground truth answer and the generated answer.  

In [24]:
answer_similarity_evals = event_loop.run_until_complete(evaluate_llm_async(answer_similarity_metric, eval_data))
similarity_scores = [eval["answer_similarity"] for eval in answer_similarity_evals if not math.isnan(eval["answer_similarity"])]
similarity_score = sum(similarity_scores) / len(similarity_scores)

print("Answer Similarity Score: ", similarity_score)

Answer Similarity Score:  0.8717184941452971


[**Answer Correctness**](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/factual_correctness/): Combines factual similarity assessed by the evaluator LLM with the semantic similarity between the generated answer and the ground truth. It is calculated as a weighted average of the factual similarity and the semantic similarity. Factual similarity is calculated similar to Faithfulness but also considers overlapping claims between the generated answer and the ground truth.

In [25]:
answer_correctness_evals = event_loop.run_until_complete(evaluate_llm_async(answer_correctness_metric, eval_data))
correctness_scores = [eval["answer_correctness"] for eval in answer_correctness_evals if not math.isnan(eval["answer_correctness"])]
correctness_score = sum(correctness_scores) / len(correctness_scores)

print("Answer Correctness Score: ", correctness_score)

In [None]:
relevancy_score = 0.78

Let's save the evaluation metrics so we can compare them with the fine-tuned model in the subsequent notebooks.

In [24]:
with open("base_evaluation.json", "w") as f:
    metrics = {
        "faithfulness": faithfulness_score,
        "relevancy": relevancy_score,
        "similarity": similarity_score,
        "correctness": correctness_score,
    }
    json.dump(metrics, f)

if mlflow_available:
    mlflow.set_experiment("Banking Regulations RAG Evaluation")
    with mlflow.start_run(run_name="baseline_rag"):
        mlflow.log_param("faithfulness", faithfulness_score)
        mlflow.log_param("relevancy", relevancy_score)
        mlflow.log_param("similarity", similarity_score)
        mlflow.log_param("correctness", correctness_score)

### Conclusion
In this notebook, we have demonstrated how to use LangChain to build a hybrid search system that combines BM25 and FAISS retrievers to retrieve relevant documents for a given question. We have also shown how to use LangChain to generate answers to questions using a language model and evaluate the generated answers using Ragas metrics.