## Evaluate our flow

The prior notebook showed us how we can take the Amazon Bedrock Knowledge Bases we created in `setup_knowledge_bases.ipynb` and put them in a structured flow using Amazon Bedrock Prompt Flows (https://aws.amazon.com/bedrock/prompt-flows/).

This notebook wil demonstrate how we can evaluate the accuracy of our flows using the [Ragas](https://docs.ragas.io/en/stable/getstarted/) framework. This is a commonly used open source framework for evaluating RAG application accuracy.

Let's start with some imports.

**Table of Contents:**

1. [Complete prerequisites](#Complete%20prerequisites)
                
    b. [Organize imports](#Organize%20imports)
    
    c. [Set AWS Region and boto3 config](#Set%20AWS%20Region%20and%20boto3%20config)
    
    d. [Get our flows](#Create%20common%20objects)
    
 2. [Load Dataset](#Load%20data%20to%20Knowledge%20Bases)
 
 3. [Execute Flows](#Cleanup)

 4. [Measure Accuracy](#Cleanup)
 
 5. [Conclusion](#Conclusion)

###  1a. Organize Imports <a id =Load%20data%20to%20Knowledge%20Bases> </a>

In [None]:
import os
import time
import boto3
import logging
import pprint
import json
import pandas as pd
from tqdm import tqdm
from botocore.client import Config
from langchain_aws.chat_models.bedrock import ChatBedrock
from langchain_aws.embeddings.bedrock import BedrockEmbeddings
from langchain_aws.retrievers.bedrock import AmazonKnowledgeBasesRetriever
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    context_precision,
    context_recall,
    answer_similarity,
    answer_correctness,
    answer_relevancy,
    faithfulness
    )

model_id_eval = "anthropic.claude-3-haiku-20240307-v1:0"
bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0})
bedrock_client = boto3.client('bedrock-runtime')
bedrock_agent = boto3.client(service_name="bedrock-agent", region_name="us-west-2")
bedrock_agent_rt = boto3.client(service_name="bedrock-agent-runtime", region_name="us-west-2")
llm_for_evaluation = ChatBedrock(model_id= model_id_eval, client=bedrock_client)
bedrock_embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v2:0",
                                                    client=bedrock_client)



###  1b. Get our flows <a id =Load%20data%20to%20Knowledge%20Bases> </a>

Let's collect our flow ids and our flow aliases, this will allow us to call the flows.

In [None]:
flow_summaries = bedrock_agent.list_flows()['flowSummaries']
flow_id = flow_summaries[0]['id']

flow_aliases = bedrock_agent.list_flow_aliases(flowIdentifier = flow_id)['flowAliasSummaries']
flow_alias = flow_aliases[0]['id']

###  2. Load our dataset <a id =Load%20data%20to%20Knowledge%20Bases> </a>


Let's load our question and answer pairs we can use for evaluation.

In [None]:
question_set = pd.read_csv("data/questions_and_answers.csv")
question_set

###  3. Execute our flow <a id =Load%20data%20to%20Knowledge%20Bases> </a>

The below will invoke our flow and store the responses.

In [None]:
flow_outputs = []
for question in tqdm(question_set['question']):

    response = bedrock_agent_rt.invoke_flow(
        flowAliasIdentifier=flow_alias,
        flowIdentifier=flow_id,
        inputs=[
            {
                'content': {
                    'document': question
                },
                'nodeName': 'FlowInputNode',
                'nodeOutputName': 'document'
            },
        ]
    )
    flow_output = [response for response in iter(response['responseStream'])]

    flow_outputs.append(flow_output[0]['flowOutputEvent']['content']['document'])
    

###  4. Measure Accuracy <a id =Load%20data%20to%20Knowledge%20Bases> </a>

Now that we have our prompt flow responses, we can evaluate them using the Ragas library.

In [None]:
context_list = []
for context in tqdm(question_set['llm_contexts']):
    eval(context)
    context_list.append(eval(context))

metrics = [
          # context_precision,
        # context_recall, # currently this metric might trigger timeout error raised by bedrock: ValueError: Error raised by bedrock service: Read timeout on endpoint URL: "https://bedrock-runtime.us-east-1.amazonaws.com/model/anthropic.claude-v2/invoke"
        answer_similarity,
        answer_correctness,
        # answer_relevancy,
        # faithfulness
]

column_map = {
        "question": "question",
        "contexts": "llm_contexts",
        "answer": "llm_answer",
        "reference": "reference",
    }

ragas_dataset = Dataset.from_dict(    {
        "question":question_set['question'],
        "llm_answer":flow_outputs,
        "reference":question_set['gt_answer'],
        "llm_contexts":context_list
    })

# Evaluate
eval_result = evaluate(ragas_dataset, 
                       metrics=metrics, 
                       column_map=column_map, 
                       llm=llm_for_evaluation,
                        embeddings=bedrock_embeddings, raise_exceptions=False)
eval_result

## 5. Cleanup <a id='Cleanup'></a>

As a best practice, you should delete AWS resources that are no longer required.  This will help you avoid incurring unncessary costs.

<div class="alert alert-block alert-info">
<b>Note:</b> If you are running this notebook as part of a workshop session, by default, all resources will be cleaned up at the end of the session. If you are running this notebook outside of a workshop session, you can cleanup the resources associated with this notebook by uncommenting the following code cell and running it.
</div>

Running the following cell will delete the following resources:
* Knowledge Bases.
* Amazon OpenSearch Serverless Collections.
* The files that were uploaded to the S3 buckets; not the S3 buckets themselves.

In [None]:
'''
# Note: 'delete_kb' available through ./scripts/helper_functions.py
delete_kb(bedrock_agt_client, kb_1_id)
delete_kb(bedrock_agt_client, kb_2_id)

# Note: 'delete_aoss_collection' and 'get_aoss_collection_id' are available through ./scripts/helper_functions.py
delete_aoss_collection(aoss_client, get_aoss_collection_id(kb_1_aoss_collection_arn))
delete_aoss_collection(aoss_client, get_aoss_collection_id(kb_2_aoss_collection_arn))

# Note: 'delete_s3_object' available through ./scripts/helper_functions.py
delete_s3_object(s3_client, kb_1_s3_bucket_name, s3_key_prefix + '/' + kb_1_downloaded_file_name)
delete_s3_object(s3_client, kb_2_s3_bucket_name, s3_key_prefix + '/' + kb_2_downloaded_file_name)
'''

## 6. Conclusion <a id='Conclusion'></a>

We have now seen how to build an advanced RAG router based assistant with Amazon Bedrock using Amazon Bedrock Prompt Flows. In the process, we learned how Amazon Bedrock with its LLMs, and Knowledge Bases (KBs) make it easy for you to build generative AI applications.