## Evaluate our flow

This notebook shows how we can take the Amazon Bedrock Knowledge Bases we created in `rag-router.ipynb` and put them in a structured flow using Amazon Bedrock Prompt Flows (https://aws.amazon.com/bedrock/prompt-flows/).

This will allow us to have a versioned flow where we can specify all of the sequential components, as well as any conditions we want to model. 

We will start with a description of a RAG framework with additional modules (e.g., current date, web search, etc.) to generate a prompt flow as shown below.

In [None]:
import os
import time
import boto3
import logging
import pprint
import json
import pandas as pd
from tqdm import tqdm
from botocore.client import Config
from langchain_aws.chat_models.bedrock import ChatBedrock
from langchain_aws.embeddings.bedrock import BedrockEmbeddings
from langchain_aws.retrievers.bedrock import AmazonKnowledgeBasesRetriever
from langchain.chains import RetrievalQA
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    context_precision,
    context_recall,
    answer_similarity,
    answer_correctness,
    answer_relevancy,
    faithfulness
    )

model_id_eval = "anthropic.claude-3-haiku-20240307-v1:0"
bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0})
bedrock_client = boto3.client('bedrock-runtime')
bedrock_agent = boto3.client(service_name="bedrock-agent", region_name="us-west-2")
bedrock_agent_rt = boto3.client(service_name="bedrock-agent-runtime", region_name="us-west-2")
llm_for_evaluation = ChatBedrock(model_id= model_id_eval, client=bedrock_client)
bedrock_embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1",
                                                    client=bedrock_client)



### Get our flows

Let's collect our flow ids and our flow aliases, this will allow us to call the flows.

In [None]:
flow_summaries = bedrock_agent.list_flows()['flowSummaries']
flow_id = flow_summaries[0]['id']

flow_aliases = bedrock_agent.list_flow_aliases(flowIdentifier = flow_id)['flowAliasSummaries']
flow_alias = flow_aliases[0]['id']

### Load our dataset

Let's load our question and answer pairs we can use for evaluation.

In [None]:
question_set = pd.read_csv("data/questions_and_answers.csv")
question_set

### Execute our flow

The below will invoke our flow and store the responses.

In [None]:
flow_outputs = []
for question in tqdm(question_set['question']):

    response = bedrock_agent_rt.invoke_flow(
        flowAliasIdentifier=flow_alias,
        flowIdentifier=flow_id,
        inputs=[
            {
                'content': {
                    'document': question
                },
                'nodeName': 'UserInput',
                'nodeOutputName': 'document'
            },
        ]
    )
    flow_output = [response for response in iter(response['responseStream'])]

    flow_outputs.append(flow_output[0]['flowOutputEvent']['content']['document'])
    
    

### Execute evaluation

Now that we have our prompt flow responses, we can evaluate them using the Ragas library.

In [None]:
context_list = []
for context in tqdm(question_set['llm_contexts']):
    eval(context)
    context_list.append(eval(context))

metrics = [
          # context_precision,
        # context_recall, # currently this metric might trigger timeout error raised by bedrock: ValueError: Error raised by bedrock service: Read timeout on endpoint URL: "https://bedrock-runtime.us-east-1.amazonaws.com/model/anthropic.claude-v2/invoke"
        answer_similarity,
        answer_correctness,
        # answer_relevancy,
        # faithfulness
]

column_map = {
        "question": "question",
        "contexts": "llm_contexts",
        "answer": "llm_answer",
        "reference": "reference",
    }

ragas_dataset = Dataset.from_dict(    {
        "question":question_set['question'],
        "llm_answer":flow_outputs,
        "reference":question_set['gt_answer'],
        "llm_contexts":context_list[:]
    })

# Evaluate
eval_result = evaluate(ragas_dataset, 
                       metrics=metrics, 
                       column_map=column_map, 
                       llm=llm_for_evaluation,
                        embeddings=bedrock_embeddings, raise_exceptions=False)
eval_result