<div class="alert alert-block alert-info">

# RAG System Evaluation
    
This notebook is a follow up from the previous notebook in which we explored the overall evaluation approach and a RAG system's overall accuracy.

This notebook we will take a closer look at specific RAG metrics and explore how different components and configurations can impact overall accuracy.



## Solution architecture
<img src="https://d3q8adh3y5sxpk.cloudfront.net/meetingrecordings/modelevaluation/architecture.png" alt="LLM selection process" width="900" height="550">

From the solution architecture, we will experiment with the below RAG components and evaluate the impact on several metric's relevant for RAG.

- 1) Embedding model: amazon.titan-embed-text-v1 vs amazon.titan-e1t-medium 
- 2) Text Splitter: TokenTextSplitter vs CharacterTextSplitter
- 3) Retriever: OpenSearch VectoreStoreRetriever search types “similarity” vs “mmr”
- 4) Prompt Template: For each LLM we evaluate two different prompt templates


## RAG evaluation metrics

This notebook explores the following metrics:

Langsmith evaluators: 
-  a. "cot_qa"
-  b. "conciseness"
-  c. "relevance"

RAGAS metrics: 
-  a. context_precision
-  b. faithfulness
-  c. context_recall
-  d. answer_relevancy

LlamaIndex: 
-  a. Faithfulness: measure if the response from a query engine matches any source nodes
-  b. Relevancy: measure if the response and source nodes match the query
-  c. Correctness: assess the relevance and correctness of a generated answer against a reference answer
-  d. Semantic Similarity: evaluates the quality of a question answering system via semantic similarity

Further information on RAG evaluation metrics can be found here: https://blog.worldline.tech/2024/01/12/metric-driven-rag-development.html

In [None]:
# install dependencies
%pip install --force-reinstall -r requirements.txt

In [None]:
# Append your RagasBedrock package path
import sys
sys.path.append("./ragas/")

In [1]:
# restart kernel to ensure proper version of libraries is loaded
from IPython.display import display_html
def restartkernel() :
    display_html("<script>Jupyter.notebook.kernel.restart()</script>",raw=True)
restartkernel()

In [1]:
!pip list | grep -E "awscli|boto3|botocore|langchain|mlflow|plotly|tiktoken|nltk|python-dotenv|xmltodict|requests-aws4auth|pypdf|opensearch-py|sagemaker|nest-asyncio"
# also review requirements.txt for reference if needed

[0mawscli                    1.32.19
boto3                     1.34.36
botocore                  1.34.36
langchain                 0.1.5
langchain-community       0.0.17
langchain-core            0.1.18
langchain-openai          0.0.5
langchainhub              0.1.14
mlflow                    2.10.0
mypy-boto3-bedrock        1.34.0
nest-asyncio              1.6.0
nltk                      3.8.1
opensearch-py             2.4.2
plotly                    5.9.0
pypdf                     3.17.4
python-dotenv             1.0.0
requests-aws4auth         1.2.3
sagemaker                 2.207.1
tiktoken                  0.5.2
xmltodict                 0.13.0
[0m

In [2]:
# load environment variables 
import boto3
import os
import botocore
from botocore.config import Config
import langchain
import sagemaker
import pandas as pd

from langchain.llms.bedrock import Bedrock
from langchain.llms import SagemakerEndpoint
from langchain.llms.sagemaker_endpoint import LLMContentHandler
from typing import Dict

import json
import requests
import csv
import time
import pandas as pd
import nltk
import sys

from langchain.llms import Bedrock
from dotenv import load_dotenv, find_dotenv
from mlflow import MlflowClient

# loading environment variables that are stored in local file dev.env
load_dotenv(find_dotenv('dev-mlflow.env'),override=True)

session = sagemaker.Session()
bucket = session.default_bucket()

os.environ['OPENSEARCH_COLLECTION'] = os.getenv('OPENSEARCH_COLLECTION')
os.environ['AWS_ACCESS_KEY'] = os.getenv('AWS_ACCESS_KEY')
os.environ['AWS_SECRET_TOKEN'] = os.getenv('AWS_SECRET_TOKEN')
os.environ['REGION'] = os.getenv('REGION')
os.environ['MLFLOW_TRACKING_URI'] = os.getenv('MLFLOW_TRACKING_URI')

# Initialize mlflow client

mlflow_client = MlflowClient(tracking_uri=os.environ['MLFLOW_TRACKING_URI'])

# Initialize Bedrock runtime
config = Config(
   retries = {
      'max_attempts': 8
   }
)
bedrock_runtime = boto3.client(
        service_name="bedrock-runtime",
        config=config
)

sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/huthmac/Library/Application Support/sagemaker/config.yaml


In [17]:
# Create a new mlflow experiment

experiment_description = (
    "RAG system evaluation project"
    "This experiment contains the produce models for apples."
)

experiment_tags = {
    "project_name": "rag-eval",
    "use_case": "information extraction",
    "team": "aws-ai-ml-analytics",
    "source": "Amazon 10k",
    "mlflow.note.content": experiment_description,
}

experiment_name = "RAG_system_accuracy"
llm_experiment = mlflow_client.create_experiment(name=experiment_name, tags=experiment_tags)

# Use search_experiments() to search on the project_name tag key
rag_experiment = mlflow_client.search_experiments(
    filter_string="tags.`project_name` = 'rag-eval'"
)

print(rag_experiment[0])

RestException: RESOURCE_ALREADY_EXISTS: Experiment(name=LLM_accuracy) already exists. Error: (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely)
(pymysql.err.IntegrityError) (1062, "Duplicate entry 'LLM_accuracy' for key 'experiments.name'")
[SQL: INSERT INTO experiments (name, artifact_location, lifecycle_stage, creation_time, last_update_time) VALUES (%(name)s, %(artifact_location)s, %(lifecycle_stage)s, %(creation_time)s, %(last_update_time)s)]
[parameters: {'name': 'LLM_accuracy', 'artifact_location': '', 'lifecycle_stage': 'active', 'creation_time': 1707334252751, 'last_update_time': 1707334252751}]
(Background on this error at: https://sqlalche.me/e/20/gkpj)

In [4]:
# Initialize LLMs (Claude-V2, Cohere, LLama2)

## 1a. Initialize Claude-v2
llm01_inference_modifier = {
    "max_tokens_to_sample": 545,
    "temperature": 0,
    "stop_sequences": ["\n\nHuman"],
}
LLM_01_NAME= "anthropic.claude-v2"
llm01 = langchain.llms.bedrock.Bedrock( #create a Bedrock llm client
    model_id=LLM_01_NAME,
    model_kwargs=llm01_inference_modifier
)

## 1b. Initialize Cohere Command
llm02_inference_modifier = { 
    "max_tokens": 545,
    "temperature": 0,    
}
LLM_02_NAME= "cohere.command-text-v14"
llm02 = langchain.llms.bedrock.Bedrock( #create a Bedrock llm client
    model_id=LLM_02_NAME,
    model_kwargs=llm02_inference_modifier
)

## 1c. Initialize Llama
llm03_inference_modifier = { 
    "max_gen_len": 545,
    "top_p": 0.9, 
    "temperature": 0,    
}
LLM_03_NAME= "meta.llama2-13b-chat-v1"
llm03 = langchain.llms.bedrock.Bedrock( #create a Bedrock llm client
    model_id=LLM_03_NAME,
    model_kwargs=llm03_inference_modifier
)

llms = [
    llm01,
    llm02,
    llm03
]

## 1d. Initialize eval llm
inference_modifier = { 
    "max_gen_len": 545,
    "top_p": 0.9, 
    "temperature": 0,    
}
LLM_EVAL_NAME= "meta.llama2-70b-chat-v1"
langchain_eval_llm = langchain.llms.bedrock.Bedrock( #create a Bedrock llm client
    model_id=LLM_EVAL_NAME,
    model_kwargs=inference_modifier
)

In [5]:
## 2. download ground truth dataset
import xmltodict
url = 'https://d3q8adh3y5sxpk.cloudfront.net/rageval/qsdata_20.xml'

# Send an HTTP GET request to download the file
response = requests.get(url)

# Check if the request was successful (HTTP status code 200)
if response.status_code == 200:        
    xml_data = xmltodict.parse(response.text)

# Convert the dictionary to a Pandas DataFrame
qa_dataset = pd.DataFrame(xml_data['data']['records'])

prompts = []
for row in qa_dataset.itertuples():
    item = {
        'prompt': str(row[1]['Question']),
        'context': str(row[1]['Context']),
        'output': str(row[1]['Answer']['question_answer']),
        'page': str(row[1]['Page'])
    }
    prompts.append(item)

# example prompt
print(prompts[0])

{'prompt': "Who is Amazon's Senior Vice President and General Counsel?", 'context': 'Available Information\nOur investor relations website is amazon.com/ir and we encourage investors to use it as a way of easily finding information about us. We promptly make available on this website, free of charge, the reports that we file or furnish with the Securities and Exchange Commission (â\x80\x9cSECâ\x80\x9d), corporate governance information (including our Code of Business Conduct and Ethics), and select press releases.\nExecutive Officers and Directors\nThe following tables set forth certain information regarding our Executive Officers and Directors as of January 25, 2023:\nInformation About Our Executive Officers\nName Age Position\nJeffrey P. Bezos. Mr. Bezos founded Amazon.com in 1994 and has served as Executive Chair since July 2021. He has served as Chair of the Board since 1994 and served as Chief Executive Officer from May 1996 until July 2021, and as President from 1994 until June 1

In [9]:
# 3. Create token_text_splitter and char_text_splitter for evaluation

## 3a. download context / Amazon annual report
import numpy as np
import pypdf
from langchain.text_splitter import CharacterTextSplitter, TokenTextSplitter, RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader, PyPDFDirectoryLoader
from urllib.request import urlretrieve

os.makedirs("data", exist_ok=True)
files = [ "https://d3q8adh3y5sxpk.cloudfront.net/rageval/AMZN-2023-10k.pdf"]
for url in files:
    file_path = os.path.join("data", url.rpartition("/")[2])
    urlretrieve(url, file_path)
    

loader = PyPDFDirectoryLoader("./data/")
documents = loader.load()

token_text_splitter = TokenTextSplitter(chunk_size=500, chunk_overlap=100)
char_text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)

token_text_list = token_text_splitter.split_documents(documents)
char_text_list = char_text_splitter.split_documents(documents)
    
print("TokenTextSplitter split documents in to " + str(len(token_text_list)) + " chunks.\n")
print("CharacterTextSplitter split documents in to " + str(len(char_text_list)) + " chunks.\n")

TokenTextSplitter split documents in to 354 chunks.

CharacterTextSplitter split documents in to 1364 chunks.



In [6]:
# 4. create vectors and store each document chunk in it's own index in vector database (OpenSearch Serverless)
## 4a. connect to OpenSearchServerless
import time
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth

host = os.environ['OPENSEARCH_COLLECTION']  # serverless collection endpoint, without https://
print(f"host: {host}")
region = os.environ['REGION']  # e.g. us-east-1
print(f'region: {region}')


service = 'aoss'
credentials = boto3.Session().get_credentials()
auth = AWSV4SignerAuth(credentials, region, service)

## 4b. create vectordatabase if it does not exist yet
if host == '':
    print('creating collection')
    vector_store_name = 'rag-eval'
    encryption_policy_name = "rag-eval-ep"
    network_policy_name = "rag-eval-np"
    access_policy_name = 'rag-eval-ap'
    identity = boto3.client('sts').get_caller_identity()['Arn']

    aoss_client = boto3.client('opensearchserverless')

    security_policy = aoss_client.create_security_policy(
        name = encryption_policy_name,
        policy = json.dumps(
            {
                'Rules': [{'Resource': ['collection/' + vector_store_name],
                'ResourceType': 'collection'}],
                'AWSOwnedKey': True
            }),
        type = 'encryption'
    )

    network_policy = aoss_client.create_security_policy(
        name = network_policy_name,
        policy = json.dumps(
            [
                {'Rules': [{'Resource': ['collection/' + vector_store_name],
                'ResourceType': 'collection'}],
                'AllowFromPublic': True}
            ]),
        type = 'network'
    )

    collection = aoss_client.create_collection(name=vector_store_name,type='VECTORSEARCH')

    while True:
        status = aoss_client.list_collections(collectionFilters={'name':vector_store_name})['collectionSummaries'][0]['status']
        if status in ('ACTIVE', 'FAILED'): 
            print(f'new collection {vector_store_name} created')
            break
        time.sleep(10)

    access_policy = aoss_client.create_access_policy(
        name = access_policy_name,
        policy = json.dumps(
            [
                {
                    'Rules': [
                        {
                            'Resource': ['collection/' + vector_store_name],
                            'Permission': [
                                'aoss:CreateCollectionItems',
                                'aoss:DeleteCollectionItems',
                                'aoss:UpdateCollectionItems',
                                'aoss:DescribeCollectionItems'],
                            'ResourceType': 'collection'
                        },
                        {
                            'Resource': ['index/' + vector_store_name + '/*'],
                            'Permission': [
                                'aoss:CreateIndex',
                                'aoss:DeleteIndex',
                                'aoss:UpdateIndex',
                                'aoss:DescribeIndex',
                                'aoss:ReadDocument',
                                'aoss:WriteDocument'],
                            'ResourceType': 'index'
                        }],
                    'Principal': [identity],
                    'Description': 'Easy data policy'}
            ]),
        type = 'data'
    )

    host = collection['createCollectionDetail']['id'] + '.' + os.environ.get("AWS_DEFAULT_REGION", None) + '.aoss.amazonaws.com:443'
    host = host.split(":")[0]
    print(f'new aoss host: {host}')

aospy_client = OpenSearch(
    hosts=[{'host': host, 'port': 443}],
    http_auth=auth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection,
    pool_maxsize=20,
)
print(f'aospy client:{aospy_client}')

host: lx0j8y3mu9ht6r5xv7za.us-east-1.aoss.amazonaws.com
region: us-east-1
aospy client:<OpenSearch([{'host': 'lx0j8y3mu9ht6r5xv7za.us-east-1.aoss.amazonaws.com', 'port': 443}])>


In [7]:
# 5. create and save prompt templates for eval
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain import hub


### Claude prompt templates
prompt_template_claude_1 = """
        Human: Given report provided, please read it and analyse the content.
        Please answer the following question: {question} basing the answer only on the information from the report
        and return it inside <question_answer></question_answer> XML tags.

        If a particular bit of information is not present, return an empty string.
        Each returned answer should be concise, remove extra information if possible.
        The report will be given between <report></report> XML tags.

        <report>
        {context}
        </report>

        Return the answer inside <question_answer></question_answer> XML tags.
        Assistant:"""

PROMPT_CLAUDE_1 = PromptTemplate(
    template=prompt_template_claude_1, input_variables=["question", "context"]
)

prompt_template_claude_2 = """
        Human: 
        You are a helpful, respectful, and honest assistant, dedicated to providing valuable and accurate information.

        Assistant:
        Understood. I will provide information based on the context given, without relying on prior knowledge.

        Human:
        If you don't see answer in the context just reply "not available" in XML tags.

        Assistant:
        Noted. I will respond with "not available" if the information is not available in the context.

        Human:
        Now read this context and answer the question and return the answer inside <question_answer></question_answer> XML tags. 
        {context}

        Assistant:
        Based on the provided context above and information from the retriever source, I will provide the answer in  and return it inside <question_answer></question_answer> XML tags to the below question
        {question}
        """

PROMPT_CLAUDE_2 = PromptTemplate(
    template=prompt_template_claude_2, input_variables=["question", "context"]
)

### Llama2 prompt templates
prompt_template_llama_1 = """
        [INST] Given report provided, please read it and analyse the content.
        Please answer the following question: {question} basing the answer only on the information from the report
        and return it inside <question_answer></question_answer> XML tags.

        If a particular bit of information is not present, return an empty string.
        Each returned answer should be concise, remove extra information if possible.
        The report will be given between <report></report> XML tags.

        <report>
        {context}
        </report>

        Return the answer inside <question_answer></question_answer> XML tags. [/INST]
        """
PROMPT_LLAMA_1 = PromptTemplate(
    template=prompt_template_llama_1, input_variables=["question", "context"]
)

prompt_template_llama_2 = """
        [INST]
        You are a helpful, respectful, and honest assistant, dedicated to providing valuable and accurate information.
        [/INST]

        Understood. I will provide information based on the context given, without relying on prior knowledge.

        [INST]
        If you don't see answer in the context just reply "not available" in XML tags.
        [/INST]

        Noted. I will respond with "not available" if the information is not available in the context.

        [INST]
        Now read this context and answer the question and return the answer inside <question_answer></question_answer> XML tags. 
        {context}
        [/INST]

        Based on the provided context above and information from the retriever source, I will provide the answer in  and return it inside <question_answer></question_answer> XML tags to the below question
        {question}
        """
PROMPT_LLAMA_2 = PromptTemplate(
    template=prompt_template_llama_2, input_variables=["question", "context"]
)


### Cohere Command prompt templates
prompt_template_command_1 = """
        Human: Given report provided, please read it and analyse the content.
        Please answer the following question: {question} basing the answer only on the information from the report
        and return it inside <question_answer></question_answer> XML tags.

        If a particular bit of information is not present, return an empty string.
        Each returned answer should be concise, remove extra information if possible.
        The report will be given between <report></report> XML tags.

        <report>
        {context}
        </report>

        Return the answer inside <question_answer></question_answer> XML tags.
        Assistant:"""

PROMPT_COMMAND_1 = PromptTemplate(
    template=prompt_template_command_1, input_variables=["question", "context"]
)

prompt_template_command_2 = """
        Human: 
        You are a helpful, respectful, and honest assistant, dedicated to providing valuable and accurate information.

        Assistant:
        Understood. I will provide information based on the context given, without relying on prior knowledge.

        Human:
        If you don't see answer in the context just reply "not available" in XML tags.

        Assistant:
        Noted. I will respond with "not available" if the information is not available in the context.

        Human:
        Now read this context and answer the question and return the answer inside <question_answer></question_answer> XML tags. 
        {context}

        Assistant:
        Based on the provided context above and information from the retriever source, I will provide the answer in  and return it inside <question_answer></question_answer> XML tags to the below question
        {question}
        """
PROMPT_COMMAND_2 = PromptTemplate(
    template=prompt_template_command_2, input_variables=["question", "context"]
)

# generic prompt template for all LLMs
generic_rag_template = hub.pull("rlm/rag-prompt")

prompttemplates = [
    {'template_name': 'generic_rag_template', 'template': generic_rag_template},
    {'template_name': 'prompt_template_claude_1', 'template': PROMPT_CLAUDE_1},
    {'template_name': 'prompt_template_claude_2', 'template': PROMPT_CLAUDE_2},
    {'template_name': 'prompt_template_command_1', 'template': PROMPT_COMMAND_1},
    {'template_name': 'prompt_template_command_2', 'template': PROMPT_COMMAND_2},
    {'template_name': 'prompt_template_llama_1', 'template': PROMPT_LLAMA_1},
    {'template_name': 'prompt_template_llama_2', 'template': PROMPT_LLAMA_2},
]

In [13]:
# create helper function to create RAG systems for evaluation
import random
from langchain.embeddings import BedrockEmbeddings
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain import hub
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings  
from langchain.vectorstores import OpenSearchVectorSearch

# # LangChain requires AWS4Auth
# from requests_aws4auth import AWS4Auth
# def get_aws4_auth():
#     region = os.environ.get("Region", os.environ["REGION"])
#     service = "aoss"
#     credentials = boto3.Session().get_credentials()
#     return AWS4Auth(
#         credentials.access_key,
#         credentials.secret_key,
#         region,
#         service,
#         session_token=credentials.token,
#     )
# aws4_auth = get_aws4_auth()


def create_rag(rag_system_details):
    existing_vector_store = rag_system_details["vector_store"]
    llm = rag_system_details["llm"]
    aospy_client = rag_system_details["aospy_client"]
    index_name = rag_system_details["index_name"]
    embedding_model = rag_system_details["embedding_model"]
    embedding_model_name = rag_system_details["embedding_model_name"]
    splitter_name = rag_system_details["splitter_name"]
    text_chunks = rag_system_details["text_chunks"]
    index_name = rag_system_details["index_name"]
    prompt_template_name = rag_system_details["prompt_template_name"]
    prompt_template = rag_system_details["prompt_template"]
    chain_type= rag_system_details["chain_type"]
    search_type= rag_system_details["search_type"]
    retriever_k = rag_system_details["retriever_k"]
    score_threshold = rag_system_details["score_threshold"]
    fetch_k = rag_system_details["fetch_k"]
    lambda_mult = rag_system_details["lambda_mult"]
    if existing_vector_store == "":
        # create index
        knn_index = {
            "settings": {
                "index.knn": True,
                
            },
            "mappings": {
                "properties": {
                    "vector_field": {
                        "type": "knn_vector",
                        "dimension": 1536,
                        "store": True
                    },
                    "text": {
                        "type": "text",
                        "store": True
                    },
                }
            }
        }

        try:
            aospy_client.indices.delete(index=index_name)
            aospy_client.indices.create(index=index_name,body=knn_index,ignore=400)
            aospy_client.indices.get(index=index_name)
        except:
            print(f'Index {index_name} not found. Creating index on OpenSearch.')
            aospy_client.indices.create(index=index_name,body=knn_index)
            aospy_client.indices.get(index=index_name)

        # generate embeddings
        full_opensearch_endpoint = 'https://' + os.environ['OPENSEARCH_COLLECTION']

        vector_store = OpenSearchVectorSearch.from_documents(
                    index_name = index_name,
                    documents = text_chunks,
                    embedding = embedding_model,
                    opensearch_url=full_opensearch_endpoint,
                    http_auth=auth,
                    use_ssl=True,
                    verify_certs=True,
                    connection_class=RequestsHttpConnection,
                    timeout=60*3,
                    bulk_size=1000,
                    is_aoss=True
                )  
    else:
        vector_store = existing_vector_store
        
    random_identifier = random.randrange(100, 1000, 3)
    run_name=f'LLM_{llm.model_id}_embeddings{embedding_model_name}_split_{splitter_name}_template_{prompt_template_name}_search_{search_type}_chain_{chain_type}_k_{retriever_k}_{random_identifier}'

    search_kwargs = {
        "retriever_k": retriever_k
    }

    retriever = vector_store.as_retriever(search_type = search_type, search_kwargs=search_kwargs)

    qa_chain = RetrievalQA.from_chain_type(
            llm=llm,
            chain_type=chain_type,
            retriever=retriever,
            return_source_documents=True,
            chain_type_kwargs = {"prompt": prompt_template}
        )
    
    return run_name, vector_store, qa_chain


In [14]:
# test RAG system
rag_system_details = {
    "aospy_client": aospy_client,
    "vector_store": "",
    "llm": llm01,
    "splitter_name": "TokenTextSplitter",
    "text_chunks": token_text_list,
    "index_name": "rag-eval-tokentextsplitter",
    "embedding_model_name": 'bedrock_embeddings',
    "embedding_model": BedrockEmbeddings(client=bedrock_runtime),
    "prompt_template_name": "PROMPT_CLAUDE_1",
    "prompt_template": PROMPT_CLAUDE_1,
    "chain_type": "stuff",
    "search_type": "similarity", # alternative: "mmr", or "similarity_score_threshold" (Default: similarity)
    "retriever_k": 4, # Amount of documents to return (Default: 4)
    "score_threshold": 0, # Minimum relevance threshold for similarity_score_threshold
    "fetch_k": 20, # Amount of documents to pass to MMR algorithm (Default: 20)
    "lambda_mult": 0.5, # Diversity of results returned by MMR, 1 for minimum diversity and 0 for maximum. (Default: 0.5)
         
}
run_name, vector_store, qa_chain = create_rag(rag_system_details)
query = "Who is Amazon's Senior Vice President and General Counsel?"
result = qa_chain.invoke(query)
print(result)

Failed to load rag-eval-run-2 session, using empty session: 1 validation error for TracerSessionV1
id
  value is not a valid integer (type=type_error.integer)
Error in LangChainTracerV1.on_chain_end callback: ValueError('Unknown run type: retriever')


{'query': "Who is Amazon's Senior Vice President and General Counsel?", 'result': ' <question_answer>\nDavid Zapolsky\n</question_answer>', 'source_documents': []}


In [33]:
import mlflow
from datasets import Dataset
import ragas
#import tqdm as notebook_tqdm
from ragas import evaluate
from ragas.metrics import (
    context_precision,
    faithfulness,
    context_recall,
    answer_relevancy,
)


def run_ragas_eval(rag_system_eval_details, rag_system_details):

    experiment_name = rag_system_eval_details["experiment_name"]
    run_name = rag_system_eval_details["run_name"]
    qa_chain = rag_system_eval_details["qa_chain"]
    ground_truth = rag_system_eval_details["ground_truth"]

    llm_experiment = mlflow.set_experiment(experiment_name)

    # Initiate the MLflow run context
    with mlflow.start_run(run_name=run_name) as run:
        # list of metrics we're going to use from RAGAS
        metrics = [
            faithfulness,
            answer_relevancy,
            context_recall,
            context_precision,
            # harmfulness,
        ]

        basic_qa_ragas_dataset = []

        for item in ground_truth:
            result = qa_chain.invoke(item['prompt'])

            context_sequence = []
            for doc in result["source_documents"]:
                context_sequence.append(doc.page_content)

            basic_qa_ragas_dataset.append(
                    {"question" : item['prompt'],
                    "answer" : result["result"],
                    "contexts" : context_sequence,
                    "ground_truths" : [item['output']]
                    }
                )
            basic_qa_ragas_df = pd.DataFrame(basic_qa_ragas_dataset)
            basic_qa_ragas = Dataset.from_pandas(basic_qa_ragas_df)

        # evaluate
        ragas_result = evaluate(basic_qa_ragas, metrics=metrics)
        evals_df = ragas_result.to_pandas()
        

        # Log parameters used for the RAG system
        params = {
            "llm_name": rag_system_details["llm"].model_id,
            "splitter_name": rag_system_details["splitter_name"],
            "index_name": rag_system_details["index_name"],
            "embedding_model_name": rag_system_details["embedding_model_name"],
            "prompt_template_name": rag_system_details["prompt_template_name"],
            "chain_type": rag_system_details["chain_type"],
            "search_type": rag_system_details["search_type"],
            "retriever_k": rag_system_details["retriever_k"],
            "score_threshold": rag_system_details["score_threshold"],
            "fetch_k": rag_system_details["fetch_k"],
            "lambda_mult": rag_system_details["lambda_mult"]
        }
        mlflow.log_params(params)

        print(f'faithfulness mean: {evals_df["faithfulness"].mean()}')
        print(f'answer_relevancy mean: {evals_df["answer_relevancy"].mean()}')
        print(f'context_recall: {evals_df["context_recall"].mean()}')
        print(f'context_precision: {evals_df["context_precision"].mean()}')

        mlflow_metrics_results = {
            "faithfulness_mean": evals_df["faithfulness"].mean(),
            "answer_relevancy_mean": evals_df["answer_relevancy"].mean(),
            "context_recall_mean": evals_df["context_recall"].mean(),
            "context_precision_mean": evals_df["context_precision"].mean(),

        }
        # Log evaluation metrics that were calculated
        mlflow.log_metrics(mlflow_metrics_results)

In [34]:
rag_system_eval_details = {
    "experiment_name": experiment_name,
    "run_name": run_name,
    "qa_chain": qa_chain,
    "ground_truth": prompts
}
run_ragas_eval(rag_system_eval_details, rag_system_details)

Failed to load rag-eval-run-2 session, using empty session: 1 validation error for TracerSessionV1
id
  value is not a valid integer (type=type_error.integer)
Error in LangChainTracerV1.on_chain_end callback: ValueError('Unknown run type: retriever')
Failed to load rag-eval-run-2 session, using empty session: 1 validation error for TracerSessionV1
id
  value is not a valid integer (type=type_error.integer)
Error in LangChainTracerV1.on_chain_end callback: ValueError('Unknown run type: retriever')
Failed to load rag-eval-run-2 session, using empty session: 1 validation error for TracerSessionV1
id
  value is not a valid integer (type=type_error.integer)
Error in LangChainTracerV1.on_chain_end callback: ValueError('Unknown run type: retriever')
Failed to load rag-eval-run-2 session, using empty session: 1 validation error for TracerSessionV1
id
  value is not a valid integer (type=type_error.integer)
Error in LangChainTracerV1.on_chain_end callback: ValueError('Unknown run type: retrieve

evaluating with [faithfulness]


Failed to load rag-eval-run-2 session, using empty session: 1 validation error for TracerSessionV1
id
  value is not a valid integer (type=type_error.integer)
Failed to load rag-eval-run-2 session, using empty session: 1 validation error for TracerSessionV1
id
  value is not a valid integer (type=type_error.integer)
  0%|          | 0/2 [00:00<?, ?it/s]Failed to load rag-eval-run-2 session, using empty session: 1 validation error for TracerSessionV1
id
  value is not a valid integer (type=type_error.integer)
Error in LangChainTracerV1.on_chain_end callback: ValidationError(model='LLMRun', errors=[{'loc': ('response', 'generations', 0, 0, 'type'), 'msg': "unexpected value; permitted: 'Generation'", 'type': 'value_error.const', 'ctx': {'given': 'ChatGeneration', 'permitted': ('Generation',)}}])
 50%|█████     | 1/2 [00:55<00:55, 55.28s/it]Failed to load rag-eval-run-2 session, using empty session: 1 validation error for TracerSessionV1
id
  value is not a valid integer (type=type_error.i

evaluating with [answer_relevancy]


Failed to load rag-eval-run-2 session, using empty session: 1 validation error for TracerSessionV1
id
  value is not a valid integer (type=type_error.integer)
Failed to load rag-eval-run-2 session, using empty session: 1 validation error for TracerSessionV1
id
  value is not a valid integer (type=type_error.integer)
  0%|          | 0/2 [00:00<?, ?it/s]Failed to load rag-eval-run-2 session, using empty session: 1 validation error for TracerSessionV1
id
  value is not a valid integer (type=type_error.integer)
Error in LangChainTracerV1.on_chain_end callback: ValidationError(model='LLMRun', errors=[{'loc': ('response', 'generations', 0, 0, 'type'), 'msg': "unexpected value; permitted: 'Generation'", 'type': 'value_error.const', 'ctx': {'given': 'ChatGeneration', 'permitted': ('Generation',)}}])
 50%|█████     | 1/2 [00:18<00:18, 18.47s/it]Failed to load rag-eval-run-2 session, using empty session: 1 validation error for TracerSessionV1
id
  value is not a valid integer (type=type_error.i

evaluating with [context_recall]


Failed to load rag-eval-run-2 session, using empty session: 1 validation error for TracerSessionV1
id
  value is not a valid integer (type=type_error.integer)
Failed to load rag-eval-run-2 session, using empty session: 1 validation error for TracerSessionV1
id
  value is not a valid integer (type=type_error.integer)
  0%|          | 0/2 [00:00<?, ?it/s]Failed to load rag-eval-run-2 session, using empty session: 1 validation error for TracerSessionV1
id
  value is not a valid integer (type=type_error.integer)
Error in LangChainTracerV1.on_chain_end callback: ValidationError(model='LLMRun', errors=[{'loc': ('response', 'generations', 0, 0, 'type'), 'msg': "unexpected value; permitted: 'Generation'", 'type': 'value_error.const', 'ctx': {'given': 'ChatGeneration', 'permitted': ('Generation',)}}])
 50%|█████     | 1/2 [00:35<00:35, 35.69s/it]Failed to load rag-eval-run-2 session, using empty session: 1 validation error for TracerSessionV1
id
  value is not a valid integer (type=type_error.i

evaluating with [context_precision]


Failed to load rag-eval-run-2 session, using empty session: 1 validation error for TracerSessionV1
id
  value is not a valid integer (type=type_error.integer)
Failed to load rag-eval-run-2 session, using empty session: 1 validation error for TracerSessionV1
id
  value is not a valid integer (type=type_error.integer)
  0%|          | 0/2 [00:00<?, ?it/s]Failed to load rag-eval-run-2 session, using empty session: 1 validation error for TracerSessionV1
id
  value is not a valid integer (type=type_error.integer)
Error in LangChainTracerV1.on_chain_end callback: ValidationError(model='LLMRun', errors=[{'loc': ('response', 'generations', 0, 0, 'type'), 'msg': "unexpected value; permitted: 'Generation'", 'type': 'value_error.const', 'ctx': {'given': 'ChatGeneration', 'permitted': ('Generation',)}}])
 50%|█████     | 1/2 [00:33<00:33, 33.99s/it]Failed to load rag-eval-run-2 session, using empty session: 1 validation error for TracerSessionV1
id
  value is not a valid integer (type=type_error.i

faithfulness mean: 0.8616780045351474
answer_relevancy mean: 0.4933361895826723
context_recall: 0.15294117647058825
context_precision: 0.02869710494905517


In [312]:
# LLAMA_INDEX EVAL

## use results from LLMInformationExtraction.ipynb
### query,llm,output,trainingoutput,context,trainingcontext,evaluationmetric,score,feedback
predictions_df = pd.read_csv('eval_run_predictions.csv')
print(f'column names: {predictions_df.columns}')
print(f'no of rows: {predictions_df.count()}')

column names: Index(['query', 'llm', 'output', 'trainingoutput', 'context',
       'trainingcontext', 'evaluationmetric', 'score', 'feedback'],
      dtype='object')
no of rows: query               63
llm                 63
output              63
trainingoutput      63
context             63
trainingcontext     60
evaluationmetric     0
score                0
feedback             0
dtype: int64


In [305]:
# run evaluation directly with llama_index on an existing dataframe
## Faithfulness: measure if the response from a query engine matches any source nodes
## Relevancy: measure if the response and source nodes match the query
## Correctness: assess the relevance and correctness of a generated answer against a reference answer
## Semantic Similarity: evaluates the quality of a question answering system via semantic similarity

from llama_index.llms import Bedrock
from llama_index.embeddings import BedrockEmbedding
from llama_index import (
    ServiceContext
)

from llama_index.evaluation import (
    FaithfulnessEvaluator,
    RelevancyEvaluator,
    CorrectnessEvaluator,
    SemanticSimilarityEvaluator
)
from llama_index.embeddings import SimilarityMode
from llama_index import Document


model_kwargs_claude = {
    "temperature": 0,
    "top_k": 10,
    "max_tokens_to_sample": 512
}

#LLM_EVAL_NAME= "meta.llama2-70b-chat-v1"
eval_llm = Bedrock(model="anthropic.claude-v2",
              #context_size=512,
              temperature=0,
              additional_kwargs={'max_tokens_to_sample': 512,'top_k': 10})

embed_model = BedrockEmbedding().from_credentials(
    model_name='amazon.titan-embed-g1-text-02'
)

service_context_eval = ServiceContext.from_defaults(
    llm=eval_llm, 
    embed_model=embed_model, 
)

faithfulness_evaluator = FaithfulnessEvaluator(service_context=service_context_eval)
relevancy_evaluator = RelevancyEvaluator(service_context=service_context_eval)
similarity_threshold = 0.8
semantic_evaluator = SemanticSimilarityEvaluator(service_context=service_context_eval,
                                                 similarity_mode=SimilarityMode.DEFAULT,
                                                 similarity_threshold=similarity_threshold) # 0.8 default
correctness_evaluator = CorrectnessEvaluator(service_context=service_context_eval) # encountered parsing errors with this class

def run_evals(qa_df):
    results_list = []
    for row in qa_df.itertuples(index=False):
        question = row.query
        reference_answer = row.trainingoutput
        generated_answer = row.output
        retrieved_context = row.context.replace('[]','')
        retrieved_context = retrieved_context.split("/n")
        #print(f'retrieved context: {retrieved_context}')
        #print(f'retrieved context type: {type(retrieved_context)}')

        faithfulness = False
        faithfulness_feedback  = 'not calculated'
        faithfulness_score =  0.0
        relevancy = False
        relevancy_feedback =  'not calculated'
        relevancy_score  =  0.0
        correctness = False
        correctness_feedback = 'not calculated'
        correctness_score = 1.0
        
        if not(len(retrieved_context) == 0 or retrieved_context[0] == ''):

            faithfulness_results = faithfulness_evaluator.evaluate(
                query=question,
                response=generated_answer,
                contexts=retrieved_context
                )
            
            relevancy_results = relevancy_evaluator.evaluate(
                query=question,
                response=generated_answer,
                contexts=retrieved_context
                )
            faithfulness = faithfulness_results.passing
            faithfulness_feedback  = faithfulness_results.feedback
            faithfulness_score =  faithfulness_results.score
            relevancy = relevancy_results.passing
            relevancy_feedback =  relevancy_results.feedback
            relevancy_score  =  relevancy_results.score
            
        semantic_results = semantic_evaluator.evaluate(
            response=generated_answer,
            reference=reference_answer
        )

        # correctness_results = correctness_evaluator.evaluate(
        #     query=question,
        #     response=generated_answer,
        #     reference=reference_answer
        # )

        # correctness= correctness_results.passing
        # correctness_feedback= correctness_results.feedback
        # correctness_score= correctness_results.score

        cur_result_dict = {
            "query": question,
            "generated_answer": generated_answer,
            "correctness": correctness,
            "correctness_feedback": correctness_feedback,
            "correctness_score": correctness_score,
            "semantic_similarity": semantic_results.passing,
            "semantic_similarity_threshold": similarity_threshold,
            "semantic_similarity_score": semantic_results.score,
            "faithfulness": faithfulness,
            "faithfulness_feedback": faithfulness_feedback,
            "faithfulness_score": faithfulness_score,
            "relevancy": relevancy,
            "relevancy_feedback": relevancy_feedback,
            "relevancy_score": relevancy_score
        }
        results_list.append(cur_result_dict)
    evals_df = pd.DataFrame(results_list)
    return evals_df

In [None]:
# TEST LLAMA_INDEX

In [15]:
## load data
!mkdir -p ./data

from urllib.request import urlretrieve
urls = [
    'https://d3q8adh3y5sxpk.cloudfront.net/rageval/AMZN-2023-10k.pdf',
]

filenames = [
    'AMZN-2023-10k.pdf',
]

data_root = "./data/"

for idx, url in enumerate(urls):
    file_path = data_root + filenames[idx]
    urlretrieve(url, file_path)

In [22]:
from llama_index import (
    SimpleDirectoryReader,
    LLMPredictor,
    ServiceContext,
    get_response_synthesizer,
    set_global_service_context
)
from llama_index.indices.document_summary import DocumentSummaryIndex
import nest_asyncio

nest_asyncio.apply()


In [56]:
from llama_index.llms import Bedrock
from llama_index.embeddings import BedrockEmbedding

model_kwargs_claude = {
    "temperature": 0,
    "top_k": 10,
    "max_tokens_to_sample": 512
}

llm = Bedrock(model="anthropic.claude-v2",
              #context_size=512,
              temperature=0,
              additional_kwargs={'max_tokens_to_sample': 512,'top_k': 10})

embed_model = BedrockEmbedding().from_credentials(
    model_name='amazon.titan-embed-g1-text-02'
)

service_context = ServiceContext.from_defaults(llm=llm, 
                                               embed_model=embed_model, 
                                               chunk_size=512)
chunk_overlap = 20
chunk_size = 512
service_context = ServiceContext.from_defaults(llm=llm, 
                                               embed_model=embed_model, 
                                               chunk_size=chunk_size,
                                               chunk_overlap=chunk_overlap,
                                            )
set_global_service_context(service_context)



In [57]:
filename_fn = lambda filename: {"file_path": filename, "file_name": filename.replace('data/', "").replace('.pdf', "")}

# automatically sets the metadata of each document according to filename_fn
documents = SimpleDirectoryReader(
    "./data", file_metadata=filename_fn
).load_data()

In [72]:
#review metadata
print(documents[50].metadata)

{'page_label': '51', 'file_name': 'AMZN-2023-10k', 'file_path': 'data/AMZN-2023-10k.pdf'}


In [59]:
from llama_index import SimpleDirectoryReader
from llama_index.vector_stores import (
    OpensearchVectorStore,
    OpensearchVectorClient,
)
from llama_index import VectorStoreIndex, StorageContext

In [61]:
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth

host = os.environ['OPENSEARCH_COLLECTION'] # OpenSearch endpoint, for example: my-test-domain.us-east-1.aoss.amazonaws.com
service = 'aoss'
region = 'us-east-1'
credentials = boto3.Session().get_credentials()
auth = AWSV4SignerAuth(credentials, region, service)

endpoint = 'https://' + os.environ['OPENSEARCH_COLLECTION']
print(f'endpoint: {endpoint}')
index_name = "rag-eval-v1"
# OpensearchVectorClient stores text in this field by default
text_field = "content"
# OpensearchVectorClient stores embeddings in this field by default
embedding_field = "embedding"

client = OpensearchVectorClient(
    endpoint=endpoint,
    index=index_name, 
    dim=1536, 
    embedding_field=embedding_field, 
    text_field=text_field,
    http_auth=auth, 
    use_ssl=True, 
    verify_certs=True, 
    connection_class=RequestsHttpConnection, 
    timeout=10,
)
print(client)

endpoint: https://lx0j8y3mu9ht6r5xv7za.us-east-1.aoss.amazonaws.com
<llama_index.vector_stores.opensearch.OpensearchVectorClient object at 0x28d65c290>


In [62]:
# initialize vector store
vector_store = OpensearchVectorStore(client)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# initialize an index using our sample data and the client we just created
index = VectorStoreIndex.from_documents(
    documents=documents, storage_context=storage_context
)

In [63]:
# run query
query_engine = index.as_query_engine()
res = query_engine.query("Who is Amazon's Senior Vice President and General Counsel?")
res.response

'Empty Response'

In [278]:
# query with filtering - NOT WORKING ATM
from llama_index import Document
from llama_index.vector_stores.types import MetadataFilters, ExactMatchFilter, MetadataFilter,FilterOperator
import regex as re

# Create a query engine that only searches certain documents.
metadata_query_engine = index.as_query_engine(
    filters=MetadataFilters(
        filters=[
            ExactMatchFilter(
                key="term", value='{"file_path": "data/AMZN-2023-10k.pdf"}'
            )
            #ExactMatchFilter(key="file_name", value="AMZN-2023-10k")
            
        ]
    )
)

res = metadata_query_engine.query(
    "who is Amazon's Senior Vice President and General Counsel?"
)
res.response

'Empty Response'

In [151]:
# use Bedrock Knowledgebase retriever
from langchain.retrievers.bedrock import AmazonKnowledgeBasesRetriever

kb_id = "<knowledge_base_id>"

bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0})
bedrock_client = boto3.client('bedrock-runtime')
bedrock_agent_client = boto3.client("bedrock-agent-runtime",
                              config=bedrock_config)

retriever = AmazonKnowledgeBasesRetriever(
        knowledge_base_id=kb_id,
        retrieval_config={"vectorSearchConfiguration": {"numberOfResults": 4}},

    )

from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": claude_prompt}
)

{'prompt': "Who is Amazon's Senior Vice President and General Counsel?", 'context': 'Available Information\nOur investor relations website is amazon.com/ir and we encourage investors to use it as a way of easily finding information about us. We promptly make available on this website, free of charge, the reports that we file or furnish with the Securities and Exchange Commission (â\x80\x9cSECâ\x80\x9d), corporate governance information (including our Code of Business Conduct and Ethics), and select press releases.\nExecutive Officers and Directors\nThe following tables set forth certain information regarding our Executive Officers and Directors as of January 25, 2023:\nInformation About Our Executive Officers\nName Age Position\nJeffrey P. Bezos. Mr. Bezos founded Amazon.com in 1994 and has served as Executive Chair since July 2021. He has served as Chair of the Board since 1994 and served as Chief Executive Officer from May 1996 until July 2021, and as President from 1994 until June 1