# Adding observability and evaluation for RAG based Q&A application using LangFuse and RAGAS framework with Knowledge Bases for Amazon


### Context

In this notebook, we will dive deep into building Q&A application and using Retrieve API provided by Knowledge Bases for Amazon Bedrock, along with LangChain, Langfuse and RAGAS for evaluating and debugging the responses. Here, we will query the knowledge base to get the desired number of document chunks based on similarity search, prompt the query using Anthropic Claude, and then we Collect & calculate scores for your LLM responses effectively using metrics, such as faithfulness, answer relevancy, context precision, harmfulness based expectations and then ingest traces to Langfuse.

### Knowledge Bases for Amazon Bedrock Introduction

With knowledge bases, you can securely connect foundation models (FMs) in Amazon Bedrock to your company
data for Retrieval Augmented Generation (RAG). Access to additional data helps the model generate more relevant,
context-speciﬁc, and accurate responses without continuously retraining the FM. All information retrieved from
knowledge bases comes with source attribution to improve transparency and minimize hallucinations. For more information on creating a knowledge base using console, please refer to this [post](https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base.html).

### Pattern

We can implement the solution using Retreival Augmented Generation (RAG) pattern. RAG retrieves data from outside the language model (non-parametric) and augments the prompts by adding the relevant retrieved data in context. Here, we are performing RAG effectively on the knowledge base created in the previous notebook or using console. 

### Pre-requisites

Before being able to answer the questions and calcuate the scores, the documents must be processed and stored in a knowledge base. For this notebook, we use a `synthetic dataset for 10K financial reports` to create the Knowledge Bases for Amazon Bedrock. Additionally,You also need to create project in Langfuse before ingesting traces to Langfuse.

1. Upload your documents (data source) to Amazon S3 bucket.
2. Knowledge Bases for Amazon Bedrock using [01_create_ingest_documents_test_kb_multi_ds.ipynb](/knowledge-bases/01-rag-concepts/01_create_ingest_documents_test_kb_multi_ds.ipynb)
3. Note the Knowledge Base ID

<!-- ![data_ingestion.png](./images/data_ingestion.png) -->
<img src="../images/data_ingestion.png" width=50% height=20% />

3. API Keys - You require a Langfuse public and secret key to get started. follow the below steps and find them in your project settings.

    1. Create [Langfuse account](https://cloud.langfuse.com/auth/sign-up)
    2. Create a new project
    3. Create new API credentials in the project settings, On the Settings page. click Create new API keys to generate a new secret and public key pair. Store these keys securely - you'll need them in this notebook.

    <!-- ![](./images/LangfuseAPIKEY.png)     -->
    <img src="./images/LangfuseAPIKEY.png" width=50% height=20% />


#### Notebook Walkthrough


For our notebook we will use the `Retreive API` provided by Knowledge Bases for Amazon Bedrock which converts user queries into
embeddings, searches the knowledge base, and returns the relevant results, giving you more control to build custom
workﬂows on top of the semantic search results. The output of the `Retrieve API` includes the the `retrieved text chunks`, the `location type` and `URI` of the source data, as well as the relevance `scores` of the retrievals. 


We will then use the text chunks being generated and augment it with the original prompt and pass it through the `anthropic.claude-3-haiku-20240307-v1:0` model using prompt engineering patterns based for your use case.

Finally we will score the generated responses using RAGAS on using metrics such as faithfulness, answer relevancy, context precision, harmfullness and ingest traces in Langfuse. For evaluation, we will use `anthropic.claude-3-sonnet-20240229-v1:0`.
### Ask question


<!-- ![retrieveapi.png](./images/retrieveAPI.png) -->
<img src="./images/retrieveAPI.png" width=50% height=20% />


#### Evaluation
1. Utilize Ragas for scores on 
    1. **Faithfulness:** This measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. Higher the better.
    2. **Answer Relevancy:** The evaluation metric, Answer Relevancy, focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information. This metric is computed using the question and the answer, with values ranging between 0 and 1, where higher scores indicate better relevancy.
    3. **Context Precision:** Context Precision is a metric that evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not. Ideally all the relevant chunks must appear at the top ranks. This metric is computed using the question and the contexts, with values ranging between 0 and 1, where higher scores indicate better precision.
    4. **Aspect Critique:**  This is designed to assess submissions based on predefined aspects such as harmlessness and correctness. Additionally, users have the flexibility to define their own aspects for evaluating submissions according to their specific criteria.
    

### USE CASE:

#### Dataset

In this example, you will use Octank's financial 10k reports (sythetically generated dataset)  as a text corpus to perform Q&A on. This data is already ingested into the knowledge base. You will need the `knowledge base id` to run this example and API keys from your langfuse project.
In your specific use case, you can sync different files for different domain topics and query this notebook in the same manner to evaluate model responses using the retrieve API from knowledge bases.


### Python 3.10

⚠  For this lab we need to run the notebook based on a Python 3.10 runtime. ⚠

### Setup

To run this notebook you would need to install dependencies, langchain and ragas and the updated boto3, botocore whls.


In [None]:
%pip install --upgrade pip
%pip install boto3==1.34.85 --force-reinstall --quiet
%pip install botocore==1.33.2 --force-reinstall --quiet
%pip install langchain==0.0.342 --force-reinstall --quiet
%pip install ragas==0.0.20 --force-reinstall --quiet
%pip install langfuse --force-reinstall --quiet

#### Restart the kernel with the updated packages that are installed through the dependencies above

In [None]:
# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

### Follow the steps below to set up necessary packages

1. Import the necessary libraries for creating `bedrock-runtime` for invoking foundation models and `bedrock-agent-runtime` client for using Retrieve API provided by Knowledge Bases for Amazon Bedrock. 
2. Import Langchain for: 
   1. Initializing bedrock model  `anthropic.claude-3-haiku-20240307-v1:0` as our large language model to perform query completions using the RAG pattern. 
   2. Initializing bedrock model  `anthropic.claude-3-sonnet-20240229-v1:0` as our large language model to perform RAG evaluation. 
   3. Initialize Langchain retriever integrated with knowledge bases. 
   4. Later in the notebook we will wrap the LLM and retriever with `RetrieverQAChain` for building our Q&A application.

In [None]:
import boto3
import pprint
from botocore.client import Config
from langchain.llms.bedrock import Bedrock
from langchain.embeddings import BedrockEmbeddings
from langchain.retrievers.bedrock import AmazonKnowledgeBasesRetriever

pp = pprint.PrettyPrinter(indent=2)

kb_id = "<knowledge base-id>" # replace it with your Knowledge base id.

# get keys for your project from https://cloud.langfuse.com
LANGFUSE_PUBLIC_KEY = "<public-key>" #replace it with your public key
LANGFUSE_SECRET_KEY = "<secret-key>" #replace it with you secret key


bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0})
bedrock_client = boto3.client('bedrock-runtime')
bedrock_agent_client = boto3.client("bedrock-agent-runtime",
                              config=bedrock_config
                              )

model_kwargs_claude = {
    "temperature": 0,
    "top_k": 10,
    "max_tokens_to_sample": 3000
}

llm_for_text_generation = Bedrock(model_id="anthropic.claude-3-haiku-20240307-v1:0",
              model_kwargs=model_kwargs_claude,
              streaming=True,
              client = bedrock_client,)


llm_for_evaluation = Bedrock(model_id="anthropic.claude-3-sonnet-20240229-v1:0",
              model_kwargs=model_kwargs_claude,
              streaming=True,
              client = bedrock_client,)

bedrock_embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1",client=bedrock_client)

### Retrieve API: Process flow 

Create a `AmazonKnowledgeBasesRetriever` object from LangChain which will call the `Retreive API` provided by Knowledge Bases for Amazon Bedrock which converts user queries into
embeddings, searches the knowledge base, and returns the relevant results, giving you more control to build custom
workﬂows on top of the semantic search results. The output of the `Retrieve API` includes the the `retrieved text chunks`, the `location type` and `URI` of the source data, as well as the relevance `scores` of the retrievals. 

In [None]:

retriever = AmazonKnowledgeBasesRetriever(
        knowledge_base_id=kb_id,
        retrieval_config={"vectorSearchConfiguration": {"numberOfResults": 4}},
        # endpoint_url=endpoint_url,
        # region_name="us-east-1",
        # credentials_profile_name="<profile_name>",
    )

`score`: You can view the associated score of each of the text chunk that was returned which depicts its correlation to the query in terms of how closely it matches it.

### Prompt specific to the model to personalize responses 

Here, we will use the specific prompt below for the model to act as a financial advisor AI system that will provide answers to questions by using fact based and statistical information when possible. We will provide the `Retrieve API` responses from above as a part of the `{context}` in the prompt for the model to refer to, along with the user `query`.  

In [None]:
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser

PROMPT_TEMPLATE = """
    Human: You are a financial advisor AI system, and provides answers to questions by using fact based and statistical information when possible. 
    Use the following pieces of information to provide a concise answer to the question enclosed in <question> tags. 
    If you don't know the answer, just say that you don't know, don't try to make up an answer.
    <context>
    {context}
    </context>

    <question>
    {question}
    </question>

    The response should be specific and use statistics or numbers when possible.

    Assistant:"""

prompt = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)

# Setup RAG pipeline
rag_chain = (
    {"context": retriever,  "question": RunnablePassthrough()} 
    | prompt 
    | llm_for_text_generation
    | StrOutputParser() 
)

## Preparing the Evaluation Data

As RAGAS aims to be a reference-free evaluation framework, the required preparations of the evaluation dataset are minimal. In this case, all you need to prepare are the `questions`.

In [None]:
from datasets import Dataset

questions = [
    "What was the primary reason for the increase in net cash provided by operating activities for Octank Financial in 2021?",
    "In which year did Octank Financial have the highest net cash used in investing activities, and what was the primary reason for this?",
    "What was the primary source of cash inflows from financing activities for Octank Financial in 2021?",
    "Calculate the year-over-year percentage change in cash and cash equivalents for Octank Financial from 2020 to 2021.",
    "Based on the information provided, what can you infer about Octank Financial's overall financial health and growth prospects?"
]



## Evaluating the RAG application
First, import all the metrics you want to use from `ragas.metrics`. 

In [None]:
#import metrics
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision
)
from ragas.metrics.critique import harmfulness

# from ragas.llms import LangchainLLM
from ragas.llms import LangchainLLM

ragas_bedrock_model = LangchainLLM(llm_for_evaluation)

#set embeddings model for evaluating answer relevancy metric
answer_relevancy.embeddings = bedrock_embeddings

#specify the metrics here
metrics = [
        faithfulness,
        answer_relevancy,
        context_precision,
        harmfulness
    ]

for m in metrics:
    m.__setattr__("llm", ragas_bedrock_model)



### Score each Trace: This means you will run the evaluations for each trace item. This gives you much better idea since of how each call to your RAG pipelines is performing

Now lets init a Lanfuse client SDK 

In [None]:
from langfuse import Langfuse

langfuse = Langfuse(public_key=LANGFUSE_PUBLIC_KEY,secret_key=LANGFUSE_SECRET_KEY, host="https://us.cloud.langfuse.com")


Here we are defining a utility function to score your trace with the metrics you choose

In [None]:
def score_with_ragas(query, chunks, answer):
    scores = {}
    for m in metrics:
        print(f"calculating {m.name}")
        scores[m.name] = m.score_single(
            {"question": query, "contexts": chunks, "answer": answer}
        )
    return scores

Now we are going to compute the score with each request. All steps are logged as spans in a single trace in langfuse. Traces are the top-level entity in the Langfuse API. They represent an execution flow in a LLM application usually triggered by an external event. Spans represent durations of units of work in a trace. You can read more about traces and spans from the [langfuse documentation](https://langfuse.com/docs/tracing/overview). Once the scores are computed you can add them to the trace in Langfuse.

In [None]:

for query in questions:
        question = query
        trace = langfuse.trace(name="rag", user_id="user1234", tags=["development"])
        contexts = [docs.page_content for docs in retriever.get_relevant_documents(query)]
        # contexts = row["contexts"]
        # pass it as span
        trace.span(name="retrieval", 
                   input={"question": question}, 
                   output={"contexts": contexts},
            )


        # answer = row["answer"]
        answer = rag_chain.invoke(query)
        trace.span(
                name="generation",
                input={"question": question, "contexts": contexts},
                output={"answer": answer}
            )


        # compute scores for the question, context, answer tuple
        ragas_scores = score_with_ragas(question, contexts, answer)
        ragas_scores
        for m in metrics:
            trace.score(name=m.name, value=ragas_scores[m.name])

Now you can see the Traces with Ragas metrics on the languse

<!-- ![langfuse](./images/rag-eval-online-langfuse.png) -->
<img src="./images/rag-eval-online-langfuse.png" width=50% height=50% />


Click on Traces to see the execution trace in detail. This was a simple example, hence you only see a couple of steps on the right - click on each to explore

<!-- ![](./images/LangfuseTraceDetailsRetreive.jpg) -->
<img src="./images/LangfuseTraceDetailsRetreive.jpg" width=50% height=20% />
<!-- ![](./images/LangfuseTraceDetailsGeneration.jpg) -->
<img src="./images/LangfuseTraceDetailsGeneration.jpg" width=50% height=20% />

> Note: Please note the scores above gives a relative idea on the performance of your RAG application and should be used with caution and not as standalone scores. Also note, that we have used only 5 question/answer pairs for evaluation, as best practice, you should use enough data to cover different aspects of your document for evaluating model.

Based on the scores, you can review other components of your RAG workflow to further optimize the scores, few recommended options are to review your chunking strategy, prompt instructions, adding more numberOfResults for additional context and so on. 

<div class="alert alert-block alert-warning">
<b>Note:</b> Remember to delete KB, OSS index and related IAM roles and policies to avoid incurring any charges.
</div>