### LLM Evaluation using LLama Index

The LlamaIndex end-to-end evaluation guide covers how to set up and run comprehensive evaluations using LlamaIndex. It includes steps for data preparation, index creation, model selection, evaluation criteria configuration, and running the evaluation pipeline. The guide also details how to analyze results and iterate on improvements.

For detailed instructions, visit the [LlamaIndex End-to-End Evaluation](https://docs.llamaindex.ai/en/stable/optimizing/evaluation/e2e_evaluation/) documentation.

In [10]:
%pip install --upgrade pip
%pip install boto3==1.33.2 --force-reinstall --quiet
%pip install botocore==1.33.2 --force-reinstall --quiet
%pip install langchain==0.0.342 --force-reinstall --quiet
%pip install llama-index==0.9.3.post1 --force-reinstall --quiet

### Llama Index Evaluation

In [11]:
import boto3
import pprint
from botocore.client import Config
from langchain.llms.bedrock import Bedrock
from llama_index import (
    ServiceContext,
    set_global_service_context
)
from langchain.embeddings.bedrock import BedrockEmbeddings
from llama_index.embeddings import LangchainEmbedding

pp = pprint.PrettyPrinter(indent=2)

bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0})
bedrock_client = boto3.client('bedrock-runtime')
bedrock_agent_client = boto3.client("bedrock-agent-runtime",
                              config=bedrock_config)

parameters = {
    "maxTokenCount":2000,
    "stopSequences":[],
    "temperature":0,
    "topP":0.9
    }

model_kwargs_claude = {
    "temperature": 0,
    "top_k": 10,
    "max_tokens_to_sample": 3000
}

embed_model = LangchainEmbedding(
    BedrockEmbeddings(model_id="amazon.titan-embed-text-v1")
)

llm = Bedrock(model_id = "amazon.titan-text-lite-v1",
              model_kwargs=parameters,
              client = bedrock_client,)

llm_claude = Bedrock(model_id = "anthropic.claude-v2",
              model_kwargs=model_kwargs_claude,
              client = bedrock_client,)

service_context = ServiceContext.from_defaults(llm=llm_claude,
                                               embed_model=embed_model)
set_global_service_context(service_context)

In [12]:
import nest_asyncio
nest_asyncio.apply()

In [13]:
from typing import Tuple, List
import pandas as pd
from llama_index.evaluation import (
    FaithfulnessEvaluator,
    RelevancyEvaluator,
    CorrectnessEvaluator,
    PairwiseComparisonEvaluator,
)


faithfulness_evaluator = FaithfulnessEvaluator(service_context=service_context)
relevancy_evaluator = RelevancyEvaluator(service_context=service_context)
correctness_evaluator = CorrectnessEvaluator(service_context=service_context)

df = pd.read_csv("capstone_eval_output.csv")

def run_llamaindex_eval():
    results_list = []
    try:
        for index, row in df.iterrows():
            

            print(f"Processing Row #: {index}")
            question = str(row['question'])
            reference_answer = str(row['ground_truths'])
            contexts = list(row['contexts'])
            generated_answer = str(row['answer'])


            correctness_results = correctness_evaluator.evaluate(
                query=question,
                response=generated_answer,
                reference=reference_answer
            )
            faithfulness_results = faithfulness_evaluator.evaluate(
                query=question,
                response=generated_answer,
                contexts=contexts
                )
            relevancy_results = relevancy_evaluator.evaluate(
                query=question,
                response=generated_answer,
                contexts=contexts
                )
            
            cur_result_dict = {
                "query": question,
                "generated_answer": generated_answer,
                "correctness": correctness_results.passing,
                "correctness_feedback": correctness_results.feedback,
                "correctness_score": correctness_results.score,
                "faithfulness": faithfulness_results.passing,
                "faithfulness_feedback": faithfulness_results.feedback,
                "faithfulness_score": faithfulness_results.score,
                "relevancy": relevancy_results.passing,
                "relevancy_feedback": relevancy_results.feedback,
                "relevancy_score": relevancy_results.score,
            }
            results_list.append(cur_result_dict)
            print(f"Lenght of result list: {len(results_list)}")
            evals_df = pd.DataFrame(results_list)
    except Exception as e:
            print(f"An error occurred: {e}")
    return evals_df

In [14]:
result_df = run_llamaindex_eval()

In [15]:
result_df

In [16]:
print(f'Correctness score: {result_df.correctness.mean()} \nFaithfulness score: {result_df.faithfulness.mean()} \nRelevancey score: {result_df.relevancy.mean()}')

## Conclusion
Congratulations on completing this moduel on retrieval augmented generation! This is an important technique that combines the power of large language models with the precision of retrieval methods. By augmenting generation with relevant retrieved examples, the responses we recieved become more coherent, consistent and grounded. You should feel proud of learning this innovative approach. I'm sure the knowledge you've gained will be very useful for building creative and engaging language generation systems. Well done!

In the above implementation of RAG based Question Answering we have explored the following concepts and how to implement them using Amazon Bedrock and it's LangChain integration.

- Loading documents and generating embeddings to create a vector store
- Retrieving documents to the question
- Preparing a prompt which goes as input to the LLM
- Present an answer in a human friendly manner
- keep source knowledge up to date, and improve trust in our system by providing citations with every answer.

### Take-aways
- Experiment with different Vector Stores
- Leverage various models available under Amazon Bedrock to see alternate outputs
- Explore options such as persistent storage of embeddings and document chunks
- Integration with enterprise data stores

# Thank You