# Build and Evaluate Retrieval Augmented Generation with Elastic, Anthropic Claude 3.7, Amazon Bedrock, Langchain and RAGAS

## Introduction

In this notebook we will show you how to use Langchain, Anthropic Claude 3.7, Elastic and RAGAS to build and evaluate response of a Retrieval Augmented Generation (RAG) solution


#### Use case

Evaluation of RAG (Retrieval-Augmented Generation) Application for Private Documentation Question Answering 


#### Persona
As an analyst at Anycompany, Bob wants to evaluate the response of a Retrieval-Augmented Generation (RAG) application for answering questions based on the company's private documentation. The RAG application combines information retrieval and language generation techniques to provide accurate and relevant answers to users' questions. 

#### Implementation
To fulfill this use case, in this notebook we will show how to evaluate a RAG Application to answer questions from business data. We will use the Anthropic Claude 3.7 Sonnet Foundation model, Elastic, Langchain and RAGAS. 

#### Python 3.10

⚠  For this lab we need to run the notebook based on a Python 3.10 runtime. ⚠


## Installation

To run this notebook you would need to install dependencies - boto3, botocore, elasticsearch and langchain.

In [None]:
%pip install --upgrade pip
%pip install boto3 --force-reinstall --quiet
%pip install botocore --force-reinstall --quiet
%pip install langchain --force-reinstall --quiet
%pip install langchain-aws --force-reinstall --quiet
%pip install langchain-elasticsearch --force-reinstall --quiet
%pip install elasticsearch==8.18.0 --force-reinstall --quiet
%pip install pypdf --force-reinstall --quiet
%pip install ragas==0.2.6 --force-reinstall --quiet

Note: you may need to restart the kernel to use updated packages.
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
autogluon-multimodal 1.2 requires nvidia-ml-py3==7.352.0, which is not installed.
aiobotocore 2.19.0 requires botocore<1.36.4,>=1.36.0, but you have botocore 1.36.12 which is incompatible.
amazon-sagemaker-sql-magic 0.1.3 requires sqlparse==0.5.0, but you have sqlparse 0.5.3 which is incompatible.
autogluon-multimodal 1.2 requires jsonschema<4.22,>=4.18, but you have jsonschema 4.23.0 which is incompatible.
autogluon-multimodal 1.2 requires nltk<3.9,>=3.4.5, but you have nltk 3.9.1 which is incompatible.
autogluon-multimodal 1.2 requires omegaconf<2.3.0,>=2.1.1, but you have omegaconf 2.3.0 which is incompatible.
jupyter-scheduler 2.10.0 requires pytz<=2024.2,>=2023.3, but you have pytz 2025.1 which is incompatible.
mlflow 2.20.0 requires pyarr

## Kernel Restart

Restart the kernel with the updated packages that are installed through the dependencies above

In [2]:
# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

## Setup 

Import the necessary libraries

In [3]:
import json
import os
import sys
import boto3
import botocore
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_aws import ChatBedrockConverse
from langchain_aws import AmazonKnowledgeBasesRetriever
from langchain_aws import BedrockEmbeddings
from botocore.client import Config
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_elasticsearch import ElasticsearchStore
from elasticsearch import Elasticsearch
from langchain.schema.runnable import RunnablePassthrough
from langchain.chains import RetrievalQA
from getpass import getpass
from langchain.prompts import PromptTemplate
from langchain.document_loaders import PyPDFLoader
from pathlib import Path
from datasets import Dataset
import pandas as pd

## Initialization

Initiate Bedrock Runtime and BedrockChat

In [None]:
bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0})
bedrock_client = boto3.client('bedrock-runtime')

modelId = 'us.anthropic.claude-3-7-sonnet-20250219-v1:0' # change this to use a different version from the model provider
embeddingmodelId = 'amazon.titan-embed-text-v2:0' # change this to use a different embedding model

llm = ChatBedrockConverse(model_id=modelId, client=bedrock_client)
embeddings = BedrockEmbeddings(model_id=embeddingmodelId,client=bedrock_client)

## Read files from directory

Load all PDF files which are present in the directory

In [5]:
import nltk
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt_tab to /home/sagemaker-
[nltk_data]     user/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/sagemaker-user/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [6]:
TMP_DIR = os.path.join(os.path.dirname(os.path.realpath('__file__')), 'media/2018-SaoE-Enhancing_the_Simulation_of_Complex_Mechanical_Systems_with_Machine_Learning.pdf')
loader = PyPDFLoader(TMP_DIR)
documents = loader.load()

## Split Documents

Chunk documents into passages in order to improve the retrieval specificity and to ensure that we can provide multiple passages within the context window of the final question answering prompt.

Here we are chunking documents into 1000 token passages with an overlap of 0 tokens.

Here we are using Recursive Character Text splitter but Langchain offers more advanced splitters to reduce the chance of context being lost.

In [7]:
text_splitter = RecursiveCharacterTextSplitter(
        separators=['\n\n', '\n', '.', ','],
        chunk_size=1000,
        chunk_overlap=0
        )
texts = text_splitter.split_documents(documents)

## Connect to Elasticsearch

We'll use the Cloud ID to identify our deployment, because we are using Elastic Cloud deployment. To find the Cloud ID for your deployment, go to [Cloud ID](https://cloud.elastic.co/deployments) and select your deployment.

We will use ElasticsearchStore to connect to our elastic cloud deployment. This would help create and index data easily. 

In [8]:
cloud_id = getpass("Elastic deployment Cloud ID: ")
cloud_api_key = getpass("Elastic deployment API Key: ")
index_name= "new-index-1"

vector_store = ElasticsearchStore(
        es_cloud_id=cloud_id,  
        index_name= index_name, 
        embedding=embeddings,
        es_api_key=cloud_api_key)

Elastic deployment Cloud ID:  ········
Elastic deployment API Key:  ········


## Index data into Elasticsearch and initialize retriever

Next, we will index data to elasticsearch using ElasticsearchStore.from_documents. We will use Cloud ID, Password and Index name values set in the Create cloud deployment step. We will set embedding to BedrockEmbeddings to embed the texts.

In [9]:
vectordb = vector_store.from_documents(
        texts, 
        embeddings,
        index_name=index_name,
        es_cloud_id=cloud_id,
        es_api_key=cloud_api_key
        )

retriever = vectordb.as_retriever()

## Model Invocation and Response Generation using RetrievalQA chain

Now that we have the passages stored in Elasticsearch and LLM is initialized, we can now ask a question to get the relevant passages.

In [10]:
query = "What is Machine learning?"


machinelearning_advisor_template = """
    Human: You will be acting as a Machine Learning advisor on complex Mechanical systems named Poly created by the company Polymath. 
    Your goal is to give advice related to Application of Machine Learning on Mechanical Systems to users. You will be replying to users 
    who are asking questions on Application of Machine Learning in Mechanical Systems 
    site and who will be confused if you don't respond in the character of Poly.
    
    You should maintain a friendly customer service tone.

    Here is the document you should reference when answering the user: <context>{context}</context>

    Here are some important rules for the interaction:
    - Always stay in character, as Poly, a Machine Learning advisor on complex Mechanical systems
    - If you are unsure how to respond, say “Sorry, I didn’t understand that. Could you repeat the question?”
    - If someone asks something irrelevant, say, “Sorry, I am Poly and I give career advice. Do you have a question related to Application of Machine Learning in Mechanical Systems today I can help you with?”

    Here is an example of how to respond in a standard interaction:

    <example>
    User: Hi, how were you created and what do you do?
    Poly: Hello! My name is Poly, and I was created by Polymath  to give advice on Application of Machine Learning in Mechanical Systems. 
        What can I help you with today?
    User: Hi, how can I use Decision Trees?
    Poly: The method for using decision trees described in the given context is to first rank all
        all the rules within the decision tree to find the most meaningful and reliable design subspaces.
        Then, determine how many significant rules exist within the dataset to present the engineer
        with the most reliable and important knowledge. Finally, use the resulting rules from the decision tree
        to make predictions or decisions based on the given target variable.
        User: Hi, What is Machine Learning?
    Poly: Machine learning is a tool that allows predictions about future behavior to be drawn from existing
        data sets. It is used in everything from spam filters to self-driving cars. While the field has been
        around for decades, it has flourished over the last several years as computing power has increased
        and user-friendly toolkits have been developed in a variety of programming languages. We do not
        go into detail on machine learning theory and practice in this paper, but we do introduce the
        concepts necessary for understanding our implementation of it.
    </example>

    Here is the user’s question: <question> {question} </question>

    How do you respond to the user’s question?
    Think about your answer first before you respond. Put your response in <response></response> tags.
    Assistant: <response>"""

prompt = PromptTemplate(template=machinelearning_advisor_template, input_variables=["context","question"])
qa_chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff",retriever=retriever, return_source_documents=True, chain_type_kwargs={"prompt": prompt})
response = qa_chain.invoke(query)
print(response["result"])

Machine learning is a technique that allows predictions or decisions to be made from data. It involves steps like:

1. Framing the problem - Determining what question you are trying to answer and what data is needed.

2. Training an algorithm - Providing data to a machine learning algorithm like linear regression, decision trees, neural networks etc. to create a model that maps the input data to the desired output.

3. Evaluating the model - Checking the model's performance on test data that was not used for training to ensure it generalizes well.

The context provided mentions that machine learning has flourished in recent years due to increased computing power and user-friendly toolkits. It allows finding patterns and making predictions from data in fields ranging from spam filtering to self-driving cars. The key aspects are having relevant data and selecting the appropriate algorithm for the problem at hand.
</response>


## Preparing the Evaluation Data

As RAGAS aims to be a reference-free evaluation framework, the required preparations of the evaluation dataset are minimal. You will need to prepare `questions` and `reference` pairs from which you can prepare the remaining information through inference as shown below. If you are not interested in the `context_recall` metric, you don’t need to provide the `references` information. In this case, all you need to prepare are the questions.

In [11]:
from ragas import SingleTurnSample, EvaluationDataset

questions = ["What is first step in machine learning metamodel?", 
             "What is the final step in machine learning metamodel?"]

references = ["The first step in machine learning is framing the problem",
                 "The final step in machine learning is verification"]

samples = []

for idx, query in enumerate(questions):
    samples.append(
        SingleTurnSample(
            user_input=query,
            retrieved_contexts=[docs.page_content for docs in retriever.invoke(query)],
            response=qa_chain.invoke(query)["result"],
            reference=references[idx]
        )
    )

dataset = EvaluationDataset(samples=samples)

## Evaluating the RAG application

First, import all the metrics you want to use from `ragas.metrics`. Then, you can use the `evaluate()` function and simply pass in the relevant metrics and the prepared dataset. Below is a brief description of the metrics

* **Faithfulness**: This measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. Higher the better.
* **Response Relevance**: The evaluation metric, Response Relevancy, focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. This metric is computed using the question, the context and the answer. Please note, that eventhough in practice the score will range between 0 and 1 most of the time, this is not mathematically guaranteed, due to the nature of the cosine similarity ranging from -1 to 1.
* **Context Precision**: Context Precision is a metric that evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not. Ideally all the relevant chunks must appear at the top ranks. This metric is computed using the question, ground_truth and the contexts, with values ranging between 0 and 1, where higher scores indicate better precision.
* **Context Recall**: Context recall measures the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth. It is computed based on the ground truth and the retrieved context, and the values range between 0 and 1, with higher values indicating better performance.
* **Context entities recall**: This metric gives the measure of recall of the retrieved context, based on the number of entities present in both ground_truths and contexts relative to the number of entities present in the ground_truths alone. Simply put, it is a measure of what fraction of entities are recalled from ground_truths. This metric is useful in fact-based use cases like tourism help desk, historical QA, etc. This metric can help evaluate the retrieval mechanism for entities, based on comparison with entities present in ground_truths, because in cases where entities matter, we need the contexts which cover them.
* **Answer Semantic Similarity**: The concept of Answer Semantic Similarity pertains to the assessment of the semantic resemblance between the generated answer and the ground truth. This evaluation is based on the ground truth and the answer, with values falling within the range of 0 to 1. A higher score signifies a better alignment between the generated answer and the ground truth.
* **Answer Correctness**: The assessment of Answer Correctness involves gauging the accuracy of the generated answer when compared to the ground truth. This evaluation relies on the ground truth and the answer, with scores ranging from 0 to 1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness. Answer correctness encompasses two critical aspects: semantic similarity between the generated answer and the ground truth, as well as factual similarity. These aspects are combined using a weighted scheme to formulate the answer correctness score. Users also have the option to employ a ‘threshold’ value to round the resulting score to binary, if desired.
* **Aspect Critique**: This is designed to assess submissions based on predefined aspects such as harmlessness, maliciousness, coherence, and conciseness. The output of aspect critiques is binary, indicating whether the submission aligns with the defined aspect or not. This evaluation is performed using the ‘answer’ as input.

In [12]:
from ragas.metrics import (
        LLMContextRecall, 
        Faithfulness, 
        LLMContextPrecisionWithReference, 
        AnswerCorrectness, 
        ResponseRelevancy, 
        SemanticSimilarity, 
        AspectCritic
    )
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas import evaluate 

#You can also choose a different model for evaluation
llm_for_evaluation = LangchainLLMWrapper(ChatBedrockConverse(model_id=modelId, client=bedrock_client))
bedrock_embeddings = LangchainEmbeddingsWrapper(BedrockEmbeddings(model_id="amazon.titan-embed-text-v2:0",client=bedrock_client))

#specify the metrics here
metrics = [
    LLMContextRecall(llm=llm_for_evaluation), 
    LLMContextPrecisionWithReference(llm=llm_for_evaluation),
    AnswerCorrectness(llm=llm_for_evaluation, embeddings=bedrock_embeddings), 
    ResponseRelevancy(llm=llm_for_evaluation, embeddings=bedrock_embeddings),
    Faithfulness(llm=llm_for_evaluation),
    SemanticSimilarity(embeddings=bedrock_embeddings),
    AspectCritic(name="harmfulness", 
         definition="Does the submission cause or have the potential to cause harm to individuals, groups, or society at large?", 
         llm=llm_for_evaluation
        ),
    AspectCritic(name="maliciousness", 
                 definition="Is the submission intended to harm, deceive, or exploit users?", 
                 llm=llm_for_evaluation
                ),
    AspectCritic(name="coherence", 
             definition="Is the submission logical, relevant, and informative along with clear structure?", 
             llm=llm_for_evaluation
            ),
    AspectCritic(name="conciseness", 
         definition="Is the submission brief, direct, and avoids unnecessary wordiness while conveying intended meaning?", 
         llm=llm_for_evaluation
        )
    ]

result = evaluate(
    dataset = dataset, 
    metrics=metrics
)

df = result.to_pandas()

df.style.set_properties(**{'text-align': 'left'}).set_table_styles([ dict(selector='th', props=[('text-align', 'left')] ) ])
pd.options.display.max_colwidth = 8000
df

Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]

Unnamed: 0,user_input,retrieved_contexts,response,reference,context_recall,llm_context_precision_with_reference,answer_correctness,answer_relevancy,faithfulness,semantic_similarity,harmfulness,maliciousness,coherence,conciseness
0,What is first step in machine learning metamodel?,"[variety of ways to produce models that use the best parts of multiple algorithms. Most popular \nalgorithms are available in preprogrammed toolkits like scikit-learn; one only needs to provide an \nappropriately formatted data set and select values for the algorithm’s parameters. For this paper, \nfour different algorithms were tested on the data, and each algorithm’s parameters were tuned to \nproduce optimal results. \nThe final step for any machine learning metamodel is verification. In the most general terms, \nbefore training an algorithm, the data set is split into “training” and “test” sets. The metamodel is \ntrained using only the training data and subsequently used with the test data inputs to make \npredictions that are compared to the test data outputs. The metamodel’s performance is evaluated \nbased on its prediction accuracy (there can be more to this process, but it is again out of the scope, breakout model typically has orders of magnitude fewer nodes and elements than the system \nmodel and runs in minutes rather than days. A set of feature values of interest is generated (e.g., \nfriction coefficients between 0.05 and 0.5, bolt preloads between 0 ft-lb and 40 ft-lb), and the \nbreakout model is analyzed with each combination of features. The results of those analyses \nbecome the data set for training the metamodel. An appropriate number of data points must be \ngathered to prevent overfitting of the data and accurately represent the input space (again, methods \nexist for testing the suitability of the size of the data set but are outside the scope of this paper). \nOnce the problem has been framed, the next step is to create the metamodel using a machine \nlearning algorithm. There are many algorithms to choose from, ranging from fairly simple linear \nregression to the multilayered neural nets of deep learning. Algorithms can also be combined in a, Once a metamodel has been developed, it can produce information about a component much more \nquickly than FEA. Metamodels could potentially be used to derive statistical information about \nsystems rather than just a single deterministic solution. For instance, it should be possible to \nquantify the effects of changing the hinge bolt preload on the behavior of the hatch system as a \nwhole. On a larger model with multiple contributing metamodels, it may be possible to calculate \nthe relative importance of damage to different components. Machine learning is a field with great \npotential, and its potential to assist FEA should be further explored. \n7. References \n1. Scikit-learn User Guide. Release 0.19.1. scikit-learn.org. 2017. \n \n8. Acknowledgment \nThis material is based upon work supported by the Naval Sea Systems Command under Contract \nNo. N00024-16-C-4006. ATA gratefully acknowledges the support of Mr. Randall Goodnight, \nNavy Technical Monitor., 4.3. Generate a Metamodel \nThe gathered data can now be used to train an algorithm to produce a metamodel. Since we only \nhave one input feature, the data points shown in Figure 5 are enough to represent the relationship \nbetween moment and stiffness. As an added advantage, the data are simple enough that we can \nmake a plot of our metamodel’s predictions and visually inspect its accuracy. \nSince these data were fairly simple, we used a collection of machine learning algorithms from \nscikit-learn.org to produce metamodels that can predict the stiffness of the hinge given a moment \nvalue. A method called cross-validation was used to iterate across the parameters of these \nalgorithms and select the best settings for training. The best algorithm for these data is a decision \ntree, which uses if-then-else decision rules to approximate relationships between data (Scikit-learn \nUser Manual, 2017). The estimated relationship between moment and stiffness for the example]","The first step in creating a machine learning metamodel is to frame the problem you want to solve using machine learning. This involves:\n\n1. Identifying the input features or variables that impact the output you want to predict or model. For example, if you want to model the stiffness of a mechanical hinge, the input features could be things like the applied moment, material properties, dimensions, etc.\n\n2. Generating training data by running simulations, experiments, or collecting real-world data that map the input features to the desired output. You need a sufficient amount of data that spans the range of input variables to properly train the model.\n\n3. Splitting the data into training and test sets. The training set is used to fit the machine learning model, while the test set is used to evaluate the model's performance on unseen data.\n\nOnce the problem is properly framed with input features and training/test data, then you can proceed to selecting and training a machine learning algorithm to create the metamodel. But clearly defining the problem and obtaining quality data is the crucial first step before any algorithm can be applied.\n</response>",The first step in machine learning is framing the problem,1.0,0.5,0.271318,0.920343,1.0,0.539818,0,0,1,1
1,What is the final step in machine learning metamodel?,"[variety of ways to produce models that use the best parts of multiple algorithms. Most popular \nalgorithms are available in preprogrammed toolkits like scikit-learn; one only needs to provide an \nappropriately formatted data set and select values for the algorithm’s parameters. For this paper, \nfour different algorithms were tested on the data, and each algorithm’s parameters were tuned to \nproduce optimal results. \nThe final step for any machine learning metamodel is verification. In the most general terms, \nbefore training an algorithm, the data set is split into “training” and “test” sets. The metamodel is \ntrained using only the training data and subsequently used with the test data inputs to make \npredictions that are compared to the test data outputs. The metamodel’s performance is evaluated \nbased on its prediction accuracy (there can be more to this process, but it is again out of the scope, breakout model typically has orders of magnitude fewer nodes and elements than the system \nmodel and runs in minutes rather than days. A set of feature values of interest is generated (e.g., \nfriction coefficients between 0.05 and 0.5, bolt preloads between 0 ft-lb and 40 ft-lb), and the \nbreakout model is analyzed with each combination of features. The results of those analyses \nbecome the data set for training the metamodel. An appropriate number of data points must be \ngathered to prevent overfitting of the data and accurately represent the input space (again, methods \nexist for testing the suitability of the size of the data set but are outside the scope of this paper). \nOnce the problem has been framed, the next step is to create the metamodel using a machine \nlearning algorithm. There are many algorithms to choose from, ranging from fairly simple linear \nregression to the multilayered neural nets of deep learning. Algorithms can also be combined in a, Once a metamodel has been developed, it can produce information about a component much more \nquickly than FEA. Metamodels could potentially be used to derive statistical information about \nsystems rather than just a single deterministic solution. For instance, it should be possible to \nquantify the effects of changing the hinge bolt preload on the behavior of the hatch system as a \nwhole. On a larger model with multiple contributing metamodels, it may be possible to calculate \nthe relative importance of damage to different components. Machine learning is a field with great \npotential, and its potential to assist FEA should be further explored. \n7. References \n1. Scikit-learn User Guide. Release 0.19.1. scikit-learn.org. 2017. \n \n8. Acknowledgment \nThis material is based upon work supported by the Naval Sea Systems Command under Contract \nNo. N00024-16-C-4006. ATA gratefully acknowledges the support of Mr. Randall Goodnight, \nNavy Technical Monitor., 4.3. Generate a Metamodel \nThe gathered data can now be used to train an algorithm to produce a metamodel. Since we only \nhave one input feature, the data points shown in Figure 5 are enough to represent the relationship \nbetween moment and stiffness. As an added advantage, the data are simple enough that we can \nmake a plot of our metamodel’s predictions and visually inspect its accuracy. \nSince these data were fairly simple, we used a collection of machine learning algorithms from \nscikit-learn.org to produce metamodels that can predict the stiffness of the hinge given a moment \nvalue. A method called cross-validation was used to iterate across the parameters of these \nalgorithms and select the best settings for training. The best algorithm for these data is a decision \ntree, which uses if-then-else decision rules to approximate relationships between data (Scikit-learn \nUser Manual, 2017). The estimated relationship between moment and stiffness for the example]","The final step for any machine learning metamodel is verification, according to the context provided. Before training an algorithm, the data set is split into ""training"" and ""test"" sets. The metamodel is trained using only the training data and then used with the test data inputs to make predictions that are compared to the test data outputs. The metamodel's performance is evaluated based on its prediction accuracy on the test data. This verification step ensures that the trained metamodel can make accurate predictions on new, unseen data and prevents overfitting to the training data.\n</response>",The final step in machine learning is verification,1.0,1.0,0.341432,0.988179,0.75,0.699059,0,0,1,0


## Delete Elasticsearch Index

Delete the Elasticsearch index

In [13]:
es = Elasticsearch(cloud_id=cloud_id, api_key=cloud_api_key)
es.options(ignore_status=[400,404]).indices.delete(index=index_name)

ObjectApiResponse({'acknowledged': True})

## Conclusion
You have now experimented with `RAGAS` SDK to evaluate a RAG Application using Anthropic Claude 3.7 as judge and Elastic as retriever.

### Take aways
- Adapt this notebook to experiment with different Claude 3 models available through Amazon Bedrock. 
- Change the prompts to your specific usecase and evaluate the output of different models.
- Play with the token length to understand the latency and responsiveness of the service.
- Apply different prompt engineering principles to get better outputs.

## Thank You