# Local Experimentation of RAG Systems on SageMaker Studio

## Install dependencies

This notebook demonstrates invoking Bedrock models directly using the AWS SDK, but for later notebooks in the workshop you'll also need to install [LangChain](https://github.com/hwchase17/langchain).

In this example, you will use [Facebook AI Similarity Search (Faiss)](https://faiss.ai/) as the vector database to store your embeddings. There are CPU or GPU options available, depending on your platform.

In [None]:
%pip install -Uq langchain==0.3.24
%pip install -Uq pydantic==2.11.3
%pip install -Uq sqlalchemy==2.0.40
%pip install -Uq faiss-cpu==1.10.0 # For CPU Installation
#%pip install faiss-gpu # For CUDA 7.5+ Supported GPU's.
%pip install -Uq pypdf==5.4.0
%pip install -Uq tiktoken==0.9.0
%pip install -Uq langchain_huggingface==0.1.2
%pip install -Uq datasets==3.5.0
%pip install -Uq ragas==0.2.14
%pip install -Uq python-dotenv==1.1.0

In [None]:
from IPython.display import display_html

display_html("<script>Jupyter.notebook.kernel.restart()</script>",raw=True)

In [None]:
!mkdir -p ./data

from urllib.request import urlretrieve
urls = [
    'https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/2022-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2022/ar/2021-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2021/ar/Amazon-2020-Shareholder-Letter-and-1997-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2020/ar/2019-Shareholder-Letter.pdf'
]

filenames = [
    'AMZN-2022-Shareholder-Letter.pdf',
    'AMZN-2021-Shareholder-Letter.pdf',
    'AMZN-2020-Shareholder-Letter.pdf',
    'AMZN-2019-Shareholder-Letter.pdf'
]

metadata = [
    dict(year=2022, source=filenames[0]),
    dict(year=2021, source=filenames[1]),
    dict(year=2020, source=filenames[2]),
    dict(year=2019, source=filenames[3])]

data_root = "./data/"

for idx, url in enumerate(urls):
    file_path = data_root + filenames[idx]
    urlretrieve(url, file_path)

As part of Amazon's culture, the CEO always includes a copy of the 1997 Letter to Shareholders with every new release. This will cause repetition, take longer to generate embeddings, and may skew your results. In the next section you will take the downloaded data, trim the 1997 letter (last 3 pages) and overwrite them as processed files.

In [None]:
import glob
from pypdf import PdfReader, PdfWriter

local_pdfs = glob.glob(data_root + '*.pdf')

for local_pdf in local_pdfs:
    pdf_reader = PdfReader(local_pdf)
    pdf_writer = PdfWriter()
    for pagenum in range(len(pdf_reader.pages)-3):
        page = pdf_reader.pages[pagenum]
        pdf_writer.add_page(page)

    with open(local_pdf, 'wb') as new_file:
        new_file.seek(0)
        pdf_writer.write(new_file)
        new_file.truncate()


Now that you have clean PDFs to work with, you will enrich your documents with metadata, then use a process called "chunking" to break up a larger document into small pieces. These small pieces will allow you to generate embeddings without surpassing the input limit of the embedding model.

In this example you will break the document into 1000 character chunks, with a 100 character overlap. This will allow your embeddings to maintain some of its context.

In [None]:
import numpy as np
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader

documents = []

for idx, file in enumerate(filenames):
    loader = PyPDFLoader(data_root + file)
    document = loader.load()
    for document_fragment in document:
        document_fragment.metadata = metadata[idx]
        
    print(f'{len(document)} {document}\n')
    documents += document

# - in our testing Character split works better with this PDF data set
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 1000,
    chunk_overlap  = 100,
)

docs = text_splitter.split_documents(documents)

## Create the boto3 client

Interaction with the Bedrock API is done via boto3 SDK. To create a the Bedrock client, we are providing an utility method that supports different options for passing credentials to boto3. 
If you are running these notebooks from your own computer, make sure you have [installed the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) before proceeding.


#### Use default credential chain

If you are running this notebook from a Sagemaker Studio notebook and your Sagemaker Studio role has permissions to access Bedrock you can just run the cells below as-is. This is also the case if you are running these notebooks from a computer whose default credentials have access to Bedrock

#### Use a different role

In case you or your company has setup a specific role to access Bedrock, you can specify such role by uncommenting the line `#os.environ['BEDROCK_ASSUME_ROLE'] = '<YOUR_VALUES>'` in the cell below before executing it. Ensure that your current user or role have permissions to assume such role.

#### Use a specific profile

In case you are running this notebooks from your own computer and you have setup the AWS CLI with multiple profiles and the profile which has access to Bedrock is not the default one, you can uncomment the line `#os.environ['AWS_PROFILE'] = '<YOUR_VALUES>'` and specify the profile to use.

#### Note about `langchain`

The Bedrock classes provided by `langchain` create a default Bedrock boto3 client. We recommend to explicitly create the Bedrock client using the instructions below, and pass it to the class instantiation methods using `client=bedrock_client`

In [None]:
#### Un comment the following lines to run from your local environment outside of the AWS account with Bedrock access

#import os
#os.environ['BEDROCK_ASSUME_ROLE'] = '<YOUR_VALUES>'
#os.environ['AWS_PROFILE'] = 'bedrock-user'

In [None]:
import os
import boto3
import json
import sys

module_path = ".."
sys.path.append(os.path.abspath(module_path))
from utils import print_ww

bedrock_client = boto3.client("bedrock-runtime")

## Building a FAISS vector database

In this example, you will be using the Amazon Titan Embeddings Model from Amazon Bedrock to generate the embeddings for our FAISS vector database.

The `TokenCounterHandler` callback function is a function you can utilize in your LLM objects and chains to generate reports on token count. It is supplied here as a utility class that will output the token counts at the end of your result chain, or can be attached to a LLM object and invoked manually.

In [None]:
from utils.TokenCounterHandler import TokenCounterHandler

token_counter = TokenCounterHandler()

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings

model_name = "Alibaba-NLP/gte-base-en-v1.5"
model_kwargs = {"device": "cpu", 'trust_remote_code': True}
encode_kwargs = {"normalize_embeddings": True}
embeddings = HuggingFaceEmbeddings(
    model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs
)

Next you will import the Document and FAISS modules from Langchain. Using these modules will allow you to quickly generate embeddings through Amazon Bedrock and store them locally in your FAISS vector store.

In [None]:
from langchain.schema import Document
from langchain.vectorstores import FAISS

In this step you will process documents and prepare them to be converted to vectors for the vector store.

Here you will use the from_documents function in the Langchain FAISS provider to build a vector database from your document embeddings.

In [None]:
db = FAISS.from_documents(docs, embeddings)

To avoid having to completely regenerate your embeddings all the time, you can save and load the vector store from the local filesystem. In the next section you will save the vector store locally, and reload it.

In [None]:
db.save_local("faiss_index")

In [None]:
new_db = FAISS.load_local("faiss_index", embeddings, allow_dangerous_deserialization=True)

In [None]:
db = new_db

## Similarity Searching

Here you will set your search query, and look for documents that match.

In [None]:
query = "How has AWS evolved?"

### Basic Similarity Search

The results that come back from the `similarity_search_with_score` API are sorted by score from lowest to highest. The score value is represented by the [L-squared (or L2)](https://en.wikipedia.org/wiki/Lp_space) distance of each result. Lower scores are better, repesenting a shorter distance between vectors.

In [None]:
results_with_scores = db.similarity_search_with_score(query)
for doc, score in results_with_scores:
    print(f"Content: {doc.page_content}\nMetadata: {doc.metadata}\nScore: {score}\n\n")

### Similarity Search with Metadata Filtering
Additionally, you can provide metadata to your query to filter the scope of your results. The `filter` parameter for search queries is a dictionary of metadata key/value pairs that will be matched to results to include/exclude them from your query.

In [None]:
filter = dict(year=2022)

In the next section, you will notice that your query has returned less results than the basic search, because of your filter criteria on the resultset.

In [None]:
results_with_scores = db.similarity_search_with_score(query, filter=filter)
for doc, score in results_with_scores:
    print(f"Content: {doc.page_content}\nMetadata: {doc.metadata}, Score: {score}\n\n")

### Top-K Matching

Top-K Matching is a filtering technique that involves a 2 stage approach.

1. Perform a similarity search, returning the top K matches.
2. Apply your metadata filter on the smaller resultset.

Note: A caveat for Top-K matching is that if the value for K is too small, there is a chance that after filtering there will be no results to return.

Using Top-K matching requires 2 values:
- `k`, the max number of results to return at the end of our query
- `fetch_k`, the max number of results to return from the similarity search before applying filters


In [None]:
results = db.similarity_search(query, filter=filter, k=2, fetch_k=4)
for doc in results:
    print(f"Content: {doc.page_content}\nMetadata: {doc.metadata}\n\n")

### Maximal Marginal Relevance

Another measurement of results is Maximal Marginal Relevance (MMR). The focus of MMR is to minimize the redundancy of your search results while still maintaining relevance by re-ranking the results to provide both similarity and diversity.

In the next section you will use the `max_marginal_relevance_search` API to run the same query as in the Metadata Filtering section, but with reranked results.

In [None]:
results = db.max_marginal_relevance_search(query, filter=filter)
for doc in results:
    print(f"Content: {doc.page_content}\nMetadata: {doc.metadata}\n\n")

## Q&A with Llama 3 and Retrieved Vectors

Now that you are able to query from the vector store, you're ready to feed context into your LLM.

Using the LangChain wrapper for Bedrock, creating an object for the LLM can be done in a single line of code where you specify the model_id of the desired LLM (Claude V2 in this case), and any model level arguments.

In [None]:
import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_community.vectorstores import FAISS

vector_store = db

retriever = vector_store.as_retriever(search_type="mmr", search_kwargs={"k": 3})

retriever.invoke("How has AWS evolved?")

# Enable Model Access in Amazon Bedrock

To use models in Amazon Bedrock, you will need to enable access for them.

Before going further, [go to the Bedrock console](https://console.aws.amazon.com/bedrock/home?#/modelaccess), choose `Enable All Models`, then `Next`, then `Submit`.

**In a workshop environment, some models may return errors and are safe to ignore.**

In [None]:
from langchain_aws import ChatBedrock

model_kwargs = { 
        "max_tokens": 512,
        "temperature": 0,  
        "top_p": 0.5
    }

llm = ChatBedrock(
    model_id="us.meta.llama3-1-8b-instruct-v1:0", 
    client=bedrock_client, 
    model_kwargs=model_kwargs,
    callbacks=[token_counter]
)

In [None]:
llm.invoke("How has AWS evolved?")

In [None]:
import boto3
from botocore.client import Config
from langchain_aws.llms import SagemakerEndpoint
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_aws.embeddings import BedrockEmbeddings

In [None]:
prompt_template = """
<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>
You are an assistant for question-answering tasks. Answer the following question using the provided context. If you don't know the answer, just say "I don't know.".
<|start_header_id|>user<|end_header_id|>
Context: {context} 
Question: {question}
<|start_header_id|>assistant<|end_header_id|> 
Answer:
"""

In [None]:
def build_messages(data):
    system_content = f"""You are an assistant for question-answering tasks. Answer the following question in 5 sentences using the provided context. If you don't know the answer, just say "I don't know."."""
    user_content = f"""
        Context: {data["context"]} 
        
        Question: {data["question"]}
        """

    messages = [
        {"role": "system", "content": system_content},
        {"role": "user", "content": user_content}
    ]

    return messages

In [None]:
prompt = PromptTemplate.from_template(prompt_template)


def format_docs(docs):
    results = "\n\n".join(doc.page_content for doc in docs)
    return results


qa_chain = (
    {
        "context": retriever | format_docs,
        "question": RunnablePassthrough(),
    }
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
query = "How has AWS evolved?"

print(f"Question: {query}")
print(f"Answer: {qa_chain.invoke(query)}")

Since you have a model object set up, you can use it to get a baseline of what the LLM will produce without any provided context.

Something you will notice is with the prompt "How has AWS evolved?", the answer isn't bad, but its not exactly what you'd look for from the lens of an executive. You'd want to hear about how they approached things that led to evolution, whereas the baseline results are just facts that indicate change. Later in the notebook, you will provide context to get a more tailored answer.

In [None]:
import json
print(llm.invoke("How has AWS evolved?"))

token_counter.report()

With your LLM ready to go, you'll create a prompt template to utilize context to answer a given question. Prompt formats will be different by model, so if you change your model you will also likely need to adjust your prompt.

In [None]:
from langchain.prompts import PromptTemplate

prompt_template = """
    <|begin_of_text|>
    <|start_header_id|>system<|end_header_id|>
    You are an assistant for question-answering tasks. Answer the following question using the provided context. If you don't know the answer, just say "I don't know.".
    <|start_header_id|>user<|end_header_id|>
    Context: {context} 
    Question: {question}
    <|start_header_id|>assistant<|end_header_id|> 
    Answer:
    """

With the LLM endpoint object created, you are ready to create your first chain!

This chain is a simple example using LangChain's RetrievalQA chain, which will:
- take a query as input
- generate query embeddings
- query the vector database for relevant document chunks based on the query embedding
- inject the context and original query into the prompt template
- invoke the LLM with the completed prompt
- return the LLM result

The [`stuff` chain type](https://python.langchain.com/docs/modules/chains/document/stuff) simply takes the context documents and inserts them into the prompt.

By setting `return_source_documents` to `True`, the LLM responses will also contain the document chunks from the vector database, to illustrate where the context came from.

In [None]:
retriever = db.as_retriever()

In [None]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel

def format_docs(docs):
    results = "\n\n".join(doc.page_content for doc in docs)
    return results

prompt = PromptTemplate.from_template(prompt_template)

rag_chain_from_docs = (
    RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
    | prompt
    | llm
    | StrOutputParser()
)

rag_chain_with_source = RunnableParallel(
    {"context": retriever, "question": RunnablePassthrough()}
).assign(answer=rag_chain_from_docs)

In [None]:
query = "How has AWS evolved?"

print(f"Question: {query}")

result = rag_chain_with_source.invoke(query)

print(f"\nAnswer: {result['answer']}")

print(f"\nContext Documents: ")
for srcdoc in result["context"]:
    print(f"{srcdoc.metadata['source']}")
    print("----------")
    print(f"{srcdoc.page_content}\n")

Now that your chain is set up, you can supply queries to it and generate responses based on your source documents.

You'll note that the LLM response references the context documents provided, using them to formulate a response calling out things that were mentioned specifically by Amazon's CEO.

In [None]:
query = "How has AWS evolved?"

print(f"Question: {query}")

result = rag_chain_with_source.invoke(query)

print(f"\nAnswer: {result['answer']}")

print(f"\nContext Documents: ")
for srcdoc in result["context"]:
    print(f"{srcdoc.metadata['source']}")
    print("----------")
    print(f"{srcdoc.page_content}\n")

In [None]:
query = "Why is Amazon successful?"

print(f"Question: {query}")

result = rag_chain_with_source.invoke(query)

print(f"\nAnswer: {result['answer']}")

print(f"\nContext Documents: ")
for srcdoc in result["context"]:
    print(f"{srcdoc.metadata['source']}")
    print("----------")
    print(f"{srcdoc.page_content}\n")

In [None]:
query = "What business challenges has Amazon experienced?"

print(f"Question: {query}")

result = rag_chain_with_source.invoke(query)

print(f"\nAnswer: {result['answer']}")

print(f"\nContext Documents: ")
for srcdoc in result["context"]:
    print(f"{srcdoc.metadata['source']}")
    print("----------")
    print(f"{srcdoc.page_content}\n")

In [None]:
query = "How was Amazon impacted by COVID-19?"

print(f"Question: {query}")

result = rag_chain_with_source.invoke(query)

print(f"\nAnswer: {result['answer']}")

print(f"\nContext Documents: ")
for srcdoc in result["context"]:
    print(f"{srcdoc.metadata['source']}")
    print("----------")
    print(f"{srcdoc.page_content}\n")

## Evaluate Retrieval-Augmented Generation (RAG) pipelines with Amazon Bedrock, and Ragas

In this section we'll explore ways to evaluate the quality of Retrieval-Augmented Generation (RAG) pipelines with the opensource tools like [RAGAS](https://docs.ragas.io/en/v0.1.21/index.html). We will leverage the local vector database created in the preivous lab and the RAG results generation to show offline evaluation and scoring.

### 📊 RAGAS Evaluation Metrics

We're going to measure the following aspects of a RAG system. These metrics are defined in **[RAGAS]**(https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/):

- 🔍 **[Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/faithfulness/)**  
  Measures how factually consistent the generated answer is with the retrieved context. It evaluates whether the answer could reasonably be derived from the context.

- 🎯 **[Response Relevancy](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/answer_relevance/)**  
  Assesses how relevant the generated answer is to the original user query. A high score indicates the answer is on-topic and useful.

- 🧠 **[Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_precision/)**  
  Measures how many of the retrieved contexts are truly relevant to answering the question. Precision reflects the "purity" of the retrieved chunks.

- 📥 **[Context Recall](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_recall/)**  
  Evaluates how well the retrieved context covers the information needed to answer the question completely. High recall means fewer relevant facts are missed.

- 🧬 **[Answer Similarity](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/answer_similarity/)**  
  Compares the generated answer to a reference answer (if available), measuring how semantically close they are using embedding-based similarity.

- ✅ **[Answer Correctness](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/answer_correctness/)**  
  Evaluates whether the generated answer is factually correct and aligns with known ground-truth answers, if such references are available.

> 📚 Want to dive deeper into how each metric is computed?  
Check out the full [RAGAS metrics documentation](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/).

In [None]:
# RAGAS
import ragas
from ragas.run_config import RunConfig
from ragas.metrics.base import MetricWithLLM, MetricWithEmbeddings
from ragas import evaluate
from ragas.metrics import Faithfulness, LLMContextPrecisionWithoutReference, ResponseRelevancy
from ragas.metrics import answer_relevancy, faithfulness
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.dataset_schema import SingleTurnSample

from langchain_aws import ChatBedrock as LangChainBedrock

### Score with RAGAS

Lets take a small example of a single trace and see how you can score that with Ragas. We first define a utility function to score your trace with the metrics you chose.

In this example, as we don't have groundtruth for the actual answer, we will use the metrics that focuses on the retrieved contexts and LLM generated answers. The metrics we selected are `Faithfullness`, `ContextPrecision`, and `ResponseRelevancy`:
- The **Faithfulness** metric measures how factually consistent a `response` is with the `retrieved context`. It ranges from 0 to 1, with higher scores indicating better consistency.
- **LLMContextPrecisionWithoutReference** metric can be used when you have both retrieved contexts and also reference contexts associated with a `user_input`. To estimate if a retrieved contexts is relevant or not this method uses the LLM to compare each of the retrieved context or chunk present in `retrieved_contexts` with `response`.
- **ResponseRelevancy** metric focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. This metric is computed using the user_input, the retrived_contexts and the response. 

When you have the corresponding groundtruth of the expected answers, you can explore other metrics that RAGAS provides. Note that, certain metrics will require a evaluator embedding model and/or LLM model, so you need to prepare those evaluators accordingly. In this example, we will use the `Claude` model for LLM evaluator and `titan embedding` model provided by Amazon Bedrock as the evaluator LLMs.

In [None]:
evaluator_llm = LangchainLLMWrapper(LangChainBedrock(model_id="us.anthropic.claude-3-5-haiku-20241022-v1:0"))
evaluator_emb = LangchainEmbeddingsWrapper(BedrockEmbeddings(client=bedrock_client, model_id="amazon.titan-embed-text-v1"))

You can evaluate a single metric and initiate the metric with the relevant evaluator llm as below:

In [None]:
context_precision = LLMContextPrecisionWithoutReference(llm=evaluator_llm)

Then we need to prepare an evaluation sample. An evaluation sample is a single structured data instance that is used to asses and measure the performance of your LLM application in specific scenarios. It represents a single unit of interaction or a specific use case that the AI application is expected to handle. In Ragas, evaluation samples are represented using the `SingleTurnSample` and `MultiTurnSample` classes.

**SingleTurnSample**

SingleTurnSample represents a single-turn interaction between a user, LLM, and expected results for evaluation. It is suitable for evaluations that involve a single question and answer pair, possibly with additional context or reference information.

In [None]:
def generate_answer_context_with_knowledgebase(query):
    print(f"Generating answer and context for question {query}")
    response = rag_chain_with_source.invoke(query)
    source_documents = response["context"]
    contexts_list_raw = list(map(lambda x: x.page_content, source_documents))
    contexts = contexts_list_raw
    answer = response["answer"]
    return {"question": query,
           "answer": answer,
           "contexts": contexts}

response = generate_answer_context_with_knowledgebase(query)

In [None]:
sample = SingleTurnSample(
        user_input=query,
        response=response["answer"],
        retrieved_contexts=response["contexts"]
    )


Metrics can be broadly classified into two categories based on the type of data they evaluate:

**Single turn metrics**: These metrics evaluate the performance of the AI application based on a single turn of interaction between the user and the AI. All metrics in ragas that supports single turn evaluation are inherited from SingleTurnMetric class and scored using single_turn_ascore method. It also expects a Single Turn Sample object as input.

**Multi-turn metrics**: These metrics evaluate the performance of the AI application based on multiple turns of interaction between the user and the AI. All metrics in ragas that supports multi turn evaluation are inherited from MultiTurnMetric class and scored using multi_turn_ascore method. It also expects a Multi Turn Sample object as input.

In this example, we will only show single turn metrics.

In [None]:
await context_precision.single_turn_ascore(sample)

We can also initiate a few RAGAS metrics together and check the corresponding evaluator LLMs required for each metric.

In [None]:
metrics=[
        ragas.metrics.faithfulness,
        ragas.metrics.answer_relevancy
    ]

In [None]:
# util function to init Ragas Metrics
def init_ragas_metrics(metrics, llm, embedding):
    for metric in metrics:
        if isinstance(metric, MetricWithLLM):
            print(metric.name + " llm")
            metric.llm = llm
        if isinstance(metric, MetricWithEmbeddings):
            print(metric.name + " embedding")
            metric.embeddings = embedding
        run_config = RunConfig()
        metric.init(run_config)

In [None]:
init_ragas_metrics(
    metrics,
    llm=evaluator_llm,
    embedding=evaluator_emb,
)

In [None]:
for metric in metrics:
    print(f"calculating {metric.name}")
    score = await metric.single_turn_ascore(sample)
    print(f"score is: {score}")

### Key Workflow Summary
- Data Preparation: Prepare your knowledge base data source.
- In memory database Setup: A vector index is created and populated.
- Model Deployment: Load embedding model to local machine and leverage Amazon Bedrock serverless models for quick testing.
- RAG Pipeline: Queries retrieve relevant context, and the LLM generates answers.
- RAGAS evaluation: Evaluate some sample queries using the LLM based metrics.


This notebook provides an end-to-end example of building an experiment RAG system locally. 

# Congratulations for finishing Lab 1. Now please continue on to the next Lab.