# AISA Capstone 2 Assignment

## Overview
This notebook provides an environment for you to build your intuition on the steps to take when developing a high quality Retrieval Augmented Generation (RAG) solution. 
RAG solutions retrieve data before calling the large language model (LLM) to generate an answer. 
The retrieved data is used to augment the prompt to the LLM by adding the relevant retrieved data in context. 
Any RAG solution is only as good as the quality of the data retrieval process, and this is the particular focus of this notebook, and the accompanying questions and tasks.

The RAG solution developed here is enabled by the Llamaindex framework. This is a popular framework in the industry for developing RAG and Agent based solutions. In addition to providing a core set of tools for orchestration of RAG and Agent workflows, there is broad integration with a variety of platforms for model inference (LLM, embedding, ...), and, importantly, tooling for solution evaluation.

## Prerequisites for running the notebook
- That you have granted access to the Bedrock models that you are going to use, in the region (**us-west-2**) where you are going to use Bedrock - 
[reference](https://docs.aws.amazon.com/bedrock/latest/userguide/model-access-modify.html)
- Your SageMakerExecutionRole has permissions to invoke Bedrock models - 
[reference](https://docs.aws.amazon.com/bedrock/latest/userguide/inference-prereq.html)
- This notebook has been tested with SageMaker Notebook Instance running a `conda_python3` kernel
- The AWS region set for Amazon Bedrock use, needs to be in a region where the models being used are 1/ available, and 2/ enabled for use. This notebook was tested with Bedrock region `us-west-2`

## Implementation
This notebook uses llamaindex to define and execute the RAG solution. We will be using the following tools:

- **LLM (Large Language Model)**: e.g. Anthropic Claude Haiku available through Amazon Bedrock

  LLMs are used in the notebook for 1/ RAG response generation, to show the overall RAG workflow in actions, and 2/ for generating test questions on the indexed content (llamaindex nodes) for retrieval evaluation.
  
- **Text Embeddings Model**: e.g. Amazon Titan Embeddings available through Amazon Bedrock

  This embedding model is used to generate semantic vector representations of the content (llamaindex nodes) to be stored and the questions input to the RAG solution.
  
- **Document Loader**: SimpleDirectoryReader (Llamaindex)

  Before your chosen LLM can act on your data you need to load it. The way LlamaIndex does this is via data connectors, also called 'Reader'. Data connectors ingest data from different data sources and format the data into Document objects. A Document is a collection of data (currently text, and in future, images and audio) and metadata about that data.
  
  This implementation use SimpleDirectoryReader, which creates documents out of every file in a given directory. It can read a variety of formats including Markdown, PDFs, Word documents, and PowerPoint decks.

- **Vector Store**: VectorIndex (Llamaindex)

  In this notebook we are using this in-memory vector-store to store both the embeddings and the documents. In an enterprise context this could be replaced with a persistent store such as AWS OpenSearch, RDS Postgres with pgVector, ChromaDB, Pinecone or Weaviate.
  
  LlamaIndex abstracts the underlying vector database storage implementation with a VectorIndex class. This warps the Index, which is a data structure composed of Document objects, designed to enable querying by an LLM. The Index is designed to be complementary to your querying strategy.

----

## Setup

Install required Python modules for constructing the RAG solution.
You only need to run this once

In [1]:
%pip install \
    llama-index \
    llama-index-llms-bedrock \
    llama-index-embeddings-bedrock

Note: you may need to restart the kernel to use updated packages.


Download the default RAG test source data to our target source_docs directory. 
You only need to run this once.

In [2]:
source_docs_dir = './source_docs/'

The following creates the source_docs directory and downloads a document to that directory. The contents of this directory, 
initially the document that is downloaded here, will be used in the steps that follow.

After running this notebook in its entirity and reviewing its operation, delete this content and add your own content to the directory.

In [3]:
# Download and load data
!mkdir -p {source_docs_dir}
!wget --no-check-certificate 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O {source_docs_dir}'/paul_graham_essay.txt'

--2025-02-19 14:51:22--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘./source_docs//paul_graham_essay.txt’


2025-02-19 14:51:22 (43.1 MB/s) - ‘./source_docs//paul_graham_essay.txt’ saved [75042/75042]



Import auxilliarty modules

In [4]:
import logging
import sys
import os
import pandas as pd
import boto3  # AWS SDK for Python

In [5]:
# This is required when running within a jupyter notebook, otherwise you will get errors when llamaindex modules run
import nest_asyncio

nest_asyncio.apply()

Import required Python modules for constructing and evaluating the RAG solution

In [6]:
from llama_index.core import Settings
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.bedrock import BedrockEmbedding
from llama_index.llms.bedrock import Bedrock

from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.extractors import TitleExtractor
from llama_index.core.ingestion import IngestionPipeline, IngestionCache
from llama_index.core.text_splitter import TokenTextSplitter


from llama_index.core.evaluation import (
    DatasetGenerator,
    RetrieverEvaluator,
    generate_question_context_pairs,
)

from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    Response,
)

## Configure the models that will be used for the RAG pipeline

**Note**: By default this notebook with use the `us-west-2` region. This region has support for the models used in this notebook. You should not need to change this setting.

In [7]:
# AWS_REGION = "us-west-2"
AWS_REGION = "us-east-1"  # this is an alternative setting to use if desired 

Define the set of Bedrock model IDs that we that we'll use when developing and testing our solution 

Establish a connection to the Amazon Bedrock service

In [8]:
boto3_bedrock = boto3.client("bedrock-runtime")

### Configure the target embeddings models for use with Llamaindex

In [9]:
titan_text_embeddings_multilingual_v1_id = "amazon.titan-embed-text-v1"
titan_text_embeddings_multilingual_v2_id = "amazon.titan-embed-text-v2:0"
cohere_text_embeddings_english_id = "cohere.embed-english-v3"
cohere_text_embeddings_multilingual_id = "cohere.embed-multilingual-v3"

Configure our chosen embeddings model for use with llama_index

In [10]:
titan_text_embeddings_v2 = BedrockEmbedding(model=titan_text_embeddings_multilingual_v2_id,region_name=AWS_REGION)
titan_text_embeddings_v1 = BedrockEmbedding(model=titan_text_embeddings_multilingual_v1_id,region_name=AWS_REGION)
cohere_text_embeddings_english = BedrockEmbedding(model=cohere_text_embeddings_english_id,region_name=AWS_REGION)
cohere_text_embeddings_multilingual= BedrockEmbedding(model=cohere_text_embeddings_english_id,region_name=AWS_REGION)

### Configure the target LLMs for use with Llamaindex

The following Mistral models produce good questions for evaluation. The Titan model produces questions of lesser quality 
and sometimes not in the format needed by the tools. 

**Note** Most Bedrock LLMs do note produce questions in a format that can be directly used for evaluation with the tooling as it is configured in this notebook.

In [11]:
instruct_mistral7b_id = "mistral.mistral-7b-instruct-v0:2"
instruct_mixtral8x7b_id="mistral.mixtral-8x7b-instruct-v0:1"
titan_text_express_id = "amazon.titan-text-express-v1"

In [12]:
# set the parameters to be applied when invoking the model
model_kwargs_llm = {
    "temperature": 0.5,
    "top_p": 0.9,
    "top_k": 200,
    "max_tokens": 4096
}

In [13]:
llm_mistral7b = Bedrock(model=instruct_mistral7b_id, client=boto3_bedrock, model_kwargs=model_kwargs_llm, region_name=AWS_REGION)
llm_mixtral8x7b = Bedrock(model=instruct_mixtral8x7b_id, client=boto3_bedrock, model_kwargs=model_kwargs_llm, region_name=AWS_REGION)
llm_titan_express = Bedrock(model=titan_text_express_id, client=boto3_bedrock, model_kwargs=model_kwargs_llm, region_name=AWS_REGION)

### Use the following cell to configure the embeddings model to use for the cells that follow

The embeddings model is a critical choice for the accuracy of your RAG solution.
Experiment with the options here to see which is best for your content.
If you want more, test with further alternatives. There are many that are readily supported by llama_index.

In [14]:
# KEY CELL 01

embed_model = titan_text_embeddings_v1
# embed_model = titan_text_embeddings_v2
# embed_model = cohere_text_embeddings_english
# embed_model = cohere_text_embeddings_multilingual

### Use the following cell to configure the LLM to use for the cells that follow
The LLM will be used for question generation and RAG answer generation in this notebook as it is currently configured.
The default value llm_mistral7b works well with the code and should be used if possible. 

In [15]:
llm_model = llm_mistral7b
# llm_model = llm_mixtral8x7b
# llm_model = llm_titan_express

In [16]:
# Set LlamaIndex default model settings to what was set in the cells above
Settings.embed_model = embed_model
Settings.llm = llm_model

## Read in the documents for adding to our data store

Read in the documents in the 'data/source_docs' directory into a structure ready for use by llama_index

In [17]:
reader = SimpleDirectoryReader(source_docs_dir)
documents = reader.load_data()

Quick check here to see that all of your documents were read. The count should match the number of pages in the documents in source_docs

In [18]:
len(documents)

1

## Create and run the document ingestion pipeline

The following cell defines two different document ingestion pipelines. 
If you have time, test using both of these, and create you own and test with that also.

In [19]:
# Define two transformation for the ingestion pipelines for initial experimentation

transformations_00=[
        TokenTextSplitter(separator=" ", chunk_size=512, chunk_overlap=100),
        embed_model,
    ]

transformations_01=[
        SentenceSplitter(chunk_size=512, chunk_overlap=100),
        TitleExtractor(),
        embed_model,
    ]


### Use the following cell to configure the data ingestion pipeline for processing the source data

In [20]:
# KEY CELL 02

# create the pipeline with one of the transformation configurations defined above

pipeline = IngestionPipeline(transformations=transformations_00)
# pipeline = IngestionPipeline(transformations=transformations_01)


### Run the configured ingestion pipeline 

In [21]:
# run the pipeline
nodes = pipeline.run(documents=documents)
print(f"number of nodes: {len(nodes)}")

number of nodes: 44


This may make test analysis easier. It is none essential

In [22]:
# By default, the node ids are set to random uuids. 
# To ensure same id's per run, we manually set them to consistent sequential numbers.
for idx, node in enumerate(nodes):
    node.id_ = f"node_{idx}"

In [23]:
# validate that node has an embedding associated with it
for idx, node in enumerate(nodes):
    if node.id_ == "node_0":
        print(node.embedding)

[0.1701388955116272, -0.23478008806705475, -0.08697916567325592, -0.028327545151114464, 0.6199652552604675, -0.1631365716457367, -0.27135413885116577, 0.0008344720699824393, -0.011516204103827477, 0.08718894422054291, -0.004535590298473835, 0.19147858023643494, 0.131626158952713, 0.24182581901550293, 0.07359664887189865, 0.14907407760620117, 0.14453125, 0.47193285822868347, -0.293070912361145, 0.006018519401550293, -0.09528356045484543, 0.2752821147441864, -0.17750290036201477, 0.03931206464767456, -0.016977719962596893, 0.03742766007781029, -0.06696686893701553, -0.08252314478158951, 0.008521411567926407, 0.1655237227678299, -0.31513309478759766, 0.1688946783542633, 0.16609518229961395, -0.11681133508682251, -0.11134259402751923, -0.2849247455596924, 0.023553241044282913, 0.061834488064050674, 0.47647568583488464, -0.13562282919883728, 0.16093748807907104, 0.12187500298023224, -0.2752893567085266, -0.49947917461395264, -0.07238136976957321, 0.12523147463798523, -0.01371527649462223, -

## Create the VectorIndex 
This creates our vector database, in memory in this case,  using the nodes that were created in the previous step

In [24]:
vector_index = VectorStoreIndex(nodes=nodes)

## Test that we have a valid starting point for our evaluation
We run a quick system test with the defaul llama_index RAG workflow with a question that is relevant to our dataset

Instantiate a query engine object

In [25]:
query_engine = vector_index.as_query_engine()

Specify a question that has can be answered by the document(s) that have been ingested. For the default document, the following is a valid question.

In [26]:
example_query="What did the author do growing up?"

Run the default RAG pipeline with the example query. This should give a meaningful result. Don't worry if the answer is overly verbose, etc. We'll fix that later.

In [27]:
response = query_engine.query(example_query)
print(response)

 The author attended RISD (Rhode Island School of Design) and learned to paint there. Prior to that, he was not one of the kids who could draw in high school. He dropped out of RISD in 1993 and moved to New York City to pursue a career as a painter. He had a friend, Idelle Weber, who was a painter and taught him a lot. He was also nervous about money and decided to write another book on Lisp to live off the royalties and paint. He also worked on various projects such as spam filters, cooking for groups, and buying a building in Cambridge to use as an office. He eventually met his future partner, Jessica Livingston, at a party and they started a venture capital firm together.


----

# Evaluate the retrieval accuracy of the VectorIndex

## Create a set of question and node (context) pairs to drive the tests that follow
This uses the llm that give the methods and the document data stored in the nodes (created during document ingestion)

This will make many calls to the specified LLM (num_questions_per_chunk * number of nodes). This will likely be throttled by Bedrock. The llama_index API will work through the throttling except in extreme cases.

In [28]:
%%time
qa_dataset = generate_question_context_pairs(nodes, num_questions_per_chunk=1)

  7%|▋         | 3/44 [00:09<02:19,  3.41s/it]Retrying llama_index.llms.bedrock.utils.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised ThrottlingException: An error occurred (ThrottlingException) when calling the InvokeModel operation (reached max retries: 4): Too many requests, please wait before trying again. You have sent too many requests.  Wait before trying again..
Retrying llama_index.llms.bedrock.utils.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised ThrottlingException: An error occurred (ThrottlingException) when calling the InvokeModel operation (reached max retries: 4): Too many requests, please wait before trying again. You have sent too many requests.  Wait before trying again..
  9%|▉         | 4/44 [00:31<07:03, 10.58s/it]Retrying llama_index.llms.bedrock.utils.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised ThrottlingException: An error occurred (ThrottlingException) whe

CPU times: user 989 ms, sys: 52.6 ms, total: 1.04 s
Wall time: 12min 54s





Take a look at the sample queries generated. This should show a meaningful questions related to your document content.

In [29]:
for item in list(qa_dataset.queries.items()):
    print(item[1])

What programming language did the author use when he first started programming on the IBM 1401 computer in 9th grade?
In what year did the narrator first start programming on a microcomputer, and what type of computer was it?
In what novel did the author describe an intelligent computer named Mike that inspired the speaker to pursue a career in AI?
In which universities did the author apply for graduate school in Artificial Intelligence during the 1980s?
In what year did Paul Graham write most of his book "On Lisp"?
In what year did the author first consider the idea of becoming an artist after visiting the Carnegie Institute?
What topic did the author choose for his PhD dissertation in computer science, despite not having a significant amount of time to write it?
What art school did the author initially apply to and why?
What arrangement existed between the students and faculty in the painting department at the Accademia, allowing both parties to adhere to the conventions of a 19th ce

## Instantiate a retriever against the index for testing



### Set the number of items to return from the Retriever
This is a trade-off item, more returned content is not always better. Consider how this may impact your pipeline and evaluation results and experiment with it.

In [30]:
# KEY CELL: 03

number_of_items_to_return = 2
# number_of_items_to_return = 3
# number_of_items_to_return = 4

In [31]:
retriever = vector_index.as_retriever(similarity_top_k=number_of_items_to_return)

Run a quick system test on the retriever and check that the output nodes look reasonable

In [32]:
example_query="What did the author do growing up?"

In [33]:
retrieved_nodes = retriever.retrieve(example_query)
print(retrieved_nodes)

[NodeWithScore(node=TextNode(id_='node_13', embedding=None, metadata={'file_path': '/home/ec2-user/SageMaker/elvtr-ai-solution-architect/class-11/source_docs/paul_graham_essay.txt', 'file_name': 'paul_graham_essay.txt', 'file_type': 'text/plain', 'file_size': 75042, 'creation_date': '2025-02-19', 'last_modified_date': '2025-02-19'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='5a4e7269-3f03-41dd-aebd-1b8e8d7a5fea', node_type='4', metadata={'file_path': '/home/ec2-user/SageMaker/elvtr-ai-solution-architect/class-11/source_docs/paul_graham_essay.txt', 'file_name': 'paul_graham_essay.txt', 'file_type': 'text/plain', 'file_size': 75042, 'creation_date': '2025-02-19', 'last_modified_date': '2025-02-19'}, hash

## Evaluate the Quality of Retrieval from the VectorIndex

Instantiate a RetrieverEvaluator with the metrics that we want to review

In [34]:
metrics = ["hit_rate", "mrr", "precision", "recall"]

retriever_evaluator = RetrieverEvaluator.from_metric_names(metrics, retriever=retriever)

In [35]:
# Evaluate on a single query
# The output is verbose, but may be useful for looking at specific results

query_id = 1  # change this to math the query id of interest

sample_id, sample_query = list(qa_dataset.queries.items())[query_id]
sample_expected = qa_dataset.relevant_docs[sample_id]

eval_result = retriever_evaluator.evaluate(sample_query, sample_expected)
print(eval_result)

Query: In what year did the narrator first start programming on a microcomputer, and what type of computer was it?
Metrics: {'hit_rate': 1.0, 'mrr': 1.0, 'precision': 0.5, 'recall': 1.0}



In [36]:
# to see detail on which nodes were returned, etc, we can look at the whole returned object
eval_result

RetrievalEvalResult(query='In what year did the narrator first start programming on a microcomputer, and what type of computer was it?', expected_ids=['node_1'], expected_texts=None, retrieved_ids=['node_1', 'node_0'], retrieved_texts=["memory is of the moment I learned it was possible for programs not to terminate, when one of mine didn't. On a machine without time-sharing, this was a social as well as a technical error, as the data center manager's expression made clear.\n\nWith microcomputers, everything changed. Now you could have a computer sitting right in front of you, on a desk, that could respond to your keystrokes as it was running instead of just churning through a stack of punch cards and then stopping. [1]\n\nThe first of my friends to get a microcomputer built it himself. It was sold as a kit by Heathkit. I remember vividly how impressed and envious I felt watching him sitting in front of it, typing programs right into the computer.\n\nComputers were expensive in those da

In [37]:
### Run evaluation on the entire test dataset (autogenerated above)
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)

In [38]:
def display_results(name, eval_results):
    metric_dicts = []
    for eval_result in eval_results:
        metric_dict = eval_result.metric_vals_dict
        metric_dicts.append(metric_dict)

    full_df = pd.DataFrame(metric_dicts)
    hit_rate = full_df["hit_rate"].mean()
    mrr = full_df["mrr"].mean()
    precision = full_df["precision"].mean()
    recall = full_df["recall"].mean()

    metric_df = pd.DataFrame({"retrievers": [name],
                              "hit_rate": [hit_rate], "mrr": [mrr],
                              "precision": [precision], "recall": [recall],
                             })
    return metric_df, full_df


### Top-level Evaluation Results

In [39]:
summary, detail = display_results(f"top-{number_of_items_to_return} eval", eval_results)
summary

Unnamed: 0,retrievers,hit_rate,mrr,precision,recall
0,top-2 eval,0.636364,0.534091,0.318182,0.636364


In [40]:
# Optionally, look at the detailed, question by question metrics:
detail

Unnamed: 0,hit_rate,mrr,precision,recall
0,0.0,0.0,0.0,0.0
1,1.0,1.0,0.5,1.0
2,0.0,0.0,0.0,0.0
3,1.0,1.0,0.5,1.0
4,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0
6,1.0,1.0,0.5,1.0
7,1.0,0.5,0.5,1.0
8,0.0,0.0,0.0,0.0
9,1.0,1.0,0.5,1.0


----

# Assignment Instructions

**01** Run the notebook in full, with the default, provided, document set (one document in .txt file).

Copy the Top-level Evaluation Results (from the cell a little way above this one) into this cell.

Save the notebook and rename is as `capstone-02-01-first-run.ipynb`. Download the notebook for assignment submission.

**Answer**

| Metric | Value |
|--------|--------|
| Retriever Hit Rate | 0.636364 |
| MRR | 0.534091 |
| Precision | 0.318182 |
| Recall | 0.636364 |


**02** Replace the default document with one of more of your own documents and re-run all the cells.

- First make a copy of the first notebook and paste it into the same directory.
- Delete the original content on source_docs
- Delete the cell in the notebook that will download the original content again to source_docs
- Upload your content to source_docs
- Initially test with a small set - 20 to 40 nodes aka document chunks - to save you time as you experiment. This may mean that you delete some of your content, to reduce its size. 
- Run the whole notebook with your small set of  test documents and observe the results.

Wrapping up:
- Copy the Top-level Evaluation Results (from the cell a little way above this one) into this cell.
- The top-level results returned are your `baseline` results. As you experiment with different configurations, in the following assignment tasks, you will see the route to improved accuracy from this baseline.
- Describe your test dataset in this cell. How many documents, how many chunks, what is the topic of the content, what is the lanuage.

This completes this step.

- Save and rename the notebook as `capstone-02-02-my-data-baseline.ipynb`. Download the notebook for assignment submission.

**03** Using your data experiment with different embedding models

- First make a copy of the previous notebook,`capstone-02-02-my-data-baseline.ipynb`, and paste it into the same directory.
- Change which embedding model is going to be used from the original setting. 
There are three other options prepared in this notebook, see `KEY CELL 01`. 
- Run all the following cells in the notebook and note the in accuracy metrics for the selected model.
- Repeat with one or more of the embeddings models, noting the accuracy metrics for the selected model, each time.
Note: Changing the embeddings model will change the accuracy of the retriever. Exactly how much will depend on your content.

Wrapping up:
- For each embeddings model that you experiment with, copy the Top-level Evaluation Results (from the cell a little way above this one) into this cell, along with an indication as to which embeddings model was being used.

- Briefly describe your results in this cell. Which model was best, did you see a major improvement in the metrics, can you suggest a reason for why the best embeddings model is performing better than the worst.

This completes this step.


- Save and rename the notebook as `capstone-02-03-my-data-embeddings.ipynb`. Download the notebook for assignment submission.

**04** Using your data experiment with different ingestion pipelines

- First make a copy of the previous notebook,`capstone-02-03-my-data-embeddings.ipynb`, and paste it into the same directory.
- Make sure the the notebook is configured to use the best performing embeddings model.
- Change which pipeline is going to be used from the original setting. 
There is one other options prepared in this notebook, see `KEY CELL 02`.
- Run all the following cells in the notebook and note the in accuracy metrics for the selected pipeline.
- Optionally, create your own pipeline and experiment with that to further improve the  accuracy metrics of your solution.
Note: Changing the ingestion should change the accuracy of the retriever. Exactly how much will depend on your content. 
The two example pipelines in this notebook may not make much of a difference for your content. 
If you have time, you will learn most by experimenting with creating you own and seeing the change in the metrics.

Wrapping up:
- For each pipeline that you experiment with, copy the Top-level Evaluation Results (from the cell a little way above this one) into this cell, along with the definition of the pipeline being used.

- Briefly describe your results in this cell. Which pipeline was best, did you see a major improvement in the metrics, can you suggest a reason for why the best pipeline is performing better than the worst.

This completes this step.


- Save and rename the notebook as `capstone-02-04-my-data-pipeline.ipynb`. Download the notebook for assignment submission.

**05** Using your data experiment with different values of k

- First make a copy of the previous notebook,`capstone-02-04-my-data-pipeline.ipynb`, and paste it into the same directory.
- Make sure the the notebook is configured to use the best performing pipeline and embeddings model.
- Change the value of *k* that is going to be used from the original setting. 
There are two other options prepared in this notebook, see `KEY CELL 03`.
- Run all the cells that follow below in the notebook and note the in accuracy metrics for the selected value of *k*. 
*Note*: with this setting, you do not need to re-run the cells above the cell where you make the update.


Wrapping up:
- For each value of *k* set for the `retriever`, run the evaluation and copy the Top-level Evaluation Results into this cell, along with noting the value of *k* being used.

- Briefly describe your results in this cell. Which value of *k* gave the best results, did you see a major improvement in the metrics, can you suggest a reason for why the best pipeline is performing better than the worst.

This completes this step.


- Save and rename the notebook as `capstone-02-05-my-data-pipeline-with-k.ipynb`. Download the notebook for assignment submission.

**06** Summarize

- Breifly summarize, in four paragraphs what your learned, regarding the following topics:
    - The configuration of the retriever for your content
    - The process of experimentating with different settings and evaluating the results
    - Which evaluation metric was most useful and why
    - What might you do next, if you had a 40 hours or more to work on this, to further improve the quality of the retriever
- Save your summary in this cell.
    
This completes this step.


- Save and rename the notebook as `capstone-02-06-summary.ipynb`. Download the notebook for assignment submission.
    


Post each of the notebooks, individually, to submit your assignment. Do not zip the set of notebooks, as that makes it harder for the grading process. 

## The following assignment tasks are completely optional 
The follow tasks are intended for students who want to dive deeper. They are more open ended and require changing and augmenting the code share above.

**Optional Task 1** Text Chunking

Experiment further with advanced chunking options and see if you can further improve the accuracy of the your Retriever (by having the better pre-processed data).

Good candidates to look are [Semantic chunking](https://docs.llamaindex.ai/en/stable/examples/node_parsers/semantic_chunking/), and
[Semantic double merging chunking](https://docs.llamaindex.ai/en/stable/examples/node_parsers/semantic_double_merging_chunking/).

**Optional Task 2** Explore other Transformations

Using the following [document](https://docs.llamaindex.ai/en/stable/module_guides/loading/ingestion_pipeline/transformations/) as your starting point, consider which transformations would be most applicable to your set of documents and experiment and see the impact of those changes.

There are lots of [transformations](https://docs.llamaindex.ai/en/v0.9.48/module_guides/loading/ingestion_pipeline/transformations.html) to consider including 
[text splitters](https://docs.llamaindex.ai/en/v0.9.48/module_guides/loading/node_parsers/modules.html#text-splitters),
[node parsers](https://docs.llamaindex.ai/en/v0.9.48/module_guides/loading/node_parsers/modules.html), 
and [metadata extractors](https://docs.llamaindex.ai/en/v0.9.48/module_guides/loading/documents_and_nodes/usage_metadata_extractor.html).



**Optional Task 3** Embeddings Model

Experiment further with other state of the art embeddings model and see if you can further improve the accuracy of the your Retriever (by having the better pre-processed data).

The embeddings models developed by [Voyage AI](https://www.voyageai.com) are some of the best in the industry. They also provide generous free-tier use of the embeddings models (as a service). Signing-up to get a developer account is a fairly light-weight process. 

The benefit that you'll get is 1/ seeing how to use a whole new model family for embeddings with Llamaindex, 2/ a deeper knowledge of your embeddings options, and 3/ perhaps, a more accurate Retriever for your solution.

If you prefer to try other embeddings models, such as those provided by OpenAI, that's also well worth exploring.