# RAG Fundamentals and Semantic Chunking

Retrieval-Augmented Generation (RAG) begins with adding contextual data to the prompt that is passed to a Large Language Model (LLM). The expectation is that In-Context Learning (ICL) takes place, leading the LLM to produce better results.

RAG can be more effective when Semantic Chunking is used. The basic idea is to retrieve and compile small "chunks" of data to augment the prompt to be sent to an LLM, rather than inserting entire documents that contain the topic of interest, but also information that is not relevant to the user interaction. For example, imagine a book about how to assemble a computer. It will contain sections about CPUs, mother boards, displays, and so on. Now suppose you have a question on how to install a hard drive. Would you read the section about keyboards, or would you go straight to hard drives one?

The same idea is applicable to LLMs. In addition, the context window<sup>‡</sup> is limited, so use the available space wisely.

One challenges that emerges from Semantic Chunking is determining the optimal chunk size. Too much or too little information would produce embeddings that would either try to encode too many meanings, or not be able to express enough meaning.

Context for RAG and semantic chunking comes from the paper "Fostering Trust and Quantifying Value of AI and ML," which I am an author, and presented at [The 2024 IARIA Annual Congress on Frontiers in Science, Technology, Services, and Applications](https://www.iaria.org/conferences2024/ProgramIARIACongress24.html).

.........

‡ <sup><sub>The number of tokens a model can receive as input. Its capacity influences how much information can be leveraged to run inferences.</sub></sup> 

## The Code

In this example, we'll explore semantic chunking and see the full pipeline from raw data through to chunking and embedding our data, ready for RAG.

This notebook implements an intuitive, albeit simple version of RAG and Semantic Chunking. Most ML practitioners will be able to follow all the steps and understand the splitting of the text into chunks, persisting the information to a vector database, building a prompt, and querying an LLM.

<p style="text-align: center;">* * * * *</p>

The implementation begins with importing the necessary packages and environment variables. [ChromaDB](https://www.trychroma.com) is used as vector database to store embeddings, and [OpenAI](https://openai.com) for generating embeddings and inferencing.

In [1]:
import chromadb
import json
import os

from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
from openai import OpenAI

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

The implementation of `chunk_text` is naive and intended for the purpose of understanding the basic idea. Here, the code groups a fixed number of lines together (a "chunk"), and stores each of those chunks into a vector. The intuition is that lines near each other are more likely to be addressing the same topic, in contrast to lines far apart.

In [2]:
def chunk_text(*, file_name: str, max_lines_per_chunk: int = 10) -> list[str]:
    text_chunks: list[str] = []

    with open(file_name, "r") as source_file:
        chunk: list[str] = []
        number_of_lines = 0
        end_of_file = False

        while not end_of_file:
            line_content = source_file.readline()
            chunk.append(line_content)
            number_of_lines += 1
            end_of_file = line_content is None or line_content == ""

            if number_of_lines == max_lines_per_chunk or end_of_file:
                chunked_text = ' '.join(chunk)
                text_chunks.append(chunked_text)
                chunk.clear()
                number_of_lines = 0

    return text_chunks

In addition to storing the embeddings, `update_vector_db` assigns an incremental id (i.e., "id-1", "id-2", ..., "id-_n_") to each entry. The reason for that will become apparent soon, given that we want to retrieve the chunks that are most relevant to the context of a query. Once we know which chunks are the best candidates, we want to fetch the two neighboring chunks–immediately before ("id-{_i_-1}") and after ("id-{_i_+1}").

In [3]:
def update_vector_db(*, vector_db, text_chunks: list[str], start_index: int = 1):
    index_id = start_index

    for chunk in text_chunks:
        vector_db.upsert(
            documents=[chunk],
            ids=[f"id-{index_id}"]
        )

        index_id +=1

`ChunkInstance` is a convenient data structure to store a text chunk, its id, and the chunks immediately before and after it. The class also implements `__str()__`, facilitating visualizing (printing/logging) instance data.

In [4]:
class ChunkInstance:
    prior_chunk_id = None
    prior_document = None
    post_chunk_id = None
    post_document = None

    def __init__(self, *, chunk_id: str, document: str):
        self.chunk_id = chunk_id
        self.document = document
    
    def __str__(self):
        chunk_dic = {
            "chunk_id": self.chunk_id,
            "document": self.document,
            "prior_chunk_id": self.prior_chunk_id,
            "prior_document": self.prior_document,
            "post_chunk_id": self.post_chunk_id,
            "post_document": self.post_document
        }

        if self.prior_chunk_id is not None:
            chunk_dic["prior_chunk_id"] = self.prior_chunk_id
            chunk_dic["prior_document"] = self.prior_document
        
        if self.post_chunk_id is not None:
            chunk_dic["post_chunk_id"] = self.post_chunk_id
            chunk_dic["post_document"] = self.post_document

        return json.dumps(chunk_dic, indent=4)

Here `query_text` is encoded and a query is run against the vector database. The query results will be the chunks whose embeddings are closest to the encoding of `query_text`. Then for each chunk, its prior and post neighbors are fetched to add more content.

In [5]:
def query_vector_db(*, vector_db, query_text: str, max_number_of_results: int = 2) -> list[ChunkInstance]:
    query_results_list = []

    query_results = vector_db.query(
                        query_texts=[query_text],
                        n_results=max_number_of_results,
                    )

    chunk_index = 0

    for chunk_id in query_results["ids"][0]:
        chunk_instance = ChunkInstance(
            chunk_id=chunk_id,
            document=query_results["documents"][0][chunk_index]
            )
        
        chunk_index += 1

        prior_chunk_id = f"id-{int(chunk_id[3:]) - 1}"
        post_chunk_id = f"id-{int(chunk_id[3:]) + 1}"

        adjancent_chunks = vector_db.get(ids=[prior_chunk_id, post_chunk_id])

        adjacent_index = 0
        for adjacent_chunk_id in adjancent_chunks["ids"]:
            if adjacent_chunk_id == prior_chunk_id:
                chunk_instance.prior_chunk_id = prior_chunk_id
                chunk_instance.prior_document = adjancent_chunks["documents"][adjacent_index]
                adjacent_index += 1
            elif adjacent_chunk_id == post_chunk_id:
                chunk_instance.post_chunk_id = post_chunk_id
                chunk_instance.post_document = adjancent_chunks["documents"][adjacent_index]
                adjacent_index += 1
            
        query_results_list.append(chunk_instance)

    return query_results_list

When asking a question to an LLM, we can experiment with the prompt format and the temperature.

> Note: Make sure that you have access to the OpenAI `gpt-3.5-turbo` model.

In [6]:
def ask_llm(*, client, prompt: str, temperature: int = 0) -> str:
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {
                "role": "system",
                "content": prompt
            }
        ],
        temperature=temperature,
        max_tokens=150,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
        stop=None,
        )
    
    return response.choices[0].message.content

<p style="text-align: center;">* * * * *</p>

### Generating the embeddings

Here we are using OpenAI's embedding function to generate the embeddings for the chunks of text we extracted from the document, then we save the embeddings and the text chunks to the vector database.

> Note: Make sure that you had set the `OPENAI_API_KEY` environment variable with your OpenAI API Key or project API Key. Also verify that you have access to the `text-embedding-3-small` model.

In [7]:
chroma_client = chromadb.Client()
openai_client = OpenAI()

embedding_function = OpenAIEmbeddingFunction(
    api_key=os.environ["OPENAI_API_KEY"],
    model_name="text-embedding-3-small"
    )

docs_collection = chroma_client.get_or_create_collection(
    name="indexed_documents",
    embedding_function=embedding_function
    )

chunked_text = chunk_text(
    file_name="Fostering_Trust_and_Quantifying_Value_of_AI_and_ML.txt"
    )

update_vector_db(
    vector_db=docs_collection,
    text_chunks=chunked_text
    )

### Querying the vector database and selecting the RAG context

Here the vector database is queried to find the chunks whose embeddings are closest (most relevant) to a question posted by a user. Then more context is given to the text by fetching the chunks that are immediately before and after the ones returned by the query.

`rag_context` will be added to the prompt used to ask questions to the LLM. 

In [8]:
user_question = "How do trustors build trust with trustees?"

chunk_instances = query_vector_db(
    vector_db=docs_collection,
    query_text=user_question
    )

rag_context = ""
for chunk_instance in chunk_instances:
    rag_context += "\n".join([
        chunk_instance.prior_document,
        chunk_instance.document,
        chunk_instance.post_document
        ])
    
    print(chunk_instance)

{
    "chunk_id": "id-7",
    "document": "trustworthy. More specifically, the trustor\u2019s act would be to\n invest in building a product and offer it to customers with the\n \n promise that it will generate value to them; more value than\n what is paid in return for the service. The trustor decides how\n much to invest, and the trustee decides whether to reciprocate\n and give continuity to the business relationship.\n Note that the trustee does not have to be held to similar\n standards for trustworthiness as the trustor. The objective is to\n make them [customers] trusting\u2014above a minimum threshold\n",
    "prior_chunk_id": "id-6",
    "prior_document": "computing the trustworthiness of AI and ML systems. Here,\n trust is defined as the willingness to interact with an AI/ML\n system while being aware that a model inference is fallible.\n The framework, however, is not without its challenges.\n There are several other elements to be considered in an AI/MLpowered system in ord

<p style="text-align: center;">* * * * *</p>

### The Prompt

The text on the prompt needs to be expressed in such a way that it communicates effective instructions to the LLM. Also, that the RAG context provided is considered.

In [9]:
prompt = f"""Answer the QUESTION based on the CONTEXT given.
If you do not know the answer and cannot find the answer in CONTEXT, say "I don't know."

QUESTION:
{user_question}

CONTEXT:
{rag_context}

ANSWER:
"""

print(prompt)

Answer the QUESTION based on the CONTEXT given.
If you do not know the answer and cannot find the answer in CONTEXT, say "I don't know."

QUESTION:
How do trustors build trust with trustees?

CONTEXT:
computing the trustworthiness of AI and ML systems. Here,
 trust is defined as the willingness to interact with an AI/ML
 system while being aware that a model inference is fallible.
 The framework, however, is not without its challenges.
 There are several other elements to be considered in an AI/MLpowered system in order for it to gain the trust of its users.
 Good inferences are one of them, but so is data privacy,
 mitigating bias, measuring qualitative aspects, tracking the
 trust level over time, model training automation, and so on.
 The paradigm explored in this paper assumes that trust is
 built by the trustor’s initial act, signaling that the actor is

trustworthy. More specifically, the trustor’s act would be to
 invest in building a product and offer it to customers with the
 

<p style="text-align: center;">* * * * *</p>

## Reference result

Before moving to the last step, we need to create a reference point where an inference result is observed without RAG and semantic chunking. This can be done by asking the `user_question` directly to the LLM–without a well-crafted prompt nor RAG context.  

In [10]:
answer = ask_llm(
    client=openai_client,
    prompt=user_question
    )

print(f"----------------\n\033[1;34;48manswer:\033[00m {answer}\n")

----------------
[1;34;48manswer:[00m Trustors can build trust with trustees by:

1. Communicating openly and honestly: Trustors should communicate their expectations, concerns, and feedback openly and honestly with trustees. This helps to establish transparency and build a foundation of trust.

2. Demonstrating reliability and consistency: Trustors can build trust by consistently following through on their commitments and demonstrating reliability in their actions. This helps to establish a sense of dependability and trustworthiness.

3. Showing respect and empathy: Trustors should show respect and empathy towards trustees, acknowledging their perspectives and feelings. This helps to build a sense of mutual understanding and trust.

4. Being transparent and accountable: Trustors should be transparent about their intentions, decisions, and actions, and hold themselves accountable for their behavior



## Bringing everything together

Here we ask the same question multiple times, varying only the temperature. You will be able to observe how the answers get progressive more creative. Depending on your use case, this may be a welcoming variation, or a disastrous outcome.

For example, if you're experimenting with text or A/B testing, results with a higher temperature may be quite handy. On the other hand, if you are preparing a financial report, not so much.

There are no rules of thumb, nor guidance principles that are universally good or bad. It will depend on your use case and what you're trying to achieve with the use of an LLM.

In [11]:
for temperature in [0, 0.5, 1, 1.9]:
    answer = ask_llm(
        client=openai_client,
        prompt=prompt,
        temperature=temperature
    )

    print(f"----------------\n\033[1;31;48mtemperature:\033[00m {temperature}\n\033[1;34;48manswer:\033[00m {answer}\n")

----------------
[1;31;48mtemperature:[00m 0
[1;34;48manswer:[00m Trustors build trust with trustees by investing in building a product and offering it to customers with the promise that it will generate more value than what is paid in return for the service. The trustor decides how much to invest, and the trustee decides whether to reciprocate and give continuity to the business relationship.

----------------
[1;31;48mtemperature:[00m 0.5
[1;34;48manswer:[00m Trustors build trust with trustees by investing in building a product and offering it to customers with the promise that it will generate more value than what is paid in return for the service. The trustor decides how much to invest, and the trustee decides whether to reciprocate and give continuity to the business relationship.

----------------
[1;31;48mtemperature:[00m 1
[1;34;48manswer:[00m Trustors build trust with trustees by investing in building a product and offering it to customers with the promise that it 