# Introduction to Chunking
In this exercise we we focus on  chunking - which is the process of splitting up documents into smaller pieces 
which can then be retrieved during the RAG process.
This worksheet starts with simple chunking approches and then shows more advanced techniques. however, it is worth noting
that more advanced does not always mean better - there are always tradeoffs to be made regarding performance, complexity and cost

In [None]:
#initial setup
import sys
import os

project_root = os.path.abspath(os.path.join(os.getcwd(), '..', '..'))
if project_root not in sys.path:
    sys.path.insert(0, project_root)

from src.utils import OPENAI_API_KEY

In [None]:
"""
for this exercise we will use a text document - 
and apply different chunking techniques to it.
"""

# call open ai with the chunking example
import openai
from typing import Tuple
from src.openai import get_response
openai.api_key = OPENAI_API_KEY


def load_example_text()->str:
    with open("example_document.txt", "r") as file:
        txt = file.read()
    return txt

In [None]:
"""
fixed length chunking:
the most basic approach to chunking - take your document and split
it into equal sized pieces (either by character count, token count or word count)

the advantage of fixed length chunking is that it is simple to implement and run
however the disadvantage is that it does not do a good job of preserving context, there 
is always a risk that an idea will be split between multiple chunks which in the 
worst case will make meaningless as a retrievable element.

TODO: run the cell and get a sense of of cases the issues that may arise from fixed 
length chunking. play around with chunk size to see how it can impact the preservation/
destruction of context
"""

CHUNK_SIZE_CHARACTERS = 100

input_txt = load_example_text()
# remove paragraph and link breaks
input_txt = input_txt.replace("\n", " ").replace("\r", " ")
chunks =  [input_txt[i:i + CHUNK_SIZE_CHARACTERS] for i in range(0, len(input_txt), CHUNK_SIZE_CHARACTERS)]


print(f"split example into {len(chunks)} chunks")
for index,chunk in enumerate(chunks):
    print(f"chunk {index}: {chunk}")


In [None]:
"""
sliding window chunking:
given that fixed length chunking has a tendency to break up context at "unnatural" boundaries
1 solution to this would be to use an overlap between chunks. the assumption being that even if a piece of context
is incomplete in one chunk it will be fully captured in another thanks to the overlap. 
Implicit in this approach is the fact that increase the amount of data which you then embedded due to the redundancy

TODO: run and understand what this code is doing, tweak the chunk size and overlap factor and observe
the impact that has on the preservation of context, think about the tradeoff between overlap size and context preservation
"""

CHUNK_SIZE_CHARACTERS = 100
OVERLAP_SIZE_CHARACTERS = 30

input_txt = load_example_text()
# remove paragraph and link breaks
input_txt = input_txt.replace("\n", " ").replace("\r", " ")

chunks = []
for i in range(0, len(input_txt), CHUNK_SIZE_CHARACTERS - OVERLAP_SIZE_CHARACTERS):
    chunks.append(input_txt[i:i + CHUNK_SIZE_CHARACTERS])

print(f"split example into {len(chunks)} chunks")
for index,chunk in enumerate(chunks):
    print(f"chunk {index}: {chunk}")


In [None]:
"""
semantic chunking: 
the examples above took a programatic approach to chunking without considering the specifics of the
information we want to chunk - given that preserving the semantic information is crucial to the 
performance of the rag we should also consider approaches which more directly aim to preserve this semantic information

document structure guided chunking - a 'obvious' approach we should consider is using the punctuation breaks given
in the document itself to form our chunk boundaries.
ie -> using sentence breaks & paragraph breaks as a way to do this separation

TODO: what would be the pitfalls of this approach? 
"""

BREAK_CHARACTERS = [".", "!", "?", "\n","\r"]
input_txt = load_example_text()

chunks = []
current_chunk = ""

for char in input_txt:
    current_chunk += char
    if char in BREAK_CHARACTERS:
        chunks.append(current_chunk.strip())
        current_chunk = ""

if current_chunk:
    chunks.append(current_chunk.strip())

# remove empty chunks
chunks = [chunk for chunk in chunks if chunk]

print(f"split example into {len(chunks)} chunks")
chunk_sizes = [len(chunk) for chunk in chunks]
print(f"average chunk size: {sum(chunk_sizes) / len(chunk_sizes)}")
print(f"min chunk size: {min(chunk_sizes)}")
print(f"max chunk size: {max(chunk_sizes)}")

for index,chunk in enumerate(chunks):
    print(f"chunk {index}: {chunk}")

In [None]:
"""
agentic chunking: 
make use of an llm to do the chunking for you. this approach can work very well for finding natural contextual
boundaries between chunks. the downside however, is that it comes with a high cost and will not be suited to
large / constantly growing datasets

TODO: run the script to do chunking with the help of an LLM, play around with the prompt - to improve its performance
"""

client = openai.OpenAI(
    api_key=openai.api_key
)


input_txt = load_example_text()
full_response, _ = get_response(client, f"create smaller chunks of approximatelly 100 characters for the following text but prioritize preserving the semantic context of the chunks, separate the chunks using | characters and : {input_txt}", model="gpt-4")
break_character = "|"
chunks = [chunk.strip() for chunk in full_response.split(break_character) if chunk.strip()]

print(f"split example into {len(chunks)} chunks")
chunk_sizes = [len(chunk) for chunk in chunks]
print(f"average chunk size: {sum(chunk_sizes) / len(chunk_sizes)}")
print(f"min chunk size: {min(chunk_sizes)}")
print(f"max chunk size: {max(chunk_sizes)}")

for index,chunk in enumerate(chunks):
    print(f"chunk {index}: {chunk}")

In [None]:
#TODO: read this article on different chunking strategies and there tradeoffs
# https://masteringllm.medium.com/11-chunking-strategies-for-rag-simplified-visualized-df0dbec8e373 

# TODO (Optional)
# implement a chunking approach we have not already covered here