# <a id='toc1_'></a>[Chunking](#toc0_)

In this notebook, we will explore the next part of the pipeline: Chunking. Chunking is the process of splitting the parsed documents into fragments of text (chunks) to be embedded in the vector store. Chunking optimisation consists in building the chunks that best represent one semantic brick, i.e. the chunks that will be easiest to retrieve later on in the pipeline.

Here, we will showcase a basic chunking strategy, and then study three other ways to perform better chunking.

**Table of contents**<a id='toc0_'></a>    
- [Chunking](#toc1_)    
- [Setup](#toc2_)    
- [Strategies](#toc3_)    
  - [Basic text splitter](#toc3_1_)    
  - [Hierarchical chunking](#toc3_2_)    
  - [Semantic chunking](#toc3_3_)    
- [To go the extra mile - Agentic chunking](#toc4_)    
  - [Multiple techniques](#toc4_1_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc2_'></a>[Setup](#toc0_)

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import os
from pathlib import Path

from dotenv import load_dotenv

os.chdir(Path.cwd().joinpath(".."))
print(Path.cwd())
load_dotenv(override=True)

In [None]:
from bs4 import BeautifulSoup
from IPython.display import HTML, display
from langchain.schema import Document
from langchain_experimental.text_splitter import SemanticChunker
from langchain_text_splitters import (
    HTMLHeaderTextSplitter,
    RecursiveCharacterTextSplitter,
)

from lib.models import embeddings, llm
from lib.utils import AgentChunker, load_documents

We will perform the chunking on the following webpage that gives some info about the solar system: 

For demonstration purposes, we will look at either page 42 of the Data for Finance report as an example to illustrate our different chunking strategies, or the follwing generated document:



In [None]:
DATA_PATH = "data/2_docs"
HTML_DOCUMENT_PATH = f"{DATA_PATH}/solar_system.html"
html_document_title = Path(HTML_DOCUMENT_PATH).name
PDF_DOCUMENT_PATH = f"{DATA_PATH}/Artefact-data-for-Finance-Report.pdf"
pdf_document_title = Path(PDF_DOCUMENT_PATH).name
EXAMPLE_PAGE_NUMBER = 42
BASE_CHUNK_SIZE = 512

In [None]:
with open(HTML_DOCUMENT_PATH, "r") as file:
    html_document = file.read()
display(HTML(html_document))

# <a id='toc3_'></a>[Strategies](#toc0_)

## <a id='toc3_1_'></a>[Basic text splitter](#toc0_)

A widely used and simple to implement chunking strategy is to use a recursive character splitter. 

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text. We have made sure that these characters are indeed present in the example document.

For now, we will use a basic chunk overlap of 26 characters (1/10th of chunk size). For chunk size, we need to consider the embedding process later on. Indeed, as the embedding model we use is OpenAI's ada-002 which embeds into 1536 dimesnions, we will use a chunk size of 256 characters to keep the embeddings coherent.

In [None]:
chunker = RecursiveCharacterTextSplitter(chunk_size=BASE_CHUNK_SIZE, chunk_overlap=round(BASE_CHUNK_SIZE / 10))

We will use the plain text of this webpage to perform our basic recursive split

In [None]:
html_document_text = Document(
    page_content=BeautifulSoup(html_document, "html.parser").get_text(separator="/n"),
    metadata={"source": html_document_title},
)
basic_chunks = chunker.split_documents([html_document_text])

We take a closer look at the chunks from this example:

In [None]:
print(f"{len(basic_chunks)} chunks")

for i in range(len(basic_chunks)):
    print(f"\n###\nChunk number {i + 1}:\n{basic_chunks[i].page_content}")

It is clear that these chunks do not represent an optimal strategy. We will now explore more thorough ways of doing such chunking

## <a id='toc3_2_'></a>[Hierarchical chunking](#toc0_)

For this to work, we need such structure to be encoded directly into the documents, which is not always the case for raw text inputs. It thus falls onto the parsing step's responsibility to  ensure this structure is stored. Some formats, such as Html or Markdown, include this structure automatically, and thus are a good starting point with dedicated chunkers in Langchain

Html is particularly interesting because it allows to easily chunk data scrapped from web pageselements. There also is an integrated HTML splitter in Langchain

We specify the headers on which we want to split in the config file

We now use the built-in HTML splitter from Langchain, and then go through our basic recursive splitter in order to keep the desired chunks size

In [None]:
SPLITTING_HEADERS = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("h4", "Header 4"),
]

hierarchical_chunks = HTMLHeaderTextSplitter(headers_to_split_on=SPLITTING_HEADERS).split_text(html_document)

If we want fixed size chunks (for embedding purposes), we can then split the larger chunks even more using the typical Recursive text splitter:

In [None]:
hierarchical_chunks = chunker.split_documents(hierarchical_chunks)

We take a closer look at the chunks from this example:

In [None]:
print(f"{len(hierarchical_chunks)} chunks")

for i, chunk in enumerate(hierarchical_chunks, start=1):
    print(f"\n###\nChunk number {i}:\n{chunk.page_content}")

We also see that the structure of the document is integrated within the chunk metadata, which can be very useful when optimizing the retrieval strategy. Notably, the titles of all the headers in the hierarchy above the chunk are preserved, which can come in very handy in the retrieval phase to retrieve other relevant chunks

In [None]:
print(f"We have {len(hierarchical_chunks)} chunks")
hierarchical_chunks[:4]

These chunks are already much better, and can be used for a basic RAG pipeline. However, we will look at even more advanced ways to do this chunking

## <a id='toc3_3_'></a>[Semantic chunking](#toc0_)

The next approach is to use Semantic Chunking. This relatively new method uses embeddings to determine semantic breaking points within the text.

This method embeds sentences grouped together through a rolling window (default is three sentences) and then calculates embedding distances between adjacent groups of sentences. For example if we have six sentences:
* The groups of sentences to be embedded are: [1], [1,2], [1,2,3], [2,3,4], [3,4,5], [4,5,6], [5,6], [6]
* The calculated distances are: d([1],[2,3,4]), d([1,2],[3,4,5]), d([1,2,3],[4,5,6]), d([2,3,4],[5,6]), d([3,4,5],[6])

It then looks at all calculated embedding distances and splits the documents along the "sentence boundaries" where the distance is above a certain threshold.
* Percentile: distance greater than Xth percentile
* Standard Deviation: distance above X standard deviations
* Interquartile: distance outside of the quartiles
* Gradient: for specific documents with high levels of similarity, looks for gradient anomalies in distances instead of a fixed value

For this example, we will use page 42 of the Data for Finance Report document, as well as the ada-002 embedding model deployed on Azure

In [None]:
page42 = [
    doc
    for doc in load_documents(DATA_PATH)
    if (doc.metadata["source"] == PDF_DOCUMENT_PATH and doc.metadata["page"] == EXAMPLE_PAGE_NUMBER - 1)
]

print(page42[0].metadata)
print(page42[0].page_content)

In [None]:
semantic_chunks = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    buffer_size=1,
    breakpoint_threshold_amount=70,
).split_documents(page42)

print(f"{len(semantic_chunks)} chunks")

for i in range(len(semantic_chunks)):
    print(
        f"""Chunk {i + 1}:
          {semantic_chunks[i].page_content}
          """
    )

Here we have set a rather arbitrary 70% threshold, i.e. we split at boundary points where the embedding distance between the two adjacent sentence groups is in the top 30% of distances. This gives us 6 chunks with no fixed size, which seem to make some sense semantically

# <a id='toc4_'></a>[To go the extra mile - Agentic chunking](#toc0_)

Finally, the last technique we will use is Agentic Chunking. Agentic Chunking is more of a general concept than a defined technique, and consists in directly using a LLM to identify the semantic chunks within the text.

There are multiple ways of using an Agent to define chunks. Here we will explore two ways that mirror the previously explored methods:
* Recursive Agent-based chunking: we use the agent to split a text in two semantically distinct parts, and do so recursively until we have attained critical size
* Iterative Agent-based chunking: we iteratively go through all sentences in the document. At each point, we ask the agent if the sentence is semantically part of a new chunk or if it should be merged withe the previous chunk.

We initiate the agent we will use, in this case GPT-4o. As these techniques are still experimental we do not have an integrated Langchain method, but we have developed simple functins for both techniques.

In [None]:
llm_chunker_recursive = AgentChunker(agent=llm, chunk_size=BASE_CHUNK_SIZE)

agent_chunks_recursive = llm_chunker_recursive.split_documents(page42)

In [None]:
for i in range(len(agent_chunks_recursive)):
    print(
        f"""Chunk {i + 1}:
          {agent_chunks_recursive[i].page_content}
          """
    )

We can see that these chunks have good semantical separation, and thus this method performs well, even though it is also very costly. We also try the other method, which is iterative chunking.

In [None]:
llm_chunker_iterative = AgentChunker(
    agent=llm,
    chunk_size=BASE_CHUNK_SIZE,
    recursive=False,
)

agent_chunks_iterative = llm_chunker_iterative.split_documents(page42)

In [None]:
for i in range(len(agent_chunks_iterative)):
    print(
        f"""Chunk {i + 1}:
          {agent_chunks_iterative[i].page_content}
          """
    )

This method also gives good chunks, albeit with no chunk length harmony.

## <a id='toc4_1_'></a>[Multiple techniques](#toc0_)

Finally, we will combine the three last techniques in order to to showcase a thorough chunking process. We will use them in the following order:
* We will first perform hierarchical chunking to split along document structure elements and keep said structure in the metadata
* Within these bricks, we will perform semantic chunking based on the 75th percentile to refine our chunks using our embeddings
* Finally, if some chunks are still too large, we will use recursive Agent-based chunking to split them until we recah the desired character length

In [None]:
chunker_1 = HTMLHeaderTextSplitter(headers_to_split_on=SPLITTING_HEADERS)

chunker_2 = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    buffer_size=1,
    breakpoint_threshold_amount=70,
)

chunker_3 = AgentChunker(llm, chunk_size=BASE_CHUNK_SIZE, recursive=True)

In [None]:
# We do not have parsed document yet so use the generated solar system page

full_pipeline_chunks = chunker_3.split_documents(chunker_2.split_documents(chunker_1.split_text(html_document)))

In [None]:
for i in range(len(full_pipeline_chunks)):
    print(
        f"""Chunk {i + 1}:
          {full_pipeline_chunks[i].page_content}
          """
    )