# 3.1. Chunking

The search solution is comprised of both **ingestion** and **retrieval**. One does not exist without the other.

While the other experiments are focused on data retrieval, ingestion plays equal importance in the effectiveness of the search solution.

Certain aspects of data ingestion need to be experimented as part of the experimentation phase:

```{note}
Other pre and post-processing techniques include: Optical Character Recognition, data conversation, use of Azure Form Recognizer to extract information from the documents, chunking, summarization, post-processing to make data more "human like", video captioning, speech to text, tagging, etc. are all methods that need to be considered and experimented with as part of the ingestion pipeline experimentation.
```

https://github.com/microsoft/rag-openai/blob/main/topics/RAG_EnablingSearch.md#learnings-from-engagements-1

When processing data, splitting the source documents into chunks requires care and expertise to ensure the resulting chunks are small enough to be effective during fact retrieval but not too small so that enough context is provided during summarization.

```{note}
Our goal here is not to identify which chunking strategy is the “best” in general but rather to demonstrate how various choices of chunking may have a non-trivial impact on the ultimate outcome from the retrieval-augmented-generation solution.
```

<!-- https://vectara.com/blog/grounded-generation-done-right-chunking/#:~:text=In%20the%20context%20of%20Grounded%20Generation%2C%20chunking%20is,find%20natural%20segments%20like%20complete%20sentences%20or%20paragraphs. -->

## Why Chunking Size Matters

As mentioned [here](https://learn.microsoft.com/en-us/azure/search/semantic-search-overview), the models used to generate embedding vectors have maximum limits on the text fragments provided as input. For example, the maximum length of input text for the Azure OpenAI embedding models is **8,191** tokens. Given that each token is around 4 characters of text for common OpenAI models, this maximum limit is equivalent to around 6000 words of text. If you're using these models to generate embeddings, it's critical that the input text stays under the limit. Partitioning your content into chunks ensures that your data can be processed by the Large Language Models (LLM) used for indexing and queries.

**Relevance and Granularity**: A small chunk size, like 128, yields more granular chunks. This granularity, however, presents a risk: vital information might not be among the top retrieved chunks, especially if the similarity _top_k_ setting is as restrictive as 2. Conversely, a chunk size of 512 is likely to encompass all necessary information within the top chunks, ensuring that answers to queries are readily available. To navigate this, we employ the _Faithfulness and Relevancy_ metrics. These measure the absence of ‘hallucinations’ and the ‘relevancy’ of responses based on the query and the retrieved contexts respectively.

**Response Generation Time**: As the chunk_size increases, so does the volume of information directed into the LLM to generate an answer. While this can ensure a more comprehensive context, it might also slow down the system. Ensuring that the added depth doesn't compromise the system's responsiveness is crucial.

In essence, determining the optimal chunk_size is about striking a balance: capturing all essential information without sacrificing speed. It's vital to undergo thorough testing with various sizes to find a configuration that suits the specific use case and dataset.

https://blog.llamaindex.ai/evaluating-the-ideal-chunk-size-for-a-rag-system-using-llamaindex-6207e5d3fec5

Example code: https://github.com/Azure/azure-search-vector-samples/blob/main/demo-python/code/data-chunking/textsplit-data-chunking-example.ipynb

Read [Common Chunking Technique](https://learn.microsoft.com/en-us/azure/search/semantic-search-overview), [Content overlap considerations](https://learn.microsoft.com/en-us/azure/search/vector-search-how-to-chunk-documents#content-overlap-considerations), [Simple example of how to create chunks with sentences](https://learn.microsoft.com/en-us/azure/search/vector-search-how-to-chunk-documents#content-overlap-considerations)

CODE: https://github.com/microsoft/rag-openai/blob/438999a5470bef7946fa1c8714ed1090e1ed40c3/samples/searchEvaluation/customskills/utils/chunker/text_chunker.py


In [1]:
%pip install langchain-community==0.0.18
# %pip install langchain-core==0.1.20
%pip install unstructured==0.12.3
%pip install unstructured-client==0.17.0
%pip install langchain==0.1.5

Collecting langchain-community==0.0.18Note: you may need to restart the kernel to use updated packages.

  Using cached langchain_community-0.0.18-py3-none-any.whl.metadata (7.9 kB)
Collecting aiohttp<4.0.0,>=3.8.3 (from langchain-community==0.0.18)
  Downloading aiohttp-3.9.3-cp312-cp312-win_amd64.whl.metadata (7.6 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community==0.0.18)
  Using cached dataclasses_json-0.6.4-py3-none-any.whl.metadata (25 kB)
Collecting langchain-core<0.2,>=0.1.19 (from langchain-community==0.0.18)
  Using cached langchain_core-0.1.22-py3-none-any.whl.metadata (6.0 kB)
Collecting langsmith<0.1,>=0.0.83 (from langchain-community==0.0.18)
  Downloading langsmith-0.0.90-py3-none-any.whl.metadata (9.9 kB)
Collecting tenacity<9.0.0,>=8.1.0 (from langchain-community==0.0.18)
  Using cached tenacity-8.2.3-py3-none-any.whl.metadata (1.0 kB)
Collecting aiosignal>=1.1.2 (from aiohttp<4.0.0,>=3.8.3->langchain-community==0.0.18)
  Using cached aiosignal-1.3.1

Note: you may need to restart the kernel to use updated packages.


ERROR: Ignored the following yanked versions: 0.8.3, 0.10.19.dev18
ERROR: Ignored the following versions that require a different python version: 0.12.0 Requires-Python >=3.9.0,<3.12; 0.12.2 Requires-Python >=3.9.0,<3.12; 0.12.3 Requires-Python >=3.9.0,<3.12; 0.12.4 Requires-Python >=3.9.0,<3.12
ERROR: Could not find a version that satisfies the requirement unstructured==0.12.3 (from versions: 0.0.1.dev0, 0.2.0, 0.2.1, 0.2.2, 0.2.3, 0.2.4, 0.2.5, 0.2.6.dev1, 0.3.0, 0.3.1, 0.3.2, 0.3.3, 0.3.4, 0.3.5, 0.4.0, 0.4.1, 0.4.2, 0.4.3, 0.4.4, 0.4.6, 0.4.7, 0.4.8, 0.4.9, 0.4.10, 0.4.11, 0.4.12, 0.4.13, 0.4.14, 0.4.15, 0.4.16, 0.5.0, 0.5.1, 0.5.2, 0.5.3, 0.5.4, 0.5.6, 0.5.7, 0.5.8, 0.5.9, 0.5.10, 0.5.11, 0.5.12, 0.5.13, 0.6.0, 0.6.1, 0.6.2, 0.6.3, 0.6.4, 0.6.5, 0.6.6, 0.6.7, 0.6.8, 0.6.9, 0.6.10, 0.6.11, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.7.4, 0.7.5, 0.7.6, 0.7.7, 0.7.8, 0.7.9, 0.7.10, 0.7.11, 0.7.12, 0.8.0, 0.8.1, 0.8.4, 0.8.5, 0.8.6, 0.8.7, 0.8.8, 0.9.0, 0.9.1, 0.9.2, 0.9.3, 0.10.0, 0.10.1, 0.10.2, 

Collecting unstructured-client==0.17.0
  Using cached unstructured_client-0.17.0-py3-none-any.whl.metadata (4.9 kB)
Collecting dataclasses-json-speakeasy>=0.5.11 (from unstructured-client==0.17.0)
  Using cached dataclasses_json_speakeasy-0.5.11-py3-none-any.whl.metadata (25 kB)
Collecting jsonpath-python>=1.0.6 (from unstructured-client==0.17.0)
  Using cached jsonpath_python-1.0.6-py3-none-any.whl (7.6 kB)
Using cached unstructured_client-0.17.0-py3-none-any.whl (20 kB)
Using cached dataclasses_json_speakeasy-0.5.11-py3-none-any.whl (28 kB)
Installing collected packages: jsonpath-python, dataclasses-json-speakeasy, unstructured-client
Successfully installed dataclasses-json-speakeasy-0.5.11 jsonpath-python-1.0.6 unstructured-client-0.17.0
Note: you may need to restart the kernel to use updated packages.


Collecting langchain==0.1.5
  Using cached langchain-0.1.5-py3-none-any.whl.metadata (13 kB)
Using cached langchain-0.1.5-py3-none-any.whl (806 kB)
Installing collected packages: langchain
Successfully installed langchain-0.1.5
Note: you may need to restart the kernel to use updated packages.


In [2]:
import tqdm
import glob
from langchain_community.document_loaders import UnstructuredMarkdownLoader, UnstructuredFileLoader
from langchain.text_splitter import MarkdownTextSplitter, NLTKTextSplitter, RecursiveCharacterTextSplitter

# Code also https://github.com/microsoft/rag-openai/blob/438999a5470bef7946fa1c8714ed1090e1ed40c3/samples/searchEvaluation/customskills/utils/chunker/text_chunker.py

ModuleNotFoundError: No module named 'tqdm'

In [None]:
%pip install "unstructured[md]"

In [5]:
def load_documents_from_folder(path: str) -> list[str]:
    print("Loading documents...")
    markdown_documents = []
    for file in tqdm.tqdm(glob.glob(path, recursive=True)):
        loader = UnstructuredFileLoader(file) 
        document = loader.load()
        markdown_documents.append(document)
    return markdown_documents

In [6]:
markdown_documents = load_documents_from_folder("../data/docs/**/*.md")
# TODO: Move this to a Storage Account?

Loading documents...


100%|██████████| 777/777 [00:57<00:00, 13.57it/s]


In [2]:
def create_chunks(documents: list) -> list:
    print("Creating chunks...")
    markdown_splitter = MarkdownTextSplitter.from_tiktoken_encoder(
        chunk_size=300, chunk_overlap=30
    )
    lengths = {}
    all_chunks = {}
    chunk_id = 0
    for document in tqdm.tqdm(documents):
        current_chunks_text_list = markdown_splitter.split_text(
            document[0].page_content
        )  # output = ["content chunk1", "content chunk2", ...]

        for i, chunk in enumerate(
            current_chunks_text_list
        ):  # (0, "content chunk1"), (1, "content chunk2"), ...
            current_chunk_dict = {
                "chunk_id": i,
                "chunk_text": chunk,
                "source": document[0].metadata["source"],
            }
            current_id_str = f"chunk{chunk_id}_{i}"
            all_chunks[current_id_str] = current_chunk_dict

        chunk_id += 1

        n_chunks = len(current_chunks_text_list)
        # lengths = {[Number of chunks]: [number of documents with that number of chunks]}
        if n_chunks not in lengths:
            lengths[n_chunks] = 1
        else:
            lengths[n_chunks] += 1

    print(lengths)
    return all_chunks

In [7]:
chunks = create_chunks(markdown_documents)

Creating chunks...


100%|██████████| 777/777 [00:02<00:00, 277.03it/s]

{1: 145, 2: 137, 4: 69, 5: 74, 7: 43, 12: 8, 16: 6, 32: 1, 18: 7, 3: 98, 8: 30, 6: 58, 20: 2, 9: 20, 13: 13, 15: 10, 10: 11, 17: 2, 11: 19, 14: 12, 43: 1, 25: 1, 26: 2, 22: 3, 30: 1, 19: 2, 38: 1, 29: 1}





In this experiment, we will use OpenAI ada for embedding the chunks


Upload files to a storage account so we can create an Indexer
https://github.com/microsoft/rag-openai/blob/438999a5470bef7946fa1c8714ed1090e1ed40c3/samples/searchEvaluation/upload_files.py


In [11]:
%pip install python-dotenv





[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


KeyError: 'workspace_name'