# Ungraded Lab: Chunking
---
Welcome to the Ungraded Lab on Chunking! As you saw in the lectures, chunking breaks large texts into smaller, manageable pieces, which is essential for efficiently working with vector databases and language models. 


# Table of Contents
- [ 1 - Introduction](#1)
  - [ 1.1 Importing necessary libraries](#1-1)
  - [ 1.2 Downloading the data](#1-2)
- [ 2 - Fixed-size chunking](#2)
  - [ 2.1 Example Chunking Code](#2-1)
  - [ 2.2 Chunking with overlap](#2-2)
- [ 3 - Variable-size chunking - Recursive Character Splitting](#3)
  - [ 3.1 Pseudo-code for variable-size chunking methods](#3-1)
  - [ 3.2 Mixing fixed and variable-sized chunking](#3-2)
- [ 4 - Chunking on real data](#4)
  - [ 4.1 Getting the data](#4-1)
  - [ 4.2 Chunking the chapters](#4-2)
  - [ 4.3 Loading Chunks into a Vector Database](#4-3)
- [ 5 - Searching ](#5)
- [ 6 - Incorporating in a RAG system](#6)


<a id='1'></a>
## 1 - Introduction

---

Chunking plays an important role in information retrieval. For example, when building a vector database from a collection of books, different chunk sizes can serve different purposes. Cataloging entire books as single vectors may help in identifying broad themes, but misses specific details. Chunking closer to the paragraph or sentence level enables the retrieval of specific information or concepts.

Language models typically have limitations on the amount of text they can process at once, known as the "context window." Chunking helps ensure that text inputs remain within these boundaries, allowing models to handle large documents, like novels, by splitting them into smaller sections.

In this ungraded lab you will explore ways of chunking and see how it can impact RAG systems!


<div align="center">
  <img src="images/chunking.png" alt="Overview" width="80%">
</div>

<a id='1-1'></a>
### 1.1 Importing necessary libraries


In [None]:
from typing import List
import requests
import re
import weaviate
from weaviate.classes.config import Configure, Property, DataType, Tokenization
from weaviate.util import generate_uuid5
import tqdm
from weaviate.classes.query import Filter

In [None]:
import flask_app
from utils import (
    generate_with_single_input, 
    suppress_subprocess_output
)

<a id='1-2'></a>
### 1.2 Downloading the data

Now you need some text long enough to justify chunking. Let's take a part from the [Pro Git book](https://git-scm.com/book/en/v2) a specifically a chapter called "What is Git?"

In [None]:
url = "https://raw.githubusercontent.com/progit/progit2/main/book/01-introduction/sections/what-is-git.asc"
source_text = requests.get(url).text

In [None]:
print(source_text[:1000])

In [None]:
print(f"There are about {len(source_text.split())} words in this chapter. Depending on how your LLM tokenizes words, you'd expect roughly {round(len(source_text.split())*1.3)} tokens.")

<a id='2'></a>
## 2 - Fixed-size chunking
---
Fixed-size chunking means breaking texts into pieces of the same size. For example, you might split an article into parts of 100 words each or sections of 200 characters each. This method is common because it is easy to use and works well.

It works by dividing texts into pieces that have a set number of units. These units can be words, characters, or even tokens. The number of units in each piece is the same up to a maximum limit, and there can be an optional overlap between the pieces.


<div align="center">
  <img src="images/fixed_size.png" alt="Fixed Size Chunking" width="80%">
</div>

<a id='2-1'></a>
### 2.1 Example Chunking Code

Let's see now an implementation for fixed-size chunking. There are many different implementations. The following implementation is a possible one.

In [None]:
def get_chunks_fixed_size(text: str, chunk_size: int) -> List[str]:
    """
    Splits a given text into chunks of a specified fixed size.

    Args:
        text (str): The input text to be split into chunks.
        chunk_size (int): The maximum number of words per chunk.

    Returns:
        List[str]: A list of text chunks, each containing up to 'chunk_size' words.
    """
    # Split the input text into individual words
    text_words = text.split()
    
    # Initialize a list to hold the chunks of words
    chunks = []
    
    # Iterate over the word indices in steps of 'chunk_size'
    for i in range(0, len(text_words), chunk_size):
        # Select a sublist of words from 'i' to 'i + chunk_size'
        chunk_words = text_words[i: i + chunk_size]
        
        # Join the selected words into a single string with spaces in between
        chunk = " ".join(chunk_words)
        
        # Add the chunk to the list of chunks
        chunks.append(chunk)
    
    # Return the list of word chunks
    return chunks

In [None]:
fixed_size_chunks = get_chunks_fixed_size(source_text, chunk_size = 100)

In [None]:
print(len(fixed_size_chunks))

In [None]:
fixed_size_chunks[0:3]

<a id='2-2'></a>
### 2.2 Chunking with overlap

Let's modify the code to allow overlapping, so chunks will have shared tokens.


<div align="center">
  <img src="images/overlap.png" alt="Chunking with overlap" width="80%">
</div>

In [None]:
def get_chunks_fixed_size_with_overlap(text: str, chunk_size: int, overlap_fraction: float) -> List[str]:
    """
    Splits a given text into chunks of a fixed size with a specified overlap fraction between consecutive chunks.

    Parameters:
    - text (str): The input text to be split into chunks.
    - chunk_size (int): The number of words each chunk should contain.
    - overlap_fraction (float): The fraction of the chunk size that should overlap with the adjacent chunk.
      For example, an overlap_fraction of 0.2 means 20% of the chunk size will be used as overlap.

    Returns:
    - List[str]: A list of chunks (each a string) where each chunk might overlap with its adjacent chunk.
    """

    # Split the text into individual words
    text_words = text.split()
    
    # Calculate the number of words to overlap between consecutive chunks
    overlap_int = int(chunk_size * overlap_fraction)
    
    # Initialize a list to store the resulting chunks
    chunks = []
    
    # Iterate over text in steps of chunk_size to create chunks
    for i in range(0, len(text_words), chunk_size):
        # Determine the start and end indices for the current chunk,
        # taking into account the overlap with the previous chunk
        chunk_words = text_words[max(i - overlap_int, 0): i + chunk_size]
        
        # Join the selected words to form a chunk string
        chunk = " ".join(chunk_words)
        
        # Append the chunk to the list of chunks
        chunks.append(chunk)
    
    # Return the list of chunks
    return chunks

In [None]:
for chosen_size in [5, 25, 100]:
    chunks = get_chunks_fixed_size_with_overlap(source_text, chosen_size, overlap_fraction=0.2)
    # Print outputs to screen
    print(f"\nSize {chosen_size} - {len(chunks)} chunks returned.")
    for i in range(3):
        print(f"Chunk {i+1}: {chunks[i]}")

Note that the smaller chunks of text are very detailed, but they might **not have enough information to be useful for searching**. In contrast, **larger chunks start to contain more information, similar to a typical paragraph in length**. As these chunks become even longer, **their associated vector embeddings become more general**. Eventually, they reach a point where they are no longer effective for information searching.

<a id='3'></a>
## 3 - Variable-size chunking - Recursive Character Splitting

---
Now let's examine variable-size chunking. Unlike fixed-size chunking, the size of each chunk here is a result, not a starting point. In variable-size chunking, text is divided using a specific marker. This marker could be something like a sentence or paragraph break or even a structural element like a markdown header.

<div align="center">
  <img src="images/recursive.png" alt="Recursive Character Splitting" width="80%">
</div>

<a id='3-1'></a>
### 3.1 Pseudo-code for variable-size chunking methods

The simplest one is to split into paragraphs (`\n\n`)

In [None]:
# Split the text into paragraphs
def get_chunks_by_paragraph(source_text: str) -> List[str]:
    return source_text.split("\n\n")

Another way, in this context, is to split into sections. As you can see inspecting the text, sections are divided with `\n==` markers.

In [None]:
# Split the text by Asciidoc section markers
def get_chunks_by_asciidoc_sections(source_text: str) -> List[str]:
    return source_text.split("\n==")

In [None]:
for marker in ["\n\n", "\n=="]:
    chunks = source_text.split(marker)
    # Print outputs to screen
    print(f"\nUsing the marker: {repr(marker)} - {len(chunks)} chunks returned.")
    for i in range(3):
        print(f"Chunk {i+1}: {repr(chunks[i])}")

One noticeable issue with simple marker-based chunking is that **headings often become separate chunks**, which might not be ideal. In practice, you might use a mixed strategy by attaching short chunks, like headings, to the following chunk. This way, the heading stays connected to its relevant section. Let's explore this approach further.

<a id='3-2'></a>
### 3.2 Mixing fixed and variable-sized chunking

You can combine fixed-size and variable-size chunking to take advantage of both methods. For instance, use a variable-size chunker to divide text at paragraph markers, and then apply a fixed-size filter. If a chunk is too small, you can merge it with the next one, and if a chunk is too large, you can split it in the middle or at another marker within the chunk.

In [None]:
def mixed_chunking(source_text):
    """
    Splits the given source_text into chunks using a mix of fixed-size and variable-size chunking.
    It first splits the text by Asciidoc markers and then processes the chunks to ensure they are 
    of appropriate size. Smaller chunks are merged with the next chunk, and larger chunks can be 
    further split at the middle or specific markers within the chunk.

    Args:
    - source_text (str): The text to be chunked.

    Returns:
    - list: A list of text chunks.
    """

    # Split the text by Asciidoc marker
    chunks = source_text.split("\n==")

    # Chunking logic
    new_chunks = []
    chunk_buffer = ""
    min_length = 25

    for chunk in chunks:
        new_buffer = chunk_buffer + chunk  # Create new buffer
        new_buffer_words = new_buffer.split(" ")  # Split into words
        if len(new_buffer_words) < min_length:  # Check whether buffer length is too small
            chunk_buffer = new_buffer  # Carry over to the next chunk
        else:
            new_chunks.append(new_buffer)  # Add to chunks
            chunk_buffer = ""

    if len(chunk_buffer) > 0:
        new_chunks.append(chunk_buffer)  # Add last chunk, if necessary

    return new_chunks

In [None]:
mixed_chunks = mixed_chunking(source_text)
for i in range(3):
    print(f"Chunk {i+1}: {repr(mixed_chunks[i])}")

This strategy helps ensure that chunks are not too small while still using syntactic markers, like headings, to define boundaries. After examining chunking strategies on one text, let's explore how they perform on a larger collection of texts.

<a id='4'></a>
## 4 - Chunking on real data

---
In this and the following section, there will be comprehensive examples of chunking in practice. You will process several sections of the [Pro Git book](https://git-scm.com/book/en/v2) using different chunking methods and then compare how well each method performs in search tasks.


<a id='4-1'></a>
### 4.1 Getting the data

Let's get the entire 14 chapter book.

In [None]:
def get_book_text_objects():
    # Source location
    text_objs = list()
    api_base_url = 'https://api.github.com/repos/progit/progit2/contents/book'  # Book base URL
    chapter_urls = ['/01-introduction/sections', '/02-git-basics/sections']  # List of section URLs

    # Loop through book chapters
    for chapter_url in chapter_urls:
        response = requests.get(api_base_url + chapter_url)  # Get the JSON data for the section files in the chapter

        # Loop through inner files (sections)
        for file_info in response.json():
            if file_info['type'] == 'file':  # Only process files (not directories)
                file_response = requests.get(file_info['download_url'])

                # Build objects including metadata
                chapter_title = file_info['download_url'].split('/')[-3]
                filename = file_info['download_url'].split('/')[-1]
                text_obj = {
                    "body": file_response.text,
                    "chapter_title": chapter_title,
                    "filename": filename
                }
                text_objs.append(text_obj)
    return text_objs

In [None]:
# This will generate a list with 14 elements, one for each chapter
book_text_objs = get_book_text_objects()

In [None]:
print(book_text_objs[0].keys())

<a id='4-2'></a>
### 4.2 Chunking the chapters

The following chunking methods will be applied to each section:

- **Fixed-length chunks with 20% overlap:**
  - Chunks with 25 words each
  - Chunks with 100 words each

- **Variable-length chunks** using paragraph markers

- **Mixed-strategy chunks** using paragraph markers with a minimum chunk length of 25 words

Additionally, metadata will be added to each chunk, including the filename, chapter name, and chunk number.

In [None]:
def build_chunk_objs(book_text_obj, chunks):
    """
    Constructs a list of chunk objects from a given book text object 
    and its associated chunks.

    Args:
        book_text_obj (dict): A dictionary containing metadata for the book text, 
                              including 'chapter_title' and 'filename'.
        chunks (list): A list of chunks that represent parts of the book text.

    Returns:
        list: A list of dictionaries, each representing a chunk object 
              with 'chapter_title', 'filename', 'chunk', and 'chunk_index'.
    """
    chunk_objs = list()  # Initialize an empty list to store chunk objects
    
    # Iterate over the chunks with an index
    for i, c in enumerate(chunks):
        # Create a dictionary for each chunk with its associated data
        chunk_obj = {
            "chapter_title": book_text_obj["chapter_title"],  # Chapter title from the book text object
            "filename": book_text_obj["filename"],            # Filename from the book text object
            "chunk": c,                                       # The actual chunk of text
            "chunk_index": i                                  # The index of the chunk in the list
        }
        # Append the constructed chunk object to the list
        chunk_objs.append(chunk_obj)

    # Return the list of chunk objects
    return chunk_objs

In [None]:
# Get multiple sets of chunks - according to chunking strategy
chunk_obj_sets = dict()
for book_text_obj in book_text_objs:
    text = book_text_obj["body"]  # Get the object's text body

    # Loop through chunking strategies:
    for strategy_name, chunks in [
        ["fixed_size_25", get_chunks_fixed_size_with_overlap(text, 25, 0.2)],
        ["fixed_size_100", get_chunks_fixed_size_with_overlap(text, 100, 0.2)],
        ["para_chunks", get_chunks_by_paragraph(text)],
        ["para_chunks_min_25", mixed_chunking(text)]
    ]:
        chunk_objs = build_chunk_objs(book_text_obj, chunks)

        if strategy_name not in chunk_obj_sets.keys():
            chunk_obj_sets[strategy_name] = list()

        chunk_obj_sets[strategy_name] += chunk_objs

In [None]:
print(chunk_obj_sets.keys())

In [None]:
chunk_type = 'fixed_size_25' # Change it to check the different chunks!
chunk_obj_sets[chunk_type][0:2]

<a id='4-3'></a>
### 4.3 Loading Chunks into a Vector Database

In this section, you'll focus on loading chunks into a vector database. Below, you'll find an outline of how to create and load data into the vector database. However, in this lab, you will work with a pre-loaded collection to save time. If you haven't yet completed the ungraded lab on the Weaviate API, it's highly recommended you do so for a better understanding of the process!

In [None]:
# Loading the client
with suppress_subprocess_output():
    try:
        client = weaviate.connect_to_embedded(
            persistence_data_path="/home/jovyan/data/collections/m3/ungraded_lab_2",
            environment_variables={
                "ENABLE_API_BASED_MODULES": "true", # Enable API based modules 
                "ENABLE_MODULES": 'text2vec-transformers', # We will be using a transformer model 
                "TRANSFORMERS_INFERENCE_API":"http://127.0.0.1:5000/", # The endpoint the weaviate API will be using to vectorize
            }
        )
    except Exception as e:
        ports = extract_ports(str(e))
        client = weaviate.connect_to_local(port=8079, grpc_port=50050)

In [None]:
# Creating the collection
if not client.collections.exists("chunking_example"):
    collection = client.collections.create(
            name='chunking_example',

            vectorizer_config=[Configure.NamedVectors.text2vec_transformers(
                    name="vector", # This is the name you will need to access the vectors of the objects in your collection
                    #source_properties=['chunk'], # which properties should be used to generate a vector, they will be appended to each other when vectorizing
                    vectorize_collection_name = False, # This tells the client to not vectorize the collection name. 
                                                       # If True, it will be appended at the beginning of the text to be vectorized
                    inference_url="http://127.0.0.1:5000", # Since we are using an API based vectorizer, you need to pass the URL used to make the calls 
                                                           # This was setup in our Flask application
                )],

            properties=[  # Define properties
            Property(name="chunk",data_type= DataType.TEXT),
            Property(name="chapter_title", data_type=DataType.TEXT),
            Property(name="filename",data_type=DataType.TEXT),
            Property(name="chunking_strategy",data_type=DataType.TEXT, tokenization = Tokenization.FIELD), # tokenization = Tokenization.FIELD means that the entire word will be treated as a token,
            Property(name="chunk_index",data_type=DataType.INT),

        ]
        )
else:
    collection = client.collections.get("chunking_example")

In [None]:
# Adding elements in the collection - this insertion should NOT run as the collection is already vectorized for you. 
if len(collection) == 0:
    with collection.batch.fixed_size(batch_size=1, concurrent_requests=1) as batch:
        for chunking_strategy, chunk_objects in tqdm.tqdm(chunk_obj_sets.items()):
            for chunk_obj in chunk_objects:
                chunk_obj["chunking_strategy"] = chunking_strategy
                batch.add_object(
                    properties=chunk_obj,
                    uuid=generate_uuid5(chunk_obj)
                )

In [None]:
print(f"Total count: {collection.aggregate.over_all().total_count}")
for chunking_strategy in chunk_obj_sets.keys():
    where_filter = Filter.by_property('chunking_strategy').equal(chunking_strategy) # Filter by chunking strategy
    count = collection.aggregate.over_all(filters = where_filter).total_count # Aggregate with filtering
    print(f"Object count for {chunking_strategy}: {count}")

<a id='5'></a>
## 5 - Searching 
---
In this section, you will explore semantic searching with different chunk sizes to visualize the impacts of the sizes in information retrieval.

In [None]:
search_string = "history of git"  # Or "available git remote commands"

for chunking_strategy in chunk_obj_sets.keys():
    where_filter = Filter.by_property('chunking_strategy').equal(chunking_strategy)
    response = collection.query.near_text(search_string, filters = where_filter, limit = 2)
    print(f"RETRIEVED OBJECTS FOR CHUNKING STRATEGY {chunking_strategy.upper()}:\n")
    for i, obj in enumerate(response.objects):
        print(f"===== Object {i} =====")
        print(f"{obj.properties['chunk']}")
        print()

In this example, the query is a broad one focused on the "history of git." The results show that longer chunks tend to perform better. Upon examination, while the 25-word chunks might closely match the query in terms of semantic similarity, they lack sufficient context to significantly enhance the reader's understanding of the topic. Conversely, the paragraph chunks retrieved—particularly those with a minimum length of 25 words—provide comprehensive information that effectively educates the reader about the history of Git.

In [None]:
search_string = "how to add the url of a remote repository"  # Or "available git remote commands"

for chunking_strategy in chunk_obj_sets.keys():
    where_filter = Filter.by_property('chunking_strategy').equal(chunking_strategy)
    response = collection.query.near_text(search_string, filters = where_filter, limit = 2)
    print(f"RETRIEVED OBJECTS FOR CHUNKING STRATEGY {chunking_strategy.upper()}:\n")
    for i, obj in enumerate(response.objects):
        print(f"===== Object {i} =====")
        print(f"{obj.properties['chunk']}")
        print()

In this example, the query was more specific, such as one made by a user looking to find out how to add the URL of a remote repository. Unlike the previous scenario, the 25-word chunks prove more useful here. Because the question was very specific, Weaviate could pinpoint the chunk with the most relevant passage—how to add a remote repository (`git remote add <shortname> <url>`). 

Although other result sets contain some of this information, it's important to consider how the result will be used and displayed. Longer results might require more cognitive effort from the user to extract the relevant information.

<a id='6'></a>
## 6 - Incorporating in a RAG system
---
Now you are familiar with chunking and you have a fully working collection, let's see how different chunk sizes impact text generation. Let's use a simple prompt.

In [None]:
PROMPT = "Using this information and only this information, please explain {search_string} in a few short points.\nContext: {context}"

In [None]:
# Set number of chunks to retrieve to compensate for different chunk sizes

n_chunks_by_strat = dict()

# Grab more of shorter chunks
n_chunks_by_strat['fixed_size_25'] = 8
n_chunks_by_strat['para_chunks'] = 8

# Grab fewer of longer chunks
n_chunks_by_strat['fixed_size_100'] = 2
n_chunks_by_strat['para_chunks_min_25'] = 2

# Perform Retreval augmented generation
search_string = "history of git"  # Or "available git remote commands"

for chunking_strategy in chunk_obj_sets.keys():
    where_filter = Filter.by_property('chunking_strategy').equal(chunking_strategy)
    response = collection.query.near_text(search_string, filters = where_filter, limit = n_chunks_by_strat[chunking_strategy])
    context_string = ""
    for obj in response.objects:
        context_string += obj.properties['chunk'] + '\n'
    prompt = PROMPT.format(search_string = search_string, context = context_string)
    response = generate_with_single_input(prompt, role = 'assistant')
    print(f"Search string: {search_string}")
    print(f"Chunking Strategy: {chunking_strategy}:")
    print(f"Response:\n\t{response['content']}")
    print()

In [None]:
# Don't forget to close the client!
client.close()

Congratulations! You've finished the ungraded lab on Chunking! Keep it up!