[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/aurelio-labs/semantic-chunkers/blob/main/docs/00-chunkers-intro.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/aurelio-labs/semantic-chunkers/blob/main/docs/00-chunkers-intro.ipynb)

In [None]:
!pip install -qU \
    semantic-chunkers \
    datasets==2.19.1

# An Async version of Semantic Chunkers

Semantic chunkers allow us to build more context aware chunks of information. We can use this for RAG, splitting video, audio, and much more.

In this example, we will stick with a simple RAG-focused example. We will learn about three different types of chunkers available to us; `StatisticalChunker`, `ConsecutiveChunker`, `CumulativeChunker`, and `RegexChunker`. To begin, we need some data.

In [None]:
from datasets import load_dataset

data = load_dataset("jamescalam/ai-arxiv2", split="train")
data

In [None]:
content = data[3]["content"]
print(content[:1000])

We will keep a smaller section of content to speed up (and limit cost) for the examples.

In [None]:
content = content[:20_000]

We will experiment with different semantic chunking methods on the above text. Every chunker requires an _encoder_ for which we can choose from open source encoders via `HuggingfaceEncoder` or `FastembedEncoder`, and proprietary API encoders like `OpenAIEncoder` or `CohereEncoder`.

We will use the `OpenAIEncoder` with `text-embedding-3-small`:

In [None]:
import os
from getpass import getpass
from semantic_router.encoders import OpenAIEncoder

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") or getpass(
    "OpenAI API key: "
)

encoder = OpenAIEncoder(name="text-embedding-3-small")

## Statistical Chunking

The statistical chunking method our most robust chunking method, it uses a varying similarity threshold to identify more dynamic and local similarity splits. It offers a good balance between accuracy and efficiency _but_ can only be used for text documents (unlike the multi-modal `ConsecutiveChunker`).

The `StatisticalChunker` can automatically identify a good threshold value to use while chunking our text, so it tends to require less customization than our other chunkers.

In [None]:
from semantic_chunkers import StatisticalChunker

chunker = StatisticalChunker(encoder=encoder)

In [None]:
chunks_async = await chunker.acall(docs=[content])

In [None]:
# Sync call
chunks = chunker(docs=[content])

Print and compare sync and async chunks.

In [None]:
import random
from IPython.display import display, HTML

# Predefined list of colors
colors = ["000000", "FF0000", "800080", "008000", "0000FF"]

html_str = '<div style="display: flex;">'
html_str += '<div style="flex: 50%; padding: 10px; border-right: 2px solid black;">'
html_str += "<h2>Synchronous Chunks</h2>"
for chunk in chunks[0]:
    chunk_text = ""
    for split in chunk.splits:
        chunk_text += split
    color = random.choice(colors)
    html_str += f'<p style="color: #{color};">{chunk_text}</p>'
html_str += "</div>"
html_str += '<div style="flex: 50%; padding: 10px;">'
html_str += "<h2>Asynchronous Chunks</h2>"
for chunk in chunks_async[0]:
    chunk_text = ""
    try:
        for split in chunk.splits:
            chunk_text += split
    except AttributeError:
        print(f"Error Chunk: {chunk}")
    color = random.choice(colors)
    html_str += f'<p style="color: #{color};">{chunk_text}</p>'
html_str += "</div>"
html_str += "</div>"

# Display the HTML
display(HTML(html_str))

In [None]:
chunker.print(chunks_async[0])

## Consecutive Chunking

Consecutive chunking is the simplest version of semantic chunking.

In [None]:
from semantic_chunkers import ConsecutiveChunker

chunker = ConsecutiveChunker(encoder=encoder, score_threshold=0.3)

In [None]:
chunks = await chunker.acall(docs=[content])

In [None]:
chunker.print(chunks[0])

## Cumulative Chunking

Cumulative chunking is a more compute intensive process, but can often provide more stable results as it is more noise resistant. However, it is _very expensive_ in both time and (if using APIs) money.

In [None]:
from semantic_chunkers import CumulativeChunker

chunker = CumulativeChunker(encoder=encoder, score_threshold=0.3)

In [None]:
chunks = await chunker.acall(docs=[content])

In [None]:
chunker.print(chunks[0])

## Regex Chunking

In [None]:
from typing import List
from semantic_chunkers import RegexChunker
from semantic_chunkers.schema import Chunk

chunker = RegexChunker(
    # delimiters=["--------------------"], option to use custom delimiters, default will be ignored
    max_chunk_tokens=300,
)
chunks: List[List[Chunk]] = await chunker.acall(docs=[content])
chunker.print(chunks[0])

---