# Exploring Document Splitters and Chunkers in LangChain

## Install OpenAI, HuggingFace and LangChain dependencies

In [None]:
!pip install langchain==0.2.0
!pip install langchain-openai==0.1.7
!pip install langchain-community==0.2.0

In [None]:
# takes 2 - 5 mins to install on Colab
!pip install "unstructured[all-docs]==0.14.0"

After installing `unstructured`above remember to restart your session when it shows you the following popup, if it doesn't go to `Runtime`and `Restart Session`

![](https://i.imgur.com/UOBaotk.png)

In [None]:
# install OCR dependencies for unstructured
!sudo apt-get install tesseract-ocr
!sudo apt-get install poppler-utils

In [None]:
!pip install langchain-text-splitters==0.2.0
!pip install tiktoken==0.7.0
!pip install spacy
!pip install sentence-transformers==2.7.0

## Document Splitting and Chunking

After loading documents into LangChain, you might need to transform them for optimal use in your application. One common transformation is splitting a long document into smaller segments to fit within your model's context window. LangChain provides several built-in document transformers to facilitate the splitting, combining, filtering, and manipulating of documents.

#### Process of Document Splitting:
1. **Splitting into Chunks:**
   - Break down the text into small, semantically meaningful units (typically sentences).
   
2. **Combining Chunks:**
   - Assemble these smaller units into larger chunks until they reach a predefined size. This size is determined by a specific measurement function.

3. **Creating Overlapping Chunks:**
   - Once the maximum size is reached, finalize the chunk as an independent text piece.
   - Begin a new chunk, incorporating some overlap with the previous chunk to maintain textual context.

This approach ensures that semantically related text pieces are kept together, which is crucial for maintaining the meaning and continuity of the document.


### RecursiveCharacterTextSplitter

The `RecursiveCharacterTextSplitter` is a versatile tool within LangChain for splitting text based on a list of characters. This splitter is designed to handle various requirements through adjustable parameters.

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

#### Features and Parameters:

- **Character List:** Utilizes a specified list of characters to determine where splits should occur.
- **Chunk Size:** Allows you to set the size of each chunk, helping ensure that chunks are manageable and suit the context window of your model.
- **Overlap:** Configurable overlap between consecutive chunks to maintain context continuity across chunks.

This splitter is particularly useful for texts where precise control over the splitting criteria is needed, allowing for customized chunking strategies based on specific characters.


In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

doc = """Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.
On a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers.
Green Valley is not only known for its scenic beauty but also for its annual festivals. The most anticipated event is the Harvest Festival, celebrated with great enthusiasm. Locals prepare months in advance, cultivating crops and crafting goods for the occasion. The festival features a parade, various competitions, and a night market that lights up the town with vibrant colors and joyous energy.
"""



In [None]:
print(doc)

Splitting with smaller chunk size (total characters) makes more paragraphs

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    separators= ["\n\n", "\n", " ", ""],
    chunk_size=300,
    chunk_overlap=0,
)

In [None]:
texts = text_splitter.split_text(doc)
print(len(texts)) # 5

In [None]:
for text in texts:
    print(text)
    print(len(text))
    print()

Splitting with larger chunk size (total characters) makes less paragraphs

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    separators= ["\n\n", "\n", " ", ""],
    chunk_size=500,
    chunk_overlap=0,
)

texts = text_splitter.split_text(doc)
print(len(texts)) # 3

In [None]:
for text in texts:
    print(text)
    print(len(text))
    print()

`chunk_overlap` helps to mitigate loss of information when context is divided between chunks especially for really small chunks

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    separators= ["\n\n", "\n", " ", ""],
    chunk_size=300,
    chunk_overlap=100,
)

texts = text_splitter.split_text(doc)
print(len(texts)) # 5

In [None]:
for text in texts:
    print(text)
    print(len(text))
    print()

You can create LangChain `Document` chunks with the `create_documents` function

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    separators= ["\n\n", "\n", " ", ""],
    chunk_size=500,
    chunk_overlap=100,
)

In [None]:
docs = text_splitter.create_documents([doc])
docs

### CharacterTextSplitter

The `CharacterTextSplitter` is a straightforward tool in LangChain for dividing text based on a specified character. It's designed to be simple yet effective, providing essential controls for customizing how text is segmented.

#### Key Features and Parameters:
- **Split Character:** By default, it uses a empty string character ("") to split the text, but this can be customized to any character you specify.
- **Chunk Size:** Allows you to define the length of each chunk in terms of the number of characters. This is useful for ensuring each piece of text is of a manageable size for processing.
- **Overlap:** You can set the amount of overlap between consecutive chunks. This helps maintain context and continuity when text is split into separate parts.

This method is the simplest among text splitting tools, focusing on character-based division and providing straightforward measures for chunk length and overlap.

To obtain the string content directly, use `.split_text`.

To create LangChain `Document` objects (e.g., for use in downstream tasks), use `.create_documents`.


In [None]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=500,
    chunk_overlap=200,
    is_separator_regex=False,
)

docs = text_splitter.create_documents([doc])
docs

### Code Splitters

`RecursiveCharacterTextSplitter` includes pre-built lists of separators that are useful for splitting text in a specific programming language.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language

python_code = """
def hello_world():
    print("Hello, World!")
hello_world()
"""

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)
python_docs = python_splitter.create_documents([python_code])
python_docs

### Markdown Splitters

We might want to chunk a document based on the structure. For example, a markdown file is organized by headers. Creating chunks within specific header groups is an intuitive idea. To address this challenge, we can use MarkdownHeaderTextSplitter. This will split a markdown file by a specified set of headers.

For example, if we want to split this markdown:

```
markdown_document = """
# Team Introductions

## Management Team

Hi, this is Jim, the CEO.  
Hi, this is Joe, the CFO.

## Development Team

Hi, this is Molly, the Lead Developer.
"""
```

We can specify the headers to split on:

```
[("#", "Header 1"),
 ("##", "Header 2")]
```

And content is grouped or split by common headers:

```
Document(page_content='Hi, this is Jim, the CEO.\nHi, this is Joe, the CFO.',
metadata={'Header 1': 'Team Introductions', 'Header 2': 'Management Team'})

Document(page_content='Hi, this is Molly, the Lead Developer.',
metadata={'Header 1': 'Team Introductions', 'Header 2': 'Development Team'})
```

In [None]:
markdown_document = """
# Team Introductions

## Management Team
Hi, this is Jim, the CEO.
Hi, this is Joe, the CFO.

## Development Team
Hi, this is Molly, the Lead Developer.
"""

In [None]:
from langchain_text_splitters import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_document)
md_header_splits

By default, `MarkdownHeaderTextSplitter` strips headers being split on from the output chunk's content. This can be disabled by setting `strip_headers = False`.

In [None]:
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on, strip_headers=False)
md_header_splits = markdown_splitter.split_text(markdown_document)
md_header_splits

### Tokenizer based Splitting

Language models have a token limit. You should not exceed the token limit. When you split your text into chunks it is therefore a good idea to count the number of tokens. There are many tokenizers. When you count tokens in your text you should use the same tokenizer as used in the language model. Let's look at how we can chunk documents using different tokenizers



#### tiktoken splitters

[`tiktoken`](https://github.com/openai/tiktoken) is a fast BPE tokenizer created by OpenAI.

We can use tiktoken to estimate tokens used. It will probably be more accurate for the OpenAI models. We measure the `chunk_size`here based on the number of tokens typically and not the number of characters

For Open AI models, roughly 1 token = 3\4 words.

Approx: 100 tokens ~= 75 words.



We can load a [`TokenTextSplitter`](https://api.python.langchain.com/en/latest/base/langchain_text_splitters.base.TokenTextSplitter.html) splitter, which works with `tiktoken` directly and will ensure each split is smaller than chunk size in terms of the number of tokens.

In [None]:
doc = """Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.
On a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers.
Green Valley is not only known for its scenic beauty but also for its annual festivals. The most anticipated event is the Harvest Festival, celebrated with great enthusiasm. Locals prepare months in advance, cultivating crops and crafting goods for the occasion. The festival features a parade, various competitions, and a night market that lights up the town with vibrant colors and joyous energy.
"""

In [None]:
from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(model_name='gpt-3.5-turbo',
                                  chunk_size=30,
                                  chunk_overlap=10)

docs = text_splitter.create_documents([doc])

In [None]:
len(docs)

In [None]:
docs

In [None]:
import tiktoken

enc = tiktoken.encoding_for_model("gpt-3.5-turbo")
for d in docs:
  print('Words:', len(d.page_content.split(' ')),
        'Tokens:', len(enc.encode(d.page_content)),
        'Chunk:', d.page_content)

Larger chunk size in terms of number of words \ tokens will create lesser chunks or paragraphs as usual

In [None]:
text_splitter = TokenTextSplitter(model_name='gpt-3.5-turbo',
                                  chunk_size=100,
                                  chunk_overlap=30)

docs = text_splitter.create_documents([doc])

In [None]:
len(docs)

In [None]:
docs

In [None]:
enc = tiktoken.encoding_for_model("gpt-3.5-turbo")
for d in docs:
  print('Words:', len(d.page_content.split(' ')),
        'Tokens:', len(enc.encode(d.page_content)),
        'Chunk:', d.page_content)

To implement a hard constraint on the chunk size, we can use `RecursiveCharacterTextSplitter.from_tiktoken_encoder`, where each split will be recursively split if it has a larger size and it makes the chunks more meaningful

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    model_name="gpt-3.5-turbo",
    chunk_size=100,
    chunk_overlap=30,
)

docs = text_splitter.create_documents([doc])

In [None]:
len(docs)

In [None]:
docs

In [None]:
enc = tiktoken.encoding_for_model("gpt-3.5-turbo")
for d in docs:
  print('Words:', len(d.page_content.split(' ')),
        'Tokens:', len(enc.encode(d.page_content)),
        'Chunk:', d.page_content)

#### spaCy

[spaCy](https://spacy.io/) is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython.

LangChain implements splitters based on the [spaCy tokenizer](https://spacy.io/api/tokenizer).

In [None]:
from langchain_text_splitters import SpacyTextSplitter

text_splitter = SpacyTextSplitter(chunk_size=500,
                                  chunk_overlap=50)

docs = text_splitter.create_documents([doc])

In [None]:
len(docs)

In [None]:
docs

In [None]:
for d in docs:
  print('Words:', len(d.page_content.split(' ')),
        'Characters:', len(d.page_content),
        'Chunk:', d.page_content)

#### SentenceTransformers

The [`SentenceTransformersTokenTextSplitter`](https://api.python.langchain.com/en/latest/sentence_transformers/langchain_text_splitters.sentence_transformers.SentenceTransformersTokenTextSplitter.html) is a specialized text splitter for use with the `sentence-transformer` language models.

The default behaviour is to split the text into chunks that fit the token window of the sentence transformer model that you would like to use.

In [None]:
from langchain_text_splitters import SentenceTransformersTokenTextSplitter

splitter = SentenceTransformersTokenTextSplitter(model_name="sentence-transformers/all-mpnet-base-v2",
                                                 tokens_per_chunk=100,
                                                 chunk_overlap=30)

In [None]:
docs = splitter.create_documents([doc])

In [None]:
len(docs)

In [None]:
docs

In [None]:
for d in docs:
  print('Words:', len(d.page_content.split(' ')),
        'Characters:', len(d.page_content),
        'Chunk:', d.page_content)

### Section-based Splitting in Unstructured.io

Chunking functions in `unstructured` use metadata and document elements detected with partition functions to split a document into smaller parts for uses cases such as Retrieval Augmented Generation (RAG).

`unstructured` uses specific knowledge about each document format to partition the document into semantic units (document elements), we only need to resort to text-splitting when a single element exceeds the desired maximum chunk size. Except in that case, all chunks contain one or more whole elements, preserving the coherence of semantic units established during partitioning.

- Chunking is performed on document elements. It is a separate step performed after partitioning, on the elements produced by partitioning. (Although it can be combined with partitioning in a single step.)

- Chunking combines consecutive elements to form chunks as large as possible without exceeding the maximum chunk size.

- A single element that by itself exceeds the maximum chunk size is divided into two or more chunks using text-splitting.

- Chunking produces a sequence of `CompositeElement`, `Table`, or `TableChunk` elements. Each “chunk” is an instance of one of these three types.

Chunking Options:

The following options are available to tune chunking behaviors. These are keyword arguments that can be used in a partitioning or chunking function call. All these options have defaults and need only be specified when a non-default setting is required. Specific chunking strategies (such as “by-title”) may have additional options.

- `max_characters`: (default=500) - the hard maximum size for a chunk. No chunk will exceed this number of characters. A single element that by itself exceeds this size will be divided into two or more chunks using text-splitting.

- `new_after_n_chars`: (default=max_characters) - the “soft” maximum size for a chunk. A chunk that already exceeds this number of characters will not be extended, even if the next _element_ would fit without exceeding the specified hard maximum. This can be used in conjunction with `max_characters` to set a “preferred” size, like “I prefer chunks of around 1000 characters, but I’d rather have a chunk of 1500 (max_characters) than resort to text-splitting”. This would be specified with `(..., max_characters=1500, new_after_n_chars=1000)`.

- `overlap`: (default=0) - only when using text-splitting to break up an oversized chunk, include this number of characters from the end of the prior chunk as a prefix on the next. This can mitigate the effect of splitting the semantic unit represented by the oversized element at an arbitrary position based on text length.

- `combine_text_under_n_chars argument`: This defaults to the same value as `max_characters` such that sequential small section chunks are combined to maximally fill the chunking window to produce a logically larger chunk


There are currently two chunking strategies, `basic` and `by_title`.

The `basic` strategy combines sequential elements to maximally fill each chunk while respecting both the specified max_characters (hard-max) and new_after_n_chars (soft-max) option values.

The `by_title` chunking strategy preserves section boundaries and optionally page boundaries as well. “Preserving” here means that a single chunk will never contain text that occurred in two different sections.

In [None]:
!wget -O 'layoutparser_paper.pdf' 'http://arxiv.org/pdf/2103.15348.pdf'

In [None]:
from langchain_community.document_loaders import UnstructuredPDFLoader

# takes 3-4 mins on Colab
loader = UnstructuredPDFLoader('./layoutparser_paper.pdf',
                               strategy='hi_res',
                               extract_images_in_pdf=False,
                               infer_table_structure=True,
                               chunking_strategy="by_title",
                               max_characters=4000,
                               new_after_n_chars=3800,
                               combine_text_under_n_chars=2000,
                               mode='elements')
data = loader.load()

In [None]:
len(data)

In [None]:
[doc.metadata['category'] for doc in data]

In [None]:
data[0]

In [None]:
print(data[0].page_content)

In [None]:
data[1]

In [None]:
print(data[1].page_content)