# Approaches to Split Documents

1. [Length Based](#length-based)
2. [Text-structure Based](#text-structure-based)
3. [Document-structure Based](#)
4. [Semantic meaning](#)

## Length Based

In [59]:
# Let's load the document from pdf
from os import path

pdf_path = "../sample_files/progit.pdf"
# pdf_path = "../sample_files/welcome.pdf"

if not path.exists(pdf_path):
    raise Exception("Invalid path, File does not exits")

In [70]:
from langchain_community.document_loaders import PyPDFLoader


loader = PyPDFLoader(pdf_path)


docs = ""
metadatas = []
list_docs = []


async for doc in loader.alazy_load():
    current_page_number = doc.metadata["page_label"]

    if current_page_number.isdigit():
        current_page_number = int(current_page_number)

        if current_page_number < 8:
            continue

        if current_page_number >= 492:
            break

        docs += doc.page_content

        list_docs.append(doc.page_content)
        metadatas.append(doc.metadata)
    else:
        continue

### Load the document from `PDF`

### 1. Split text by token

In [79]:
from langchain.text_splitter import TokenTextSplitter

# Define the token splitter
token_splitter = TokenTextSplitter.from_tiktoken_encoder(
    encoding_name="cl100k_base",
    chunk_size=500,
    chunk_overlap=25
)

token_chunks = token_splitter.split_text(docs)

print(f"Token-based: {len(token_chunks)} chunks")
print(token_chunks[0][:500])

Token-based: 442 chunks
Introduction
You’re about to spend several hours of your life reading about Git. Let’s take a minute to explain
what we have in store for you. Here is a quick summary of the ten chapters and three appendices of
this book.
In Chapter 1 , we’re going to cover Version Control Systems (VCSs) and Git basics — no technical
stuff, just what Git is, why it came about in a land full of VCSs, what sets it apart, and why so many
people are using it. Then, we’ll explain how to download Git and set it up for


### Create **`document`**

In [82]:
doc_chunks = token_splitter.create_documents(list_docs, metadatas=metadatas)
print(f"Total Chunk size: {len(doc_chunks)}")
print(f"Content is: {doc_chunks[0].page_content[:200]}")

Total Chunk size: 666
Content is: Introduction
You’re about to spend several hours of your life reading about Git. Let’s take a minute to explain
what we have in store for you. Here is a quick summary of the ten chapters and three app


- In **`.from_tiktoken_encorder()`**
  - It takes **`encoding_name`** eg **"cl100k_base"** or **`model_name`** eg **"gpt-4"**
- **`chunk_size`** define the size of token.
- **`chunk_overlap`** defines the how many characters or tokens from the end of one chunk repeat at beginning of the next chunk.

- We can also use **`TokenTextSplitter`** splitter which is directly work with **`tiktoken`**. Also ensure each chunk should smaller than `chunk_size`.

In [78]:
from langchain_text_splitters import TokenTextSplitter

splitter = TokenTextSplitter(
    model_name="gpt-4",
    chunk_size=10,
    chunk_overlap=0
)

token_chunks = splitter.split_text(docs)
chunk_docs = splitter.create_documents(token_chunks)
print(f"Character-based: {len(token_chunks)} chunks")
print(f"String token: {token_chunks[0]}")  # Show first 300 chars of the first
print(f"Document tokens: {chunk_docs[0]}")

Character-based: 20954 chunks
String token: Preface by Ben Straub
The first edition
Document tokens: page_content='Preface by Ben Straub
The first edition'


### 2. Based by Characters

In [1]:
from langchain.text_splitter import CharacterTextSplitter

print("Total length of the docs is:- ", len(docs))
# Define the character splitter
char_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,      # each chunk has 1000 characters
    chunk_overlap=200,     # each chunk overlaps 100 characters with the previous
    length_function=len,
    is_separator_regex=False
)

char_chunks = char_splitter.split_text(docs)
print(f"Len chunks: {len(char_chunks)}")
print(f"Char chunk: {char_chunks[0]}")

NameError: name 'docs' is not defined

- It you want to add metadata associated with each chunk use **`.create_documents`**

In [None]:
doc_chunk = char_splitter.create_documents(
    [char_chunks[0]], metadatas=[metadatas[0]])

print(doc_chunk)

[Document(metadata={'producer': 'Asciidoctor PDF 2.3.17, based on Prawn 2.4.0', 'creator': 'Scott Chacon, Ben Straub', 'creationdate': '2025-04-10T14:40:20+00:00', 'title': 'Pro Git', 'author': 'Scott Chacon, Ben Straub', 'moddate': '2025-04-10T14:40:04+00:00', 'source': '../sample_files/progit.pdf', 'total_pages': 501, 'page': 8, 'page_label': '3'}, page_content='Preface by Ben Straub\nThe first edition of this book is what got me hooked on Git. This was my introduction to a style of\nmaking software that felt more natural than anything I had seen before. I had been a developer for\nseveral years by then, but this was the right turn that sent me down a much more interesting path\nthan the one I was on.\nNow, years later, I’m a contributor to a major Git implementation, I’ve worked for the largest Git\nhosting company, and I’ve traveled the world teaching people about Git. When Scott asked if I’d be\ninterested in working on the second edition, I didn’t even have to think.\nIt’s been a

## Text-structure Based


In [105]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Define text splitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    length_function=len,
    is_separator_regex=False
)

texts = splitter.create_documents([docs])

In [111]:
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings

vector_store = InMemoryVectorStore.from_documents(
    texts, OpenAIEmbeddings(model="text-embedding-3-small"))

In [117]:
docs = vector_store.similarity_search("commit")

print(docs)

[Document(id='e671f66b-f921-42f3-b545-f04ef662271d', metadata={}, page_content="all and commit. Staging the files computes a checksum for each one (the SHA-1 hash we mentioned\nin What is Git?), stores that version of the file in the Git repository (Git refers to them as blobs), and\nadds that checksum to the staging area:\n$ git add README test.rb LICENSE\n$ git commit -m 'Initial commit'\nWhen you create the commit by running git commit, Git checksums each subdirectory (in this case,"), Document(id='3360d11c-6d6d-4d3d-b63c-8d04e43b7de2', metadata={}, page_content=')\ncommit = repo.lookup(commit_id) ⑧\n① Create a new blob, which contains the contents of a new file.\n② Populate the index with the head commit’s tree, and add the new file at the path newfile.txt.\n③ This creates a new tree in the ODB, and uses it for the new commit.\n④ We use the same signature for both the author and committer fields.\n⑤ The commit message.\n⑥ When creating a commit, you have to specify the new commit’s

## Semantic meaning Based