# Approaches to Split Documents

1. [Length Based](#length-base)
2. [Text-structure Based](#text-structure-based)
3. [Document-structure Based](#document-structure-based)
4. [Semantic meaning](#semantic-meaning-based)

## Length Based

In [2]:
# Let's load the document from pdf
from os import path

pdf_path = "../sample_files/progit.pdf"

if not path.exists(pdf_path):
    raise Exception("Invalid file, File does not exits")

### Load the document from `PDF`

In [80]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(pdf_path)

docs = ""
metadatas = []
for doc in loader.load():
    if doc.metadata["page"] < 8:
        continue

    if doc.metadata["page"] >= 492:
        break

    docs += doc.page_content
    metadatas.append(doc.metadata)

### 1. Split text by token

In [None]:
from langchain.text_splitter import TokenTextSplitter

# Define the token splitter
token_splitter = TokenTextSplitter.from_tiktoken_encoder(
    encoding_name="cl100k_base",
    chunk_size=1000,
    chunk_overlap=400
)

token_chunks = token_splitter.split_text(docs)

print(f"Token-based: {len(token_chunks)} chunks")
print(token_chunks[0])

Token-based: 349 chunks
 been staged, it is modified. In Git Basics, you’ll learn more about these states and
how you can either take advantage of them or skip the staged part entirely.
The Command Line
There are a lot of different ways to use Git. There are the original command-line tools, and there
are many graphical user interfaces of varying capabilities. For this book, we will be using Git on the
command line. For one, the command line is the only place you can run all Git commands — most
of the GUIs implement only a partial subset of Git functionality for simplicity. If you know how to
run the command-line version, you can probably also figure out how to run the GUI version, while
the opposite is not necessarily true. Also, while your choice of graphical client is a matter of
personal taste, all users will have the command-line tools installed and available.
So we will expect you to know how to open Terminal in macOS or Command Prompt or PowerShell
in Windows. If you don’t know w

- In **`.from_tiktoken_encorder()`**
  - It takes **`encoding_name`** eg **"cl100k_base"** or **`model_name`** eg **"gpt-4"**
- **`chunk_size`** define the size of token.
- **`chunk_overlap`** defines the how many characters or tokens from the end of one chunk repeat at beginning of the next chunk.

- We can also use **`TokenTextSplitter`** splitter which is directly work with **`tiktoken`**. Also ensure each chunk should smaller than `chunk_size`.

In [78]:
from langchain_text_splitters import TokenTextSplitter

splitter = TokenTextSplitter(
    model_name="gpt-4",
    chunk_size=10,
    chunk_overlap=0
)

token_chunks = splitter.split_text(docs)
chunk_docs = splitter.create_documents(token_chunks)
print(f"Character-based: {len(token_chunks)} chunks")
print(f"String token: {token_chunks[0]}")  # Show first 300 chars of the first
print(f"Document tokens: {chunk_docs[0]}")

Character-based: 20954 chunks
String token: Preface by Ben Straub
The first edition
Document tokens: page_content='Preface by Ben Straub
The first edition'


### 2. Based by Characters

In [83]:
from langchain.text_splitter import CharacterTextSplitter

print("Total length of the docs is:- ", len(docs))
# Define the character splitter
char_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,      # each chunk has 1000 characters
    chunk_overlap=200,     # each chunk overlaps 100 characters with the previous
    length_function=len,
    is_separator_regex=False
)

char_chunks = char_splitter.split_text(docs)
print(f"Len chunks: {len(char_chunks)}")
print(f"Char chunk: {char_chunks[0]}")

Total length of the docs is:-  844051
Len chunks: 1063
Char chunk: Preface by Ben Straub
The first edition of this book is what got me hooked on Git. This was my introduction to a style of
making software that felt more natural than anything I had seen before. I had been a developer for
several years by then, but this was the right turn that sent me down a much more interesting path
than the one I was on.
Now, years later, I’m a contributor to a major Git implementation, I’ve worked for the largest Git
hosting company, and I’ve traveled the world teaching people about Git. When Scott asked if I’d be
interested in working on the second edition, I didn’t even have to think.
It’s been a great pleasure and privilege to work on this book. I hope it helps you as much as it did
me.
3Dedications
To my wife, Becky, without whom this adventure never would have begun. — Ben
This edition is dedicated to my girls. To my wife Jessica who has supported me for all of these years


- It you want to add metadata associated with each chunk use **`.create_documents`**

In [None]:
doc_chunk = char_splitter.create_documents(
    [char_chunks[0]], metadatas=[metadatas[0]])

print(doc_chunk)

[Document(metadata={'producer': 'Asciidoctor PDF 2.3.17, based on Prawn 2.4.0', 'creator': 'Scott Chacon, Ben Straub', 'creationdate': '2025-04-10T14:40:20+00:00', 'title': 'Pro Git', 'author': 'Scott Chacon, Ben Straub', 'moddate': '2025-04-10T14:40:04+00:00', 'source': '../sample_files/progit.pdf', 'total_pages': 501, 'page': 8, 'page_label': '3'}, page_content='Preface by Ben Straub\nThe first edition of this book is what got me hooked on Git. This was my introduction to a style of\nmaking software that felt more natural than anything I had seen before. I had been a developer for\nseveral years by then, but this was the right turn that sent me down a much more interesting path\nthan the one I was on.\nNow, years later, I’m a contributor to a major Git implementation, I’ve worked for the largest Git\nhosting company, and I’ve traveled the world teaching people about Git. When Scott asked if I’d be\ninterested in working on the second edition, I didn’t even have to think.\nIt’s been a

## Text-structure Based


In [105]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Define text splitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    length_function=len,
    is_separator_regex=False
)

texts = splitter.create_documents([docs])

In [111]:
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings

vector_store = InMemoryVectorStore.from_documents(
    texts, OpenAIEmbeddings(model="text-embedding-3-small"))

In [117]:
docs = vector_store.similarity_search("commit")

print(docs)

[Document(id='e671f66b-f921-42f3-b545-f04ef662271d', metadata={}, page_content="all and commit. Staging the files computes a checksum for each one (the SHA-1 hash we mentioned\nin What is Git?), stores that version of the file in the Git repository (Git refers to them as blobs), and\nadds that checksum to the staging area:\n$ git add README test.rb LICENSE\n$ git commit -m 'Initial commit'\nWhen you create the commit by running git commit, Git checksums each subdirectory (in this case,"), Document(id='3360d11c-6d6d-4d3d-b63c-8d04e43b7de2', metadata={}, page_content=')\ncommit = repo.lookup(commit_id) ⑧\n① Create a new blob, which contains the contents of a new file.\n② Populate the index with the head commit’s tree, and add the new file at the path newfile.txt.\n③ This creates a new tree in the ODB, and uses it for the new commit.\n④ We use the same signature for both the author and committer fields.\n⑤ The commit message.\n⑥ When creating a commit, you have to specify the new commit’s

## Semantic meaning Based