# 1) Chunking
Chunking in Large Language Models (LLMs) refers to the process of dividing input data or text into multiple smaller, manageable pieces, ensuring that each chunk
- retains its meaningful context and coherence without losing any critical information and
- adheres to the model's token limit to preserve full context within the manageable processing window.

In a RAG pipeline, the goal is to provide the LLM with precisely the information needed for the specific task, and nothing more.

This is one of the most crucial steps for enhancing the efficiency of LLM applications.

“What should be the right chunking strategy in my solution” is one of the initial and fundamental decision a LLM practitioner must make while building advance RAG solution.

In the world of multi-modal, splitting also applies to images.

In the video(link shared below), Greg Kamradt provides overview of different chunking strategies. These strategies can be leveraged as starting points to develop RAG based LLM application. They have been classified into five levels based on the complexity and effectiveness.

https://www.youtube.com/watch?v=8OJC21T2SL4

### Context Window

We cannot pass unlimited data to our language model. Reasons are:
1. Context Length/ Token Limit: 
    - Limit on the amount of words/ tokens that a language model allows as an input.
2. Signal to Noise: 
    - Language models perform better when you increase the signal to noise ratio.
    - Distrecting information in the model's context window does tend to measurably destroy the performance of the overall application.  

This approach significantly reduces the time and computational resources required for the LLM to process large amounts of data, as it only needs to interact with the relevant chunks instead of the entire documentation.  

It also allows for real-time updates to the database. As product documentation evolves, the corresponding chunks in the vector database can be easily updated. This ensures that the chatbot always provides the most up-to-date information.  

Finally, by focusing on semantically relevant chunks, the LLM can provide more precise and contextually appropriate responses, leading to improved customer satisfaction.

*************************
*************************

## 1.1) Levels Of Text Splitting
* **Level 1: [Character Splitting](#CharacterSplitting)** - Simple static character chunks of data
* **Level 2: [Recursive Character Text Splitting](#RecursiveCharacterSplitting)** - Recursive chunking based on a list of separators
* **Level 3: [Document Specific Splitting](#DocumentSpecific)** - Various chunking methods for different document types (PDF, Python, Markdown)
* **Level 4: [Semantic Splitting](#SemanticChunking)** - Embedding walk based chunking
* **Level 5: [Agentic Splitting](#AgenticChunking)** - Experimental method of splitting text with an agent-like system. Good for if you believe that token cost will trend to $0.00
* **\*Bonus Level:\*** **[Alternative Representation Chunking + Indexing](#BonusLevel)** - Derivative representations of your raw text that will aid in retrieval and indexing

**Notebook resources:**
* [Video Overview]() - Walkthrough of this code with commentary
* [ChunkViz.com](https://www.chunkviz.com/) - Visual representation of chunk splitting methods
* [RAGAS](https://github.com/explodinggradients/ragas) - Retrieval evaluation framework

In [44]:
# import required libraries
import fitz # fitz is the legacy name for PyMuPDF library
from pathlib import Path

In [45]:
# Define function to build source file path
def get_source_file(file_name: str) -> Path:
    file_path = Path.cwd().parent.joinpath("99_Datasets").joinpath(file_name)
    
    if not file_path.exists():
        print("File not found")

    return file_path

In [46]:
# Define function to cleanse the input text
import re
def cleanse_text(text: str) -> str:
    pattern = r"\s?RAG\n|\d+\n"
    final_text = re.sub(pattern, "", text)
    cleaned_text = final_text.replace("\n", " ")

    return cleaned_text

In [47]:
# Define function to read pdf
def read_pdf(pdf_path: Path) -> str:
    doc = fitz.open(pdf_path)
    pages = []

    for page in doc:
        text = page.get_text()
        pages.append(text)

    combined_text = " ".join(pages)
    final_text = cleanse_text(combined_text)
    return final_text

In [48]:
source_file = get_source_file('rag.pdf')
final_text = read_pdf(source_file)

print(final_text)

What is Retrieval-Augmented Generation? Retrieval-Augmented Generation (RAG) is the process of optimizing the output of  a large language model, so it references an authoritative knowledge base outside  of its training data sources before generating a response. Large Language Models  (LLMs) are trained on vast volumes of data and use billions of parameters to  generate original output for tasks like answering questions, translating languages,  and completing sentences. RAG extends the already powerful capabilities of LLMs  to specific domains or an organization's internal knowledge base, all without the  need to retrain the model. It is a cost-effective approach to improving LLM output  so it remains relevant, accurate, and useful in various contexts. Why is Retrieval-Augmented Generation important? LLMs are a key artificial intelligence (AI) technology powering intelligent chatbots  and other natural language processing (NLP) applications. The goal is to create  bots that can answer u

*************************
*************************

### 1.1.1) Level 1: Character Splitting / Fixed Size Chunking

Character splitting is the most basic form of splitting up your text. It is the process of simply dividing your text into N-character sized chunks regardless of their content or form.  

This is the most crude and simplest method of segmenting the text. It breaks down the text into chunks of a specified number of characters, regardless of their content or structure.

This method isn't recommended for any applications - but it's a great starting point for us to understand the basics.

- **Pros:** Easy & Simple
- **Cons:** Very rigid and doesn't take into account the structure and context of your text

Concepts to know:

- **Chunk Size** - The number of characters you would like in your chunks. 50, 100, 100,000, etc.
- **Chunk Overlap** - The amount you would like your sequential chunks to overlap. This is to try to avoid cutting a single piece of context into multiple pieces. This will create duplicate data across chunks.
- **separator:** character(s) on which the text would be split on (default “”)

Langchain and llamaindex framework offer `CharacterTextSplitter` and `SentenceSplitter` (default to spliting on sentences) classes for this chunking technique.

#### Step 1: get some sample text

In [49]:
text = "Retrieval-Augmented Generation (RAG) is the process of optimizing the output of  a large language model, so it references an authoritative knowledge base outside  of its training data sources before generating a response."

#### Step2: Implement character splitter using plain python

In [50]:
# Create a list that will hold the chunks
chunks = []
chunk_size = 25
len_of_text = len(text)
for i in range(0, len_of_text, chunk_size):
    chunk = text[i:i+chunk_size]
    chunks.append(chunk)
chunks

['Retrieval-Augmented Gener',
 'ation (RAG) is the proces',
 's of optimizing the outpu',
 't of  a large language mo',
 'del, so it references an ',
 'authoritative knowledge b',
 'ase outside  of its train',
 'ing data sources before g',
 'enerating a response.']

If you see the result above, words are not broken down in a meaningful way.  
Generation is splitted into Gener and ation. We are loosing the meaning of the word itself.

*************************
*************************

#### Step3: Implement character splitting using LangChain `CharacterTextSplitter`

In [51]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=25, chunk_overlap=0, separator='', strip_whitespace=True)

Use `create_documents` method from `CharacterTextSplitter` to split our text.  

Note: `create_documents` expects a list of texts, so if you just have a string (like we do) you'll need to wrap it in `[]`

In [52]:
document_chunks = text_splitter.create_documents([text])
document_chunks

[Document(page_content='Retrieval-Augmented Gener', metadata={}),
 Document(page_content='ation (RAG) is the proces', metadata={}),
 Document(page_content='s of optimizing the outpu', metadata={}),
 Document(page_content='t of  a large language mo', metadata={}),
 Document(page_content='del, so it references an', metadata={}),
 Document(page_content='authoritative knowledge b', metadata={}),
 Document(page_content='ase outside  of its train', metadata={}),
 Document(page_content='ing data sources before g', metadata={}),
 Document(page_content='enerating a response.', metadata={})]

Notice how this time we have the same chunks, but they are in documents. These will play nicely with the rest of the LangChain world. Also notice how the trailing whitespace on the end of the 5th chunk is missing. This is because LangChain removes it, see [this line](https://github.com/langchain-ai/langchain/blob/f36ef0739dbb548cabdb4453e6819fc3d826414f/libs/langchain/langchain/text_splitter.py#L167) for where they do it. You can avoid this with `strip_whitespace=False`

##### Step 3.1: Try using Chunk Overlap & Separators

**Step 3.1.1: Chunk overlap** will blend together our chunks so that the tail of Chunk #1 will be the same thing and the head of Chunk #2 and so on and so forth.

This time I'll load up my overlap with a value of 4, this means 4 characters of overlap

In [53]:
text_splitter = CharacterTextSplitter(chunk_size=25, chunk_overlap=4, separator='', strip_whitespace=True)
document_chunks = text_splitter.create_documents([text])
document_chunks

[Document(page_content='Retrieval-Augmented Gener', metadata={}),
 Document(page_content='eneration (RAG) is the pr', metadata={}),
 Document(page_content='e process of optimizing t', metadata={}),
 Document(page_content='ng the output of  a large', metadata={}),
 Document(page_content='arge language model, so i', metadata={}),
 Document(page_content='so it references an autho', metadata={}),
 Document(page_content='uthoritative knowledge ba', metadata={}),
 Document(page_content='e base outside  of its tr', metadata={}),
 Document(page_content='s training data sources b', metadata={}),
 Document(page_content='es before generating a re', metadata={}),
 Document(page_content='a response.', metadata={})]

Notice how we have the same chunks, but now there is overlap between 1 & 2 and 2 & 3 and so on. The 'ener' on the tail of Chunk #1 matches the 'ener' of the head of Chunk #2.

To better visualize this use [ChunkViz.com](www.chunkviz.com). Here's what the same text looks like.

<div style="text-align: center;">
    <img src="../98_Images/ChunkViz.png" alt="image" style="max-width: 800px;">
</div>

static/ChunkVizCharacterRecursive.png

Check out how we have three colors, with two overlaping sections.

**Step 3.1.2: Separators** are character(s) sequences you would like to split on. Say if you wanted to chunk your data at `ng`, you can specify it.

In [54]:
text_splitter = CharacterTextSplitter(separator='ng', chunk_size=25, chunk_overlap=0)
document_chunks = text_splitter.create_documents([text])
document_chunks

Created a chunk of size 63, which is longer than the specified 25
Created a chunk of size 26, which is longer than the specified 25
Created a chunk of size 83, which is longer than the specified 25
Created a chunk of size 29, which is longer than the specified 25


[Document(page_content='Retrieval-Augmented Generation (RAG) is the process of optimizi', metadata={}),
 Document(page_content='the output of  a large la', metadata={}),
 Document(page_content='uage model, so it references an authoritative knowledge base outside  of its traini', metadata={}),
 Document(page_content='data sources before generati', metadata={}),
 Document(page_content='a response.', metadata={})]

Note that if `separator` parameter is passed with a character other than `''`, it is not respecting the `chunk_size`

Refer below URL for the additional keywords that can be used while creating objects from `CharacterTextSplitter` class.  
https://api.python.langchain.com/en/latest/base/langchain_text_splitters.base.TextSplitter.html#langchain_text_splitters.base.TextSplitter

*************************
*************************

#### Step4: Implement character splitting using Llama Index `SentenceSplitter`

[Llama Index](https://www.llamaindex.ai/) is a great choice for flexibility in the chunking and indexing process. They provide node relationships out of the box which can aid in retrieval later.

Let's take a look at their sentence splitter. It is similar to the character splitter, but using its default settings, it'll split on sentences instead.

In [55]:
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import SimpleDirectoryReader

In [61]:
# Create SentenceSplitter object
splitter = SentenceSplitter(chunk_size=25, chunk_overlap=2)

In [62]:
# Load Documents
documents = SimpleDirectoryReader(input_files=["../99_Datasets/simple_rag.txt"]).load_data()

In [63]:
# Create your nodes. Nodes are similar to documents but with more relationship data added to them.
doc_nodes = splitter.get_nodes_from_documents(documents=documents)

Metadata length (11) is close to chunk size (25). Resulting chunks are less than 50 tokens. Consider increasing the chunk size or decreasing the size of your metadata to avoid this.


In [66]:
print(len(doc_nodes[0].get_content()))
print(doc_nodes[0].get_content())
print(doc_nodes[1].get_content())
doc_nodes[0]

54
Retrieval-Augmented Generation (RAG) is the process of
process of optimizing the output of  a large language model, so it


TextNode(id_='5d853c82-b7cb-4325-b0b0-8c18f569c1b6', embedding=None, metadata={'file_path': '../99_Datasets/simple_rag.txt', 'file_name': 'simple_rag.txt', 'file_type': 'text/plain', 'file_size': 222, 'creation_date': '2024-04-27', 'last_modified_date': '2024-04-27'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='21392551-1b70-4e21-b41a-5009bf2cd6de', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'file_path': '../99_Datasets/simple_rag.txt', 'file_name': 'simple_rag.txt', 'file_type': 'text/plain', 'file_size': 222, 'creation_date': '2024-04-27', 'last_modified_date': '2024-04-27'}, hash='788154316878a11c228a2835e9e9140c590c2110984864aa51d752ffdd55a688'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(no

As you can see there is a lot more relationship data held within Llama Index's nodes.

****
****

### 1.1.2) Level 2: Recursive Character Text Splitting

<a id="RecursiveCharacterSplitting"></a>
Let's jump a level of complexity.

The problem with Level #1 is that we don't take into account the structure of our document at all. We simply split by a fix number of characters.

The Recursive Character Text Splitter helps with this. With it, we'll specify a series of separatators which will be used to split our docs.

You can see the default separators for LangChain [here](https://github.com/langchain-ai/langchain/blob/9ef2feb6747f5a69d186bd623b569ad722829a5e/libs/langchain/langchain/text_splitter.py#L842). Let's take a look at them one by one.

* "\n\n" - Double new line, or most commonly paragraph breaks
* "\n" - New lines
* " " - Spaces
* "" - Characters

I'm not sure why a period (".") isn't included on the list, perhaps it is not universal enough? If you know, let me know.

This is the swiss army knife of splitters and my first choice when mocking up a quick application. If you don't know which splitter to start with, this is a good first bet.

Let's try it out

In [67]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

Then let's load up a larger piece of text

In [10]:
text = """
One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.

Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business.

It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]
"""

In [2]:
first_sentence = """One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear."""
first_sentence_length = len(first_sentence)
print(f"first_sentence_length: {first_sentence_length}")

second_sentence = """Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business."""
second_sentence_length = len(second_sentence)
print(f"second_sentence_length: {second_sentence_length}")

third_sentence = """It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]"""
third_sentence_length = len(third_sentence)
print(f"third_sentence_length: {third_sentence_length}")

first_sentence_length: 155
second_sentence_length: 313
third_sentence_length: 433


In [11]:
text

'\nOne of the most important things I didn\'t understand about the world when I was a child is the degree to which the returns for performance are superlinear.\n\nTeachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor\'s, you don\'t get half as many customers. You get no customers, and you go out of business.\n\nIt\'s obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we\'ve invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]\n'

#### Iteration 1:

In first iteration, the method `create_documents` will split data based on the delimiter `\n\n` (double new lines).  

**Note:** Check the output of `text` variable above, End of each line we will have a new line character and between each sentence we another newline character. Total 2 new line characters.  

**The output after 1st iteration is:**  

chunk_1 = "One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear."  

chunk_2 = "Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business."  

chunk_3 = "It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]"

#### Iteration 2:

Now it verifies the `chunk_size`, since each sentence length is more than 65 characters, it looks for the delimiter `\n` (new line).  

**The output after 2nd iteration is:**  

Since new line is not present in any of the chunk, the outpiut **remains same** after this iteration.

#### Iteration 3:

In [13]:
print(first_sentence[:65])
print(len("One of the most important things I didn't understand about the world"))
print(len("world when I was a child is the degree to which the returns for performance"))

One of the most important things I didn't understand about the wo
68
75


Now it verifies the `chunk_size` again, since each sentence length is more than 65 characters, it looks for the delimiter `" "` (spaces).  

**The output after 3rd iteration is:**
  
chunk_1 = "One of the most important things I didn't understand about the"  
chunk_2 = "world when I was a child is the degree to which the returns for"  
chunk_3 = "performance are superlinear."  
.  
.  
chunk_15 = "even benefit to humanity. In all of these, the rich get richer."  
chunk_16 = "[1]"  

**Note:** First chunk, it splits at 62nd character since if it consider the next available space, the chunk size will go beyond 65 characters.

Now let's make our text splitter

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 65, chunk_overlap=0)

In [None]:
text_splitter.create_documents([text])

[Document(page_content="One of the most important things I didn't understand about the"),
 Document(page_content='world when I was a child is the degree to which the returns for'),
 Document(page_content='performance are superlinear.'),
 Document(page_content='Teachers and coaches implicitly told us the returns were linear.'),
 Document(page_content='"You get out," I heard a thousand times, "what you put in." They'),
 Document(page_content='meant well, but this is rarely true. If your product is only'),
 Document(page_content="half as good as your competitor's, you don't get half as many"),
 Document(page_content='customers. You get no customers, and you go out of business.'),
 Document(page_content="It's obviously true that the returns for performance are"),
 Document(page_content='superlinear in business. Some think this is a flaw of'),
 Document(page_content='capitalism, and that if we changed the rules it would stop being'),
 Document(page_content='true. But superlinear returns for

Notice how now there are more chunks that end with a period ".". This is because those likely are the end of a paragraph and the splitter first looks for double new lines (paragraph break).

Once paragraphs are split, then it looks at the chunk size, if a chunk is too big, then it'll split by the next separator. If the chunk is still too big, then it'll move onto the next one and so forth.

For text of this size, let's split on something bigger.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 450, chunk_overlap=0)
text_splitter.create_documents([text])

[Document(page_content="One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear."),
 Document(page_content='Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor\'s, you don\'t get half as many customers. You get no customers, and you go out of business.'),
 Document(page_content="It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]")]

For this text, 450 splits the paragraphs perfectly. You can even switch the chunk size to 469 and get the same splits. This is because this splitter builds in a bit of cushion and wiggle room to allow your chunks to 'snap' to the nearest separator.

Let's view this visually

<div style="text-align: center;">
    <img src="static/ChunkVizCharacterRecursive.png" alt="image" style="max-width: 800px;">
</div>

Wow - you already made it to level 2, awesome! We're on a roll. If you like the content, I send updates to email subscribers on projects I'm working on. If you want to get the scoop, sign up [here](https://mail.gregkamradt.com/signup).

### 1.1.3) Level 3: Document Specific Splitting <a id="DocumentSpecific"></a>

Stepping up our levels ladder, let's start to handle document types other than normal prose in a .txt. What if you have pictures? or a PDF? or code snippets?

Our first two levels wouldn't work great for this so we'll need to find a different tactic.

This level is all about making your chunking strategy fit your different data formats. Let's run through a bunch of examples of this in action

The Markdown, Python, and JS splitters will basically be similar to Recursive Character, but with different separators.

See all of LangChains document splitters [here](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/code_splitter) and Llama Index ([HTML](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules.html#htmlnodeparser), [JSON](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules.html#jsonnodeparser), [Markdown](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules.html#markdownnodeparser))

#### Markdown

You can see the separators [here](https://github.com/langchain-ai/langchain/blob/9ef2feb6747f5a69d186bd623b569ad722829a5e/libs/langchain/langchain/text_splitter.py#L1175).

Separators:
* `\n#{1,6}` - Split by new lines followed by a header (H1 through H6)
* ```` ```\n ```` - Code blocks
* `\n\\*\\*\\*+\n` - Horizontal Lines
* `\n---+\n` - Horizontal Lines
* `\n___+\n` - Horizontal Lines
* `\n\n` Double new lines
* `\n` - New line
* `" "` - Spaces
* `""` - Character

In [19]:
# import required libraries
from langchain.text_splitter import MarkdownTextSplitter
from pathlib import Path

# create an object of MarkdownTextSplitter class
md_splitter = MarkdownTextSplitter(chunk_size=128, chunk_overlap=0)

# read input markdown file and split it
file_path = Path.cwd().parent.joinpath("99_Datasets").joinpath("nlp_small_feedforward.md")
if not file_path.exists():
    print("Input file not available")
else:
    with open(file=file_path, mode="rt") as input_file:
        text = input_file.read()
        doc_chunks = md_splitter.create_documents([text])

In [20]:
doc_chunks

[Document(page_content='# [NLP with Small Feed-Forward Networks](https://arxiv.org/pdf/1708.00214.pdf)', metadata={}),
 Document(page_content='by: **Jan A. Botha, Emily Pitler, Ji Ma, Anton Bakalov, \nAlex Salcianu, David Weiss, Ryan McDonald, Slav Petrov (Google)**', metadata={}),
 Document(page_content='## tl;dr\nSmall and shallow feedforward networks are memory and speed efficient, and perform surprisingly well.', metadata={}),
 Document(page_content='Using some techniques, they get close to state-of-the-art on structured and unstructured language tasks.', metadata={}),
 Document(page_content='## Notes \n\n#### Language tasks on a budget\n\nrelated work :', metadata={}),
 Document(page_content='* 900000 params only for LSTM-based POS tagging model (Gillick et al 2016)', metadata={}),
 Document(page_content='* 8.8 words / s for two-layered LSTM on Android phone for translation (Kim and Rush 2016)', metadata={}),
 Document(page_content='***\n\n4 tasks adressed here :', metadata={}),
 

Notice how the splits gravitate towards markdown sections. However, it's still not perfect.  
Check out how there is a chunk with just **"scratch"**, **"ngram"** in it. You'll run into this at low-sized chunks.

#### Python

See the python splitters [here](https://github.com/langchain-ai/langchain/blob/9ef2feb6747f5a69d186bd623b569ad722829a5e/libs/langchain/langchain/text_splitter.py#L1069)

* `\nclass` - Classes first
* `\ndef` - Functions next
* `\n\tdef` - Indented functions
* `\n\n` - Double New lines
* `\n` - New Lines
* `" "` - Spaces
* `""` - Characters


Let's load up our splitter

In [27]:
from langchain.text_splitter import PythonCodeTextSplitter
py_splitter = PythonCodeTextSplitter(chunk_size=98, chunk_overlap=0)

In [28]:
python_text = """
class Person:
  def __init__(self, name, age):
    self.name = name
    self.age = age

p1 = Person("John", 36)

for i in range(10):
    print (i)
"""

In [29]:
doc_chunks = py_splitter.create_documents([python_text])
doc_chunks

[Document(page_content='class Person:\n  def __init__(self, name, age):\n    self.name = name\n    self.age = age', metadata={}),
 Document(page_content='p1 = Person("John", 36)\n\nfor i in range(10):\n    print (i)', metadata={})]

Check out how the class stays together in a single document (good), then the rest of the code is in a second document (ok).

I needed to play with the chunk size to get a clean result like that. You'll likely need to do the same for yours which is why using evaluations to determine optimal chunk sizes is crucial.

When working with text in the language model world, we don't deal with raw strings.  
It is more common to work with documents. So will consider reading `paul_graham_essays.txt`

Refer the source notebook from Greg  
https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb

***