## Level 1: Character Splitting

##### Character splitting is the most basic form of splitting up your text. It is the process of simply dividing your text into N-character size chunks regardless of their content form.
##### This method isn't recommended for any application - but it's a great starting point for us to understand the basics 

- **Pros**: Easy & Simple.
- **Cons**: Very rigid and doesn't take into account the structure of your text. 

Concepts to Know

- **Chunk Size** - The number of character you would like in your chunks 50,100,10000 etc.
- **Chunk Overlap** - The amount you would like your sequentail chunks to overlap. This is to try to avoid the cutting a single piece of context into mutiple pieces.This will create duplicate data across chunks.

First lets get some sample text



In [1]:
text = "This is the text I would like to chunk up. It is the example text for this exercise"

Then lets split this text manually

In [2]:
## Create a list that will hold your chunks
chunks=[]
chunk_size=35 # character
# run through the range with the lenght of your data and iterate every chunk_size you want

for i in range(0,len(text),chunk_size):
    chunk=text[i:i+chunk_size]
    chunks.append(chunk)
chunks


['This is the text I would like to ch',
 'unk up. It is the example text for ',
 'this exercise']

Congratualtions! You just split first text. We have long way to go but you are already making progress. Feel like a language model practitionor yet?
When working with text in the language model world , we don't deal with raw strings. It is more common to work with documents.Documents are objects that hold the text you are concerned with, but also additional metadata which makes filtering and manipulation eassier later.

We could convert our list of stings into documents but i'd rather start from scratch and create the docs

Let's load up Langchain ` CharacterSplitter` to do this for us

In [3]:
from langchain.text_splitter import CharacterTextSplitter

Then lets load up this text splitter. I need to specify `Chunk_overlap` and `separator` or else we will get funk results.we will get inot next 

In [4]:
text_splitter=CharacterTextSplitter(separator='',chunk_size=chunk_size,strip_whitespace=False,chunk_overlap=0)

The we can actually split our text via `create_documents`. Note: `create_documents` expects a list of texts, so if you have a string (like we do)
you will need to wrap it in `[]`

In [5]:
text_splitter.create_documents([text])

[Document(page_content='This is the text I would like to ch'),
 Document(page_content='unk up. It is the example text for '),
 Document(page_content='this exercise')]

Notice how this time we have the same chunks, but they are in documents. These will play nicely with rest of the Langchain world.Also notice how the trailling whitespace on the end of the 2nd chunk is missing. This is because Langchain remove it .

**Chunk Overlap & Separator**

**Chunk Overlap** will blend together our chunks so that the tail of chunk #1 will be same thing and the head of chunk #2 and so on ans so forth.

This time I will load up my overlap with value of 4 this means 4 charcters of overlap
 

In [6]:
text_splitter=CharacterTextSplitter(chunk_size=35,chunk_overlap=4,separator='')

In [7]:
text_splitter.create_documents([text])

[Document(page_content='This is the text I would like to ch'),
 Document(page_content='o chunk up. It is the example text'),
 Document(page_content='ext for this exercise')]

Notice how we have the same chunks, but now there is overlap between 1 & 2 and 2 & 3. The 'o ch' on the tail of Chunk #1 matches the 'o ch' of the head of Chunk #2.


I wanted a better way to visualize this, so I made [ChunkViz.com](www.chunkviz.com) to help show it. Here's what the same text looks like.

<div style="text-align: center;">
    <img src="static\ChunkVizCharacter34_4_w_overlap.png" alt="image" style="max-width: 400px;">
</div>

static/ChunkVizCharacterRecursive.png

Check out how we have three colors, with two overlaping sections.

**Separators** are character(s) sequences you would like to split on. Say you wanted to chunk your data at `ch`, you can specify it.

## Level 2: Recursive Character Text Splitting

Let's  jump a level of complexity

The problem with Level#1 is that we don't take account the structure of our document at all. We simply split by a fix number of character.

The Recursive Character text splitter helps with this. With it , will specify a series of separators which will be used to split our docs.

You can see the default separator for langchain here.Let's take a look at them one by one.

* "\n\n" - Double new line, or most commonly paragraph breaks
* "\n" - New lines
* " " - Spaces
* "" - Characters


I'm not sure why a period (".") isn't included on the list, perhaps it is not universal enough? If you know, let me know.

This is the swiss army knife of splitters and my first choice when mocking up a quick application. If you don't know which splitter to start with, this is a good first bet.

Let's try it out

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter 

Then lets load up a larger piece of text

In [9]:
text = """
One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.

Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business.

It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]
"""

Now lets make our text splitter

In [10]:
text_splitter=RecursiveCharacterTextSplitter(
    chunk_size=65,
    chunk_overlap=0  
)

In [11]:
text_splitter.create_documents([text])

[Document(page_content="One of the most important things I didn't understand about the"),
 Document(page_content='world when I was a child is the degree to which the returns for'),
 Document(page_content='performance are superlinear.'),
 Document(page_content='Teachers and coaches implicitly told us the returns were linear.'),
 Document(page_content='"You get out," I heard a thousand times, "what you put in." They'),
 Document(page_content='meant well, but this is rarely true. If your product is only'),
 Document(page_content="half as good as your competitor's, you don't get half as many"),
 Document(page_content='customers. You get no customers, and you go out of business.'),
 Document(page_content="It's obviously true that the returns for performance are"),
 Document(page_content='superlinear in business. Some think this is a flaw of'),
 Document(page_content='capitalism, and that if we changed the rules it would stop being'),
 Document(page_content='true. But superlinear returns for

Notice how now there are more chunks that end with a period ".". This is because those likely are the end of a paragraph and the splitter first looks for double new lines (paragraph break).

Once paragraphs are split, then it looks at the chunk size, if a chunk is too big, then it'll split by the next separator. If the chunk is still too big, then it'll move onto the next one and so forth.

For text of this size, let's split on something bigger.

In [12]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 450, chunk_overlap=0)
text_splitter.create_documents([text])

[Document(page_content="One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear."),
 Document(page_content='Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor\'s, you don\'t get half as many customers. You get no customers, and you go out of business.'),
 Document(page_content="It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]")]

For this text, 450 splits the paragraphs perfectly. You can even switch the chunk size to 469 and get the same splits. This is because this splitter builds in a bit of cushion and wiggle room to allow your chunks to 'snap' to the nearest separator.

Let's view this visually

<div style="text-align: center;">
    <img src="static\ChunkVizCharacterRecursive.png" alt="image" style="max-width: 300px;">
</div>

Wow - you already made it to level 2, awesome! We're on a roll. 

Level 3: Document Specific Splitting 

Stepping up our level ladder,lets start to handle document types other than normal prose in a .text what if you have picture? or a PDF? or Code Snippets?

Our first two level wouldn't work great for this so will need to find a different tactic.


This 


## Level 3: Document Specific Splitting

Stepping up our levels ladder let's start to handle document types other than normal in a .txt what if you have picture? or a PDf? or code snippets?

our first levels wouldn't work great for this so we will need to find a different tatic.

This level is all about making your chunking strategy fit your different data formats.Let's run through a bunch of examples of this in action.

The Markdown,Python and Js splitter will basically be similar to Recursive Character, but with different separators.

See all of Langchain Document spliiter Here .

### Markdown

**Separator**
 You can see the separator

 - \n#{1,6} - Split by new lines followed by a header(H1 through H6)
 - ```\n - conde blocks
 - \n\\*\\*\\8+\n - Horizontal Lines
 - \n---+\n - Horizontal Lines
 - \n__+\n -Horizontal Lines
 - \n\n - Double Lines
 - " " - Spaces
 - "" - Character
 

In [13]:
from langchain.text_splitter import MarkdownTextSplitter

In [14]:
spliter=MarkdownTextSplitter(chunk_size=40,chunk_overlap=0)

In [15]:
markdown_text = """
# Fun in California

## Driving

Try driving on the 1 down to San Diego

### Food

Make sure to eat a burrito while you're there

## Hiking

Go to Yosemite
"""

In [16]:
spliter.create_documents([markdown_text])

[Document(page_content='# Fun in California\n\n## Driving'),
 Document(page_content='Try driving on the 1 down to San Diego'),
 Document(page_content='### Food'),
 Document(page_content="Make sure to eat a burrito while you're"),
 Document(page_content='there'),
 Document(page_content='## Hiking\n\nGo to Yosemite')]

Notice how the splits gravitate towards markdown sections. However, it's still not perfect. Check out how there is a chunk with just "there" in it. You'll run into this at low-sized chunks.

### Python

See the python splitters [here](https://github.com/langchain-ai/langchain/blob/9ef2feb6747f5a69d186bd623b569ad722829a5e/libs/langchain/langchain/text_splitter.py#L1069)

* `\nclass` - Classes first
* `\ndef` - Functions next
* `\n\tdef` - Indented functions
* `\n\n` - Double New lines
* `\n` - New Lines
* `" "` - Spaces
* `""` - Characters


Let's load up our splitter

In [17]:
from langchain.text_splitter import PythonCodeTextSplitter

In [18]:
python_text= """
class Person:
  def __init__(self, name, age):
    self.name = name
    self.age = age

p1 = Person("John", 36)

for i in range(10):
    print (i)
"""

In [19]:
python_splitter = PythonCodeTextSplitter(chunk_size=50, chunk_overlap=0)

In [20]:
python_splitter.create_documents([python_text])

[Document(page_content='class Person:\n  def __init__(self, name, age):'),
 Document(page_content='self.name = name\n    self.age = age'),
 Document(page_content='p1 = Person("John", 36)'),
 Document(page_content='for i in range(10):\n    print (i)')]

Check out how the class stays together in a single document (good), then the rest of the code is in a second document (ok).

I needed to play with the chunk size to get a clean result like that. You'll likely need to do the same for yours which is why using evaluations to determine optimal chunk sizes is crucial.

### JS

Very similar to python. See the separators [here](https://github.com/langchain-ai/langchain/blob/9ef2feb6747f5a69d186bd623b569ad722829a5e/libs/langchain/langchain/text_splitter.py#L983).

Separators:
* `\nfunction` - Indicates the beginning of a function declaration
* `\nconst` - Used for declaring constant variables
* `\nlet` - Used for declaring block-scoped variables
* `\nvar` - Used for declaring a variable
* `\nclass` - Indicates the start of a class definition
* `\nif` - Indicates the beginning of an if statement
* `\nfor` - Used for for-loops
* `\nwhile` - Used for while-loops
* `\nswitch` - Used for switch statements
* `\ncase` - Used within switch statements
* `\ndefault` - Also used within switch statements
* `\n\n` - Indicates a larger separation in text or code
* `\n` - Separates lines of code or text
* `" "` - Separates words or tokens in the code
* `""` - Makes every character a separate element

In [21]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language

In [22]:
javascript_text = """
// Function is called, the return value will end up in x
let x = myFunction(4, 3);

function myFunction(a, b) {
// Function returns the product of a and b
  return a * b;
}
"""

In [23]:
js_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.JS, chunk_size=65, chunk_overlap=0
)

In [24]:
js_splitter.create_documents([javascript_text])

[Document(page_content='// Function is called, the return value will end up in x'),
 Document(page_content='let x = myFunction(4, 3);'),
 Document(page_content='function myFunction(a, b) {'),
 Document(page_content='// Function returns the product of a and b\n  return a * b;\n}')]

### PDFs w/ tables

Ok now things will get a bit spicier.

PDFs are an extremely common data type for language model work. Often they'll contain tables that contain information.

This could be financial data, studies, academic papers, etc.

Trying to split tables by a character based separator isn't reliable. We need to try out a different method. For a deep dive on this I recommend checking out [Lance Martin's](https://twitter.com/RLanceMartin) [tutorial](https://twitter.com/RLanceMartin/status/1721942636364456336) w/ LangChain.

I'll be going through a text based methods. [Mayo](https://twitter.com/mayowaoshin) has also outlined a GPT-4V method which tries to pulls tables via vision rather than text. You can check out [here](https://twitter.com/mayowaoshin/status/1727399231734886633).

A very convenient way to do this is with [Unstructured](https://unstructured.io/), a library dedicated to making your data LLM ready.

In [25]:
import os
from unstructured.partition.pdf import partition_pdf
from unstructured.staging.base import elements_to_json
from pathlib import Path

Let's load up our PDF and then parition it. This is a PDF from a [Salesforce earning report](https://investor.salesforce.com/financials/default.aspx).

In [26]:
filename = Path("static/SalesforceFinancial.pdf")

## extract the elements from the Pdf

elements=partition_pdf(
    filename=filename,
    ##unstructure Helpers
    strategy='hi_res',
    infer_table_structure=True,
    model_name='yolox'
)

PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

: 

In [None]:
Let's look at our elements

In [None]:
elements

These are just unstructured objects, we could look at them all but I want to look at the table it parsed.

In [None]:
elements[-4].metadata.text_as_html