# Review

Recall last time we discussed document loading.

In [None]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/MachineLearning-Lecture01.pdf")
pages = loader.load()

In [None]:
from langchain.document_loaders import TextLoader
loader = TextLoader('docs/cs229_lecture1_whisper_transcript.txt', encoding='utf8')
text = loader.load()

# Text splitting - Charecters and Tokens

LLMs have limited context windows, meaning a limited prompt length can be passed into the model.

At the larger end, Anthropic's Claude model has a 100k token context windw.
 
But, ChatGPT (gpt-3.5) only has 4096 tokens.

Tokens are often [~4 charecters](https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them).

So, let's have a look at the PDF we loaded to check!

In [None]:
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)
print(len(joined_page_text)/4)

In [None]:
print(len(text[0].page_content)/4)

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
chunk_size = 1000
chunk_overlap = 150
text_splitter = RecursiveCharacterTextSplitter(chunk_size,chunk_overlap)

Note that we define two important parameters:
    
`chunk_size = 1500`

`chunk_overlap = 150`

These are the length of each chunk (in charecters) and the overlap between them.

Let's try it!

In [None]:
splits = text_splitter.split_text(text[0].page_content)
print(len(splits))
print(len(splits[0]))

Good, so we can see that it set the expected chunk size. 

Later, we will discuss how to pick the best chunk_size and overlap.

For now, the selections show above 1500, 150 are reasonable default values. 

In [None]:
splits[0]

We can also split on [token count explicity](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/split_by_token), if we want.

This can be useful because LLMs often have context windows [designated in tokens](https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them).

In [None]:
from langchain.text_splitter import TokenTextSplitter
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)
texts = text_splitter.split_text(text[0].page_content)
print(texts[0])

# Text splitting - Preserving context

Sometimes we want to preserve the local context of certain text groups.

For example, a markdown file is organized by headers. 

We might want to only split text within certain header groups to keep them together.

[These notes](https://www.pinecone.io/learn/chunking-strategies/) from Pinecone provide additional motivation:

```
When a full paragraph or document is embedded, the embedding process considers both the overall context and the relationships between the sentences and phrases within the text. This can result in a more comprehensive vector representation that captures the broader meaning and themes of the text. Larger input text sizes, on the other hand, may introduce noise or dilute the significance of individual sentences or phrases, making finding precise matches when querying the index more difficult.
```

In particular, if we mix chunks across header groups we may degregate the retrieval quality.

We can use `MarkdownHeaderTextSplitter` to [preserve header metadata](https://github.com/hwchase17/langchain/blob/master/docs/extras/modules/data_connection/document_transformers/text_splitters/markdown_header_metadata.ipynb) in our chunks, as showm below.

In [28]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter

In [29]:
loader = TextLoader('docs/Notion_DB/Blendle.md')
text = loader.load()

In [30]:
# Now, we have splits grouped by the specified headers! 
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(text[0].page_content)
md_header_splits[0]

{'content': "This is a living document with everything we've learned working with people while running a startup. And, of course, we continue to learn. Therefore it's a document that will continue to change.  \n**Everything related to working at Blendle and the people of Blendle, made public.**  \nThese are the lessons from three years of working with the people of Blendle. It contains everything from [how our leaders lead](https://www.notion.so/ecfb7e647136468a9a0a32f1771a8f52?pvs=21) to [how we increase salaries](https://www.notion.so/Salary-Review-e11b6161c6d34f5c9568bb3e83ed96b6?pvs=21), from [how we hire](https://www.notion.so/Hiring-451bbcfe8d9b49438c0633326bb7af0a?pvs=21) and [fire](https://www.notion.so/Firing-5567687a2000496b8412e53cd58eed9d?pvs=21) to [how we think people should give each other feedback](https://www.notion.so/Our-Feedback-Process-eb64f1de796b4350aeab3bc068e3801f?pvs=21) — and much more.  \nWe've made this document public because we want to learn from you. We'

Within each markdown group we can then apply any splitter we want e.g., `TokenTextSplitter` or `RecursiveCharacterTextSplitter`. 

In [36]:
# Define our text splitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
chunk_size = 100
chunk_overlap = 0
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
 
# Create splits within each header group and combine them!
all_splits=[]
all_metadatas=[]
for header_group in md_header_splits:
    _splits = text_splitter.split_text(header_group['content'])
    _metadatas = [header_group['metadata'] for _ in _splits]
    all_splits += _splits
    all_metadatas += _metadatas

In [38]:
all_splits[0]

"This is a living document with everything we've learned working with people while running a startup."

In [40]:
all_metadatas[0]

{'Header 1': "Blendle's Employee Handbook"}