# Chunking

In [None]:
!pip install langchain_community

In [13]:
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter, RecursiveCharacterTextSplitter

## Loading Document

In [4]:
loader = TextLoader("/content/data/guidetoinvestors.txt")
document = loader.load()
len(document)

1

In [12]:
text = document[0].page_content
print(text[:1111])

April 2007(This essay is derived from a keynote talk at the 2007 ASES Summit
at Stanford.)The world of investors is a foreign one to most hackers—partly
because investors are so unlike hackers, and partly because they
tend to operate in secret.  I've been dealing with this world for
many years, both as a founder and an investor, and I still don't
fully understand it.In this essay I'm going to list some of the more surprising things
I've learned about investors.  Some I only learned in the past year.Teaching hackers how to deal with investors is probably the second
most important thing we do at Y Combinator.  The most important
thing for a startup is to make something good.  But everyone knows
that's important.  The dangerous thing about investors is that
hackers don't know how little they know about this strange world.1. The investors are what make a startup hub.About a year ago I tried to figure out what you'd need to reproduce
Silicon Valley.  I decided the 
critical ingredients were

## Fixed Size Chunking

In [14]:
fixed_chunking = CharacterTextSplitter(
    separator="",
    chunk_size=500,
    chunk_overlap=100,
    length_function=len,
    is_separator_regex=False
)

In [16]:
fixed_chunks = fixed_chunking.create_documents([text])
len(fixed_chunks), type(fixed_chunks)

(88, list)

In [24]:
for i, chunk in enumerate(fixed_chunks[:5]):
    print(f"Chunk {i+1} length: {len(chunk.page_content)}")
    print(f"Chunk {i+1} content preview: {chunk.page_content[:200]}...")

Chunk 1 length: 500
Chunk 1 content preview: April 2007(This essay is derived from a keynote talk at the 2007 ASES Summit
at Stanford.)The world of investors is a foreign one to most hackers—partly
because investors are so unlike hackers, and pa...
Chunk 2 length: 498
Chunk 2 content preview: some of the more surprising things
I've learned about investors.  Some I only learned in the past year.Teaching hackers how to deal with investors is probably the second
most important thing we do at ...
Chunk 3 length: 499
Chunk 3 content preview: know about this strange world.1. The investors are what make a startup hub.About a year ago I tried to figure out what you'd need to reproduce
Silicon Valley.  I decided the 
critical ingredients were...
Chunk 4 length: 499
Chunk 4 content preview: Not because they contribute more to the startup, but simply
because they're least willing to move.  They're rich.  They're not
going to move to Albuquerque just because there are some smart
hackers th...
Chun

*If we pass the separtor parameter then it will override the chunk length and perform chunking on basis of separator.*

In [27]:
fixed_chunking = CharacterTextSplitter(
    separator=",",
    chunk_size=500,
    chunk_overlap=100,
    length_function=len,
    is_separator_regex=False
)

In [28]:
fixed_chunks = fixed_chunking.create_documents([text])
len(fixed_chunks), type(fixed_chunks)



(90, list)

In [29]:
for i, chunk in enumerate(fixed_chunks[:5]):
    print(f"Chunk {i+1} length: {len(chunk.page_content)}")
    print(f"Chunk {i+1} content preview: {chunk.page_content[:200]}...")

Chunk 1 length: 329
Chunk 1 content preview: April 2007(This essay is derived from a keynote talk at the 2007 ASES Summit
at Stanford.)The world of investors is a foreign one to most hackers—partly
because investors are so unlike hackers, and pa...
Chunk 2 length: 759
Chunk 2 content preview: and I still don't
fully understand it.In this essay I'm going to list some of the more surprising things
I've learned about investors.  Some I only learned in the past year.Teaching hackers how to dea...
Chunk 3 length: 156
Chunk 3 content preview: and all the other people will move.If I had to narrow that down, I'd say investors are the limiting
factor.  Not because they contribute more to the startup...
Chunk 4 length: 486
Chunk 4 content preview: I'd say investors are the limiting
factor.  Not because they contribute more to the startup, but simply
because they're least willing to move.  They're rich.  They're not
going to move to Albuquerque ...
Chunk 5 length: 467
Chunk 5 content preview: and

**So, the points to remember are -**  
- It will chunk based on character length and hence, words may split in between.
- If separator is passed then the chunking priority is seprator rather than chunk length.

## Recursive Chunking

In [30]:
recursive_chunking = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ".", " "],  # Splits on these in order
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len
)

In [34]:
recursive_chunks = recursive_chunking.create_documents([text])
len(recursive_chunks), type(recursive_chunks)

(45, list)

In [35]:
print("\nRecursive chunks:")
for i, chunk in enumerate(recursive_chunks[:10]):
    print(f"Chunk {i+1} length: {len(chunk.page_content)}")
    print(f"Chunk {i+1} content preview: {chunk.page_content[:200]}...")


Recursive chunks:
Chunk 1 length: 973
Chunk 1 content preview: April 2007(This essay is derived from a keynote talk at the 2007 ASES Summit
at Stanford.)The world of investors is a foreign one to most hackers—partly
because investors are so unlike hackers, and pa...
Chunk 2 length: 999
Chunk 2 content preview: Silicon Valley.  I decided the 
critical ingredients were rich people
and nerds—investors and founders.  People are all you need to
make technology, and all the other people will move.If I had to narr...
Chunk 3 length: 976
Chunk 3 content preview: companies that VCs invest in would never have made it that far if angels
hadn't invested first.  VCs say between half and three quarters of
companies that raise series A rounds have taken some outside...
Chunk 4 length: 957
Chunk 4 content preview: to reward.  So the most successful startup of all is likely to have
seemed an extremely risky bet at first, and that is exactly the
kind VCs won't touch.Where do angel investors come from? 

**Now, as you can see -**   
- It never cuts the word in between, not in splitting as well as not in overlapping also.
- It will recursively break the text based on passed separators.
- Thus, it prioritize the structure of the text *(but not meaning obviously)*.

In [None]:
type(vars(recursive_chunks[0]))