## Optimal Chunk-size for Large Document Summarization

### Large document summarization

In [2]:
from chunker import naive_chunker, auto_chunker,get_token_size
from utils import  get_chunk_summary, get_global_summary
import textwrap

OPENAI_KEY="your key here"
MODEL='gpt-3.5-turbo'

with open('./documents/startupideas.txt', 'r') as file:
    test_document=file.read()

print('The document has a size of:',get_token_size(test_document, MODEL))

The document has a size of: 9160


#### Biased Global Summarization with Naive chunking method
We intentionally selected a chunk size of 3000 to highlight the problem more clearly.

In [3]:
CHUNK_SIZE=3000
naive_chunks=naive_chunker(test_document, CHUNK_SIZE, MODEL)
naive_chunk_summaries=[get_chunk_summary(chunk,  OPENAI_KEY, MODEL) for chunk in naive_chunks]
naive_global_summary=get_global_summary(naive_chunk_summaries,OPENAI_KEY, MODEL)

In [9]:
print('chunk size list:',[get_token_size(chunk, MODEL) for chunk in naive_chunks], '\n')
print('last chunk text:',naive_chunks[-1], '\n')

chunk size list: [3000, 3000, 3000, 160] 

last chunk text:  already exist.  Anything that got built this way would be
very promising, because such users are not just the most demanding
but also the perfect point to spread from.I have no idea whether this would work.[17]
And the reason it used a TV for a monitor is that Steve Wozniak
started out by solving his own problems.  He, like most of his
peers, couldn't afford a monitor.Thanks to Sam Altman, Mike Arrington, Paul Buchheit, John Collison,
Patrick Collison, Garry Tan, and Harj Taggar for reading drafts of
this, and Marc Andreessen, Joe Gebbia, Reid Hoffman, Shel Kaphan,
Mike Moritz and Kevin Systrom for answering my questions about
startup history. 



**The final chunk has a size of 160, which ideally shouldn't contribute any significant information to the global summary.**

In [5]:
print('global summary:', textwrap.fill(naive_global_summary, 80))

global summary: The document discusses the importance of understanding user needs and creating
products that fulfill those needs in order to build successful startups. It
emphasizes the danger of building products that no one wants and the need to
focus on a specific group of users who urgently need the product. The chunk also
mentions the importance of being at the leading edge of a rapidly changing field
and having a mind that is prepared to notice opportunities for startup ideas. It
suggests that startup ideas should come from the founders' own experiences and
that the most successful startups begin organically. The document also discusses
the process of coming up with startup ideas, emphasizing the importance of
living in the future and identifying what is missing in the present. It advises
against focusing too much on entrepreneurship education and instead recommends
gaining knowledge in different fields to identify problems that software can
solve. The document provides advice an

**However, in the produced global summary, it erroneously emphasizes the content of this final chunk:** 

>"It explains why Steve Wozniak used a TV as a monitor when starting out. The document concludes by encouraging individuals to question the status quo and look for things that are missing in order to find startup ideas."

### Chunking with automatic chunk size determination

We use the same chunk size 3000 as our MAX_CHUNK_SIZE.

In [3]:
MAX_CHUNK_SIZE=3000
auto_chunks=auto_chunker(test_document, MAX_CHUNK_SIZE, MODEL)
auto_chunk_summaries=[get_chunk_summary(chunk,  OPENAI_KEY, MODEL) for chunk in auto_chunks]
auto_global_summary=get_global_summary(auto_chunk_summaries,OPENAI_KEY, MODEL)

In [4]:
print('chunk size list:',[get_token_size(chunk, MODEL) for chunk in auto_chunks],'\n')

chunk size list: [2290, 2290, 2290, 2290] 



**We can see the chunk sizes are more balanced with our method.**

In [5]:
print('global summary:', textwrap.fill(auto_global_summary, 80))

global summary: The author reflects on their experience starting a company and emphasizes the
importance of paying attention to users' needs. They discuss the danger of
"made-up" startup ideas that sound plausible but do not have a market demand.
The author suggests that good startup ideas are those that address a specific
group or type of user and have a path for growth. They also mention that being
at the leading edge of a field or having experiences that prepare the mind to
notice opportunities can lead to successful startup ideas. The author concludes
by stating that preparation and being in the right mindset are key factors in
generating organic startup ideas.  The document chunk discusses the process of
finding startup ideas. It emphasizes the importance of being at the leading edge
of a rapidly changing field and noticing things that are missing. It suggests
questioning the status quo and focusing on problems that annoy or challenge you.
The document also mentions the benefits o

**The global summary has also improved compared to the naive method!**