In [2]:
%%capture
!pip install git+https://github.com/brandonstarxel/chunking_evaluation.git

In [3]:
from chunking_evaluation.chunking import RecursiveTokenChunker

In [4]:
import os

In [5]:
file = open('../cleaned_earnings_call.txt', 'r', encoding="utf-8" ) 
document = file.read()
print("First 1000 Characters", document[:1000])

First 1000 Characters Earnings Call AnalysisQ1-2024 AnalysisAmazon.com IncAmazon Q1 2024 Earnings SnapshotAmazon reported $143.3 billion in revenue, up 13%, and operating income of $15.3 billion, a 221% increase. Free cash flow over the last 12 months was $48.8 billion. North America revenue grew 12%, and international revenue grew 11%. Advertising revenue rose 24%, and AWS revenue jumped 17% to $25 billion. Amazon also anticipates Q2 net sales between $144 billion and $149 billion, with operating income ranging from $10 billion to $14 billion. Despite economic uncertainty, Amazon continues to focus on selection, price, and convenience to drive growth and profitability.Stellar Growth and Operational Excellence Amazon reported impressive financial results for Q1 2024, showcasing its ability to blend revenue growth with operational efficiency. The company recorded $143.3 billion in revenue, a 13% year-over-year increase, and operating income surged by 221% to $15.3 billion. This leap hig

In [6]:
def analyze_chunks(chunks, use_tokens=False):
    # Print the chunks of interest
    print("\nNumber of Chunks:", len(chunks))
    print("\n", "="*50, "39th Chunk", "="*50, "\n", chunks[38])
    print("\n", "="*50, "40th Chunk", "="*50, "\n", chunks[39])

    chunk1, chunk2, = chunks[38], chunks[39]

    if use_tokens:
        encoding = tiktoken.get_encoding("cl100k_base")
        tokens1 = encoding.encode(chunk1)
        tokens2 = encoding.encode(chunk2)
        
        # Find overlapping tokens
        for i in range(len(tokens1), 0, -1):
            if tokens1[-i:] == tokens2[:i]:
                overlap = encoding.decode(tokens1[-i:])
                print("\n", "="*50, f"\nOverlapping text ({i} tokens):", overlap)
                return
        print("\nNo token overlap found")
    else:
        # Find overlapping characters
        for i in range(min(len(chunk1), len(chunk2)), 0, -1):
            if chunk1[-i:] == chunk2[:i]:
                print("\n", "="*50, f"\nOverlapping text ({i} chars):", chunk1[-i:])
                return
        print("\nNo character overlap found")

### Chunking size of 800 characters and 0 overlaps

In [7]:
recursive_character_chunker = RecursiveTokenChunker(
    chunk_size=600,
    chunk_overlap=0,
    length_function=len,
    separators=["\n\n", "\n", ".", "?", "!", " ", ""]

)

In [8]:
recursive_character_chunks = recursive_character_chunker.split_text(document)
analyze_chunks(recursive_character_chunks, use_tokens=False)


Number of Chunks: 150

 . And we don't spend the capital without very clear signals that we can monetize it this way.We remain very bullish on AWS. We're at $100 billion-plus annualized revenue run rate, yet 85% or more of the global IT spend remains on-premises. And this is before you even calculate gen AI, most of which will be created over the next 10 to 20 years from scratch and on the cloud. There is a very large opportunity in front of us. We also continue to make strong progress on our newer investments. Our emerging international stores are growing and moving towards profitability

 . Our third-party logistics business offering services like Buy with Prime, Amazon shipping and multichannel fulfillment continues to grow well.We just launched a Prime delivery grocery benefit that lets customers receive free unlimited grocery delivery for just $9.99 a month, which is great value and customers are responding accordingly. Later this year in Manhattan, we're launching a new smaller 

### Chunk size of 800 and overlap length of 100

In [9]:
recursive_character_chunker = RecursiveTokenChunker(
    chunk_size=800,  # Character Length
    chunk_overlap=100,  # Overlap
    length_function=len,  # Character length with len()
    separators=["\n\n", "\n", ".", "?", "!", " ", ""] # According to Research
)

recursive_character_overlap_chunks = recursive_character_chunker.split_text(document)
analyze_chunks(recursive_character_overlap_chunks, use_tokens=False)


Number of Chunks: 112

 .Additionally, we continue to see the impact of cost optimizations diminish. While there always be a level of ongoing optimization, we think the majority of the recent cycle is behind us, and we're likely closer to a steady state of these optimization efforts. AWS operating income was $9.4 billion, an increase of $4.3 billion year-over-year. As a reminder, these results include the impact from the change in the estimated useful life of our servers, which primarily benefits the AWS segment. We made progress in managing our infrastructure and fixed costs while still growing at a healthy rate, which has resulted in improved leverage.As we've said in the past, over time, we expect the AWS operating margins to fluctuate, driven in part by the level of investments we are making in the business

 . We remain focused on driving efficiencies across the business, which enables us to invest to support the strong growth we're seeing in AWS, including generative AI, which b

In [12]:
structured_chunks = []
for i, chunk_text in enumerate(recursive_character_overlap_chunks):
    structured_chunks.append({
        "chunk_id": i,
        "text": chunk_text.strip(),
        "metadata": {
            "company": "Amazon",
            "quarter": "Q1 2024",
            "doc_type": "earnings_call",
            "source": "earnings_call.txt"
        }
    })

In [13]:
structured_chunks[1]

{'chunk_id': 1,
 'text': ".Stellar Growth and Operational Excellence Amazon reported impressive financial results for Q1 2024, showcasing its ability to blend revenue growth with operational efficiency. The company recorded $143.3 billion in revenue, a 13% year-over-year increase, and operating income surged by 221% to $15.3 billion. This leap highlights Amazon's effective cost management and strategic investments across its primary segments. Moreover, the adjusted free cash flow on a trailing 12-month basis reached $48.8 billion, marking significant progress in liquidity .Developing Selection and Speed in RetailAmazon's retail segment emphasized expanding product selection and increasing delivery speeds",
 'metadata': {'company': 'Amazon',
  'quarter': 'Q1 2024',
  'doc_type': 'earnings_call',
  'source': 'earnings_call.txt'}}

In [14]:
import json

with open("chunk_earnings_call.json", "w") as f:
    json.dump(structured_chunks, f, indent=2)

In [16]:
# Testing chunks
print(structured_chunks[0].keys())
print(structured_chunks[0]["text"][:100])
print(structured_chunks[0]["metadata"])

dict_keys(['chunk_id', 'text', 'metadata'])
Earnings Call AnalysisQ1-2024 AnalysisAmazon.com IncAmazon Q1 2024 Earnings SnapshotAmazon reported 
{'company': 'Amazon', 'quarter': 'Q1 2024', 'doc_type': 'earnings_call', 'source': 'earnings_call.txt'}
