## Document A: FAQ

### Strategy 

Custom Chunking(Chunking by Q & A )

### Reason
The document to be chunked is FAQ which implies it consists of questions and answers. Chunking should be done by Q and A to provide the full context for our LLms and preserve semantic relationship between a question and its answer.


### Implement the chunking

In [3]:
def custom_chunk(text):
    q_as = text.split("\n\n")
    chunks = []
    for chunk in q_as:
        chunks.append(chunk.strip())
    return chunks 


text = """
Q: What is the return policy?
A: Items can be returned within 30 days of purchase with original receipt.

Q: Do you offer international shipping?
A: Yes, we ship to over 50 countries worldwide. Shipping times vary by location.

Q: How do I track my order?
A: Use the tracking number sent to your email after shipment.
    """

chunks = custom_chunk(text)

print(f"Number of chunks: {len(chunks)}\n")
for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i} ({len(chunk)} chars):")
    print(chunk)
    print("-" * 80)


Number of chunks: 3

Chunk 1 (104 chars):
Q: What is the return policy?
A: Items can be returned within 30 days of purchase with original receipt.
--------------------------------------------------------------------------------
Chunk 2 (120 chars):
Q: Do you offer international shipping?
A: Yes, we ship to over 50 countries worldwide. Shipping times vary by location.
--------------------------------------------------------------------------------
Chunk 3 (89 chars):
Q: How do I track my order?
A: Use the tracking number sent to your email after shipment.
--------------------------------------------------------------------------------


## Document B: Technical Documentation

### Strategy

Sentence Chunking

### Reason

Fixed sized(Word) chunking might split in the middle making it difficult for our LLMs to derive context. Paragraph chunking on the other hand might make it more difficult for our LLMs to pinpoint specific information which might result in latency. Sentence chunking with overlapping chunks is the best strategy for this task.





### Implement the chunking

In [4]:
def chunk_by_sentences(text, max_chunk_size = 100):
  import re
  sentences = re.split(r'(?<=[.!?])\s+', text)
  chunks = []
  current_chunk = ""

  for sentence in sentences:
    #Check if adding this sentence would exceed max size
    if len(current_chunk) + len(sentence) > max_chunk_size and current_chunk:
      chunks.append(current_chunk.strip())
      current_chunk = sentence
    else:
      current_chunk += " " + sentence if current_chunk else sentence

  if current_chunk:
    chunks.append(current_chunk.strip())

  return chunks

doc = """
        Installation Guide

        Step 1: Download the installer from our website.
        Extract the zip file to your desired location.

        Step 2: Run setup.exe as administrator.
        Follow the on-screen instructions.

        Step 3: Configure your API key in the settings file.
        The settings file is located at config/settings.json.
    """

chunk_by_sentences(doc)

['Installation Guide\n\n        Step 1: Download the installer from our website.',
 'Extract the zip file to your desired location. Step 2: Run setup.exe as administrator.',
 'Follow the on-screen instructions. Step 3: Configure your API key in the settings file.',
 'The settings file is located at config/settings.json.']

## Document C: Article

### Strategy

Paragraph Chunking

### Reason

Fixed sized(Word) chunking might split in the middle making it difficult for our LLMs to derive context. Paragraph chunking on the other hand might make it more difficult for our LLMs to pinpoint specific information which might result in latency. Sentence chunking with overlapping chunks is the best strategy for this task.





In [2]:
def chunk_by_paragraphs(text, min_chunk_size=100):
    """
    Split text by paragraphs (double newlines).
    
    Args:
        text: The text to chunk
        min_chunk_size: Minimum characters per chunk (combine small paragraphs)
    
    Returns:
        List of text chunks
    """
    # Split by double newlines (paragraph separator)
    paragraphs = text.split('\n\n')
    
    chunks = []
    current_chunk = ""
    
    for para in paragraphs:
        para = para.strip()
        if not para:
            continue
            
        # If paragraph is too small, combine with next
        if len(para) < min_chunk_size:
            current_chunk += "\n\n" + para if current_chunk else para
        else:
            # Save previous chunk if exists
            if current_chunk:
                chunks.append(current_chunk.strip())
            # Start new chunk with this paragraph
            current_chunk = para
    
    # Don't forget the last chunk
    if current_chunk:
        chunks.append(current_chunk.strip())
    
    return chunks



In [3]:
sample_document = """
The Future of Renewable Energy

Solar and wind power have seen tremendous growth in recent years. As technology improves
and costs decrease, renewable energy becomes increasingly competitive with fossil fuels.

Energy storage solutions are critical for renewable adoption. Battery technology advances
enable better grid management and reliability. This addresses the intermittent nature of
solar and wind power.

Policy support and public awareness continue to drive the transition. Many countries have
set ambitious renewable energy targets for the coming decades.
"""

# Test it
chunks = chunk_by_paragraphs(sample_document, min_chunk_size=100)

print(f"Number of chunks: {len(chunks)}\n")
for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i} ({len(chunk)} chars):")
    print(chunk)
    print("-" * 80)

Number of chunks: 4

Chunk 1 (30 chars):
The Future of Renewable Energy
--------------------------------------------------------------------------------
Chunk 2 (177 chars):
Solar and wind power have seen tremendous growth in recent years. As technology improves
and costs decrease, renewable energy becomes increasingly competitive with fossil fuels.
--------------------------------------------------------------------------------
Chunk 3 (200 chars):
Energy storage solutions are critical for renewable adoption. Battery technology advances
enable better grid management and reliability. This addresses the intermittent nature of
solar and wind power.
--------------------------------------------------------------------------------
Chunk 4 (152 chars):
Policy support and public awareness continue to drive the transition. Many countries have
set ambitious renewable energy targets for the coming decades.
--------------------------------------------------------------------------------
