# Lab 3.2: Chunking Strategies for RAG

Este notebook implementa diferentes estrat√©gias de chunking para os GCN Circulars,
seguindo as melhores pr√°ticas dos labs oficiais da Databricks.

**Objetivos:**
1. Implementar chunking por caracteres (simples)
2. Implementar chunking por senten√ßas (regex - compat√≠vel com serverless)
3. Implementar chunking sem√¢ntico (par√°grafos)
4. **Comparar estrat√©gias COM vs SEM overlap** e avaliar trade-offs
5. **Analisar impacto do tamanho do chunk** na qualidade do retrieval
6. Aplicar chunking ao dataset e salvar em Delta Lake

**Exam Topics Covered:**
- Section 2: Data Preparation (14%)
  - Apply chunking strategy for document structure and model constraints
  - Design retrieval systems using advanced chunking strategies
  - Evaluate how chunk size and overlap affect retrieval precision

## Setup

In [0]:
%pip install tiktoken -q
dbutils.library.restartPython()

In [0]:
import re
import tiktoken
from typing import List, Dict, Any
from pyspark.sql.functions import (
    col, udf, explode, lit, length,
    monotonically_increasing_id, concat_ws, size, array
)
from pyspark.sql.types import ArrayType, StructType, StructField, StringType, IntegerType

# Configura√ß√£o
CATALOG = "sandbox"
SCHEMA = "nasa_gcn_dev"
spark.sql(f"USE CATALOG {CATALOG}")
spark.sql(f"USE SCHEMA {SCHEMA}")

print("‚úÖ Setup completo")

In [0]:
# Carregar dataset do notebook anterior
df_prepared = spark.table("gcn_circulars_prepared")

# Estat√≠sticas
total_docs = df_prepared.count()
avg_chars = df_prepared.agg({"char_count": "avg"}).collect()[0][0]

print(f"""
üìä Dataset carregado:
  - Total documentos: {total_docs:,}
  - M√©dia de caracteres: {avg_chars:,.0f}
""")

# Mostrar exemplo
df_prepared.select("circular_id", "event_id", "char_count").show(5)

## 1. Estrat√©gia 1: Character-based Chunking

Chunking simples por n√∫mero de caracteres com overlap.

In [0]:
def chunk_by_chars(text: str, chunk_size: int = 500, overlap: int = 100) -> List[Dict[str, Any]]:
    """
    Divide texto em chunks de tamanho fixo com overlap.

    Args:
        text: Texto a ser dividido
        chunk_size: Tamanho m√°ximo de cada chunk em caracteres
        overlap: N√∫mero de caracteres de sobreposi√ß√£o entre chunks

    Returns:
        Lista de dicts com chunk_text, chunk_index, start_pos, end_pos
    """
    if not text or len(text) == 0:
        return []

    chunks = []
    start = 0
    chunk_idx = 0

    while start < len(text):
        end = start + chunk_size

        # Tentar quebrar em espa√ßo para n√£o cortar palavras
        if end < len(text):
            # Procurar √∫ltimo espa√ßo dentro do chunk
            last_space = text.rfind(' ', start, end)
            if last_space > start:
                end = last_space

        chunk_text = text[start:end].strip()

        if chunk_text:
            chunks.append({
                "chunk_text": chunk_text,
                "chunk_index": chunk_idx,
                "start_pos": start,
                "end_pos": end,
                "char_count": len(chunk_text)
            })
            chunk_idx += 1

        # Pr√≥ximo chunk come√ßa com overlap
        start = end - overlap if end < len(text) else len(text)

    return chunks

# Testar
test_text = "This is a test. " * 50
test_chunks = chunk_by_chars(test_text, chunk_size=200, overlap=50)
print(f"Texto de {len(test_text)} chars ‚Üí {len(test_chunks)} chunks")
print(f"Primeiro chunk: '{test_chunks[0]['chunk_text'][:50]}...'")

## 2. Estrat√©gia 2: Sentence-based Chunking (Regex)

Chunking inteligente que respeita limites de senten√ßas.

> **Nota:** Usamos regex ao inv√©s de NLTK para compatibilidade com clusters serverless.
> NLTK requer download de dados que n√£o ficam dispon√≠veis nos workers do Spark.

In [0]:
# Nota: Usamos regex ao inv√©s de NLTK para compatibilidade com clusters serverless
# NLTK requer download de dados que n√£o ficam dispon√≠veis nos workers

def simple_sent_tokenize(text: str) -> List[str]:
    """
    Tokenizador de senten√ßas simples usando regex.
    Compat√≠vel com clusters serverless (n√£o requer NLTK).

    Funciona bem para textos cient√≠ficos em ingl√™s como GCN Circulars.
    """
    # Proteger abrevia√ß√µes comuns substituindo temporariamente
    abbreviations = ['Dr.', 'Mr.', 'Mrs.', 'Ms.', 'Prof.', 'Fig.', 'Tab.', 'Eq.', 'et al.', 'i.e.', 'e.g.', 'vs.', 'etc.']
    protected = text
    for i, abbr in enumerate(abbreviations):
        protected = protected.replace(abbr, f"__ABBR{i}__")

    # Dividir em senten√ßas: . ! ? seguido de espa√ßo e letra mai√∫scula
    sentences = re.split(r'([.!?]) +(?=[A-Z])', protected)

    # Recombinar pontua√ß√£o com a senten√ßa anterior
    result = []
    i = 0
    while i < len(sentences):
        if i + 1 < len(sentences) and sentences[i + 1] in '.!?':
            result.append(sentences[i] + sentences[i + 1])
            i += 2
        else:
            result.append(sentences[i])
            i += 1

    # Restaurar abrevia√ß√µes
    final = []
    for sent in result:
        restored = sent
        for i, abbr in enumerate(abbreviations):
            restored = restored.replace(f"__ABBR{i}__", abbr)
        if restored.strip():
            final.append(restored.strip())

    return final


def chunk_by_sentences(text: str, max_chunk_size: int = 500, overlap_sentences: int = 1) -> List[Dict[str, Any]]:
    """
    Divide texto em chunks respeitando limites de senten√ßas.

    Args:
        text: Texto a ser dividido
        max_chunk_size: Tamanho m√°ximo aproximado de cada chunk
        overlap_sentences: N√∫mero de senten√ßas de overlap

    Returns:
        Lista de chunks
    """
    if not text or len(text) == 0:
        return []

    # Tokenizar em senten√ßas (usando regex, compat√≠vel com serverless)
    sentences = simple_sent_tokenize(text)

    if len(sentences) == 0:
        return [{"chunk_text": text, "chunk_index": 0, "sentence_count": 1, "char_count": len(text)}]

    chunks = []
    current_chunk = []
    current_size = 0
    chunk_idx = 0

    for i, sentence in enumerate(sentences):
        sentence_size = len(sentence)

        # Se adicionar esta senten√ßa ultrapassa o limite
        if current_size + sentence_size > max_chunk_size and current_chunk:
            # Salvar chunk atual
            chunk_text = ' '.join(current_chunk)
            chunks.append({
                "chunk_text": chunk_text,
                "chunk_index": chunk_idx,
                "sentence_count": len(current_chunk),
                "char_count": len(chunk_text)
            })
            chunk_idx += 1

            # Overlap: manter √∫ltimas N senten√ßas
            current_chunk = current_chunk[-overlap_sentences:] if overlap_sentences > 0 else []
            current_size = sum(len(s) for s in current_chunk)

        current_chunk.append(sentence)
        current_size += sentence_size

    # √öltimo chunk
    if current_chunk:
        chunk_text = ' '.join(current_chunk)
        chunks.append({
            "chunk_text": chunk_text,
            "chunk_index": chunk_idx,
            "sentence_count": len(current_chunk),
            "char_count": len(chunk_text)
        })

    return chunks

# Testar
test_text = "GRB 251208B was detected by Fermi GBM. The burst had a duration of 2.5 seconds. " * 20
test_chunks = chunk_by_sentences(test_text, max_chunk_size=300, overlap_sentences=1)
print(f"Texto de {len(test_text)} chars ‚Üí {len(test_chunks)} chunks")
for i, c in enumerate(test_chunks[:2]):
    print(f"  Chunk {i}: {c['sentence_count']} senten√ßas, {c['char_count']} chars")

## 3. Estrat√©gia 3: Semantic/Paragraph Chunking

Chunking que respeita par√°grafos e estrutura do documento.

In [0]:
import re

def chunk_by_paragraphs(text: str, max_chunk_size: int = 800, min_paragraph_size: int = 50) -> List[Dict[str, Any]]:
    """
    Divide texto em chunks respeitando par√°grafos.

    Args:
        text: Texto a ser dividido
        max_chunk_size: Tamanho m√°ximo de cada chunk
        min_paragraph_size: Tamanho m√≠nimo para considerar um par√°grafo separado

    Returns:
        Lista de chunks
    """
    if not text or len(text) == 0:
        return []

    # Dividir por linhas duplas (par√°grafos)
    paragraphs = re.split(r'\n\s*\n', text)
    paragraphs = [p.strip() for p in paragraphs if p.strip()]

    if len(paragraphs) == 0:
        return [{"chunk_text": text, "chunk_index": 0, "paragraph_count": 1, "char_count": len(text)}]

    chunks = []
    current_chunk = []
    current_size = 0
    chunk_idx = 0

    for para in paragraphs:
        para_size = len(para)

        # Par√°grafo muito pequeno? Juntar com o anterior
        if para_size < min_paragraph_size and current_chunk:
            current_chunk.append(para)
            current_size += para_size
            continue

        # Se adicionar ultrapassa o limite
        if current_size + para_size > max_chunk_size and current_chunk:
            chunk_text = '\n\n'.join(current_chunk)
            chunks.append({
                "chunk_text": chunk_text,
                "chunk_index": chunk_idx,
                "paragraph_count": len(current_chunk),
                "char_count": len(chunk_text)
            })
            chunk_idx += 1
            current_chunk = []
            current_size = 0

        current_chunk.append(para)
        current_size += para_size

    # √öltimo chunk
    if current_chunk:
        chunk_text = '\n\n'.join(current_chunk)
        chunks.append({
            "chunk_text": chunk_text,
            "chunk_index": chunk_idx,
            "paragraph_count": len(current_chunk),
            "char_count": len(chunk_text)
        })

    return chunks

# Testar
test_text = """GRB 251208B was detected by Fermi GBM on December 8, 2025 at 14:32:15 UT.

The burst showed a complex light curve with multiple peaks. The T90 duration was measured at 2.5 seconds, classifying it as a short GRB.

Follow-up observations were conducted with Swift XRT starting at T+300 seconds. An X-ray afterglow was detected at coordinates RA=123.456, Dec=-45.678.

Optical observations from NOT revealed a fading counterpart with magnitude r=21.5 at T+2 hours."""

test_chunks = chunk_by_paragraphs(test_text, max_chunk_size=400)
print(f"Texto de {len(test_text)} chars ‚Üí {len(test_chunks)} chunks")
for i, c in enumerate(test_chunks):
    print(f"  Chunk {i}: {c['paragraph_count']} par√°grafos, {c['char_count']} chars")

## 4. Aplicar Chunking ao Dataset

In [0]:
# Schema para os chunks
chunk_schema = ArrayType(StructType([
    StructField("chunk_text", StringType(), True),
    StructField("chunk_index", IntegerType(), True),
    StructField("char_count", IntegerType(), True)
]))

# UDF para chunking por senten√ßas (melhor para textos cient√≠ficos)
# IMPORTANTE: Todo o c√≥digo deve estar inline no UDF para funcionar em clusters serverless
@udf(returnType=chunk_schema)
def sentence_chunk_udf(text: str) -> List[Dict]:
    """
    UDF que aplica chunking por senten√ßas.
    C√≥digo inline para compatibilidade com serverless.
    """
    import re
    from typing import List, Dict, Any

    def _sent_tokenize(text: str) -> List[str]:
        """Tokenizador de senten√ßas usando regex."""
        # Proteger abrevia√ß√µes comuns
        abbreviations = ['Dr.', 'Mr.', 'Mrs.', 'Ms.', 'Prof.', 'Fig.', 'Tab.', 'Eq.', 'et al.', 'i.e.', 'e.g.', 'vs.', 'etc.']
        protected = text
        for i, abbr in enumerate(abbreviations):
            protected = protected.replace(abbr, f"__ABBR{i}__")

        # Dividir em senten√ßas
        sentences = re.split(r'([.!?]) +(?=[A-Z])', protected)

        # Recombinar pontua√ß√£o
        result = []
        i = 0
        while i < len(sentences):
            if i + 1 < len(sentences) and sentences[i + 1] in '.!?':
                result.append(sentences[i] + sentences[i + 1])
                i += 2
            else:
                result.append(sentences[i])
                i += 1

        # Restaurar abrevia√ß√µes
        final = []
        for sent in result:
            restored = sent
            for i, abbr in enumerate(abbreviations):
                restored = restored.replace(f"__ABBR{i}__", abbr)
            if restored.strip():
                final.append(restored.strip())

        return final

    def _chunk_by_sentences(text: str, max_chunk_size: int = 500, overlap_sentences: int = 1) -> List[Dict[str, Any]]:
        if not text or len(text) == 0:
            return []

        sentences = _sent_tokenize(text)

        if len(sentences) == 0:
            return [{"chunk_text": text, "chunk_index": 0, "sentence_count": 1, "char_count": len(text)}]

        chunks = []
        current_chunk = []
        current_size = 0
        chunk_idx = 0

        for sentence in sentences:
            sentence_size = len(sentence)

            if current_size + sentence_size > max_chunk_size and current_chunk:
                chunk_text = ' '.join(current_chunk)
                chunks.append({
                    "chunk_text": chunk_text,
                    "chunk_index": chunk_idx,
                    "sentence_count": len(current_chunk),
                    "char_count": len(chunk_text)
                })
                chunk_idx += 1
                current_chunk = current_chunk[-overlap_sentences:] if overlap_sentences > 0 else []
                current_size = sum(len(s) for s in current_chunk)

            current_chunk.append(sentence)
            current_size += sentence_size

        if current_chunk:
            chunk_text = ' '.join(current_chunk)
            chunks.append({
                "chunk_text": chunk_text,
                "chunk_index": chunk_idx,
                "sentence_count": len(current_chunk),
                "char_count": len(chunk_text)
            })

        return chunks

    # Executar chunking
    if not text:
        return []
    chunks = _chunk_by_sentences(text, max_chunk_size=500, overlap_sentences=1)
    return [{"chunk_text": c["chunk_text"], "chunk_index": c["chunk_index"], "char_count": c["char_count"]}
            for c in chunks]

In [0]:
from pyspark.sql.functions import posexplode

# Aplicar chunking
df_chunked = df_prepared.withColumn(
    "chunks", sentence_chunk_udf(col("body"))
)

# Explodir chunks em linhas separadas
df_exploded = df_chunked.select(
    col("circular_id"),
    col("event_id"),
    col("subject"),
    col("created_on"),
    posexplode(col("chunks")).alias("chunk_index", "chunk")
).select(
    col("circular_id"),
    col("event_id"),
    col("subject"),
    col("created_on"),
    col("chunk_index"),
    col("chunk.chunk_text").alias("chunk_text"),
    col("chunk.char_count").alias("chunk_char_count")
)

# Adicionar ID √∫nico para cada chunk
df_final = df_exploded.withColumn(
    "chunk_id",
    concat_ws("_", col("circular_id").cast("string"), col("chunk_index").cast("string"))
)

# Mostrar resultado
print("üìÑ Resultado do chunking:")
df_final.select("chunk_id", "circular_id", "event_id", "chunk_index", "chunk_char_count").show(10)

In [0]:
# Estat√≠sticas
chunk_stats = df_final.agg({
    "*": "count",
    "chunk_char_count": "avg",
    "chunk_char_count": "min",
    "chunk_char_count": "max"
}).collect()[0]

chunks_per_doc = df_final.groupBy("circular_id").count()
avg_chunks = chunks_per_doc.agg({"count": "avg"}).collect()[0][0]
max_chunks = chunks_per_doc.agg({"count": "max"}).collect()[0][0]

print(f"""
üìä Estat√≠sticas de Chunking:
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
Total de chunks:           {df_final.count():,}
Documentos originais:      {df_prepared.count():,}
M√©dia de chunks/doc:       {avg_chunks:.1f}
M√°ximo de chunks em 1 doc: {max_chunks}

Tamanho dos chunks:
  - M√≠nimo: {df_final.agg({'chunk_char_count': 'min'}).collect()[0][0]:,} chars
  - M√©dio:  {df_final.agg({'chunk_char_count': 'avg'}).collect()[0][0]:,.0f} chars
  - M√°ximo: {df_final.agg({'chunk_char_count': 'max'}).collect()[0][0]:,} chars
""")

In [0]:
from pyspark.sql.functions import when

# Categorizar chunks por tamanho
df_size_dist = df_final.withColumn(
    "size_bucket",
    when(col("chunk_char_count") < 200, "tiny (<200)")
    .when(col("chunk_char_count") < 400, "small (200-400)")
    .when(col("chunk_char_count") < 600, "medium (400-600)")
    .when(col("chunk_char_count") < 800, "large (600-800)")
    .otherwise("very_large (800+)")
)

print("üìè Distribui√ß√£o de tamanho dos chunks:")
df_size_dist.groupBy("size_bucket").count().orderBy("size_bucket").show()

## 5. Salvar Chunks

In [0]:
# Criar documento formatado para embedding (incluindo metadados)
df_to_save = df_final.withColumn(
    "document_for_embedding",
    concat_ws(
        "\n",
        concat_ws(": ", lit("EVENT"), col("event_id")),
        concat_ws(": ", lit("SUBJECT"), col("subject")),
        lit("---"),
        col("chunk_text")
    )
)

# Salvar
TABLE_NAME = "gcn_circulars_chunks"

df_to_save.write \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .saveAsTable(TABLE_NAME)

saved_count = spark.table(TABLE_NAME).count()
print(f"‚úÖ Tabela {CATALOG}.{SCHEMA}.{TABLE_NAME} criada com {saved_count:,} chunks")

In [0]:
# Mostrar exemplos
print("üìÑ Exemplos de chunks salvos:")
spark.table(TABLE_NAME).select(
    "chunk_id", "event_id", "chunk_index", "chunk_char_count", "document_for_embedding"
).show(5, truncate=100)

## 6. Estimativa de Tokens (para custo de embedding)

In [0]:
# Usar tiktoken para estimativa mais precisa
encoding = tiktoken.get_encoding("cl100k_base")  # Encoding usado pelo OpenAI/BGE

# Sample para estimativa
sample_chunks = spark.table(TABLE_NAME).select("chunk_text").limit(1000).collect()
total_tokens = sum(len(encoding.encode(row.chunk_text)) for row in sample_chunks)
avg_tokens = total_tokens / len(sample_chunks)

total_chunks = spark.table(TABLE_NAME).count()
estimated_total_tokens = avg_tokens * total_chunks

print(f"""
üéØ Estimativa de Tokens:
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
Sample size:              {len(sample_chunks):,} chunks
M√©dia tokens/chunk:       {avg_tokens:.0f}
Total de chunks:          {total_chunks:,}
Tokens estimados (total): {estimated_total_tokens:,.0f}

üí∞ Custo estimado de embedding (databricks-bge-large-en):
   ~$0.0001 por 1K tokens
   Custo estimado: ${estimated_total_tokens/1000 * 0.0001:.2f}
""")

## 7. Comparar Estrat√©gias: Com vs Sem Overlap

Uma das decis√µes mais importantes no chunking √© o **overlap** (sobreposi√ß√£o).
Vamos comparar os resultados para entender o impacto.

### üéØ Por que Overlap Importa?

**Sem overlap:**
```
Chunk 1: "...the burst was detected at T0."
Chunk 2: "Follow-up observations started at T+300s."
```
Uma query sobre "when did observations start after detection" pode perder o contexto.

**Com overlap:**
```
Chunk 1: "...the burst was detected at T0. Follow-up observations..."
Chunk 2: "...detected at T0. Follow-up observations started at T+300s."
```
Ambos os chunks cont√™m o contexto completo!

In [0]:
def chunk_by_sentences_no_overlap(text: str, max_chunk_size: int = 500) -> List[Dict[str, Any]]:
    """Chunking por senten√ßas SEM overlap."""
    if not text or len(text) == 0:
        return []

    sentences = simple_sent_tokenize(text)
    if len(sentences) == 0:
        return [{"chunk_text": text, "chunk_index": 0, "char_count": len(text)}]

    chunks = []
    current_chunk = []
    current_size = 0
    chunk_idx = 0

    for sentence in sentences:
        sentence_size = len(sentence)

        if current_size + sentence_size > max_chunk_size and current_chunk:
            chunk_text = ' '.join(current_chunk)
            chunks.append({
                "chunk_text": chunk_text,
                "chunk_index": chunk_idx,
                "char_count": len(chunk_text)
            })
            chunk_idx += 1
            current_chunk = []  # SEM overlap
            current_size = 0

        current_chunk.append(sentence)
        current_size += sentence_size

    if current_chunk:
        chunk_text = ' '.join(current_chunk)
        chunks.append({
            "chunk_text": chunk_text,
            "chunk_index": chunk_idx,
            "char_count": len(chunk_text)
        })

    return chunks

# Comparar com um documento de exemplo
sample_doc = df_prepared.select("body").limit(1).collect()[0][0]

chunks_with_overlap = chunk_by_sentences(sample_doc, max_chunk_size=500, overlap_sentences=1)
chunks_no_overlap = chunk_by_sentences_no_overlap(sample_doc, max_chunk_size=500)

print(f"""
üî¨ Compara√ß√£o de Overlap:
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
Documento original: {len(sample_doc):,} caracteres

COM overlap (1 senten√ßa):
  - Total chunks: {len(chunks_with_overlap)}
  - Chars total: {sum(c['char_count'] for c in chunks_with_overlap):,}
  - Redund√¢ncia: {sum(c['char_count'] for c in chunks_with_overlap) - len(sample_doc):,} chars extras

SEM overlap:
  - Total chunks: {len(chunks_no_overlap)}
  - Chars total: {sum(c['char_count'] for c in chunks_no_overlap):,}
  - Redund√¢ncia: ~0 chars
""")

In [0]:
# Mostrar as bordas dos chunks para ver o overlap
print("üìã Primeiros 3 chunks COM overlap:")
print("=" * 80)
for i, chunk in enumerate(chunks_with_overlap[:3]):
    print(f"\nChunk {i} ({chunk['char_count']} chars):")
    # Mostrar in√≠cio e fim
    text = chunk['chunk_text']
    print(f"  In√≠cio: '{text[:80]}...'")
    print(f"  Fim:    '...{text[-80:]}'")

print("\n\nüìã Primeiros 3 chunks SEM overlap:")
print("=" * 80)
for i, chunk in enumerate(chunks_no_overlap[:3]):
    print(f"\nChunk {i} ({chunk['char_count']} chars):")
    text = chunk['chunk_text']
    print(f"  In√≠cio: '{text[:80]}...'")
    print(f"  Fim:    '...{text[-80:]}'")

## 8. Avaliar Impacto do Tamanho do Chunk

O tamanho do chunk afeta diretamente a qualidade do retrieval:

| Tamanho | Pr√≥s | Contras |
|---------|------|---------|
| **Pequeno** (100-200 chars) | Alta precis√£o, foco | Pode perder contexto |
| **M√©dio** (300-500 chars) | Bom equil√≠brio | Escolha mais comum |
| **Grande** (600-1000 chars) | Mais contexto | Menor precis√£o, dilui relev√¢ncia |

### üéØ Recomenda√ß√µes por Caso de Uso:

- **FAQ / Perguntas diretas**: Chunks menores (200-300)
- **Documentos t√©cnicos**: Chunks m√©dios (400-600)
- **Artigos cient√≠ficos**: Chunks maiores (600-800)

In [0]:
chunk_sizes = [200, 400, 600, 800]
results = []

for size in chunk_sizes:
    chunks = chunk_by_sentences(sample_doc, max_chunk_size=size, overlap_sentences=1)
    total_chars = sum(c['char_count'] for c in chunks)
    avg_chars = total_chars / len(chunks) if chunks else 0

    results.append({
        "max_size": size,
        "num_chunks": len(chunks),
        "avg_chunk_size": avg_chars,
        "total_chars": total_chars,
        "overhead_pct": ((total_chars - len(sample_doc)) / len(sample_doc)) * 100
    })

print("üìä Impacto do Tamanho do Chunk:")
print("=" * 80)
print(f"{'Max Size':<12} {'Chunks':<10} {'Avg Size':<12} {'Total':<12} {'Overhead %':<12}")
print("-" * 80)
for r in results:
    print(f"{r['max_size']:<12} {r['num_chunks']:<10} {r['avg_chunk_size']:<12.0f} {r['total_chars']:<12,} {r['overhead_pct']:<12.1f}")

In [0]:
print("""
üìà An√°lise de Trade-offs:
‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê

1. CHUNKS PEQUENOS (200-300 chars):
   ‚úÖ Alta precis√£o no retrieval
   ‚úÖ Menos tokens por chunk = menor custo de LLM
   ‚ùå Pode fragmentar informa√ß√£o relacionada
   ‚ùå Mais chunks = mais chamadas de embedding

2. CHUNKS M√âDIOS (400-600 chars):
   ‚úÖ Bom equil√≠brio entre precis√£o e contexto
   ‚úÖ Adequado para a maioria dos casos
   ‚úÖ Tamanho t√≠pico de 1-3 senten√ßas completas
   ‚Üí RECOMENDADO para GCN Circulars

3. CHUNKS GRANDES (800+ chars):
   ‚úÖ Contexto rico para respostas complexas
   ‚ùå Menor precis√£o (dilui relev√¢ncia)
   ‚ùå Mais tokens = maior custo de LLM
   ‚ùå Pode incluir informa√ß√£o irrelevante

üí° Nossa Escolha: 500 chars com overlap de 1 senten√ßa
   - Preserva contexto cient√≠fico
   - Compat√≠vel com limite de tokens do embedding model
   - Overlap garante continuidade sem√¢ntica
""")

## 9. Conceitos-Chave para o Exame

### üìö Chunking Strategies (Section 2: Data Preparation - 14%)

| Estrat√©gia | Quando Usar | Trade-off |
|------------|-------------|-----------|
| **Fixed-size** | Documentos uniformes | Simples, mas pode cortar contexto |
| **Sentence-based** | Textos em prosa | Preserva sem√¢ntica, overhead moderado |
| **Paragraph-based** | Docs estruturados | Respeita estrutura, chunks vari√°veis |
| **Semantic** | Docs complexos | Melhor qualidade, mais complexo |

### üéØ Exam Tips:

1. **Overlap** previne perda de contexto nas bordas dos chunks
2. **Chunk size** deve considerar:
   - Limite de tokens do embedding model (tipicamente 512)
   - Janela de contexto do LLM
   - Custo de embedding e infer√™ncia
3. **Metadata enrichment** melhora retrieval (source, date, section)
4. **Delta Lake** √© preferido para armazenar chunks (ACID, versioning)

In [0]:
print("""
üìã Resumo: Decis√µes de Chunking para GCN Circulars
‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê

‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Par√¢metro           ‚îÇ Valor Escolhido                        ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ Estrat√©gia          ‚îÇ Sentence-based (regex)                 ‚îÇ
‚îÇ Max chunk size      ‚îÇ 500 caracteres (~125 tokens)           ‚îÇ
‚îÇ Overlap             ‚îÇ 1 senten√ßa                             ‚îÇ
‚îÇ Min doc size        ‚îÇ 100 caracteres (filtrado antes)        ‚îÇ
‚îÇ Metadata inclu√≠do   ‚îÇ event_id, subject, created_on          ‚îÇ
‚îÇ Storage             ‚îÇ Delta Lake (Unity Catalog)             ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

Justificativas:
1. Sentence-based preserva contexto cient√≠fico dos GCN Circulars
2. 500 chars √© compat√≠vel com embeddings BGE (max 512 tokens)
3. Overlap de 1 senten√ßa previne perda de contexto
4. Regex usado ao inv√©s de NLTK para compatibilidade serverless
5. Metadata enriquece retrieval com informa√ß√µes do evento
""")

## 10. Lab Wrap-Up: Key Learnings

### ‚úÖ O que voc√™ aprendeu:

| Etapa | Conceito | Aplica√ß√£o |
|-------|----------|-----------|
| **Chunking por caracteres** | Divis√£o simples com overlap | Baseline, documentos uniformes |
| **Chunking por senten√ßas** | Respeita limites sem√¢nticos | Textos cient√≠ficos, prosa |
| **Chunking por par√°grafos** | Preserva estrutura do documento | Docs com se√ß√µes claras |
| **Compara√ß√£o de overlap** | Trade-off redund√¢ncia vs contexto | Decis√£o de design |
| **An√°lise de chunk size** | Impacto em precis√£o e custo | Otimiza√ß√£o |

### üß† Insights Cr√≠ticos:

1. **Qualidade > Quantidade**: Chunks bem estruturados superam volume
2. **Overlap √© essencial**: Previne perda de contexto em bordas
3. **Tamanho importa**: Muito pequeno fragmenta, muito grande dilui
4. **Metadata enriquece**: Source, date, section melhoram retrieval
5. **Serverless requer adapta√ß√£o**: NLTK n√£o funciona, regex sim

### üöÄ Pr√≥ximos Passos:

1. **Embeddings**: Gerar vetores com BGE model
2. **Vector Search**: Criar √≠ndice para retrieval
3. **RAG Chain**: Conectar retriever ao LLM
4. **Avalia√ß√£o**: Medir qualidade do retrieval

## Pr√≥ximos Passos

‚úÖ Chunks criados e salvos
‚û°Ô∏è Pr√≥ximo notebook: `03-embeddings-vector-search.py`
   - Gerar embeddings com modelo BGE
   - Criar √≠ndice Vector Search
   - Testar retrieval