<center>
<img src="https://supportvectors.ai/logo-poster-transparent.png" width="400px" style="opacity:0.7">
</center>


In [1]:
%run supportvectors-common.ipynb



<div style="color:#aaa;font-size:8pt">
<hr/>
&copy; SupportVectors. All rights reserved. <blockquote>This notebook is the intellectual property of SupportVectors, and part of its training material. 
Only the participants in SupportVectors workshops are allowed to study the notebooks for educational purposes currently, but is prohibited from copying or using it for any other purposes without written permission.

<b> These notebooks are chapters and sections from Asif Qamar's textbook that he is writing on Data Science. So we request you to not circulate the material to others.</b>
 </blockquote>
 <hr/>
</div>



# Contextual and Late Chunking

This demo demonstrates how semantic chunking can lose important context (like pronoun references) and how contextual chunking and late chunking can restore this lost context.

We'll use a sample text with pronoun references that will be lost during semantic chunking but restored through contextual or late chunking.


In [2]:
from rich import print as rprint

In [3]:
# Sample text with pronoun references that will be lost during semantic chunking
sample_text = """
Machine Learning Fundamentals

Neural networks are computational models inspired by biological neural networks. They consist of interconnected nodes called neurons that process information through weighted connections. The basic building block is the perceptron, which takes multiple inputs, applies weights, and produces an output through an activation function.

The learning process in neural networks involves adjusting these weights based on training data. This is typically done through backpropagation, where the network calculates the gradient of the loss function with respect to each weight and updates them accordingly. The learning rate determines how much the weights are adjusted in each iteration.

Deep learning extends this concept by using multiple hidden layers between the input and output layers. Each layer can learn increasingly complex features, with early layers detecting simple patterns like edges and later layers combining these into more complex concepts. This hierarchical feature learning is what makes deep neural networks so powerful for tasks like image recognition and natural language processing.

Advanced Optimization Techniques

While basic gradient descent works for simple problems, more sophisticated optimization algorithms have been developed to improve training efficiency and convergence. These techniques address common challenges like getting stuck in local minima, slow convergence, and handling different scales of gradients across parameters.

One popular approach is adaptive learning rates, where the learning rate is adjusted for each parameter based on its historical gradients. Adam (Adaptive Moment Estimation) combines the benefits of momentum and adaptive learning rates by maintaining exponentially decaying averages of both gradients and squared gradients. This allows the algorithm to automatically adjust the learning rate for each parameter, often leading to faster convergence and better performance.

Another important technique is regularization, which helps prevent overfitting by adding constraints to the model. L1 regularization adds a penalty proportional to the sum of absolute values of weights, encouraging sparsity. L2 regularization adds a penalty proportional to the sum of squared weights, encouraging smaller weights. Dropout randomly sets a fraction of input units to zero during training, forcing the network to not rely on any single neuron and improving generalization.

Batch normalization is another crucial technique that normalizes the inputs to each layer by adjusting and scaling the activations. This helps stabilize training by reducing internal covariate shift, allowing for higher learning rates and making the network less sensitive to initialization. It also acts as a regularizer, reducing the need for dropout in some cases.
"""

rprint("Sample text loaded with pronoun references:")
rprint(sample_text[:200] + "...")

## Step 1: Use Entire Document as Single Parent Chunk

We'll use the entire document as one parent chunk and then apply semantic chunking to it.


In [4]:
# Use entire document as single parent chunk
parent_chunk = {
    "id": 0,
    "text": sample_text.strip(),
    "title": "Complete Document"
}

rprint("Using entire document as single parent chunk:")
rprint(f"Length: {len(parent_chunk['text'])} characters")
rprint(f"Preview: {parent_chunk['text'][:200]}...")

## Step 2: Semantic Chunking

Use `chonkie` to semantically chunk each parent chunk. This will break down the text into smaller, semantically coherent pieces, but may lose important context like pronoun references.


In [5]:
from chonkie import SemanticChunker

# Initialize semantic chunker
semantic_chunker = SemanticChunker()

# Process the single parent chunk
semantic_chunks = []
parent_chunk_dict = {
    "id": parent_chunk["id"],
    "text": parent_chunk["text"],
    "title": parent_chunk["title"],
    "semantic_chunks": []
}

# Get semantic chunks for the parent
sem_chunks = semantic_chunker.chunk(parent_chunk["text"])

for i, sc in enumerate(sem_chunks):
    semantic_chunk_dict_item = {
        "id": i,
        "text": sc.text,
        "start_char": sc.start_index,
        "end_char": sc.end_index
    }
    parent_chunk_dict["semantic_chunks"].append(semantic_chunk_dict_item)
    
    # Also maintain a flat list for contextual chunking
    semantic_chunks.append({
        "chunk_id": i,
        "chunk": sc,
        "parent_id": parent_chunk["id"],
        "parent_chunk": parent_chunk
    })

rprint(f"Created {len(semantic_chunks)} semantic chunks from the complete document")
rprint("\nSemantic chunks preview:")
for i, sc in enumerate(semantic_chunks[:5]):  # Show first 5 chunks
    rprint(f"\nSemantic Chunk {i}:")
    rprint(f"Text: {sc['chunk'].text[:100]}...")
    rprint(f"Start: {sc['chunk'].start_index}, End: {sc['chunk'].end_index}")


## Step 3: Identify Pronoun References Lost in Semantic Chunking

Let's examine some semantic chunks to see where pronoun references like "it", "this", "they" lose their context.


In [6]:
# Find semantic chunks with pronoun references
pronoun_chunks = []

for i, sc in enumerate(semantic_chunks):
    text = sc['chunk'].text.lower()
    pronouns = ['it ', 'this ', 'they ', 'these ', 'that ', 'those ']
    found_pronouns = [p for p in pronouns if p in text]
    
    if found_pronouns:
        pronoun_chunks.append({
            'chunk_id': i,
            'text': sc['chunk'].text,
            'pronouns': found_pronouns
        })

rprint(f"Found {len(pronoun_chunks)} semantic chunks with pronoun references:")
rprint("\n" + "="*80)

for chunk in pronoun_chunks[:3]:  # Show first 3 examples
    rprint(f"\nChunk {chunk['chunk_id']}:")
    rprint(f"Pronouns found: {chunk['pronouns']}")
    rprint(f"Text: {chunk['text']}")
    rprint("-" * 60)


## Step 4: Contextual Chunking

Now we'll use an LLM to enrich each semantic chunk with context from its parent chunk, resolving pronoun references and making each chunk self-standing.


In [7]:
from openai import OpenAI

# Initialize OpenAI client (using Ollama)
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"
)

# Contextual chunking prompt
contextual_prompt = """
Here is the parent section: {parent_text}

Now here is the semantic chunk: {semantic_text}

Please produce an enriched chunk which retains the semantic chunk but adds any necessary context from the parent so that the chunk is self-standing. Pay special attention to resolving any pronoun references (it, this, they, etc.) by replacing them with their proper antecedents.

Do not add anything that is not needed to make the chunk self-standing.
"""

# Process each semantic chunk
contextual_chunks = []

rprint("Processing semantic chunks with contextual chunking...")
rprint("="*60)

for i, sc in enumerate(semantic_chunks):
    parent_text = sc['parent_chunk']['text']
    semantic_text = sc['chunk'].text
    
    try:
        response = client.chat.completions.create(
            model="gpt-oss:20b",
            messages=[
                {"role": "user", "content": contextual_prompt.format(parent_text=parent_text, semantic_text=semantic_text)},
            ],
        )
        
        contextual_chunk = response.choices[0].message.content
        
        contextual_chunks.append({
            "chunk_id": sc['chunk_id'],
            "parent_id": sc['parent_id'],
            "original_semantic_chunk": semantic_text,
            "contextual_chunk": contextual_chunk,
            "parent_title": sc['parent_chunk']['title']
        })
        
        # Show progress for first few chunks
        if i < 3:
            rprint(f"\nChunk {i} (from {sc['parent_chunk']['title']}):")
            rprint(f"Original: {semantic_text[:100]}...")
            rprint(f"Contextual: {contextual_chunk[:100]}...")
            rprint("-" * 40)
            
    except Exception as e:
        print(f"Error processing chunk {i}: {e}")
        contextual_chunks.append({
            "chunk_id": sc['chunk_id'],
            "parent_id": sc['parent_id'],
            "original_semantic_chunk": semantic_text,
            "contextual_chunk": semantic_text,  # fallback to original
            "parent_title": sc['parent_chunk']['title']
        })

rprint(f"\nCompleted contextual chunking for {len(contextual_chunks)} chunks")


## Step 5: Compare Original vs Contextual Chunks

Let's compare some of the original semantic chunks with their contextual versions to see how pronoun references were resolved.


In [8]:
# Compare original vs contextual chunks, focusing on those with pronoun references
rprint("COMPARISON: Original Semantic Chunks vs Contextual Chunks")
rprint("="*80)

# Focus on chunks that originally had pronoun references
for pronoun_chunk in pronoun_chunks[:2]:  # Show first 2 examples
    chunk_id = pronoun_chunk['chunk_id']
    
    # Find the corresponding contextual chunk
    contextual_chunk = next((cc for cc in contextual_chunks if cc['chunk_id'] == chunk_id), None)
    
    if contextual_chunk:
        rprint(f"\nChunk {chunk_id}:")
        rprint(f"Pronouns found in original: {pronoun_chunk['pronouns']}")
        rprint("\nORIGINAL SEMANTIC CHUNK:")
        rprint(f"'{contextual_chunk['original_semantic_chunk']}'")
        rprint("\nCONTEXTUAL CHUNK:")
        rprint(f"'{contextual_chunk['contextual_chunk']}'")
        rprint("\n" + "="*60)


## Step 6: Late Chunking

Now let's implement late chunking as an alternative approach. Late chunking uses embeddings to capture context from the parent chunk and associate it with each semantic chunk.


In [9]:
from transformers import AutoTokenizer, AutoModel
import torch

# Load embedding model
model_name = "jinaai/jina-embeddings-v2-base-en"
rprint(f"Loading embedding model: {model_name}")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True, output_hidden_states=True)

def late_chunk_parent(parent_chunk):
    """
    parent_chunk: dict with keys {id, text, semantic_chunks: List[{start_char, end_char, text, id}]}
    Returns enriched semantic chunks with embeddings from parent context.
    """
    text = parent_chunk["text"]
    sem_chunks = parent_chunk["semantic_chunks"]

    # Tokenize + embed the *parent chunk text only*
    inputs = tokenizer(text, return_tensors="pt", truncation=False, return_offsets_mapping=True)
    # Save offsets separately
    offsets = inputs.pop("offset_mapping")[0].tolist()
    
    with torch.no_grad():
        outputs = model(**inputs)
    token_embs = outputs.last_hidden_state.squeeze(0)

    enriched_semantics = []
    for sc in sem_chunks:
        s, e = sc["start_char"], sc["end_char"]
        indices = [i for i, (ts, te) in enumerate(offsets) if te > s and ts < e]
        if not indices:
            # If no tokens found, use mean of all parent embeddings
            emb = token_embs.mean(dim=0).cpu().numpy()
        else:
            emb = token_embs[indices].mean(dim=0).cpu().numpy()
            
        enriched_semantics.append({
            "semantic_id": sc["id"],
            "parent_id": parent_chunk["id"],
            "embedding": emb.tolist(),
            "text": sc["text"],
            "parent_text": text,
            "num_tokens": len(indices) if indices else 0
        })
    return enriched_semantics

rprint("Late chunking implementation ready!")


In [10]:
# Process the single parent chunk with late chunking
rprint("Processing parent chunk with late chunking...")
rprint("="*50)

rprint(f"Processing parent chunk: {parent_chunk_dict['title']}")
enriched_semantics = late_chunk_parent(parent_chunk_dict)
all_late_chunks = enriched_semantics

rprint(f"\nExample late chunks from {parent_chunk_dict['title']}:")
for i, es in enumerate(enriched_semantics[:3]):
    rprint(f"\nLate Chunk {i}:")
    rprint(f"Text: {es['text'][:100]}...")
    rprint(f"Embedding dimension: {len(es['embedding'])}")
    rprint(f"Tokens used: {es['num_tokens']}")

rprint(f"\nCompleted late chunking for {len(all_late_chunks)} chunks")
