<center>
<img src="https://supportvectors.ai/logo-poster-transparent.png" width="400px" style="opacity:0.7">
</center>

In [1]:
%run supportvectors-common.ipynb


<div style="color:#aaa;font-size:8pt">
<hr/>
&copy; SupportVectors. All rights reserved. <blockquote>This notebook is the intellectual property of SupportVectors, and part of its training material. 
Only the participants in SupportVectors workshops are allowed to study the notebooks for educational purposes currently, but is prohibited from copying or using it for any other purposes without written permission.

<b> These notebooks are chapters and sections from Asif Qamar's textbook that he is writing on Data Science. So we request you to not circulate the material to others.</b>
 </blockquote>
 <hr/>
</div>



# Chunking Pipeline
Here, we explore the chunking pipeline starting with chunking along section boundaries first, then breaking further into semantic chunks, and finally creating contextualized chunks using an LLM or leveraging late interactions to perform "late chunking" to bring in any context that may have been lost during the semantic chunking process.

## Document ingestion → load document (PDF, Word, markdown).
Use Docling to parse & chunk into parent chunks (each section/chapter)

In [2]:
from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker
import os
doc = DocumentConverter().convert(source=f"{os.getenv("BOOTCAMP_ROOT_DIR")}/data/technicalReport1.pdf").document
chunker = HybridChunker()

chunks = list(chunker.chunk(dl_doc=doc))
parent_chunks = list(chunker.chunk(doc))


2025-10-21 20:57:56,570 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-10-21 20:57:56,578 - INFO - Going to convert document batch...
2025-10-21 20:57:56,579 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 4f2edc0f7d9bb60b38ebfecf9a2609f5
2025-10-21 20:57:56,585 - INFO - Loading plugin 'docling_defaults'
2025-10-21 20:57:56,586 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-10-21 20:57:56,590 - INFO - Loading plugin 'docling_defaults'
2025-10-21 20:57:56,592 - INFO - Registered ocr engines: ['auto', 'easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
2025-10-21 20:57:56,756 - INFO - Auto OCR model selected ocrmac.
2025-10-21 20:57:56,767 - INFO - Accelerator device: 'mps'
2025-10-21 20:57:57,930 - INFO - Accelerator device: 'mps'
2025-10-21 20:57:58,075 - INFO - Processing document technicalReport1.pdf
2025-10-21 20:58:03,162 - INFO - Finished converting document technicalReport1.pdf in 6.59 sec.
Token indices sequence length i

## Semantic chunking 
Use `chonkie` to further semantically chunk each of above `parent` chunks

In [3]:
from chonkie import SemanticChunker

semantic_chunker = SemanticChunker() # Use defaults


2025-10-21 20:58:07,947 - INFO - No cached model found for minishlab/potion-base-32M, loading from local or hub.
2025-10-21 20:58:07,947 - INFO - Folder does not exist locally, attempting to use huggingface hub.


In [4]:
semantic_chunks = [] # A running list of all semantic chunks (used during contextual chunking)
parent_chunk_id = 0
parent_chunk_dict_list = [] # A running list of all parent chunks (used during late chunking)

for p in parent_chunks:
    parent_chunk_dict = {}
    parent_chunk_dict["id"] = parent_chunk_id
    parent_chunk_dict["text"] = p.text
    semantic_chunk_dict_list = []
    parent_chunk_dict["semantic_chunks"] = semantic_chunk_dict_list
    parent_chunk_dict_list.append(parent_chunk_dict)
    sem_chunks = semantic_chunker.chunk(p.text)
    semantic_chunk_id = 0
    for sc in sem_chunks:
        # Maintain a list of semantic chunks for each parent chunk
        semantic_chunk_dict = {}
        semantic_chunk_dict["id"] = semantic_chunk_id
        semantic_chunk_dict["text"] = sc.text
        semantic_chunk_dict["start_char"] = sc.start_index
        semantic_chunk_dict["end_char"] = sc.end_index
        semantic_chunk_dict_list.append(semantic_chunk_dict)

        # Maintain a list of all semantic chunks with their parent chunk
        semantic_chunk = {}
        semantic_chunk["chunk_id"] = semantic_chunk_id
        semantic_chunk["chunk"] = sc
        semantic_chunk["parent_id"] = parent_chunk_id
        semantic_chunk["parent_chunk"] = p
        semantic_chunks.append(semantic_chunk)

        semantic_chunk_id += 1
    parent_chunk_id += 1

## Contextual Chunking
For each semantic chunk: gather p.text (parent), sc.text (semantic), then call LLM with prompt like:

“Here is context from the parent section: {parent_text}. Now here is the semantic chunk: {sc_text}. Please produce an enriched chunk which retains the semantic chunk but adds any necessary context from the parent so that the chunk is self-standing.”

Store output as contextual_chunk and save in a json file.

In [5]:
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama")

system_prompt = """
Here is the parent section: {parent_text}. Now here is the semantic chunk: {sc_text}. 
Please produce an enriched chunk which retains the semantic chunk but adds any necessary context from the parent so that the chunk is self-standing.
Do not add anything that is not needed to make the chunk self-standing.
"""



In [9]:
import json


In [None]:
os.makedirs(f"{os.getenv("BOOTCAMP_ROOT_DIR")}/output", exist_ok=True)
contextual_chunks = []
with open(f"{os.getenv("BOOTCAMP_ROOT_DIR")}/output/semantic_chunks.jsonl", "w") as f:
    for sc in semantic_chunks:
        contextual_chunk_dict = {}
        parent_text = sc["parent_chunk"].text
        sc_text = sc["chunk"].text
        response = client.chat.completions.create(
            model="gpt-oss:20b",
            messages=[
                {"role": "user", "content": system_prompt.format(parent_text=parent_text, sc_text=sc_text)},
            ],
        )
        contextual_chunk_dict["contextual_chunk"] = response.choices[0].message.content
        contextual_chunk_dict["semantic_chunk_id"] = sc["chunk_id"]
        contextual_chunk_dict["parent_id"] = sc["parent_id"]
        contextual_chunk_dict["semantic_chunk"] = sc["chunk"].text
        contextual_chunk_dict["parent_chunk"] = sc["parent_chunk"].text
        contextual_chunks.append(contextual_chunk_dict)
        f.write(json.dumps(contextual_chunk_dict) + "\n")

## Late Chunking
Alternatively, use the `Late Chunking` approach along lines of how it is explained in [Jina AI](https://github.com/jina-ai/late-chunking)

In [12]:
from transformers import AutoTokenizer, AutoModel
import torch

model_name = "jinaai/jina-embeddings-v2-base-en"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True, output_hidden_states=True)

def late_chunk_parent(parent_chunk):
    """
    parent_chunk: dict with keys {id, text, semantic_chunks: List[{start_char, end_char, text, id}]}
    Returns same structure with added embeddings for each semantic chunk.
    """
    text = parent_chunk["text"]
    sem_chunks = parent_chunk["semantic_chunks"]

    # Tokenize + embed the *parent chunk text only*
    inputs = tokenizer(text, return_tensors="pt", truncation=False, return_offsets_mapping=True)
    # Save offsets separately
    offsets = inputs.pop("offset_mapping")[0].tolist()
    with torch.no_grad():
        outputs = model(**inputs)
    token_embs = outputs.last_hidden_state.squeeze(0)

    enriched_semantics = []
    for sc in sem_chunks:
        s, e = sc["start_char"], sc["end_char"]
        indices = [i for i, (ts, te) in enumerate(offsets) if te > s and ts < e]
        if not indices:
            continue
        emb = token_embs[indices].mean(dim=0).cpu().numpy()
        enriched_semantics.append({
            "semantic_id": sc["id"],
            "parent_id": parent_chunk["id"],
            "embedding": emb.tolist(),
            "text": sc["text"],
            "num_tokens": len(indices)
        })
    return enriched_semantics

# Example loop over all parent chunks
all_embeddings = []
os.makedirs(f"{os.getenv("BOOTCAMP_ROOT_DIR")}/output", exist_ok=True)
with open(f"{os.getenv("BOOTCAMP_ROOT_DIR")}/output/late_chunks.jsonl", "w") as f:
    for parent in parent_chunk_dict_list:   # parent_chunks from Docling
        enriched_semantics = late_chunk_parent(parent)
        all_embeddings.extend(enriched_semantics)
        for es in enriched_semantics:
            f.write(json.dumps(es) + "\n")
