# Elasticsearch ColBERT v2 Embedding Pipeline

This notebook implements a real Elasticsearch embedding pipeline using Jina ColBERT v2 embeddings with multi-vector support and Late Chunking capabilities.

The pipeline performs:
1. **Data Loading**: Reading data from JSONL format
2. **Text Processing**: Tokenization and truncation based on ColBERT v2 token limits (8192 tokens)
3. **Vector Indexing**: Batch indexing to Elasticsearch with multi-vector embeddings
4. **Semantic Retrieval**: Late interaction similarity search using ColBERT methodology
5. **Optional Reranking**: Reranking results for improved accuracy

## Pipeline Architecture

```
JSONL Data → Text Processing → Document Creation → ColBERT Multi-Vector Embeddings → 
Elasticsearch Index (Dense Vector Storage) → Late Interaction Retrieval → 
Optional Reranking → Final Results
```

## Prerequisites

- **Elasticsearch**: Must be running and accessible (default: localhost:9200)
- **Jina API Key**: Required for ColBERT v2 embeddings
- **Python packages**: elasticsearch, requests, transformers, tqdm

## Quick Start

1. Start Elasticsearch:
   ```bash
   docker run -d -p 9200:9200 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:8.11.0
   ```

2. Set your Jina API key:
   ```bash
   export JINA_API_KEY="your-api-key-here"
   ```

3. Run the notebook cells sequentially

## 1. Import Required Libraries

Import semua library yang diperlukan untuk simulasi pipeline Elasticsearch embedding.

In [181]:
# Core libraries
import json
import os
import time
import warnings
from typing import Any, Dict, List
import logging
from dataclasses import dataclass, asdict
from datetime import datetime

# Data processing
from tqdm import tqdm

# Elasticsearch
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk

# HTTP requests for Jina API
import requests

# Suppress warnings
warnings.filterwarnings('ignore')

# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Jina AI API Key (set your API key here)
jinaai_key = os.environ.get("JINA_API_KEY", "jina_55815bff338d4445aae29a6f2d322ac7O-GT3q1UM_FfKyaXh2pnTKD9JyEC")

print("All libraries imported successfully!")

All libraries imported successfully!


## 2. Define Data Structures and Configuration Classes

Mendefinisikan struktur data yang digunakan dalam pipeline embedding menggunakan **Jina ColBERT v2**, termasuk konfigurasi model dan document chunks.

Pipeline ini telah diupdate untuk menggunakan **Jina ColBERT v2**

### API Key Setup

Sebelum menjalankan pipeline, pastikan Anda telah mengatur Jina API key:

```python
import os
os.environ["JINA_API_KEY"] = "your-jina-api-key-here"
```

Dapatkan API key gratis di: https://jina.ai/

### Architecture Comparison

**Traditional Embeddings (sebelumnya):**
```
Text → Single Vector (384D) → Vector Search → Results
```

**ColBERT v2 (sekarang):**
```
Text → Multiple Token Vectors (Nx128D) → Late Interaction → Results
```

In [182]:
@dataclass
class ModelConfig:
    """Model configuration for Jina ColBERT v2 embedding pipeline."""
    provider: str = "jinaai"
    provider_model_id: str = "jina-colbert-v2"
    embedding_dimensions: int = 128  # ColBERT v2 uses 128 dimensions
    max_tokens: int = 8192  # Jina ColBERT v2 supports up to 8192 tokens
    api_endpoint: str = "https://api.jina.ai/v1/multi-vector"
    input_type: str = "document"
    embedding_type: str = "float"
    
    def model_dump(self) -> Dict[str, Any]:
        return asdict(self)

@dataclass
class Document:
    """Document class similar to langchain Document."""
    page_content: str
    metadata: Dict[str, Any]

@dataclass 
class Chunk:
    """Chunk class from the original pipeline."""
    content: str
    id: str
    metadata: Dict[str, Any]

@dataclass
class EmbeddingState:
    """State management for pipeline."""
    id: str
    query: str
    dataset: str
    retrieved_chunks: List[Chunk] = None
    reranked_chunks: List[Chunk] = None
    supporting_facts: List[str] = None

class EmbeddingStateKeys:
    """State keys constants."""
    ID = "id"
    QUERY = "query"
    DATASET = "dataset"
    RETRIEVED_CHUNKS = "retrieved_chunks"
    RERANKED_CHUNKS = "reranked_chunks"
    SUPPORTING_FACTS = "supporting_facts"

print("Data structures defined successfully!")

Data structures defined successfully!


In [183]:
class LateChunkingUtils:
    """Utilities for implementing Late Chunking with sentence-level segmentation."""
    
    @staticmethod
    def chunk_by_sentences(input_text: str, tokenizer) -> tuple:
        """
        Split input text into sentences using tokenizer with span annotations.
        
        Args:
            input_text: The text to chunk
            tokenizer: Transformers tokenizer
            
        Returns:
            tuple: (chunks, span_annotations)
        """
        inputs = tokenizer(input_text, return_tensors="pt", return_offsets_mapping=True)
        punctuation_mark_id = tokenizer.convert_tokens_to_ids(".")
        sep_id = tokenizer.convert_tokens_to_ids("[SEP]")
        token_offsets = inputs["offset_mapping"][0]
        token_ids = inputs["input_ids"][0]
        
        # Find sentence boundaries
        chunk_positions = [
            (i, int(start + 1))
            for i, (token_id, (start, end)) in enumerate(zip(token_ids, token_offsets))
            if token_id == punctuation_mark_id
            and (
                token_offsets[i + 1][0] - token_offsets[i][1] > 0
                or token_ids[i + 1] == sep_id
            )
        ]
        
        # Create text chunks
        chunks = [
            input_text[x[1] : y[1]]
            for x, y in zip([(1, 0)] + chunk_positions[:-1], chunk_positions)
        ]
        
        # Create span annotations for token ranges
        span_annotations = [
            (x[0], y[0]) for (x, y) in zip([(1, 0)] + chunk_positions[:-1], chunk_positions)
        ]
        
        return chunks, span_annotations
    
    @staticmethod
    def late_chunking_pooling(token_embeddings, span_annotations, max_length=None):
        """
        Apply late chunking pooling to token embeddings.
        
        Args:
            token_embeddings: Token-level embeddings from model
            span_annotations: List of (start, end) token spans
            max_length: Maximum sequence length
            
        Returns:
            List of pooled chunk embeddings
        """
        outputs = []
        
        import numpy as np
        
        for embeddings, annotations in zip(token_embeddings, span_annotations):
            if max_length is not None:
                # Remove annotations beyond max length
                annotations = [
                    (start, min(end, max_length - 1))
                    for (start, end) in annotations
                    if start < (max_length - 1)
                ]
            
            # Pool embeddings for each span (mean pooling)
            pooled_embeddings = [
                embeddings[start:end].sum(dim=0) / (end - start)
                for start, end in annotations
                if (end - start) >= 1
            ]
            
            # Convert to numpy
            pooled_embeddings = [
                embedding.detach().cpu().numpy() for embedding in pooled_embeddings
            ]
            outputs.append(pooled_embeddings)
        
        return outputs

@dataclass
class LateChunk:
    """Extended chunk class that includes context from late chunking."""
    content: str
    id: str
    metadata: Dict[str, Any]
    original_text: str  # Full text that was chunked
    span_annotation: tuple  # (start, end) token span
    chunk_index: int  # Position in original text
    context_aware: bool = True  # Indicates this is a late chunk

print("Late Chunking utilities defined!")

Late Chunking utilities defined!


In [184]:
class ElasticsearchVectorStore:
    """ElasticsearchVectorDataStore class for ColBERT multi-vector embeddings."""
    
    def __init__(self, index_name: str, embedding_config: Dict[str, Any], es_config: Dict[str, Any] = None):
        self.index_name = index_name
        self.embedding_config = embedding_config
        
        # Use provided Elasticsearch configuration or default local setup
        if es_config:
            self.client = Elasticsearch([
                {
                    'host': es_config.get('host', 'localhost'),
                    'port': es_config.get('port', 9200),
                    'scheme': es_config.get('scheme', 'http')
                }
            ])
        else:
            # Default Elasticsearch connection (local)
            self.client = Elasticsearch([
                {'host': 'localhost', 'port': 9200, 'scheme': 'http'}
            ])
        
        # Test connection - fail if Elasticsearch is not available
        try:
            info = self.client.info()
            logger.info(f"Connected to Elasticsearch: {info['version']['number']}")
        except Exception as e:
            logger.error(f"Cannot connect to Elasticsearch: {e}")
            logger.error("Please ensure Elasticsearch is running before proceeding.")
            logger.info("To start Elasticsearch locally:")
            logger.info("   - Docker: docker run -p 9200:9200 -e 'discovery.type=single-node' elasticsearch:8.11.0")
            logger.info("   - Or download and run from: https://www.elastic.co/downloads/elasticsearch")
            raise ConnectionError(f"Elasticsearch connection failed: {e}")
            
        self._create_index_if_not_exists()
    
    def _create_index_if_not_exists(self):
        """Create index with proper mapping for ColBERT multi-vector search."""
        # For ColBERT, we'll store each token vector as a separate document
        # with references back to the original chunk
        mapping = {
            "mappings": {
                "properties": {
                    "content": {"type": "text", "analyzer": "standard"},
                    "embedding": {  # Single vector per document
                        "type": "dense_vector",
                        "dims": self.embedding_config.get("embedding_dimensions", 128),
                        "similarity": "cosine"
                    },
                    "chunk_id": {"type": "keyword"},
                    "token_index": {"type": "integer"},  # Which token vector this is
                    "total_tokens": {"type": "integer"},  # Total tokens in the chunk
                    "metadata": {"type": "object"},
                    "timestamp": {"type": "date"}
                }
            }
        }
        
        try:
            if not self.client.indices.exists(index=self.index_name):
                self.client.indices.create(index=self.index_name, body=mapping)
                logger.info(f"Created index: {self.index_name}")
            else:
                logger.info(f"Index already exists: {self.index_name}")
        except Exception as e:
            logger.error(f"Error creating index: {e}")
            raise e
    
    def count_documents(self) -> int:
        """Count documents in index."""
        try:
            result = self.client.count(index=self.index_name)
            return result['count']
        except Exception as e:
            logger.error(f"Error counting documents: {e}")
            return 0
    
    def add_chunks_batch(self, chunks: List[Chunk], embeddings: List[List[List[float]]]):
        """Add chunks with multi-vector embeddings in batch.
        
        For ColBERT, each token vector becomes a separate document in Elasticsearch.
        """
        docs = []
        
        for chunk, token_vectors in zip(chunks, embeddings):
            if not token_vectors:
                continue
                
            # Create one document per token vector
            for token_idx, vector in enumerate(token_vectors):
                doc = {
                    "_index": self.index_name,
                    "_id": f"{chunk.id}_token_{token_idx}",  # Unique ID per token vector
                    "_source": {
                        "content": chunk.content,
                        "embedding": vector,  # Single vector per document
                        "chunk_id": chunk.id,
                        "token_index": token_idx,
                        "total_tokens": len(token_vectors),
                        "metadata": chunk.metadata,
                        "timestamp": datetime.now()
                    }
                }
                docs.append(doc)
        
        # Bulk index
        try:
            success, failed = bulk(self.client, docs)
            logger.info(f"Indexed {success} token vectors, {len(failed)} failed")
            if failed:
                logger.warning(f"First failed document: {failed[0] if failed else 'None'}")
        except Exception as e:
            logger.error(f"Bulk indexing error: {e}")
            raise e
    
    def delete_index(self):
        """Delete the index."""
        try:
            if self.client.indices.exists(index=self.index_name):
                self.client.indices.delete(index=self.index_name)
                logger.info(f"Deleted index: {self.index_name}")
            else:
                logger.info(f"Index does not exist: {self.index_name}")
        except Exception as e:
            logger.error(f"Error deleting index: {e}")
            raise e

# Initialize configuration
model_config = ModelConfig()
pipeline_config = {
    "vector_store_provider": "elasticsearch",
    "chunks_file_name": "corpus.jsonl",
    "retrieval_top_k": 10,
    "truncate_chunk_size": 8192,  # Updated for ColBERT v2
    "use_reranker": False,
    "batch_size": 16,  # Smaller batch size for multi-vector embeddings
    "enable_late_chunking": True  # Always use Late Chunking approach
}

# Elasticsearch configuration (customize as needed)
es_config = {
    "host": "localhost",
    "port": 9200,
    "scheme": "http"
}

# Validate Jina API key
if jinaai_key == "your-jina-api-key-here" or not jinaai_key:
    logger.warning("Jina API key not configured! Please set JINA_API_KEY environment variable")
    logger.info("You can get an API key from: https://jina.ai/")
else:
    logger.info("Jina API key configured")
print("Elasticsearch configuration ready for ColBERT v2 with Late Chunking!")
print(f"Model: {model_config.provider_model_id}")
print(f"Embedding dimensions: {model_config.embedding_dimensions}")
print(f"Max tokens: {model_config.max_tokens}")
print(f"API endpoint: {model_config.api_endpoint}")
print(f"Elasticsearch: {es_config['host']}:{es_config['port']}")
print("Note: Late Chunking enabled for context-aware embeddings")
print("Note: ColBERT multi-vectors stored as separate documents per token")

INFO:__main__:Jina API key configured


Elasticsearch configuration ready for ColBERT v2 with Late Chunking!
Model: jina-colbert-v2
Embedding dimensions: 128
Max tokens: 8192
API endpoint: https://api.jina.ai/v1/multi-vector
Elasticsearch: localhost:9200
Note: Late Chunking enabled for context-aware embeddings
Note: ColBERT multi-vectors stored as separate documents per token


## 3. Setup Real Elasticsearch Connection

Configure and connect to Elasticsearch. This pipeline requires a running Elasticsearch instance for vector storage and retrieval.

**Connection Requirements:**
- Elasticsearch 7.0+ running on localhost:9200 (or configure custom host/port)
- Network accessibility to the Elasticsearch cluster
- Sufficient memory for dense vector storage

**Configuration Options:**
- **Local Development**: Default localhost:9200
- **Docker**: Use the provided Docker command in the prerequisites
- **Cloud**: Configure es_config with your cloud Elasticsearch endpoints
- **Custom**: Modify es_config dictionary with your specific settings

## 4. Create Mock Data Structure

Generate sample JSONL data similar to format corpus.jsonl untuk testing pipeline. Data ini mensimulasikan dokumen yang akan diindex.

In [185]:
# Sample data yang mensimulasikan corpus.jsonl format
sample_texts = [
    "Machine learning is a subset of artificial intelligence that enables computers to learn without being explicitly programmed.",
    "Natural language processing involves the interaction between computers and human language, enabling machines to understand text.",
    "Deep learning uses neural networks with multiple layers to solve complex problems and recognize patterns in data.",
    "Computer vision allows machines to interpret and understand visual information from the world around them.",
    "Elasticsearch is a distributed search and analytics engine built on Apache Lucene for full-text search capabilities.",
    "Vector databases store and search high-dimensional vectors efficiently, enabling semantic search and similarity matching.",
    "Transformer models have revolutionized natural language processing with their attention mechanism and parallel processing.",
    "Embedding models convert text into numerical representations that capture semantic meaning and context.",
    "Retrieval-augmented generation combines information retrieval with language generation for improved AI responses.",
    "Semantic search goes beyond keyword matching to understand the meaning and intent behind search queries."
]

def create_mock_jsonl_data(texts: List[str], dataset_name: str = "sample_dataset") -> List[Dict[str, Any]]:
    """Create mock JSONL data similar to corpus format."""
    mock_data = []
    
    for i, text in enumerate(texts):
        doc = {
            "_id": f"{dataset_name}_chunk_{i:04d}",
            "text": text,
            "title": f"Document {i+1}",
            "source": f"sample_source_{i+1}",
            "category": "technology",
            "chunk_index": i,
            "word_count": len(text.split()),
            "char_count": len(text)
        }
        mock_data.append(doc)
    
    return mock_data

# Generate mock data
mock_jsonl_data = create_mock_jsonl_data(sample_texts)

print(f"Generated {len(mock_jsonl_data)} mock documents")
print("\nSample document:")
print(json.dumps(mock_jsonl_data[0], indent=2))

# Save to temporary file for processing simulation
temp_jsonl_file = "/tmp/sample_corpus.jsonl"
with open(temp_jsonl_file, 'w') as f:
    for doc in mock_jsonl_data:
        f.write(json.dumps(doc) + '\n')

print(f"\nSaved mock data to: {temp_jsonl_file}")

Generated 10 mock documents

Sample document:
{
  "_id": "sample_dataset_chunk_0000",
  "text": "Machine learning is a subset of artificial intelligence that enables computers to learn without being explicitly programmed.",
  "title": "Document 1",
  "source": "sample_source_1",
  "category": "technology",
  "chunk_index": 0,
  "word_count": 17,
  "char_count": 124
}

Saved mock data to: /tmp/sample_corpus.jsonl


## 5. Initialize Tokenizer and Embedding Model

Setup tokenizer dan embedding model untuk text processing dan generate embeddings seperti pada pipeline asli.

In [186]:
# Initialize Enhanced Jina ColBERT v2 Embedding Pipeline with Late Chunking
print("Initializing Jina ColBERT v2 Pipeline with Late Chunking...")

# Ensure Late Chunking is enabled
pipeline_config["enable_late_chunking"] = True

# Initialize pipeline with Late Chunking enabled
pipeline_simulator = JinaColBERTEmbeddingPipeline(model_config, pipeline_config)

try:
    pipeline_simulator.initialize_tokenizer_and_model()
    print(f"Enhanced Jina ColBERT v2 pipeline initialized!")
    print(f"Max tokens: {model_config.max_tokens}")
    print(f"Late chunking enabled: {pipeline_simulator.late_chunking_enabled}")
    
    if pipeline_simulator.late_chunking_enabled:
        print(f"Local model loaded: {pipeline_simulator.local_model is not None}")
        print(f"Tokenizer: {type(pipeline_simulator.tokenizer).__name__}")
        print("📋 Current Mode: Late Chunking (Context-Aware)")
    else:
        print("📋 Current Mode: Regular ColBERT")
    
    # Test basic embedding generation
    test_text = "Machine learning enables intelligent systems."
    test_embedding = pipeline_simulator.get_single_embedding(test_text, input_type="document")
    
    if test_embedding:
        print(f"ColBERT embedding shape: ({len(test_embedding)}, {len(test_embedding[0]) if test_embedding else 0})")
        print(f"Number of token vectors: {len(test_embedding)}")
        print(f"Vector dimension: {len(test_embedding[0]) if test_embedding else 0}")
        print(f"Sample values from first vector: {test_embedding[0][:5] if test_embedding else 'None'}")
    else:
        print("Failed to generate test embedding")
        
except Exception as e:
    print(f"Error initializing pipeline: {e}")
    raise e

print(f"\nPipeline ready for Late Chunking processing!")

Initializing Jina ColBERT v2 Pipeline with Late Chunking...


INFO:__main__:Loaded local Jina model for late chunking
INFO:__main__:Initialized tokenizer for text processing
INFO:__main__:Using Jina ColBERT v2 API for embeddings
INFO:__main__:Late chunking enabled: True
INFO:__main__:Vocab size: 30528


Enhanced Jina ColBERT v2 pipeline initialized!
Max tokens: 8192
Late chunking enabled: True
Local model loaded: True
Tokenizer: BertTokenizerFast
📋 Current Mode: Late Chunking (Context-Aware)
ColBERT embedding shape: (10, 128)
Number of token vectors: 10
Vector dimension: 128
Sample values from first vector: [0.14147949, 0.011352539, -0.19763184, -0.22631836, 0.12420654]

Pipeline ready for Late Chunking processing!


## 6. Process and Prepare Documents

Load dan process documents dari mock JSONL data, handle text truncation, dan prepare metadata untuk indexing seperti pada pipeline asli.

In [187]:
def process_jsonl_documents_with_late_chunking(file_path: str, pipeline_simulator: JinaColBERTEmbeddingPipeline) -> List[LateChunk]:
    """Process JSONL documents with PROPER Late Chunking for enhanced context-aware embeddings."""
    late_chunks = []
    
    logger.info(f"Processing documents with PROPER Late Chunking from: {file_path}")
    
    with open(file_path, "r") as f:
        for line_num, line in enumerate(f, 1):
            try:
                data = json.loads(line.strip())
                
                # Extract metadata
                metadata = {}
                for k, v in data.items():
                    if k == "text":
                        continue
                    if k == "_id":
                        metadata["chunk_id"] = v
                    else:
                        metadata[k] = v
                
                # Get text content
                original_text = data.get("text", "")
                
                if not original_text.strip():
                    continue
                
                # Apply truncation if configured
                truncate_chunk_size = pipeline_simulator.pipeline_config.get("truncate_chunk_size")
                if truncate_chunk_size is not None:
                    original_length = len(original_text.split())
                    original_text = JinaColBERTEmbeddingPipeline.truncate_to_token_limit(
                        text=original_text, 
                        tokenizer=pipeline_simulator.tokenizer, 
                        max_tokens=truncate_chunk_size
                    )
                    new_length = len(original_text.split())
                    if original_length != new_length:
                        logger.debug(f"🔄 Truncated doc {line_num}: {original_length} → {new_length} tokens")
                
                # TRUE Late Chunking: Use sentence-level boundaries but preserve context
                try:
                    # Use late chunking to create context-aware chunks
                    chunks, span_annotations = LateChunkingUtils.chunk_by_sentences(
                        original_text, pipeline_simulator.tokenizer
                    )
                    
                    # Create LateChunk objects for each chunk
                    for i, (chunk_text, span) in enumerate(zip(chunks, span_annotations)):
                        if chunk_text.strip():  # Only process non-empty chunks
                            late_chunk = LateChunk(
                                content=chunk_text.strip(),
                                id=f"{metadata.get('chunk_id', f'doc_{line_num}')}_{i}",
                                metadata={
                                    **metadata,
                                    "original_chunk_id": metadata.get('chunk_id', f'doc_{line_num}'),
                                    "late_chunk_index": i,
                                    "total_chunks": len(chunks),
                                    "original_text_length": len(original_text),
                                    "chunk_text_length": len(chunk_text),
                                    "context_aware": True  # Mark as context-aware
                                },
                                original_text=original_text,
                                span_annotation=span,
                                chunk_index=i,
                                context_aware=True
                            )
                            late_chunks.append(late_chunk)
                    
                    logger.debug(f"Doc {line_num}: Created {len(chunks)} late chunks with context")
                    
                except Exception as e:
                    logger.warning(f"Late chunking failed for doc {line_num}: {e}")
                    # Fallback to traditional chunking
                    late_chunk = LateChunk(
                        content=original_text,
                        id=metadata.get('chunk_id', f'doc_{line_num}'),
                        metadata={**metadata, "context_aware": False},
                        original_text=original_text,
                        span_annotation=(0, len(original_text.split())),
                        chunk_index=0,
                        context_aware=False
                    )
                    late_chunks.append(late_chunk)
                
            except json.JSONDecodeError as e:
                logger.error(f"JSON decode error at line {line_num}: {e}")
            except Exception as e:
                logger.error(f"Error processing line {line_num}: {e}")
    
    logger.info(f"Processed {len(late_chunks)} late chunks")
    context_aware_count = sum(1 for chunk in late_chunks if chunk.context_aware)
    logger.info(f"Context-aware chunks: {context_aware_count}/{len(late_chunks)}")
    
    return late_chunks

def process_jsonl_documents(file_path: str, pipeline_simulator: JinaColBERTEmbeddingPipeline) -> List[Document]:
    """Process JSONL documents for regular ColBERT pipeline."""
    documents = []
    
    logger.info(f"Processing documents from: {file_path}")
    
    with open(file_path, "r") as f:
        for line_num, line in enumerate(f, 1):
            try:
                data = json.loads(line.strip())
                
                # Extract metadata (exclude 'text' field)
                metadata = {}
                for k, v in data.items():
                    if k == "text":
                        continue
                    if k == "_id":
                        metadata["chunk_id"] = v
                    else:
                        metadata[k] = v
                
                # Get text content
                text = data.get("text", "")
                
                # Apply truncation if configured (ColBERT v2 supports up to 8192 tokens)
                truncate_chunk_size = pipeline_simulator.pipeline_config.get("truncate_chunk_size")
                if truncate_chunk_size is not None:
                    original_length = len(text.split())
                    text = JinaColBERTEmbeddingPipeline.truncate_to_token_limit(
                        text=text, 
                        tokenizer=pipeline_simulator.tokenizer, 
                        max_tokens=truncate_chunk_size
                    )
                    new_length = len(text.split())
                    if original_length != new_length:
                        logger.debug(f"Truncated doc {line_num}: {original_length} -> {new_length} tokens")
                
                # Apply text size truncation if configured
                truncate_text_size = pipeline_simulator.pipeline_config.get("truncate_text_size")
                if truncate_text_size is not None:
                    text = text[:truncate_text_size] if len(text) > truncate_text_size else text
                
                # Create document
                document = Document(
                    page_content=text,
                    metadata=metadata
                )
                documents.append(document)
                
            except json.JSONDecodeError as e:
                logger.error(f"JSON decode error at line {line_num}: {e}")
            except Exception as e:
                logger.error(f"Error processing line {line_num}: {e}")
    
    logger.info(f"Processed {len(documents)} documents")
    return documents

def filter_complex_metadata(documents: List[Document]) -> List[Document]:
    """Filter complex metadata - simplified version."""
    filtered_docs = []
    for doc in documents:
        # Simple filtering - remove any non-serializable metadata
        filtered_metadata = {}
        for k, v in doc.metadata.items():
            if isinstance(v, (str, int, float, bool, type(None))):
                filtered_metadata[k] = v
        
        filtered_doc = Document(
            page_content=doc.page_content,
            metadata=filtered_metadata
        )
        filtered_docs.append(filtered_doc)
    
    return filtered_docs

def documents_to_chunks(documents: List[Document]) -> List[Chunk]:
    """Convert documents to chunks format."""
    chunks = []
    for doc in documents:
        chunk = Chunk(
            content=doc.page_content,
            id=doc.metadata["chunk_id"],
            metadata=doc.metadata
        )
        chunks.append(chunk)
    
    return chunks

def late_chunks_to_chunks(late_chunks: List[LateChunk]) -> List[Chunk]:
    """Convert LateChunk objects to regular Chunk objects for compatibility."""
    chunks = []
    for late_chunk in late_chunks:
        chunk = Chunk(
            content=late_chunk.content,
            id=late_chunk.id,
            metadata=late_chunk.metadata
        )
        chunks.append(chunk)
    
    return chunks

# Process the mock JSONL data based on Late Chunking setting
if pipeline_simulator.late_chunking_enabled:
    print("Processing documents with TRUE Late Chunking...")
    late_chunks = process_jsonl_documents_with_late_chunking(temp_jsonl_file, pipeline_simulator)
    chunks = late_chunks_to_chunks(late_chunks)
    
    print(f"Created {len(chunks)} context-aware chunks using Late Chunking")
    print(f"\nSample Late Chunk:")
    sample_late_chunk = late_chunks[0]
    print(f"ID: {sample_late_chunk.id}")
    print(f"Content: {sample_late_chunk.content[:100]}...")
    print(f"Context-aware: {sample_late_chunk.context_aware}")
    print(f"Chunk index: {sample_late_chunk.chunk_index}")
    print(f"Original text length: {len(sample_late_chunk.original_text)}")
    print(f"Span annotation: {sample_late_chunk.span_annotation}")
    print(f"Metadata keys: {list(sample_late_chunk.metadata.keys())}")
else:
    print("Processing documents with regular ColBERT chunking...")
    documents = process_jsonl_documents(temp_jsonl_file, pipeline_simulator)
    documents = filter_complex_metadata(documents)
    chunks = documents_to_chunks(documents)
    
    print(f"Created {len(chunks)} chunks using traditional chunking")
    print(f"\nSample Regular Chunk:")
    sample_chunk = chunks[0]
    print(f"ID: {sample_chunk.id}")
    print(f"Content: {sample_chunk.content[:100]}...")
    print(f"Metadata keys: {list(sample_chunk.metadata.keys())}")

print(f"\nProcessing method: {'Late Chunking (Context-Aware)' if pipeline_simulator.late_chunking_enabled else 'Regular ColBERT'}")
print(f"Total chunks created: {len(chunks)}")

INFO:__main__:Processing documents with PROPER Late Chunking from: /tmp/sample_corpus.jsonl
INFO:__main__:Processed 10 late chunks
INFO:__main__:Context-aware chunks: 10/10


Processing documents with TRUE Late Chunking...
Created 10 context-aware chunks using Late Chunking

Sample Late Chunk:
ID: sample_dataset_chunk_0000_0
Content: machine learning is a subset of artificial intelligence that enables computers to learn without bein...
Context-aware: True
Chunk index: 0
Original text length: 124
Span annotation: (1, 18)
Metadata keys: ['chunk_id', 'title', 'source', 'category', 'chunk_index', 'word_count', 'char_count', 'original_chunk_id', 'late_chunk_index', 'total_chunks', 'original_text_length', 'chunk_text_length', 'context_aware']

Processing method: Late Chunking (Context-Aware)
Total chunks created: 10


## 7. Generate Document Embeddings with Late Chunking

Generate context-aware embeddings untuk semua document chunks menggunakan Late Chunking approach. Teknik ini memproses entire document sekaligus untuk maintain context, kemudian mensegmentasi token embeddings untuk chunk-level storage.

In [188]:
# Generate ColBERT multi-vector embeddings with Late Chunking
logger.info("Generating ColBERT multi-vector embeddings with Late Chunking...")
start_time = time.time()

logger.info("Using TRUE Late Chunking for context-aware embeddings...")

# Group late chunks by original document for late chunking
doc_groups = {}
for chunk in chunks:
    # Get original document ID from metadata
    original_id = chunk.metadata.get('original_chunk_id', chunk.id.split('_')[0])
    if original_id not in doc_groups:
        doc_groups[original_id] = []
    doc_groups[original_id].append(chunk)

logger.info(f"Processing {len(doc_groups)} original documents with late chunking")

all_embeddings = []
doc_items = list(doc_groups.items())
batch_size = 2  # Process fewer documents at a time for late chunking

for i in tqdm(range(0, len(doc_items), batch_size), desc="Late Chunking Embeddings"):
    batch_docs = doc_items[i:i + batch_size]
    
    try:
        for doc_id, doc_chunks in batch_docs:
            # Get the original text from the first chunk
            original_text = doc_chunks[0].metadata.get('original_text', doc_chunks[0].content)
            if not hasattr(doc_chunks[0], 'original_text'):
                # If LateChunk object doesn't exist, reconstruct from chunks
                original_text = ' '.join([chunk.content for chunk in doc_chunks])
            
            try:
                # Use TRUE late chunking: encode full document once, then segment
                late_embeddings, chunks_texts, span_annotations = pipeline_simulator.get_late_chunking_embeddings(original_text)
                
                # Map late chunk embeddings to our chunk objects
                for j, chunk in enumerate(doc_chunks):
                    if j < len(late_embeddings):
                        all_embeddings.append(late_embeddings[j])
                    else:
                        # If we have more chunks than embeddings, duplicate the last one
                        if late_embeddings:
                            all_embeddings.append(late_embeddings[-1])
                        else:
                            all_embeddings.append([])
                
                logger.debug(f"Late chunking successful for doc {doc_id}: {len(late_embeddings)} context-aware embeddings")
                
            except Exception as late_error:
                logger.warning(f"Late chunking failed for doc {doc_id}: {late_error}")
                # Fallback to individual chunk embeddings
                for chunk in doc_chunks:
                    try:
                        fallback_embedding = pipeline_simulator.get_single_embedding(chunk.content, "document")
                        all_embeddings.append(fallback_embedding if fallback_embedding else [])
                    except:
                        all_embeddings.append([])
        
        # Add delay to respect API rate limits
        time.sleep(0.5)
        
    except Exception as e:
        logger.error(f"Error processing late chunking batch {i//batch_size + 1}: {e}")
        # Add empty embeddings as placeholders
        for doc_id, doc_chunks in batch_docs:
            for _ in doc_chunks:
                all_embeddings.append([])

embedding_time = time.time() - start_time

print(f"Generated {len(all_embeddings)} ColBERT multi-vector embeddings")
print(f"⏱Embedding generation took: {embedding_time:.2f} seconds")
print(f"Average time per document: {embedding_time/len(chunks):.3f} seconds")
print(f"Method: TRUE Late Chunking (Context-Aware)")

# Validate ColBERT embeddings
if all_embeddings and all_embeddings[0]:
    first_embedding = all_embeddings[0]
    token_count = len(first_embedding)
    vector_dim = len(first_embedding[0]) if first_embedding else 0
    
    print(f"\nColBERT embedding structure:")
    print(f"├── Number of token vectors: {token_count}")
    print(f"├── Vector dimension: {vector_dim}")
    print(f"└── Total parameters per document: {token_count * vector_dim}")
    
    # Show stats for all embeddings
    token_counts = [len(emb) for emb in all_embeddings if emb]
    if token_counts:
        print(f"\nToken count statistics across all documents:")
        print(f"├── Min tokens: {min(token_counts)}")
        print(f"├── Max tokens: {max(token_counts)}")
        print(f"├── Average tokens: {sum(token_counts)/len(token_counts):.1f}")
        print(f"└── Total embeddings: {len(token_counts)}")
    
    # Sample values from first embedding
    print(f"\nSample values from first token vector: {first_embedding[0][:5]}")
    
    # Check for consistent dimensions
    dims_consistent = all(
        len(emb[0]) == vector_dim for emb in all_embeddings 
        if emb and len(emb) > 0
    )
    print(f"All token vectors have consistent dimensions: {dims_consistent}")
    
    # Late Chunking specific statistics
    context_aware_embeddings = len([emb for emb in all_embeddings if emb])
    print(f"\nLate Chunking Statistics:")
    print(f"├── Context-aware embeddings: {context_aware_embeddings}")
    print(f"├── Failed embeddings: {len(all_embeddings) - context_aware_embeddings}")
    print(f"└── Success rate: {(context_aware_embeddings/len(all_embeddings)*100):.1f}%")
    
    # Show the key difference: context preservation
    print(f"\nLate Chunking Benefits:")
    print(f"├── Documents encoded as complete sequences for context")
    print(f"├── Token embeddings segmented after contextualization")
    print(f"└── Each chunk retains full document context information")
else:
    print("No valid embeddings generated")

print(f"\nReady for Elasticsearch indexing with context-aware ColBERT embeddings!")

INFO:__main__:Generating ColBERT multi-vector embeddings with Late Chunking...
INFO:__main__:Using TRUE Late Chunking for context-aware embeddings...
INFO:__main__:Processing 10 original documents with late chunking
Late Chunking Embeddings:   0%|          | 0/5 [00:00<?, ?it/s]INFO:__main__:Late chunking: 1 context-aware chunks from single encoding
INFO:__main__:Late chunking: 1 context-aware chunks from single encoding
Late Chunking Embeddings:  20%|██        | 1/5 [00:00<00:03,  1.10it/s]INFO:__main__:Late chunking: 1 context-aware chunks from single encoding
INFO:__main__:Late chunking: 1 context-aware chunks from single encoding
Late Chunking Embeddings:  40%|████      | 2/5 [00:01<00:02,  1.35it/s]INFO:__main__:Late chunking: 1 context-aware chunks from single encoding
INFO:__main__:Late chunking: 1 context-aware chunks from single encoding
Late Chunking Embeddings:  60%|██████    | 3/5 [00:02<00:01,  1.43it/s]INFO:__main__:Late chunking: 1 context-aware chunks from single encodi

Generated 10 ColBERT multi-vector embeddings
⏱Embedding generation took: 3.47 seconds
Average time per document: 0.347 seconds
Method: TRUE Late Chunking (Context-Aware)

ColBERT embedding structure:
├── Number of token vectors: 1
├── Vector dimension: 20
└── Total parameters per document: 20

Token count statistics across all documents:
├── Min tokens: 1
├── Max tokens: 1
├── Average tokens: 1.0
└── Total embeddings: 10

Sample values from first token vector: [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0




## 8. Index Documents to Elasticsearch

Batch index processed documents dengan embeddings ke Elasticsearch, handle existing indices dan error management seperti pada pipeline asli.

In [190]:
# Create vector store and index ColBERT multi-vector embeddings
dataset_name = "sample_dataset"
model_for_index_name = model_config.provider_model_id.replace("/", "-")
index_name = f"colbert_{dataset_name.lower()}_{model_for_index_name.lower()}"

logger.info(f"Creating ColBERT vector store with index: {index_name}")

try:
    # Initialize vector store for ColBERT embeddings with real Elasticsearch
    vector_store = ElasticsearchVectorStore(
        index_name=index_name,
        embedding_config=model_config.model_dump(),
        es_config=es_config
    )
    
    # Check if index already has data (like in original pipeline)
    existing_doc_count = vector_store.count_documents()
    logger.info(f"Existing documents in index: {existing_doc_count}")
    
    # Filter out chunks with empty embeddings
    valid_chunks = []
    valid_embeddings = []
    for chunk, embedding in zip(chunks, all_embeddings):
        if embedding and len(embedding) > 0:  # Check if embedding is not empty
            valid_chunks.append(chunk)
            valid_embeddings.append(embedding)
    
    logger.info(f"Valid chunks with embeddings: {len(valid_chunks)}/{len(chunks)}")
    
    if existing_doc_count >= len(valid_chunks):
        logger.info(f"⏭Index already has {existing_doc_count} documents. Skipping indexing.")
    else:
        logger.info(f"Starting batch indexing of {len(valid_chunks)} chunks with ColBERT embeddings...")
        
        # Batch indexing for ColBERT multi-vector embeddings
        batch_size = pipeline_config.get("batch_size", 8)  # Smaller batch for multi-vector
        start_time = time.time()
        
        successful_docs = 0
        for i in tqdm(range(0, len(valid_chunks), batch_size), desc="Indexing ColBERT batches"):
            batch_chunks = valid_chunks[i:i + batch_size]
            batch_embeddings = valid_embeddings[i:i + batch_size]
            
            try:
                # Add batch to vector store
                vector_store.add_chunks_batch(batch_chunks, batch_embeddings)
                successful_docs += len(batch_chunks)
                logger.debug(f"Indexed ColBERT batch {i//batch_size + 1}: {len(batch_chunks)} chunks")
                
            except Exception as e:
                logger.error(f"Error indexing ColBERT batch {i//batch_size + 1}: {e}")
                # Continue with next batch instead of failing completely
                continue
        
        indexing_time = time.time() - start_time
        
        # Verify indexing
        final_doc_count = vector_store.count_documents()
        logger.info(f"ColBERT indexing completed!")
        logger.info(f"Successfully indexed: {successful_docs}/{len(valid_chunks)} documents")
        logger.info(f"Final document count: {final_doc_count}")
        logger.info(f"Indexing took: {indexing_time:.2f} seconds")
        
        if successful_docs > 0:
            logger.info(f"Average indexing time per document: {indexing_time/successful_docs:.3f} seconds")
    
    # Store vector store for later use
    pipeline_simulator.vector_stores[dataset_name] = vector_store
    
    print(f"\nColBERT Pipeline Statistics:")
    print(f"├── Documents processed: {len(chunks)}")
    print(f"├── Valid embeddings generated: {len(valid_embeddings)}")
    print(f"├── Failed embeddings: {len(chunks) - len(valid_embeddings)}")  
    print(f"├── Index name: {index_name}")
    print(f"├── Final document count: {vector_store.count_documents()}")
    print(f"├── Vector store mode: Real Elasticsearch")
    print(f"└── Model: {model_config.provider_model_id} (ColBERT v2)")
    
    # Display embedding statistics
    if valid_embeddings:
        total_vectors = sum(len(emb) for emb in valid_embeddings)
        avg_vectors_per_doc = total_vectors / len(valid_embeddings)
        print(f"\nColBERT Embedding Statistics:")
        print(f"├── Total token vectors: {total_vectors:,}")
        print(f"├── Average vectors per document: {avg_vectors_per_doc:.1f}")
        print(f"├── Vector dimension: {model_config.embedding_dimensions}")
        print(f"└── Total parameters: {total_vectors * model_config.embedding_dimensions:,}")
    
except ConnectionError as e:
    logger.error(f"Failed to connect to Elasticsearch: {e}")
    logger.error("Please ensure Elasticsearch is running and accessible.")
    logger.info("💡 Quick start options:")
    logger.info("   1. Docker: docker run -d -p 9200:9200 -e 'discovery.type=single-node' docker.elastic.co/elasticsearch/elasticsearch:8.11.0")
    logger.info("   2. Local installation: https://www.elastic.co/downloads/elasticsearch")
    raise e
except Exception as e:
    logger.error(f"Error during indexing: {e}")
    raise e

INFO:__main__:Creating ColBERT vector store with index: colbert_sample_dataset_jina-colbert-v2


INFO:elastic_transport.transport:GET http://localhost:9200/ [status:200 duration:0.037s]
INFO:__main__:Connected to Elasticsearch: 9.0.3
INFO:elastic_transport.transport:HEAD http://localhost:9200/colbert_sample_dataset_jina-colbert-v2 [status:200 duration:0.017s]
INFO:__main__:Index already exists: colbert_sample_dataset_jina-colbert-v2
INFO:elastic_transport.transport:POST http://localhost:9200/colbert_sample_dataset_jina-colbert-v2/_count [status:200 duration:0.076s]
INFO:__main__:Existing documents in index: 244
INFO:__main__:Valid chunks with embeddings: 10/10
INFO:__main__:⏭Index already has 244 documents. Skipping indexing.
INFO:elastic_transport.transport:POST http://localhost:9200/colbert_sample_dataset_jina-colbert-v2/_count [status:200 duration:0.006s]



ColBERT Pipeline Statistics:
├── Documents processed: 10
├── Valid embeddings generated: 10
├── Failed embeddings: 0
├── Index name: colbert_sample_dataset_jina-colbert-v2
├── Final document count: 244
├── Vector store mode: Real Elasticsearch
└── Model: jina-colbert-v2 (ColBERT v2)

ColBERT Embedding Statistics:
├── Total token vectors: 10
├── Average vectors per document: 1.0
├── Vector dimension: 128
└── Total parameters: 1,280


## 9. Implement Semantic Search

Create search functionality yang embed query text dan melakukan vector similarity search terhadap indexed documents.

In [191]:
class ColBERTSemanticRetriever:
    """Enhanced ColBERT semantic retrieval class with late interaction similarity and Late Chunking support."""
    
    def __init__(self, vector_store: ElasticsearchVectorStore, top_k: int = 10):
        self.vector_store = vector_store
        self.top_k = top_k
    
    def cosine_similarity(self, vec1: List[float], vec2: List[float]) -> float:
        """Calculate cosine similarity between two vectors."""
        dot_product = sum(a * b for a, b in zip(vec1, vec2))
        magnitude1 = sum(a * a for a in vec1) ** 0.5
        magnitude2 = sum(b * b for b in vec2) ** 0.5
        
        if magnitude1 == 0 or magnitude2 == 0:
            return 0.0
        
        return dot_product / (magnitude1 * magnitude2)
    
    def late_interaction_similarity(self, query_vectors: List[List[float]], doc_vectors: List[List[float]]) -> float:
        """
        Compute ColBERT late interaction similarity.
        For each query token, find the best matching document token, then sum.
        Enhanced for Late Chunking context-aware embeddings.
        """
        if not query_vectors or not doc_vectors:
            return 0.0
        
        total_score = 0.0
        
        # For each query token vector
        for q_vec in query_vectors:
            # Find the maximum similarity with any document token vector
            max_similarity = max(
                self.cosine_similarity(q_vec, d_vec) 
                for d_vec in doc_vectors
            )
            total_score += max_similarity
        
        # Normalize by query length (Late Chunking improvement)
        normalized_score = total_score / len(query_vectors)
        
        return normalized_score
    
    def search(self, query_embedding: List[List[float]], include_metadata: bool = True) -> List[Dict[str, Any]]:
        """Search for similar documents using ColBERT late interaction with Late Chunking support."""
        try:
            if not query_embedding or not query_embedding[0]:
                return []
            
            # First, get candidate token vectors using first query vector
            search_body = {
                "query": {
                    "script_score": {
                        "query": {"match_all": {}},
                        "script": {
                            "source": "Math.max(0, cosineSimilarity(params.query_vector, 'embedding') + 1.0)",
                            "params": {
                                "query_vector": query_embedding[0]  # Use first vector for initial search
                            }
                        }
                    }
                },
                "size": 1000,  # Get many token vectors for aggregation
                "_source": ["chunk_id", "embedding", "content", "metadata", "token_index", "total_tokens"]
            }
            
            response = self.vector_store.client.search(
                index=self.vector_store.index_name,
                body=search_body
            )
            
            # Group token vectors by chunk_id
            chunk_groups = {}
            for hit in response['hits']['hits']:
                source = hit['_source']
                chunk_id = source['chunk_id']
                
                if chunk_id not in chunk_groups:
                    chunk_groups[chunk_id] = {
                        'content': source['content'],
                        'metadata': source['metadata'],
                        'vectors': [],
                        'chunk_id': chunk_id
                    }
                
                chunk_groups[chunk_id]['vectors'].append(source['embedding'])
            
            # Compute late interaction scores for each chunk
            results = []
            for chunk_id, chunk_data in chunk_groups.items():
                doc_vectors = chunk_data['vectors']
                
                if not doc_vectors:
                    continue
                
                # Compute ColBERT late interaction score
                late_interaction_score = self.late_interaction_similarity(
                    query_embedding, doc_vectors
                )
                
                chunk = Chunk(
                    content=chunk_data['content'],
                    id=chunk_id,
                    metadata=chunk_data['metadata']
                )
                
                result = {
                    'chunk': chunk,
                    'score': late_interaction_score,
                    'late_interaction_score': late_interaction_score,
                    'token_count': len(doc_vectors)
                }
                
                # Add Late Chunking metadata if available
                if include_metadata:
                    metadata = chunk_data['metadata']
                    result.update({
                        'context_aware': metadata.get('context_aware', False),
                        'chunk_index': metadata.get('late_chunk_index', 0),
                        'original_text_length': metadata.get('original_text_length', 0)
                    })
                
                results.append(result)
            
            # Sort by late interaction score
            results.sort(key=lambda x: x['score'], reverse=True)
            return results[:self.top_k]
            
        except Exception as e:
            logger.error(f"Elasticsearch ColBERT search error: {e}")
            raise e

# Initialize enhanced ColBERT semantic retriever
retriever = ColBERTSemanticRetriever(vector_store, top_k=pipeline_config.get("retrieval_top_k", 10))

# Enhanced test queries for Late Chunking evaluation
test_queries = [
    "What is machine learning and artificial intelligence?",
    "How does natural language processing work with computers?",
    "Tell me about deep learning neural networks and patterns",
    "What is vector search and semantic embeddings?",
    "Explain distributed search and analytics engines"
]

print("Testing Enhanced ColBERT Semantic Search with Late Chunking:")
print("=" * 70)

for i, query in enumerate(test_queries, 1):
    print(f"\n🔎 Query {i}: {query}")
    
    try:
        # Generate query embedding using ColBERT (query type)
        query_embedding = pipeline_simulator.get_single_embedding(query, input_type="query")
        
        if not query_embedding:
            print("Failed to generate query embedding")
            continue
        
        print(f"Query vectors: {len(query_embedding)} tokens")
        
        # Perform enhanced ColBERT search with late interaction
        search_start = time.time()
        results = retriever.search(query_embedding, include_metadata=True)
        search_time = time.time() - search_start
        
        print(f"⏱Search time: {search_time:.3f} seconds")
        print(f"Found {len(results)} results")
        
        # Analyze Late Chunking vs Traditional results
        context_aware_results = [r for r in results if r.get('context_aware', False)]
        traditional_results = [r for r in results if not r.get('context_aware', False)]
        
        if pipeline_simulator.late_chunking_enabled:
            print(f"Context-aware results: {len(context_aware_results)}")
            print(f"Traditional results: {len(traditional_results)}")
        
        # Display top 3 results with enhanced information
        for j, result in enumerate(results[:3], 1):
            chunk = result['chunk']
            score = result['score']
            context_aware = result.get('context_aware', False)
            chunk_index = result.get('chunk_index', 0)
            token_count = result.get('token_count', 0)
            
            print(f"\n   {j}. Late Interaction Score: {score:.4f}")
            print(f"      Token Count: {token_count}")
            print(f"      Context-Aware: {'✅' if context_aware else '❌'}")
            if context_aware:
                print(f"      Chunk Index: {chunk_index}")
                print(f"      Original Length: {result.get('original_text_length', 'N/A')} chars")
            print(f"      Content: {chunk.content[:100]}...")
            print(f"      Chunk ID: {chunk.id}")
            
    except Exception as e:
        print(f"Error processing query {i}: {e}")
        continue

print("\nEnhanced ColBERT semantic search with Late Chunking testing completed!")

Testing Enhanced ColBERT Semantic Search with Late Chunking:

🔎 Query 1: What is machine learning and artificial intelligence?
Query vectors: 32 tokens


INFO:elastic_transport.transport:POST http://localhost:9200/colbert_sample_dataset_jina-colbert-v2/_search [status:200 duration:0.892s]


⏱Search time: 1.082 seconds
Found 10 results
Context-aware results: 0
Traditional results: 10

   1. Late Interaction Score: 0.6881
      Token Count: 26
      Context-Aware: ❌
      Content: machine learning is a subset of artificial intelligence that enables computers to learn without bein...
      Chunk ID: sample_dataset_chunk_0000

   2. Late Interaction Score: 0.5795
      Token Count: 26
      Context-Aware: ❌
      Content: deep learning uses neural networks with multiple layers to solve complex problems and recognize patt...
      Chunk ID: sample_dataset_chunk_0002

   3. Late Interaction Score: 0.5720
      Token Count: 26
      Context-Aware: ❌
      Content: natural language processing involves the interaction between computers and human language, enabling ...
      Chunk ID: sample_dataset_chunk_0001

🔎 Query 2: How does natural language processing work with computers?
Query vectors: 32 tokens


INFO:elastic_transport.transport:POST http://localhost:9200/colbert_sample_dataset_jina-colbert-v2/_search [status:200 duration:0.264s]


⏱Search time: 0.395 seconds
Found 10 results
Context-aware results: 0
Traditional results: 10

   1. Late Interaction Score: 0.6777
      Token Count: 26
      Context-Aware: ❌
      Content: natural language processing involves the interaction between computers and human language, enabling ...
      Chunk ID: sample_dataset_chunk_0001

   2. Late Interaction Score: 0.5837
      Token Count: 22
      Context-Aware: ❌
      Content: transformer models have revolutionized natural language processing with their attention mechanism an...
      Chunk ID: sample_dataset_chunk_0006

   3. Late Interaction Score: 0.5535
      Token Count: 26
      Context-Aware: ❌
      Content: machine learning is a subset of artificial intelligence that enables computers to learn without bein...
      Chunk ID: sample_dataset_chunk_0000

🔎 Query 3: Tell me about deep learning neural networks and patterns


INFO:elastic_transport.transport:POST http://localhost:9200/colbert_sample_dataset_jina-colbert-v2/_search [status:200 duration:0.055s]


Query vectors: 32 tokens
⏱Search time: 0.344 seconds
Found 10 results
Context-aware results: 0
Traditional results: 10

   1. Late Interaction Score: 0.6581
      Token Count: 26
      Context-Aware: ❌
      Content: deep learning uses neural networks with multiple layers to solve complex problems and recognize patt...
      Chunk ID: sample_dataset_chunk_0002

   2. Late Interaction Score: 0.5334
      Token Count: 26
      Context-Aware: ❌
      Content: machine learning is a subset of artificial intelligence that enables computers to learn without bein...
      Chunk ID: sample_dataset_chunk_0000

   3. Late Interaction Score: 0.5106
      Token Count: 26
      Context-Aware: ❌
      Content: retrieval - augmented generation combines information retrieval with language generation for improve...
      Chunk ID: sample_dataset_chunk_0008

🔎 Query 4: What is vector search and semantic embeddings?


INFO:elastic_transport.transport:POST http://localhost:9200/colbert_sample_dataset_jina-colbert-v2/_search [status:200 duration:0.054s]


Query vectors: 32 tokens
⏱Search time: 0.191 seconds
Found 10 results
Context-aware results: 0
Traditional results: 10

   1. Late Interaction Score: 0.6367
      Token Count: 29
      Context-Aware: ❌
      Content: vector databases store and search high - dimensional vectors efficiently, enabling semantic search a...
      Chunk ID: sample_dataset_chunk_0005

   2. Late Interaction Score: 0.6135
      Token Count: 21
      Context-Aware: ❌
      Content: embedding models convert text into numerical representations that capture semantic meaning and conte...
      Chunk ID: sample_dataset_chunk_0007

   3. Late Interaction Score: 0.5941
      Token Count: 22
      Context-Aware: ❌
      Content: semantic search goes beyond keyword matching to understand the meaning and intent behind search quer...
      Chunk ID: sample_dataset_chunk_0009

🔎 Query 5: Explain distributed search and analytics engines


INFO:elastic_transport.transport:POST http://localhost:9200/colbert_sample_dataset_jina-colbert-v2/_search [status:200 duration:0.039s]


Query vectors: 32 tokens
⏱Search time: 0.173 seconds
Found 10 results
Context-aware results: 0
Traditional results: 10

   1. Late Interaction Score: 0.6261
      Token Count: 27
      Context-Aware: ❌
      Content: elasticsearch is a distributed search and analytics engine built on apache lucene for full - text se...
      Chunk ID: sample_dataset_chunk_0004

   2. Late Interaction Score: 0.5404
      Token Count: 22
      Context-Aware: ❌
      Content: semantic search goes beyond keyword matching to understand the meaning and intent behind search quer...
      Chunk ID: sample_dataset_chunk_0009

   3. Late Interaction Score: 0.5353
      Token Count: 26
      Context-Aware: ❌
      Content: retrieval - augmented generation combines information retrieval with language generation for improve...
      Chunk ID: sample_dataset_chunk_0008

Enhanced ColBERT semantic search with Late Chunking testing completed!


## 10. Test Retrieval and Reranking

Implement dan test document retrieval dengan optional reranking functionality, measuring retrieval accuracy dan performance.

In [192]:
class ColBERTReranker:
    """ColBERT-aware reranker that considers token-level interactions."""
    
    def __init__(self):
        pass
    
    def rerank(self, query: str, chunks: List[Chunk], query_embedding: List[List[float]]) -> List[Dict[str, Any]]:
        """Rerank using ColBERT late interaction + text-based features."""
        results = []
        
        for chunk in chunks:
            # Text-based scoring features
            query_words = set(query.lower().split())
            chunk_words = set(chunk.content.lower().split())
            
            # Keyword overlap score
            overlap_score = len(query_words.intersection(chunk_words)) / len(query_words) if query_words else 0
            
            # Length preference (prefer moderate length chunks)
            length_score = 1.0 / (1.0 + abs(len(chunk.content.split()) - 100) * 0.005)
            
            # Diversity score (prefer chunks with varied vocabulary)
            diversity_score = len(chunk_words) / len(chunk.content.split()) if chunk.content.split() else 0
            
            # Combined rerank score
            rerank_score = overlap_score * 0.5 + length_score * 0.3 + diversity_score * 0.2
            
            results.append({
                'chunk': chunk,
                'score': rerank_score,
                'overlap_score': overlap_score,
                'length_score': length_score,
                'diversity_score': diversity_score
            })
        
        # Sort by rerank score
        results.sort(key=lambda x: x['score'], reverse=True)
        return results

def comprehensive_colbert_retrieval_test():
    """Comprehensive test of the ColBERT retrieval pipeline."""
    
    print("🔬 Comprehensive ColBERT Retrieval Pipeline Test")
    print("=" * 70)
    
    # Test configuration
    test_queries = [
        "machine learning",
        "natural language processing text understanding", 
        "deep learning neural networks pattern recognition",
        "vector database similarity semantic search",
        "elasticsearch distributed search analytics engine"
    ]
    
    # Initialize ColBERT reranker
    reranker = ColBERTReranker()
    
    # Performance metrics
    total_embedding_time = 0
    total_search_time = 0
    total_rerank_time = 0
    
    for i, query in enumerate(test_queries, 1):
        print(f"\n🔎 Test Query {i}: '{query}'")
        print("-" * 50)
        
        try:
            # Step 1: Generate query embedding (ColBERT multi-vector)
            embed_start = time.time()
            query_embedding = pipeline_simulator.get_single_embedding(query, input_type="query")
            embed_time = time.time() - embed_start
            total_embedding_time += embed_time
            
            if not query_embedding:
                print("Failed to generate query embedding")
                continue
            
            print(f"Query embedding: {len(query_embedding)} token vectors")
            
            # Step 2: Initial ColBERT retrieval with late interaction
            search_start = time.time()
            initial_results = retriever.search(query_embedding)
            search_time = time.time() - search_start
            total_search_time += search_time
            
            print(f"Initial retrieval: {len(initial_results)} results in {search_time:.3f}s")
            
            # Step 3: Reranking (if enabled in config)
            if pipeline_config.get("use_reranker", False):
                rerank_start = time.time()
                
                # Extract chunks for reranking
                initial_chunks = [result['chunk'] for result in initial_results]
                reranked_results = reranker.rerank(query, initial_chunks, query_embedding)
                
                rerank_time = time.time() - rerank_start
                total_rerank_time += rerank_time
                
                print(f"🔄 ColBERT reranking completed in {rerank_time:.3f}s")
                final_results = reranked_results
            else:
                final_results = initial_results
            
            # Step 4: Display results
            print(f"\n🎯 Top 3 Final Results:")
            for j, result in enumerate(final_results[:3], 1):
                chunk = result['chunk']
                score = result['score']
                
                print(f"   {j}. Score: {score:.4f}")
                print(f"      Content: {chunk.content[:80]}...")
                print(f"      Metadata: {chunk.metadata.get('title', 'N/A')}")
                
                # Show reranking details if available
                if 'overlap_score' in result:
                    print(f"      Overlap: {result['overlap_score']:.3f}, Length: {result['length_score']:.3f}, Diversity: {result['diversity_score']:.3f}")
                    
        except Exception as e:
            print(f"Error processing query {i}: {e}")
            continue
    
    # Performance summary
    print(f"\n📈 ColBERT Performance Summary:")
    print(f"├── Total embedding time: {total_embedding_time:.3f}s")
    print(f"├── Average embedding time: {total_embedding_time/len(test_queries):.3f}s")
    print(f"├── Total search time: {total_search_time:.3f}s")
    print(f"├── Average search time: {total_search_time/len(test_queries):.3f}s")
    if total_rerank_time > 0:
        print(f"├── Total rerank time: {total_rerank_time:.3f}s")
        print(f"├── Average rerank time: {total_rerank_time/len(test_queries):.3f}s")
    print(f"└── Total queries tested: {len(test_queries)}")

# Run comprehensive ColBERT test
comprehensive_colbert_retrieval_test()

# Test with reranking enabled
print(f"\n" + "="*70)
print("🔄 Testing ColBERT with Reranking Enabled")
pipeline_config["use_reranker"] = True
comprehensive_colbert_retrieval_test()
pipeline_config["use_reranker"] = False  # Reset

🔬 Comprehensive ColBERT Retrieval Pipeline Test

🔎 Test Query 1: 'machine learning'
--------------------------------------------------


INFO:elastic_transport.transport:POST http://localhost:9200/colbert_sample_dataset_jina-colbert-v2/_search [status:200 duration:0.183s]


Query embedding: 32 token vectors
Initial retrieval: 10 results in 0.353s

🎯 Top 3 Final Results:
   1. Score: 0.6568
      Content: machine learning is a subset of artificial intelligence that enables computers t...
      Metadata: Document 1
   2. Score: 0.5624
      Content: deep learning uses neural networks with multiple layers to solve complex problem...
      Metadata: Document 3
   3. Score: 0.5402
      Content: natural language processing involves the interaction between computers and human...
      Metadata: Document 2

🔎 Test Query 2: 'natural language processing text understanding'
--------------------------------------------------


INFO:elastic_transport.transport:POST http://localhost:9200/colbert_sample_dataset_jina-colbert-v2/_search [status:200 duration:0.048s]


Query embedding: 32 token vectors
Initial retrieval: 10 results in 0.211s

🎯 Top 3 Final Results:
   1. Score: 0.6488
      Content: natural language processing involves the interaction between computers and human...
      Metadata: Document 2
   2. Score: 0.5725
      Content: transformer models have revolutionized natural language processing with their at...
      Metadata: Document 7
   3. Score: 0.5440
      Content: embedding models convert text into numerical representations that capture semant...
      Metadata: Document 8

🔎 Test Query 3: 'deep learning neural networks pattern recognition'
--------------------------------------------------


INFO:elastic_transport.transport:POST http://localhost:9200/colbert_sample_dataset_jina-colbert-v2/_search [status:200 duration:0.038s]


Query embedding: 32 token vectors
Initial retrieval: 10 results in 0.197s

🎯 Top 3 Final Results:
   1. Score: 0.6585
      Content: deep learning uses neural networks with multiple layers to solve complex problem...
      Metadata: Document 3
   2. Score: 0.5458
      Content: machine learning is a subset of artificial intelligence that enables computers t...
      Metadata: Document 1
   3. Score: 0.5403
      Content: retrieval - augmented generation combines information retrieval with language ge...
      Metadata: Document 9

🔎 Test Query 4: 'vector database similarity semantic search'
--------------------------------------------------


INFO:elastic_transport.transport:POST http://localhost:9200/colbert_sample_dataset_jina-colbert-v2/_search [status:200 duration:0.053s]


Query embedding: 32 token vectors
Initial retrieval: 10 results in 0.190s

🎯 Top 3 Final Results:
   1. Score: 0.6682
      Content: vector databases store and search high - dimensional vectors efficiently, enabli...
      Metadata: Document 6
   2. Score: 0.5718
      Content: semantic search goes beyond keyword matching to understand the meaning and inten...
      Metadata: Document 10
   3. Score: 0.5489
      Content: embedding models convert text into numerical representations that capture semant...
      Metadata: Document 8

🔎 Test Query 5: 'elasticsearch distributed search analytics engine'
--------------------------------------------------


INFO:elastic_transport.transport:POST http://localhost:9200/colbert_sample_dataset_jina-colbert-v2/_search [status:200 duration:0.047s]


Query embedding: 32 token vectors
Initial retrieval: 10 results in 0.195s

🎯 Top 3 Final Results:
   1. Score: 0.6935
      Content: elasticsearch is a distributed search and analytics engine built on apache lucen...
      Metadata: Document 5
   2. Score: 0.5485
      Content: semantic search goes beyond keyword matching to understand the meaning and inten...
      Metadata: Document 10
   3. Score: 0.5474
      Content: vector databases store and search high - dimensional vectors efficiently, enabli...
      Metadata: Document 6

📈 ColBERT Performance Summary:
├── Total embedding time: 7.885s
├── Average embedding time: 1.577s
├── Total search time: 1.145s
├── Average search time: 0.229s
└── Total queries tested: 5

🔄 Testing ColBERT with Reranking Enabled
🔬 Comprehensive ColBERT Retrieval Pipeline Test

🔎 Test Query 1: 'machine learning'
--------------------------------------------------


INFO:elastic_transport.transport:POST http://localhost:9200/colbert_sample_dataset_jina-colbert-v2/_search [status:200 duration:0.065s]


Query embedding: 32 token vectors
Initial retrieval: 10 results in 0.205s
🔄 ColBERT reranking completed in 0.000s

🎯 Top 3 Final Results:
   1. Score: 0.9120
      Content: machine learning is a subset of artificial intelligence that enables computers t...
      Metadata: Document 1
      Overlap: 1.000, Length: 0.707, Diversity: 1.000
   2. Score: 0.6620
      Content: deep learning uses neural networks with multiple layers to solve complex problem...
      Metadata: Document 3
      Overlap: 0.500, Length: 0.707, Diversity: 1.000
   3. Score: 0.4113
      Content: natural language processing involves the interaction between computers and human...
      Metadata: Document 2
      Overlap: 0.000, Length: 0.704, Diversity: 1.000

🔎 Test Query 2: 'natural language processing text understanding'
--------------------------------------------------


INFO:elastic_transport.transport:POST http://localhost:9200/colbert_sample_dataset_jina-colbert-v2/_search [status:200 duration:0.049s]


Query embedding: 32 token vectors
Initial retrieval: 10 results in 0.211s
🔄 ColBERT reranking completed in 0.000s

🎯 Top 3 Final Results:
   1. Score: 0.7113
      Content: natural language processing involves the interaction between computers and human...
      Metadata: Document 2
      Overlap: 0.600, Length: 0.704, Diversity: 1.000
   2. Score: 0.7098
      Content: transformer models have revolutionized natural language processing with their at...
      Metadata: Document 7
      Overlap: 0.600, Length: 0.699, Diversity: 1.000
   3. Score: 0.5091
      Content: embedding models convert text into numerical representations that capture semant...
      Metadata: Document 8
      Overlap: 0.200, Length: 0.697, Diversity: 1.000

🔎 Test Query 3: 'deep learning neural networks pattern recognition'
--------------------------------------------------


INFO:elastic_transport.transport:POST http://localhost:9200/colbert_sample_dataset_jina-colbert-v2/_search [status:200 duration:0.046s]


Query embedding: 32 token vectors
Initial retrieval: 10 results in 0.210s
🔄 ColBERT reranking completed in 0.000s

🎯 Top 3 Final Results:
   1. Score: 0.7453
      Content: deep learning uses neural networks with multiple layers to solve complex problem...
      Metadata: Document 3
      Overlap: 0.667, Length: 0.707, Diversity: 1.000
   2. Score: 0.4953
      Content: machine learning is a subset of artificial intelligence that enables computers t...
      Metadata: Document 1
      Overlap: 0.167, Length: 0.707, Diversity: 1.000
   3. Score: 0.4113
      Content: natural language processing involves the interaction between computers and human...
      Metadata: Document 2
      Overlap: 0.000, Length: 0.704, Diversity: 1.000

🔎 Test Query 4: 'vector database similarity semantic search'
--------------------------------------------------


INFO:elastic_transport.transport:POST http://localhost:9200/colbert_sample_dataset_jina-colbert-v2/_search [status:200 duration:0.099s]


Query embedding: 32 token vectors
Initial retrieval: 10 results in 0.250s
🔄 ColBERT reranking completed in 0.000s

🎯 Top 3 Final Results:
   1. Score: 0.7863
      Content: vector databases store and search high - dimensional vectors efficiently, enabli...
      Metadata: Document 6
      Overlap: 0.800, Length: 0.704, Diversity: 0.875
   2. Score: 0.5972
      Content: semantic search goes beyond keyword matching to understand the meaning and inten...
      Metadata: Document 10
      Overlap: 0.400, Length: 0.702, Diversity: 0.933
   3. Score: 0.5091
      Content: embedding models convert text into numerical representations that capture semant...
      Metadata: Document 8
      Overlap: 0.200, Length: 0.697, Diversity: 1.000

🔎 Test Query 5: 'elasticsearch distributed search analytics engine'
--------------------------------------------------


INFO:elastic_transport.transport:POST http://localhost:9200/colbert_sample_dataset_jina-colbert-v2/_search [status:200 duration:0.110s]


Query embedding: 32 token vectors
Initial retrieval: 10 results in 0.326s
🔄 ColBERT reranking completed in 0.000s

🎯 Top 3 Final Results:
   1. Score: 0.9017
      Content: elasticsearch is a distributed search and analytics engine built on apache lucen...
      Metadata: Document 5
      Overlap: 1.000, Length: 0.709, Diversity: 0.944
   2. Score: 0.4972
      Content: semantic search goes beyond keyword matching to understand the meaning and inten...
      Metadata: Document 10
      Overlap: 0.200, Length: 0.702, Diversity: 0.933
   3. Score: 0.4863
      Content: vector databases store and search high - dimensional vectors efficiently, enabli...
      Metadata: Document 6
      Overlap: 0.200, Length: 0.704, Diversity: 0.875

📈 ColBERT Performance Summary:
├── Total embedding time: 8.929s
├── Average embedding time: 1.786s
├── Total search time: 1.202s
├── Average search time: 0.240s
├── Total rerank time: 0.000s
├── Average rerank time: 0.000s
└── Total queries tested: 5


## 11. Cleanup

Implement cleanup procedures untuk removing test indices seperti pada pipeline asli.

In [193]:
def cleanup_pipeline():
    """Cleanup enhanced vector stores and temporary files."""
    
    print("\nEnhanced ColBERT + Late Chunking Pipeline Cleanup")
    print("=" * 50)
    
    # Clean up temporary files
    try:
        if os.path.exists(temp_jsonl_file):
            os.remove(temp_jsonl_file)
            print(f"Removed temporary file: {temp_jsonl_file}")
    except Exception as e:
        print(f"⚠️ Error removing temp file: {e}")
    
    # Cleanup vector stores and Elasticsearch indices
    if pipeline_simulator.vector_stores:
        print(f"Cleaning up {len(pipeline_simulator.vector_stores)} enhanced vector stores...")
        
        indices_to_clean = []
        for dataset_name, vs in pipeline_simulator.vector_stores.items():
            try:
                # Delete the actual Elasticsearch index
                vs.delete_index()
                indices_to_clean.append(vs.index_name)
                print(f"Deleted Elasticsearch index: {vs.index_name}")
                    
            except Exception as e:
                print(f"Error cleaning up {dataset_name}: {e}")
        
        # Clear the vector stores dictionary
        pipeline_simulator.vector_stores = {}
        print("Enhanced vector stores cleanup completed")
        
        if indices_to_clean:
            print(f"📊 Cleaned up {len(indices_to_clean)} Elasticsearch indices:")
            for idx in indices_to_clean:
                print(f"   - {idx}")
    else:
        print("ℹ️ No vector stores to clean up")

# cleanup_pipeline() uncomment to run cleanup