# Elasticsearch Embedding Pipeline Simulation

Notebook ini mensimulasikan pipeline embedding Elasticsearch berdasarkan kode dari `benchmarks/embedding/contrib/pipeline/pipeline.py`. 

Pipeline ini melakukan:
1. **Data Loading**: Membaca data dari format JSONL
2. **Text Processing**: Tokenisasi dan truncation berdasarkan token limit
3. **Vector Indexing**: Batch indexing ke Elasticsearch dengan embeddings
4. **Semantic Retrieval**: Pencarian semantik dengan similarity vector
5. **Optional Reranking**: Reranking hasil retrieval untuk akurasi yang lebih baik

## Overview Pipeline Architecture

```
JSONL Data → Text Processing → Document Creation → Chunk Conversion → 
Elasticsearch Index (Vector Embeddings + Metadata) → Vector Retrieval → 
Optional Reranking → Final Results
```

## 1. Import Required Libraries

Import semua library yang diperlukan untuk simulasi pipeline Elasticsearch embedding.

In [108]:
# Core libraries
import json
import os
import time
import uuid
import warnings
from typing import Any, Dict, List, Optional
import logging
from dataclasses import dataclass, asdict
from datetime import datetime

# Data processing
from tqdm import tqdm

# Elasticsearch
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk

# HTTP requests for Jina API
import requests

# Suppress warnings
warnings.filterwarnings('ignore')

# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Jina AI API Key (set your API key here)
jinaai_key = os.environ.get("JINA_API_KEY", "jina_55815bff338d4445aae29a6f2d322ac7O-GT3q1UM_FfKyaXh2pnTKD9JyEC")

print("✅ All libraries imported successfully!")
print(f"📅 Timestamp: {datetime.now()}")
print(f"🔑 Jina API Key configured: {'✅' if jinaai_key != 'your-jina-api-key-here' else '❌ Please set JINA_API_KEY environment variable'}")

✅ All libraries imported successfully!
📅 Timestamp: 2025-07-24 09:17:32.846767
🔑 Jina API Key configured: ✅


## 2. Define Data Structures and Configuration Classes

Mendefinisikan struktur data yang digunakan dalam pipeline embedding menggunakan **Jina ColBERT v2**, termasuk konfigurasi model dan document chunks.

### 🚀 Model Migration: Jina ColBERT v2

Pipeline ini telah diupdate untuk menggunakan **Jina ColBERT v2** yang memberikan keunggulan:

1. **Multi-Vector Architecture**: Setiap dokumen direpresentasikan sebagai multiple token-level vectors
2. **Late Interaction**: Similarity dihitung melalui token-to-token matching yang lebih presisi
3. **Higher Token Limit**: Mendukung hingga 8192 tokens (vs 512 tokens sebelumnya)
4. **Better Semantic Understanding**: Lebih akurat untuk dokumen panjang dan complex queries

### 🔑 API Key Setup

Sebelum menjalankan pipeline, pastikan Anda telah mengatur Jina API key:

```python
import os
os.environ["JINA_API_KEY"] = "your-jina-api-key-here"
```

Dapatkan API key gratis di: https://jina.ai/

### 🏗️ Architecture Comparison

**Traditional Embeddings (sebelumnya):**
```
Text → Single Vector (384D) → Vector Search → Results
```

**ColBERT v2 (sekarang):**
```
Text → Multiple Token Vectors (Nx128D) → Late Interaction → Results
```

In [109]:
@dataclass
class ModelConfig:
    """Model configuration for Jina ColBERT v2 embedding pipeline."""
    provider: str = "jinaai"
    provider_model_id: str = "jina-colbert-v2"
    embedding_dimensions: int = 128  # ColBERT v2 uses 128 dimensions
    max_tokens: int = 8192  # Jina ColBERT v2 supports up to 8192 tokens
    api_endpoint: str = "https://api.jina.ai/v1/multi-vector"
    input_type: str = "document"
    embedding_type: str = "float"
    
    def model_dump(self) -> Dict[str, Any]:
        return asdict(self)

@dataclass
class Document:
    """Document class similar to langchain Document."""
    page_content: str
    metadata: Dict[str, Any]

@dataclass 
class Chunk:
    """Chunk class from the original pipeline."""
    content: str
    id: str
    metadata: Dict[str, Any]

@dataclass
class EmbeddingState:
    """State management for pipeline."""
    id: str
    query: str
    dataset: str
    retrieved_chunks: List[Chunk] = None
    reranked_chunks: List[Chunk] = None
    supporting_facts: List[str] = None

class EmbeddingStateKeys:
    """State keys constants."""
    ID = "id"
    QUERY = "query"
    DATASET = "dataset"
    RETRIEVED_CHUNKS = "retrieved_chunks"
    RERANKED_CHUNKS = "reranked_chunks"
    SUPPORTING_FACTS = "supporting_facts"

print("✅ Data structures defined successfully!")
print("🔄 Updated to use Jina ColBERT v2 model")
print(f"📏 Embedding dimensions: 128")
print(f"🔤 Max tokens: 8192")

✅ Data structures defined successfully!
🔄 Updated to use Jina ColBERT v2 model
📏 Embedding dimensions: 128
🔤 Max tokens: 8192


## 3. Setup Elasticsearch Connection

Konfigurasi dan koneksi ke Elasticsearch. Untuk simulasi, kita akan menggunakan koneksi lokal atau Docker Elasticsearch.

In [110]:
class ElasticsearchVectorStore:
    """Simulated ElasticsearchVectorDataStore class for ColBERT multi-vector embeddings."""
    
    def __init__(self, index_name: str, embedding_config: Dict[str, Any]):
        self.index_name = index_name
        self.embedding_config = embedding_config
        
        # Elasticsearch connection (sesuaikan dengan setup lokal)
        self.client = Elasticsearch([
            {'host': 'localhost', 'port': 9200, 'scheme': 'http'}
        ])
        
        # Untuk simulasi, kita akan menggunakan in-memory storage jika ES tidak tersedia
        self.use_simulation = False
        self.simulated_docs = []
        
        try:
            # Test connection
            info = self.client.info()
            logger.info(f"✅ Connected to Elasticsearch: {info['version']['number']}")
        except Exception as e:
            logger.warning(f"⚠️ Cannot connect to Elasticsearch: {e}")
            logger.info("🔄 Using in-memory simulation mode")
            self.use_simulation = True
            
        self._create_index_if_not_exists()
    
    def _create_index_if_not_exists(self):
        """Create index with proper mapping for ColBERT multi-vector search."""
        if self.use_simulation:
            return
            
        # Updated mapping for ColBERT multi-vector embeddings
        mapping = {
            "mappings": {
                "properties": {
                    "content": {"type": "text", "analyzer": "standard"},
                    "embeddings": {
                        "type": "dense_vector",
                        "dims": self.embedding_config.get("embedding_dimensions", 128),
                        "similarity": "cosine"
                    },
                    "metadata": {"type": "object"},
                    "chunk_id": {"type": "keyword"},
                    "timestamp": {"type": "date"}
                }
            }
        }
        
        try:
            if not self.client.indices.exists(index=self.index_name):
                self.client.indices.create(index=self.index_name, body=mapping)
                logger.info(f"✅ Created index: {self.index_name}")
            else:
                logger.info(f"📋 Index already exists: {self.index_name}")
        except Exception as e:
            logger.error(f"❌ Error creating index: {e}")
    
    def count_documents(self) -> int:
        """Count documents in index."""
        if self.use_simulation:
            return len(self.simulated_docs)
        
        try:
            result = self.client.count(index=self.index_name)
            return result['count']
        except:
            return 0
    
    def add_chunks_batch(self, chunks: List[Chunk], embeddings: List[List[List[float]]]):
        """Add chunks with multi-vector embeddings in batch."""
        if self.use_simulation:
            for chunk, embedding in zip(chunks, embeddings):
                self.simulated_docs.append({
                    'chunk': chunk,
                    'embedding': embedding  # Multi-vector embedding
                })
            return
        
        # Prepare documents for bulk indexing
        docs = []
        for chunk, embedding in zip(chunks, embeddings):
            doc = {
                "_index": self.index_name,
                "_id": chunk.id,
                "_source": {
                    "content": chunk.content,
                    "embeddings": embedding,  # Store multi-vector embeddings
                    "metadata": chunk.metadata,
                    "chunk_id": chunk.id,
                    "timestamp": datetime.now()
                }
            }
            docs.append(doc)
        
        # Bulk index
        try:
            success, failed = bulk(self.client, docs)
            logger.info(f"✅ Indexed {success} documents, {len(failed)} failed")
        except Exception as e:
            logger.error(f"❌ Bulk indexing error: {e}")

# Initialize configuration
model_config = ModelConfig()
pipeline_config = {
    "vector_store_provider": "elasticsearch",
    "chunks_file_name": "corpus.jsonl",
    "retrieval_top_k": 10,
    "truncate_chunk_size": 8192,  # Updated for ColBERT v2
    "use_reranker": False,
    "batch_size": 16  # Smaller batch size for multi-vector embeddings
}

# Validate Jina API key
if jinaai_key == "your-jina-api-key-here" or not jinaai_key:
    logger.warning("⚠️ Jina API key not configured! Please set JINA_API_KEY environment variable")
    logger.info("💡 You can get an API key from: https://jina.ai/")
else:
    logger.info("✅ Jina API key configured")

print("✅ Elasticsearch configuration ready for ColBERT v2!")
print(f"📝 Model: {model_config.provider_model_id}")
print(f"🔍 Embedding dimensions: {model_config.embedding_dimensions}")
print(f"🔤 Max tokens: {model_config.max_tokens}")
print(f"🌐 API endpoint: {model_config.api_endpoint}")

INFO:__main__:✅ Jina API key configured


✅ Elasticsearch configuration ready for ColBERT v2!
📝 Model: jina-colbert-v2
🔍 Embedding dimensions: 128
🔤 Max tokens: 8192
🌐 API endpoint: https://api.jina.ai/v1/multi-vector


## 4. Create Mock Data Structure

Generate sample JSONL data similar to format corpus.jsonl untuk testing pipeline. Data ini mensimulasikan dokumen yang akan diindex.

In [111]:
# Sample data yang mensimulasikan corpus.jsonl format
sample_texts = [
    "Machine learning is a subset of artificial intelligence that enables computers to learn without being explicitly programmed.",
    "Natural language processing involves the interaction between computers and human language, enabling machines to understand text.",
    "Deep learning uses neural networks with multiple layers to solve complex problems and recognize patterns in data.",
    "Computer vision allows machines to interpret and understand visual information from the world around them.",
    "Elasticsearch is a distributed search and analytics engine built on Apache Lucene for full-text search capabilities.",
    "Vector databases store and search high-dimensional vectors efficiently, enabling semantic search and similarity matching.",
    "Transformer models have revolutionized natural language processing with their attention mechanism and parallel processing.",
    "Embedding models convert text into numerical representations that capture semantic meaning and context.",
    "Retrieval-augmented generation combines information retrieval with language generation for improved AI responses.",
    "Semantic search goes beyond keyword matching to understand the meaning and intent behind search queries."
]

def create_mock_jsonl_data(texts: List[str], dataset_name: str = "sample_dataset") -> List[Dict[str, Any]]:
    """Create mock JSONL data similar to corpus format."""
    mock_data = []
    
    for i, text in enumerate(texts):
        doc = {
            "_id": f"{dataset_name}_chunk_{i:04d}",
            "text": text,
            "title": f"Document {i+1}",
            "source": f"sample_source_{i+1}",
            "category": "technology",
            "chunk_index": i,
            "word_count": len(text.split()),
            "char_count": len(text)
        }
        mock_data.append(doc)
    
    return mock_data

# Generate mock data
mock_jsonl_data = create_mock_jsonl_data(sample_texts)

print(f"✅ Generated {len(mock_jsonl_data)} mock documents")
print("\n📄 Sample document:")
print(json.dumps(mock_jsonl_data[0], indent=2))

# Save to temporary file for processing simulation
temp_jsonl_file = "/tmp/sample_corpus.jsonl"
with open(temp_jsonl_file, 'w') as f:
    for doc in mock_jsonl_data:
        f.write(json.dumps(doc) + '\n')

print(f"\n💾 Saved mock data to: {temp_jsonl_file}")

✅ Generated 10 mock documents

📄 Sample document:
{
  "_id": "sample_dataset_chunk_0000",
  "text": "Machine learning is a subset of artificial intelligence that enables computers to learn without being explicitly programmed.",
  "title": "Document 1",
  "source": "sample_source_1",
  "category": "technology",
  "chunk_index": 0,
  "word_count": 17,
  "char_count": 124
}

💾 Saved mock data to: /tmp/sample_corpus.jsonl


## 5. Initialize Tokenizer and Embedding Model

Setup tokenizer dan embedding model untuk text processing dan generate embeddings seperti pada pipeline asli.

In [112]:
class JinaColBERTEmbeddingPipeline:
    """Jina ColBERT v2 embedding pipeline based on the original code and ColBERT reference."""
    
    def __init__(self, model_config: ModelConfig, pipeline_config: Dict[str, Any]):
        self.model_config = model_config
        self.pipeline_config = pipeline_config
        self.api_key = jinaai_key
        self.tokenizer = None
        self.vector_stores = {}
        
        if not self.api_key or self.api_key == "your-jina-api-key-here":
            raise ValueError("Jina API key is required! Please set JINA_API_KEY environment variable")
    
    def initialize_tokenizer_and_model(self):
        """Initialize tokenizer for text processing (no local model needed for Jina API)."""
        try:
            # Load tokenizer for text truncation and processing
            # We'll use a simple tokenizer for text processing, actual embedding is done via API
            from transformers import AutoTokenizer
            self.tokenizer = AutoTokenizer.from_pretrained(
                "sentence-transformers/all-MiniLM-L6-v2",  # Use for tokenization only
                trust_remote_code=True
            )
            
            logger.info(f"✅ Initialized tokenizer for text processing")
            logger.info(f"🌐 Using Jina ColBERT v2 API for embeddings")
            logger.info(f"📊 Vocab size: {len(self.tokenizer)}")
            
        except Exception as e:
            logger.error(f"❌ Error loading tokenizer: {e}")
            raise e
    
    @staticmethod
    def truncate_to_token_limit(text: str, tokenizer, max_tokens: int = 8192) -> str:
        """Truncate text to token limit for ColBERT v2."""
        tokenized = tokenizer(
            text,
            truncation=True,
            max_length=max_tokens,
            return_tensors=None,
            return_attention_mask=False,
            return_token_type_ids=False,
        )
        input_ids = tokenized["input_ids"]
        return tokenizer.decode(input_ids, skip_special_tokens=True)
    
    def get_colbert_embeddings(self, texts: List[str], input_type: str = "document") -> List[List[List[float]]]:
        """Generate ColBERT multi-vector embeddings using Jina API."""
        headers = {
            "Content-Type": "application/json",
            "Authorization": f"Bearer {self.api_key}",
        }
        
        all_embeddings = []
        
        for text in texts:
            data = {
                "model": self.model_config.provider_model_id,
                "dimensions": self.model_config.embedding_dimensions,
                "input_type": input_type,  # "document" or "query"
                "embedding_type": self.model_config.embedding_type,
                "input": [text],
            }
            
            try:
                response = requests.post(
                    self.model_config.api_endpoint,
                    headers=headers,
                    data=json.dumps(data),
                    timeout=30
                )
                response.raise_for_status()
                
                response_data = response.json()
                embedding = response_data["data"][0]["embeddings"]
                
                if not isinstance(embedding, list):
                    raise ValueError(f"Expected list embedding, got {type(embedding)}")
                
                # Validate embedding structure (should be list of lists for ColBERT)
                if embedding and isinstance(embedding[0], list):
                    all_embeddings.append(embedding)
                else:
                    raise ValueError(f"Invalid embedding structure: {type(embedding[0]) if embedding else 'empty'}")
                    
            except requests.exceptions.RequestException as e:
                logger.error(f"❌ API request error for text '{text[:50]}...': {e}")
                raise e
            except KeyError as e:
                logger.error(f"❌ Unexpected API response structure: {e}")
                raise e
            except Exception as e:
                logger.error(f"❌ Error getting embedding for text '{text[:50]}...': {e}")
                raise e
        
        return all_embeddings
    
    def get_single_embedding(self, text: str, input_type: str = "document") -> List[List[float]]:
        """Get single ColBERT embedding for a text."""
        embeddings = self.get_colbert_embeddings([text], input_type)
        return embeddings[0] if embeddings else None

# Initialize pipeline simulator
pipeline_simulator = JinaColBERTEmbeddingPipeline(model_config, pipeline_config)

try:
    pipeline_simulator.initialize_tokenizer_and_model()
    print("✅ Jina ColBERT v2 pipeline initialized!")
    print(f"🔤 Max tokens: {model_config.max_tokens}")
    
    # Test embedding generation
    test_text = "This is a sample text for ColBERT embedding generation."
    logger.info("🧪 Testing ColBERT embedding generation...")
    test_embedding = pipeline_simulator.get_single_embedding(test_text)
    
    if test_embedding:
        print(f"🧮 ColBERT embedding shape: ({len(test_embedding)}, {len(test_embedding[0])})")
        print(f"📈 Number of token vectors: {len(test_embedding)}")
        print(f"📏 Vector dimension: {len(test_embedding[0])}")
        print(f"📊 Sample values from first vector: {test_embedding[0][:5]}")
    else:
        print("❌ Failed to generate test embedding")
        
except Exception as e:
    logger.error(f"❌ Error initializing pipeline: {e}")
    print("💡 Make sure your Jina API key is valid and you have internet connection")

INFO:__main__:✅ Initialized tokenizer for text processing
INFO:__main__:🌐 Using Jina ColBERT v2 API for embeddings
INFO:__main__:📊 Vocab size: 30522
INFO:__main__:🧪 Testing ColBERT embedding generation...
INFO:__main__:🌐 Using Jina ColBERT v2 API for embeddings
INFO:__main__:📊 Vocab size: 30522
INFO:__main__:🧪 Testing ColBERT embedding generation...


✅ Jina ColBERT v2 pipeline initialized!
🔤 Max tokens: 8192
🧮 ColBERT embedding shape: (16, 128)
📈 Number of token vectors: 16
📏 Vector dimension: 128
📊 Sample values from first vector: [0.112854004, 0.080444336, -0.15185547, -0.13452148, 0.0670166]
🧮 ColBERT embedding shape: (16, 128)
📈 Number of token vectors: 16
📏 Vector dimension: 128
📊 Sample values from first vector: [0.112854004, 0.080444336, -0.15185547, -0.13452148, 0.0670166]


## 6. Process and Prepare Documents

Load dan process documents dari mock JSONL data, handle text truncation, dan prepare metadata untuk indexing seperti pada pipeline asli.

In [113]:
def process_jsonl_documents(file_path: str, pipeline_simulator: JinaColBERTEmbeddingPipeline) -> List[Document]:
    """Process JSONL documents for ColBERT pipeline - same logic as original pipeline."""
    documents = []
    
    logger.info(f"📄 Processing documents from: {file_path}")
    
    with open(file_path, "r") as f:
        for line_num, line in enumerate(f, 1):
            try:
                data = json.loads(line.strip())
                
                # Extract metadata (exclude 'text' field)
                metadata = {}
                for k, v in data.items():
                    if k == "text":
                        continue
                    if k == "_id":
                        metadata["chunk_id"] = v
                    else:
                        metadata[k] = v
                
                # Get text content
                text = data.get("text", "")
                
                # Apply truncation if configured (ColBERT v2 supports up to 8192 tokens)
                truncate_chunk_size = pipeline_simulator.pipeline_config.get("truncate_chunk_size")
                if truncate_chunk_size is not None:
                    original_length = len(text.split())
                    text = JinaColBERTEmbeddingPipeline.truncate_to_token_limit(
                        text=text, 
                        tokenizer=pipeline_simulator.tokenizer, 
                        max_tokens=truncate_chunk_size
                    )
                    new_length = len(text.split())
                    if original_length != new_length:
                        logger.debug(f"🔄 Truncated doc {line_num}: {original_length} → {new_length} tokens")
                
                # Apply text size truncation if configured
                truncate_text_size = pipeline_simulator.pipeline_config.get("truncate_text_size")
                if truncate_text_size is not None:
                    text = text[:truncate_text_size] if len(text) > truncate_text_size else text
                
                # Create document
                document = Document(
                    page_content=text,
                    metadata=metadata
                )
                documents.append(document)
                
            except json.JSONDecodeError as e:
                logger.error(f"❌ JSON decode error at line {line_num}: {e}")
            except Exception as e:
                logger.error(f"❌ Error processing line {line_num}: {e}")
    
    logger.info(f"✅ Processed {len(documents)} documents")
    return documents

def filter_complex_metadata(documents: List[Document]) -> List[Document]:
    """Filter complex metadata - simplified version."""
    filtered_docs = []
    for doc in documents:
        # Simple filtering - remove any non-serializable metadata
        filtered_metadata = {}
        for k, v in doc.metadata.items():
            if isinstance(v, (str, int, float, bool, type(None))):
                filtered_metadata[k] = v
        
        filtered_doc = Document(
            page_content=doc.page_content,
            metadata=filtered_metadata
        )
        filtered_docs.append(filtered_doc)
    
    return filtered_docs

def documents_to_chunks(documents: List[Document]) -> List[Chunk]:
    """Convert documents to chunks format."""
    chunks = []
    for doc in documents:
        chunk = Chunk(
            content=doc.page_content,
            id=doc.metadata["chunk_id"],
            metadata=doc.metadata
        )
        chunks.append(chunk)
    
    return chunks

# Process the mock JSONL data
documents = process_jsonl_documents(temp_jsonl_file, pipeline_simulator)
documents = filter_complex_metadata(documents)
chunks = documents_to_chunks(documents)

print(f"✅ Created {len(chunks)} chunks ready for ColBERT indexing")
print(f"\n📋 Sample chunk:")
print(f"ID: {chunks[0].id}")
print(f"Content: {chunks[0].content[:100]}...")
print(f"Metadata: {chunks[0].metadata}")
print(f"\n🔄 Ready for ColBERT multi-vector embedding generation...")

INFO:__main__:📄 Processing documents from: /tmp/sample_corpus.jsonl
INFO:__main__:✅ Processed 10 documents
INFO:__main__:✅ Processed 10 documents


✅ Created 10 chunks ready for ColBERT indexing

📋 Sample chunk:
ID: sample_dataset_chunk_0000
Content: machine learning is a subset of artificial intelligence that enables computers to learn without bein...
Metadata: {'chunk_id': 'sample_dataset_chunk_0000', 'title': 'Document 1', 'source': 'sample_source_1', 'category': 'technology', 'chunk_index': 0, 'word_count': 17, 'char_count': 124}

🔄 Ready for ColBERT multi-vector embedding generation...


## 7. Generate Document Embeddings

Generate embeddings untuk semua document chunks menggunakan embedding model yang sudah dikonfigurasi.

In [114]:
# Generate ColBERT multi-vector embeddings for all chunks
logger.info("🧮 Generating ColBERT multi-vector embeddings for all chunks...")
start_time = time.time()

# Extract text content from chunks
chunk_texts = [chunk.content for chunk in chunks]

# Generate embeddings in smaller batches for API efficiency
batch_size = 3  # Smaller batch size for API calls to avoid rate limits
all_embeddings = []

for i in tqdm(range(0, len(chunk_texts), batch_size), desc="Generating ColBERT embeddings"):
    batch_texts = chunk_texts[i:i + batch_size]
    
    try:
        # Get multi-vector embeddings from Jina ColBERT v2 API
        batch_embeddings = pipeline_simulator.get_colbert_embeddings(batch_texts, input_type="document")
        all_embeddings.extend(batch_embeddings)
        
        # Add a small delay to respect API rate limits
        time.sleep(0.1)
        
    except Exception as e:
        logger.error(f"❌ Error generating embeddings for batch {i//batch_size + 1}: {e}")
        # For demo purposes, continue with empty embeddings
        # In production, you would want to retry or handle this differently
        for _ in batch_texts:
            all_embeddings.append([])  # Empty embedding as placeholder

embedding_time = time.time() - start_time

print(f"✅ Generated {len(all_embeddings)} ColBERT multi-vector embeddings")
print(f"⏱️ Embedding generation took: {embedding_time:.2f} seconds")
print(f"🎯 Average time per document: {embedding_time/len(chunks):.3f} seconds")

# Validate ColBERT embeddings
if all_embeddings and all_embeddings[0]:  # Check if we have valid embeddings
    first_embedding = all_embeddings[0]
    token_count = len(first_embedding)
    vector_dim = len(first_embedding[0]) if first_embedding else 0
    
    print(f"📏 ColBERT embedding structure:")
    print(f"├── Number of token vectors: {token_count}")
    print(f"├── Vector dimension: {vector_dim}")
    print(f"└── Total parameters per document: {token_count * vector_dim}")
    
    # Show stats for all embeddings
    token_counts = [len(emb) for emb in all_embeddings if emb]
    if token_counts:
        print(f"\n📊 Token count statistics across all documents:")
        print(f"├── Min tokens: {min(token_counts)}")
        print(f"├── Max tokens: {max(token_counts)}")
        print(f"├── Average tokens: {sum(token_counts)/len(token_counts):.1f}")
        print(f"└── Total embeddings: {len(token_counts)}")
    
    # Sample values from first embedding
    print(f"\n🧮 Sample values from first token vector: {first_embedding[0][:5]}")
    
    # Check for consistent dimensions
    dims_consistent = all(
        len(emb[0]) == vector_dim for emb in all_embeddings 
        if emb and len(emb) > 0
    )
    print(f"✅ All token vectors have consistent dimensions: {dims_consistent}")
else:
    print("❌ No valid embeddings generated!")
    print("💡 Please check your Jina API key and internet connection")

INFO:__main__:🧮 Generating ColBERT multi-vector embeddings for all chunks...
Generating ColBERT embeddings: 100%|██████████| 4/4 [00:16<00:00,  4.00s/it]

✅ Generated 10 ColBERT multi-vector embeddings
⏱️ Embedding generation took: 16.01 seconds
🎯 Average time per document: 1.601 seconds
📏 ColBERT embedding structure:
├── Number of token vectors: 26
├── Vector dimension: 128
└── Total parameters per document: 3328

📊 Token count statistics across all documents:
├── Min tokens: 19
├── Max tokens: 29
├── Average tokens: 24.4
└── Total embeddings: 10

🧮 Sample values from first token vector: [0.16882324, 0.009887695, -0.17614746, -0.26831055, 0.08026123]
✅ All token vectors have consistent dimensions: True





## 8. Index Documents to Elasticsearch

Batch index processed documents dengan embeddings ke Elasticsearch, handle existing indices dan error management seperti pada pipeline asli.

In [115]:
# Create vector store and index ColBERT multi-vector embeddings
dataset_name = "sample_dataset"
model_for_index_name = model_config.provider_model_id.replace("/", "-")
index_name = f"colbert_{dataset_name.lower()}_{model_for_index_name.lower()}"

logger.info(f"📋 Creating ColBERT vector store with index: {index_name}")

# Initialize vector store for ColBERT embeddings
vector_store = ElasticsearchVectorStore(
    index_name=index_name,
    embedding_config=model_config.model_dump()
)

# Check if index already has data (like in original pipeline)
existing_doc_count = vector_store.count_documents()
logger.info(f"📊 Existing documents in index: {existing_doc_count}")

# Filter out chunks with empty embeddings
valid_chunks = []
valid_embeddings = []
for chunk, embedding in zip(chunks, all_embeddings):
    if embedding and len(embedding) > 0:  # Check if embedding is not empty
        valid_chunks.append(chunk)
        valid_embeddings.append(embedding)

logger.info(f"📋 Valid chunks with embeddings: {len(valid_chunks)}/{len(chunks)}")

if existing_doc_count >= len(valid_chunks):
    logger.info(f"⏭️ Index already has {existing_doc_count} documents. Skipping indexing.")
else:
    logger.info(f"🚀 Starting batch indexing of {len(valid_chunks)} chunks with ColBERT embeddings...")
    
    # Batch indexing for ColBERT multi-vector embeddings
    batch_size = pipeline_config.get("batch_size", 8)  # Smaller batch for multi-vector
    start_time = time.time()
    
    for i in tqdm(range(0, len(valid_chunks), batch_size), desc="Indexing ColBERT batches"):
        batch_chunks = valid_chunks[i:i + batch_size]
        batch_embeddings = valid_embeddings[i:i + batch_size]
        
        try:
            # Add batch to vector store
            vector_store.add_chunks_batch(batch_chunks, batch_embeddings)
            logger.debug(f"✅ Indexed ColBERT batch {i//batch_size + 1}: {len(batch_chunks)} chunks")
            
        except Exception as e:
            logger.error(f"❌ Error indexing ColBERT batch {i//batch_size + 1}: {e}")
            # In real pipeline, this would raise the exception
            # raise e
    
    indexing_time = time.time() - start_time
    
    # Verify indexing
    final_doc_count = vector_store.count_documents()
    logger.info(f"✅ ColBERT indexing completed!")
    logger.info(f"📊 Final document count: {final_doc_count}")
    logger.info(f"⏱️ Indexing took: {indexing_time:.2f} seconds")
    logger.info(f"🎯 Average indexing time per document: {indexing_time/len(valid_chunks):.3f} seconds")

# Store vector store for later use
pipeline_simulator.vector_stores[dataset_name] = vector_store

print(f"\n📈 ColBERT Pipeline Statistics:")
print(f"├── Documents processed: {len(chunks)}")
print(f"├── Valid embeddings generated: {len(valid_embeddings)}")
print(f"├── Failed embeddings: {len(chunks) - len(valid_embeddings)}")  
print(f"├── Index name: {index_name}")
print(f"├── Final document count: {vector_store.count_documents()}")
print(f"├── Vector store mode: {'Simulation' if vector_store.use_simulation else 'Elasticsearch'}")
print(f"└── Model: {model_config.provider_model_id} (ColBERT v2)")

# Display embedding statistics
if valid_embeddings:
    total_vectors = sum(len(emb) for emb in valid_embeddings)
    avg_vectors_per_doc = total_vectors / len(valid_embeddings)
    print(f"\n🧮 ColBERT Embedding Statistics:")
    print(f"├── Total token vectors: {total_vectors:,}")
    print(f"├── Average vectors per document: {avg_vectors_per_doc:.1f}")
    print(f"├── Vector dimension: {model_config.embedding_dimensions}")
    print(f"└── Total parameters: {total_vectors * model_config.embedding_dimensions:,}")

INFO:__main__:📋 Creating ColBERT vector store with index: colbert_sample_dataset_jina-colbert-v2
INFO:elastic_transport.transport:GET http://localhost:9200/ [status:N/A duration:0.002s]
Traceback (most recent call last):
  File "/Users/azmyaryarizaldi/Desktop/GDP/ElasticSearch/Handson/env/lib/python3.13/site-packages/urllib3/connection.py", line 198, in _new_conn
    sock = connection.create_connection(
        (self._dns_host, self.port),
    ...<2 lines>...
        socket_options=self.socket_options,
    )
  File "/Users/azmyaryarizaldi/Desktop/GDP/ElasticSearch/Handson/env/lib/python3.13/site-packages/urllib3/util/connection.py", line 85, in create_connection
    raise err
  File "/Users/azmyaryarizaldi/Desktop/GDP/ElasticSearch/Handson/env/lib/python3.13/site-packages/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
    ~~~~~~~~~~~~^^^^
ConnectionRefusedError: [Errno 61] Connection refused

The above exception was the direct cause of the following exc


📈 ColBERT Pipeline Statistics:
├── Documents processed: 10
├── Valid embeddings generated: 10
├── Failed embeddings: 0
├── Index name: colbert_sample_dataset_jina-colbert-v2
├── Final document count: 10
├── Vector store mode: Simulation
└── Model: jina-colbert-v2 (ColBERT v2)

🧮 ColBERT Embedding Statistics:
├── Total token vectors: 244
├── Average vectors per document: 24.4
├── Vector dimension: 128
└── Total parameters: 31,232


## 9. Implement Semantic Search

Create search functionality yang embed query text dan melakukan vector similarity search terhadap indexed documents.

In [116]:
class ColBERTSemanticRetriever:
    """ColBERT semantic retrieval class with late interaction similarity."""
    
    def __init__(self, vector_store: ElasticsearchVectorStore, top_k: int = 10):
        self.vector_store = vector_store
        self.top_k = top_k
    
    def cosine_similarity(self, vec1: List[float], vec2: List[float]) -> float:
        """Calculate cosine similarity between two vectors."""
        dot_product = sum(a * b for a, b in zip(vec1, vec2))
        magnitude1 = sum(a * a for a in vec1) ** 0.5
        magnitude2 = sum(b * b for b in vec2) ** 0.5
        
        if magnitude1 == 0 or magnitude2 == 0:
            return 0.0
        
        return dot_product / (magnitude1 * magnitude2)
    
    def late_interaction_similarity(self, query_vectors: List[List[float]], doc_vectors: List[List[float]]) -> float:
        """
        Compute ColBERT late interaction similarity.
        For each query token, find the best matching document token, then sum.
        """
        if not query_vectors or not doc_vectors:
            return 0.0
        
        total_score = 0.0
        
        # For each query token vector
        for q_vec in query_vectors:
            # Find the maximum similarity with any document token vector
            max_similarity = max(
                self.cosine_similarity(q_vec, d_vec) 
                for d_vec in doc_vectors
            )
            total_score += max_similarity
        
        # Normalize by query length
        return total_score / len(query_vectors)
    
    def search(self, query_embedding: List[List[float]]) -> List[Dict[str, Any]]:
        """Search for similar documents using ColBERT late interaction."""
        if self.vector_store.use_simulation:
            # Simulation mode: compute late interaction similarities in memory
            similarities = []
            
            for doc in self.vector_store.simulated_docs:
                doc_embedding = doc['embedding']
                if not doc_embedding:  # Skip empty embeddings
                    continue
                    
                similarity = self.late_interaction_similarity(query_embedding, doc_embedding)
                similarities.append({
                    'chunk': doc['chunk'],
                    'score': similarity
                })
            
            # Sort by similarity score (descending)
            similarities.sort(key=lambda x: x['score'], reverse=True)
            
            # Return top-k results
            return similarities[:self.top_k]
        
        else:
            # Real Elasticsearch mode (would need custom script for late interaction)
            try:
                # Note: Real Elasticsearch implementation would require a custom script
                # for late interaction scoring. For now, we'll use a simplified approach.
                
                # First, get candidate documents using simple vector search
                # This is a simplified implementation - real ColBERT would need custom scoring
                
                # Use the first query vector for initial retrieval
                if not query_embedding or not query_embedding[0]:
                    return []
                
                search_body = {
                    "query": {
                        "script_score": {
                            "query": {"match_all": {}},
                            "script": {
                                "source": "Math.max(0, cosineSimilarity(params.query_vector, 'embeddings') + 1.0)",
                                "params": {
                                    "query_vector": query_embedding[0]  # Use first vector for ES search
                                }
                            }
                        }
                    },
                    "size": self.top_k,
                    "_source": ["content", "metadata", "chunk_id", "embeddings"]
                }
                
                response = self.vector_store.client.search(
                    index=self.vector_store.index_name,
                    body=search_body
                )
                
                results = []
                for hit in response['hits']['hits']:
                    # Get document embeddings for late interaction scoring
                    doc_embeddings = hit['_source'].get('embeddings', [])
                    
                    # Compute late interaction score
                    late_interaction_score = self.late_interaction_similarity(
                        query_embedding, doc_embeddings
                    )
                    
                    chunk = Chunk(
                        content=hit['_source']['content'],
                        id=hit['_source']['chunk_id'],
                        metadata=hit['_source']['metadata']
                    )
                    results.append({
                        'chunk': chunk,
                        'score': late_interaction_score
                    })
                
                # Re-sort by late interaction score
                results.sort(key=lambda x: x['score'], reverse=True)
                return results
                
            except Exception as e:
                logger.error(f"❌ Elasticsearch ColBERT search error: {e}")
                return []

# Initialize ColBERT semantic retriever
retriever = ColBERTSemanticRetriever(vector_store, top_k=pipeline_config.get("retrieval_top_k", 10))

# Test queries for ColBERT
test_queries = [
    "What is machine learning and artificial intelligence?",
    "How does natural language processing work?",
    "Tell me about deep learning and neural networks",
    "What is vector search and embeddings?"
]

print("🔍 Testing ColBERT Semantic Search with Late Interaction:")
print("=" * 60)

for i, query in enumerate(test_queries, 1):
    print(f"\n🔎 Query {i}: {query}")
    
    try:
        # Generate query embedding using ColBERT (query type)
        query_embedding = pipeline_simulator.get_single_embedding(query, input_type="query")
        
        if not query_embedding:
            print("❌ Failed to generate query embedding")
            continue
        
        print(f"📊 Query vectors: {len(query_embedding)} tokens")
        
        # Perform ColBERT search with late interaction
        search_start = time.time()
        results = retriever.search(query_embedding)
        search_time = time.time() - search_start
        
        print(f"⏱️ Search time: {search_time:.3f} seconds")
        print(f"📊 Found {len(results)} results")
        
        # Display top 3 results
        for j, result in enumerate(results[:3], 1):
            chunk = result['chunk']
            score = result['score']
            print(f"\n   {j}. Late Interaction Score: {score:.4f}")
            print(f"      Content: {chunk.content[:100]}...")
            print(f"      Chunk ID: {chunk.id}")
            
    except Exception as e:
        print(f"❌ Error processing query {i}: {e}")
        continue

print("\n✅ ColBERT semantic search testing completed!")
print("🎯 Late interaction scoring provides more nuanced similarity matching")

🔍 Testing ColBERT Semantic Search with Late Interaction:

🔎 Query 1: What is machine learning and artificial intelligence?
📊 Query vectors: 32 tokens
⏱️ Search time: 0.118 seconds
📊 Found 10 results

   1. Late Interaction Score: 0.6881
      Content: machine learning is a subset of artificial intelligence that enables computers to learn without bein...
      Chunk ID: sample_dataset_chunk_0000

   2. Late Interaction Score: 0.5795
      Content: deep learning uses neural networks with multiple layers to solve complex problems and recognize patt...
      Chunk ID: sample_dataset_chunk_0002

   3. Late Interaction Score: 0.5720
      Content: natural language processing involves the interaction between computers and human language, enabling ...
      Chunk ID: sample_dataset_chunk_0001

🔎 Query 2: How does natural language processing work?
📊 Query vectors: 32 tokens
⏱️ Search time: 0.118 seconds
📊 Found 10 results

   1. Late Interaction Score: 0.6881
      Content: machine learning is 

## 10. Test Retrieval and Reranking

Implement dan test document retrieval dengan optional reranking functionality, measuring retrieval accuracy dan performance.

In [117]:
class ColBERTReranker:
    """ColBERT-aware reranker that considers token-level interactions."""
    
    def __init__(self):
        pass
    
    def rerank(self, query: str, chunks: List[Chunk], query_embedding: List[List[float]]) -> List[Dict[str, Any]]:
        """Rerank using ColBERT late interaction + text-based features."""
        results = []
        
        for chunk in chunks:
            # Text-based scoring features
            query_words = set(query.lower().split())
            chunk_words = set(chunk.content.lower().split())
            
            # Keyword overlap score
            overlap_score = len(query_words.intersection(chunk_words)) / len(query_words) if query_words else 0
            
            # Length preference (prefer moderate length chunks)
            length_score = 1.0 / (1.0 + abs(len(chunk.content.split()) - 100) * 0.005)
            
            # Diversity score (prefer chunks with varied vocabulary)
            diversity_score = len(chunk_words) / len(chunk.content.split()) if chunk.content.split() else 0
            
            # Combined rerank score
            rerank_score = overlap_score * 0.5 + length_score * 0.3 + diversity_score * 0.2
            
            results.append({
                'chunk': chunk,
                'score': rerank_score,
                'overlap_score': overlap_score,
                'length_score': length_score,
                'diversity_score': diversity_score
            })
        
        # Sort by rerank score
        results.sort(key=lambda x: x['score'], reverse=True)
        return results

def comprehensive_colbert_retrieval_test():
    """Comprehensive test of the ColBERT retrieval pipeline."""
    
    print("🔬 Comprehensive ColBERT Retrieval Pipeline Test")
    print("=" * 70)
    
    # Test configuration
    test_queries = [
        "machine learning artificial intelligence algorithms",
        "natural language processing text understanding", 
        "deep learning neural networks pattern recognition",
        "vector database similarity semantic search",
        "elasticsearch distributed search analytics engine"
    ]
    
    # Initialize ColBERT reranker
    reranker = ColBERTReranker()
    
    # Performance metrics
    total_embedding_time = 0
    total_search_time = 0
    total_rerank_time = 0
    
    for i, query in enumerate(test_queries, 1):
        print(f"\n🔎 Test Query {i}: '{query}'")
        print("-" * 50)
        
        try:
            # Step 1: Generate query embedding (ColBERT multi-vector)
            embed_start = time.time()
            query_embedding = pipeline_simulator.get_single_embedding(query, input_type="query")
            embed_time = time.time() - embed_start
            total_embedding_time += embed_time
            
            if not query_embedding:
                print("❌ Failed to generate query embedding")
                continue
            
            print(f"🧮 Query embedding: {len(query_embedding)} token vectors")
            
            # Step 2: Initial ColBERT retrieval with late interaction
            search_start = time.time()
            initial_results = retriever.search(query_embedding)
            search_time = time.time() - search_start
            total_search_time += search_time
            
            print(f"📊 Initial retrieval: {len(initial_results)} results in {search_time:.3f}s")
            
            # Step 3: Reranking (if enabled in config)
            if pipeline_config.get("use_reranker", False):
                rerank_start = time.time()
                
                # Extract chunks for reranking
                initial_chunks = [result['chunk'] for result in initial_results]
                reranked_results = reranker.rerank(query, initial_chunks, query_embedding)
                
                rerank_time = time.time() - rerank_start
                total_rerank_time += rerank_time
                
                print(f"🔄 ColBERT reranking completed in {rerank_time:.3f}s")
                final_results = reranked_results
            else:
                final_results = initial_results
            
            # Step 4: Display results
            print(f"\n🎯 Top 3 Final Results:")
            for j, result in enumerate(final_results[:3], 1):
                chunk = result['chunk']
                score = result['score']
                
                print(f"   {j}. Score: {score:.4f}")
                print(f"      Content: {chunk.content[:80]}...")
                print(f"      Metadata: {chunk.metadata.get('title', 'N/A')}")
                
                # Show reranking details if available
                if 'overlap_score' in result:
                    print(f"      Overlap: {result['overlap_score']:.3f}, Length: {result['length_score']:.3f}, Diversity: {result['diversity_score']:.3f}")
                    
        except Exception as e:
            print(f"❌ Error processing query {i}: {e}")
            continue
    
    # Performance summary
    print(f"\n📈 ColBERT Performance Summary:")
    print(f"├── Total embedding time: {total_embedding_time:.3f}s")
    print(f"├── Average embedding time: {total_embedding_time/len(test_queries):.3f}s")
    print(f"├── Total search time: {total_search_time:.3f}s")
    print(f"├── Average search time: {total_search_time/len(test_queries):.3f}s")
    if total_rerank_time > 0:
        print(f"├── Total rerank time: {total_rerank_time:.3f}s")
        print(f"├── Average rerank time: {total_rerank_time/len(test_queries):.3f}s")
    print(f"└── Total queries tested: {len(test_queries)}")

# Run comprehensive ColBERT test
comprehensive_colbert_retrieval_test()

# Test with reranking enabled
print(f"\n" + "="*70)
print("🔄 Testing ColBERT with Reranking Enabled")
pipeline_config["use_reranker"] = True
comprehensive_colbert_retrieval_test()
pipeline_config["use_reranker"] = False  # Reset

print(f"\n🎯 ColBERT vs Traditional Embeddings:")
print("├── ✅ Token-level interactions for better semantic matching")
print("├── ✅ Handles long documents more effectively") 
print("├── ✅ Late interaction preserves fine-grained relevance")
print("├── ⚠️  Higher computational cost during search")
print("├── ⚠️  More complex indexing and storage requirements")
print("└── 📊 Better accuracy-efficiency trade-off for semantic search")

🔬 Comprehensive ColBERT Retrieval Pipeline Test

🔎 Test Query 1: 'machine learning artificial intelligence algorithms'
--------------------------------------------------
🧮 Query embedding: 32 token vectors
📊 Initial retrieval: 10 results in 0.112s

🎯 Top 3 Final Results:
   1. Score: 0.6302
      Content: machine learning is a subset of artificial intelligence that enables computers t...
      Metadata: Document 1
   2. Score: 0.5850
      Content: retrieval - augmented generation combines information retrieval with language ge...
      Metadata: Document 9
   3. Score: 0.5730
      Content: deep learning uses neural networks with multiple layers to solve complex problem...
      Metadata: Document 3

🔎 Test Query 2: 'natural language processing text understanding'
--------------------------------------------------
🧮 Query embedding: 32 token vectors
📊 Initial retrieval: 10 results in 0.112s

🎯 Top 3 Final Results:
   1. Score: 0.6302
      Content: machine learning is a subset of arti

## 11. Cleanup and Performance Monitoring

Monitor index performance, document counts, dan implement cleanup procedures untuk removing test indices seperti pada pipeline asli.

In [118]:
def performance_monitoring():
    """Monitor performance and index statistics for ColBERT implementation."""
    
    print("📊 ColBERT Performance Monitoring & Index Statistics")
    print("=" * 70)
    
    # Vector store statistics
    for dataset_name, vs in pipeline_simulator.vector_stores.items():
        print(f"\n📋 Dataset: {dataset_name}")
        print(f"├── Index name: {vs.index_name}")
        print(f"├── Document count: {vs.count_documents()}")
        print(f"├── Storage mode: {'Simulation' if vs.use_simulation else 'Elasticsearch'}")
        print(f"├── Embedding model: {vs.embedding_config.get('provider_model_id', 'Unknown')}")
        print(f"├── Vector dimensions: {vs.embedding_config.get('embedding_dimensions', 'Unknown')}")
        print(f"└── Multi-vector architecture: ColBERT late interaction")
        
        if not vs.use_simulation and hasattr(vs, 'client'):
            try:
                # Get index stats from Elasticsearch
                stats = vs.client.indices.stats(index=vs.index_name)
                index_stats = stats['indices'][vs.index_name]['total']
                
                print(f"├── Index size: {index_stats['store']['size_in_bytes']} bytes")
                print(f"├── Documents indexed: {index_stats['docs']['count']}")
                print(f"└── Search operations: {index_stats['search']['query_total']}")
                
            except Exception as e:
                print(f"├── ⚠️ Cannot retrieve ES stats: {e}")

def cleanup_pipeline():
    """Cleanup vector stores and temporary files."""
    
    print("\n🧹 ColBERT Pipeline Cleanup Procedures")
    print("=" * 50)
    
    # Clean up temporary files
    try:
        if os.path.exists(temp_jsonl_file):
            os.remove(temp_jsonl_file)
            print(f"✅ Removed temporary file: {temp_jsonl_file}")
    except Exception as e:
        print(f"⚠️ Error removing temp file: {e}")
    
    # Cleanup vector stores (similar to original pipeline)
    if pipeline_simulator.vector_stores:
        print(f"🗑️ Cleaning up {len(pipeline_simulator.vector_stores)} ColBERT vector stores...")
        
        for dataset_name, vs in pipeline_simulator.vector_stores.items():
            try:
                if not vs.use_simulation:
                    # In real implementation, this would delete the index
                    # vs.client.indices.delete(index=vs.index_name)
                    print(f"🗑️ Would delete ColBERT index: {vs.index_name}")
                else:
                    vs.simulated_docs.clear()
                    print(f"🗑️ Cleared simulated ColBERT data for: {dataset_name}")
                    
            except Exception as e:
                print(f"⚠️ Error cleaning up {dataset_name}: {e}")
        
        # Clear the vector stores dictionary
        pipeline_simulator.vector_stores = {}
        print("✅ ColBERT vector stores cleanup completed")

def pipeline_summary():
    """Generate final ColBERT pipeline summary."""
    
    print("\n📈 ColBERT Pipeline Execution Summary")
    print("=" * 70)
    
    print(f"🏗️ ColBERT Pipeline Configuration:")
    print(f"├── Model: {model_config.provider_model_id}")
    print(f"├── Provider: {model_config.provider}")
    print(f"├── API endpoint: {model_config.api_endpoint}")
    print(f"├── Vector dimensions: {model_config.embedding_dimensions}")
    print(f"├── Max tokens: {model_config.max_tokens}")
    print(f"├── Batch size: {pipeline_config.get('batch_size', 16)}")
    print(f"├── Top-K retrieval: {pipeline_config.get('retrieval_top_k', 10)}")
    print(f"└── Reranking enabled: {pipeline_config.get('use_reranker', False)}")
    
    print(f"\n📊 Data Processing Results:")
    print(f"├── Documents processed: {len(chunks)}")
    print(f"├── Valid embeddings generated: {len([e for e in all_embeddings if e])}")
    print(f"├── Failed embeddings: {len([e for e in all_embeddings if not e])}")
    print(f"├── Index name: {index_name}")
    print(f"├── Storage mode: {'Simulation' if vector_store.use_simulation else 'Elasticsearch'}")
    print(f"└── Architecture: ColBERT multi-vector late interaction")
    
    # ColBERT specific statistics
    if valid_embeddings:
        total_vectors = sum(len(emb) for emb in valid_embeddings)
        avg_vectors_per_doc = total_vectors / len(valid_embeddings)
        total_params = total_vectors * model_config.embedding_dimensions
        
        print(f"\n🧮 ColBERT Embedding Statistics:")
        print(f"├── Total token vectors: {total_vectors:,}")
        print(f"├── Average vectors per document: {avg_vectors_per_doc:.1f}")
        print(f"├── Vector dimension: {model_config.embedding_dimensions}")
        print(f"├── Total parameters: {total_params:,}")
        print(f"└── Storage efficiency: {total_params/(len(valid_embeddings)*512*384):.2f}x vs traditional")
    
    print(f"\n✅ ColBERT pipeline completed successfully!")
    print(f"🎯 This simulation demonstrates the complete ColBERT embedding pipeline")
    print(f"📝 Key improvements over traditional embeddings:")
    print(f"   ├── Token-level semantic matching")
    print(f"   ├── Late interaction scoring")
    print(f"   ├── Better handling of long documents")
    print(f"   └── More nuanced relevance assessment")

# Run monitoring and summary
performance_monitoring()

print("\n" + "="*70)
print("🎯 FINAL COLBERT PIPELINE SUMMARY")
pipeline_summary()

# Optional: Run cleanup (uncomment if you want to clean up)
# cleanup_pipeline()

print(f"\n🏁 ColBERT Elasticsearch Embedding Pipeline Simulation Complete!")
print(f"📅 Completed at: {datetime.now()}")
print("🔗 This notebook successfully simulates the entire ColBERT pipeline using Jina AI API")
print("🚀 ColBERT v2 provides superior semantic search through late interaction architecture")

📊 ColBERT Performance Monitoring & Index Statistics

📋 Dataset: sample_dataset
├── Index name: colbert_sample_dataset_jina-colbert-v2
├── Document count: 10
├── Storage mode: Simulation
├── Embedding model: jina-colbert-v2
├── Vector dimensions: 128
└── Multi-vector architecture: ColBERT late interaction

🎯 FINAL COLBERT PIPELINE SUMMARY

📈 ColBERT Pipeline Execution Summary
🏗️ ColBERT Pipeline Configuration:
├── Model: jina-colbert-v2
├── Provider: jinaai
├── API endpoint: https://api.jina.ai/v1/multi-vector
├── Vector dimensions: 128
├── Max tokens: 8192
├── Batch size: 16
├── Top-K retrieval: 10
└── Reranking enabled: False

📊 Data Processing Results:
├── Documents processed: 10
├── Valid embeddings generated: 10
├── Failed embeddings: 0
├── Index name: colbert_sample_dataset_jina-colbert-v2
├── Storage mode: Simulation
└── Architecture: ColBERT multi-vector late interaction

🧮 ColBERT Embedding Statistics:
├── Total token vectors: 244
├── Average vectors per document: 24.4
├── Vect