# Document Ingestion Demo for Pommeline Product Knowledge Base

This notebook demonstrates how to ingest product documents into the Pinecone vector store for the Pommeline knowledge base.

## Features:
- Checks for existing 'pommeline' index in Pinecone
- Creates index if it doesn't exist with HNSW algorithm and dotproduct similarity
- Ingests product documents with proper chunking and embedding
- Normalizes embeddings before storage for optimal retrieval
- Provides detailed logging and progress tracking

# Install required packages if not already installed

In [1]:
# !uv add pinecone-client sentence-transformers python-dotenv

In [2]:
import os
import sys
import pathlib
import logging
from typing import List, Dict, Any
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Add parent directory to path for imports to handle relative imports
sys.path.append(str(pathlib.Path().absolute().parent))
sys.path.append(str(pathlib.Path().absolute().parent / "src"))

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger("ingestion_demo")

# Import our modules
from src.ingestion.vector_store import get_vector_store
from src.ingestion.chunker import SemanticChunker, DocumentChunk
from src.ingestion.embedder import EmbeddingGenerator
from src.utils.file_loader import load_documents_from_directory
from src.config import settings

logger.info("Successfully imported all required modules")

  from .autonotebook import tqdm as notebook_tqdm


{"asctime": "2025-10-30 03:28:51,987", "name": "pinecone_index_client", "levelname": "INFO", "message": "Initialized PineconeIndexClient for dense index 'curator-pommeline' (dim: 768, metric: dotproduct)"}


2025-10-30 03:28:51,987 - pinecone_index_client - INFO - Initialized PineconeIndexClient for dense index 'curator-pommeline' (dim: 768, metric: dotproduct)


{"asctime": "2025-10-30 03:28:52,003", "name": "pinecone_vector_store", "levelname": "INFO", "message": "Connected to Pinecone Index container: {'namespaces': {'curator-pommeline-12fa085f': {'vectorCount': 0}, 'pommeline': {'vectorCount': 0}, 'curator-pommeline-7b1a7bbb': {'vectorCount': 0}, '': {'vectorCount': 0}, 'curator-pommeline-f03bab83': {'vectorCount': 0}, 'curator-pommeline-a9b4d456': {'vectorCount': 0}, 'curator-pommeline': {'vectorCount': 92}}, 'dimension': 768, 'indexFullness': 0.0, 'totalVectorCount': 92}"}


2025-10-30 03:28:52,003 - pinecone_vector_store - INFO - Connected to Pinecone Index container: {'namespaces': {'curator-pommeline-12fa085f': {'vectorCount': 0}, 'pommeline': {'vectorCount': 0}, 'curator-pommeline-7b1a7bbb': {'vectorCount': 0}, '': {'vectorCount': 0}, 'curator-pommeline-f03bab83': {'vectorCount': 0}, 'curator-pommeline-a9b4d456': {'vectorCount': 0}, 'curator-pommeline': {'vectorCount': 92}}, 'dimension': 768, 'indexFullness': 0.0, 'totalVectorCount': 92}
2025-10-30 03:28:52,013 - ingestion_demo - INFO - Successfully imported all required modules


## Configuration

Set up the index configuration for the Pommeline knowledge base.

In [3]:
# Index configuration with UUID for unique identification
import uuid

# Generate a unique UUID for this notebook run
index_uuid = str(uuid.uuid4())[:8]
INDEX_NAME = f"curator-pommeline-{index_uuid}"
DIMENSION = 768
METRIC = "dotproduct"

# Update settings for our specific index
settings.pinecone_index_name = INDEX_NAME
settings.pinecone_dimension = DIMENSION
settings.pinecone_metric = METRIC

print(f"Generated unique index UUID: {index_uuid}")
print(f"Index configuration: {INDEX_NAME}")
print(f"Dimension: {DIMENSION}, Metric: {METRIC}")
print(f"Note: This index will be automatically cleaned up at the end of the notebook.")

Generated unique index UUID: ccb01621
Index configuration: curator-pommeline-ccb01621
Dimension: 768, Metric: dotproduct
Note: This index will be automatically cleaned up at the end of the notebook.


## Initialize Vector Store

Connect to Pinecone and set up the 'pommeline' index.

In [4]:
# Initialize vector store with our unique configuration
vector_store = get_vector_store()

# Check current status
stats = vector_store.get_stats()
print("Vector Store Status:")
for key, value in stats.items():
    print(f"  {key}: {value}")

print(f"IMPORTANT: Using unified index architecture")
print(f"Index configured as: '{stats['index_name']}'")
print(f"Ingestion will use namespace: '{INDEX_NAME}'")
print(f"This creates proper unified index with dense+sparse vectors")
print(f"This index will be automatically cleaned up at the end of the notebook.")

Vector Store Status:
  total_documents: 0
  embedding_dimension: 768
  index_name: curator-pommeline
  index_fullness: 0
  index_type: pinecone_index_container
  namespaces: {'': {'vectorCount': 0}, 'curator-pommeline-a9b4d456': {'vectorCount': 0}, 'curator-pommeline-7b1a7bbb': {'vectorCount': 0}, 'pommeline': {'vectorCount': 0}, 'curator-pommeline': {'vectorCount': 92}, 'curator-pommeline-f03bab83': {'vectorCount': 0}, 'curator-pommeline-12fa085f': {'vectorCount': 0}}
IMPORTANT: Using unified index architecture
Index configured as: 'curator-pommeline'
Ingestion will use namespace: 'curator-pommeline-ccb01621'
This creates proper unified index with dense+sparse vectors
This index will be automatically cleaned up at the end of the notebook.


## Load Product Documents

Load all product and policy documents from the data directory.

In [5]:
# Define data directories
data_dir = pathlib.Path().absolute().parent / "data"
products_dir = data_dir / "products"
policies_dir = data_dir / "policies"

print(f"Loading documents from: {data_dir}")
print(f"Products directory: {products_dir}")
print(f"Policies directory: {policies_dir}")

# Check if directories exist
if not products_dir.exists():
    logger.warning(f"Products directory not found: {products_dir}")
if not policies_dir.exists():
    logger.warning(f"Policies directory not found: {policies_dir}")

Loading documents from: /Users/aamirsyedaltaf/Documents/curator-pommeline/data
Products directory: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products
Policies directory: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/policies


In [6]:
# Load documents from both directories
all_documents = []

if products_dir.exists():
    product_docs = load_documents_from_directory(str(products_dir))
    all_documents.extend(product_docs)
    logger.info(f"Loaded {len(product_docs)} product documents")

if policies_dir.exists():
    policy_docs = load_documents_from_directory(str(policies_dir))
    all_documents.extend(policy_docs)
    logger.info(f"Loaded {len(policy_docs)} policy documents")

logger.info(f"Total documents loaded: {len(all_documents)}")

# Display document information
for i, doc in enumerate(all_documents[:3]):
    print(f"\nDocument {i+1}:")
    print(f"  Source: {doc.get('source', 'Unknown')}")
    print(f"  Content length: {len(doc.get('content', ''))}")
    print(f"  Preview: {doc.get('content', '')[:200]}...")

{"asctime": "2025-10-30 03:28:52,045", "name": "file_loader", "levelname": "INFO", "message": "Loaded 3 documents from /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products"}


2025-10-30 03:28:52,045 - file_loader - INFO - Loaded 3 documents from /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products
2025-10-30 03:28:52,046 - ingestion_demo - INFO - Loaded 3 product documents


{"asctime": "2025-10-30 03:28:52,047", "name": "file_loader", "levelname": "INFO", "message": "Loaded 2 documents from /Users/aamirsyedaltaf/Documents/curator-pommeline/data/policies"}


2025-10-30 03:28:52,047 - file_loader - INFO - Loaded 2 documents from /Users/aamirsyedaltaf/Documents/curator-pommeline/data/policies
2025-10-30 03:28:52,047 - ingestion_demo - INFO - Loaded 2 policy documents
2025-10-30 03:28:52,047 - ingestion_demo - INFO - Total documents loaded: 5



Document 1:
  Source: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/iphone_16_pro.md
  Content length: 2364
  Preview: # iPhone 16 Pro

The iPhone 16 Pro represents Apple's latest flagship smartphone, combining cutting-edge technology with premium design and exceptional performance.

## Key Features

### Display and D...

Document 2:
  Source: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/airpods_pro_2.md
  Content length: 3271
  Preview: # AirPods Pro (2nd Generation)

The second generation AirPods Pro represent Apple's commitment to premium wireless audio with advanced noise cancellation and spatial audio capabilities.

## Key Featur...

Document 3:
  Source: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/macbook_air_m3.md
  Content length: 4708
  Preview: # MacBook Air with M3 Chip

The MacBook Air with M3 chip combines exceptional performance with incredible portability, featuring a stunning Liquid Retina display and all-day bat

## Document Chunking

Split documents into smaller chunks for better retrieval.

In [7]:
# Initialize document chunker
chunker = SemanticChunker(
    chunk_size=500,
    chunk_overlap=50,
    min_chunk_size=50,
)

# Chunk all documents
all_chunks = []

for doc in all_documents:
    chunks = chunker.chunk_text(
        text=doc['content'],
        source=doc['source']
    )
    all_chunks.extend(chunks)

logger.info(f"Created {len(all_chunks)} chunks from {len(all_documents)} documents")

# Display chunk information
print(f"Total chunks created: {len(all_chunks)}")
print(f"Average chunk length: {sum(len(chunk.content) for chunk in all_chunks) / len(all_chunks):.1f} characters")

# Show first few chunks
for i, chunk in enumerate(all_chunks[:3]):
    print(f"\nChunk {i+1}:")
    print(f"  ID: {chunk.chunk_id}")
    print(f"  Source: {chunk.source_file}")
    print(f"  Length: {len(chunk.content)} characters")
    print(f"  Preview: {chunk.content[:150]}...")

{"asctime": "2025-10-30 03:28:52,055", "name": "chunker", "levelname": "INFO", "message": "Created 11 chunks from /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/iphone_16_pro.md"}


2025-10-30 03:28:52,055 - chunker - INFO - Created 11 chunks from /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/iphone_16_pro.md


{"asctime": "2025-10-30 03:28:52,056", "name": "chunker", "levelname": "INFO", "message": "Created 15 chunks from /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/airpods_pro_2.md"}


2025-10-30 03:28:52,056 - chunker - INFO - Created 15 chunks from /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/airpods_pro_2.md


{"asctime": "2025-10-30 03:28:52,057", "name": "chunker", "levelname": "INFO", "message": "Created 20 chunks from /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/macbook_air_m3.md"}


2025-10-30 03:28:52,057 - chunker - INFO - Created 20 chunks from /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/macbook_air_m3.md


{"asctime": "2025-10-30 03:28:52,058", "name": "chunker", "levelname": "INFO", "message": "Created 30 chunks from /Users/aamirsyedaltaf/Documents/curator-pommeline/data/policies/student_discount.md"}


2025-10-30 03:28:52,058 - chunker - INFO - Created 30 chunks from /Users/aamirsyedaltaf/Documents/curator-pommeline/data/policies/student_discount.md


{"asctime": "2025-10-30 03:28:52,059", "name": "chunker", "levelname": "INFO", "message": "Created 30 chunks from /Users/aamirsyedaltaf/Documents/curator-pommeline/data/policies/return_policy.md"}


2025-10-30 03:28:52,059 - chunker - INFO - Created 30 chunks from /Users/aamirsyedaltaf/Documents/curator-pommeline/data/policies/return_policy.md
2025-10-30 03:28:52,059 - ingestion_demo - INFO - Created 106 chunks from 5 documents


Total chunks created: 106
Average chunk length: 240.2 characters

Chunk 1:
  ID: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/iphone_16_pro.md_chunk_0
  Source: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/iphone_16_pro.md
  Length: 164 characters
  Preview: # iPhone 16 Pro

The iPhone 16 Pro represents Apple's latest flagship smartphone, combining cutting-edge technology with premium design and exceptiona...

Chunk 2:
  ID: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/iphone_16_pro.md_chunk_1
  Source: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/iphone_16_pro.md
  Length: 344 characters
  Preview: ### Display and Design
- **6.3-inch Super Retina XDR display** with ProMotion technology
- **Titanium construction** for enhanced durability and reduc...

Chunk 3:
  ID: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/iphone_16_pro.md_chunk_2
  Source: /Users/aamirsyedaltaf/Documents/curator-pommelin

## Unified Index Ingestion

Use the UnifiedIndexIngestion system to store both dense and sparse vectors in the same 768-dimensional space.

In [8]:
# Import the UnifiedIndexIngestion system
from src.ingestion.unified_index_ingestion import UnifiedIndexIngestion

# Use the same index name as the vector store
actual_index_name = vector_store.index_name

print(f"Unified Index Configuration:")
print(f"  Vector Store Index: {vector_store.index_name}")
print(f"  Settings Index: {settings.pinecone_index_name}")
print(f"  Using Index Name: {actual_index_name}")

# Initialize the unified index ingestion system
unified_ingestion = UnifiedIndexIngestion(
    index_name=actual_index_name,
    ingestion_id=index_uuid,
    vector_dimension=DIMENSION
)

print(f"Initialized UnifiedIndexIngestion:")
print(f"  Index: {unified_ingestion.index_name}")
print(f"  Ingestion ID: {unified_ingestion.ingestion_id}")
print(f"  Vector Dimension: {unified_ingestion.vector_dimension}")
print(f"  BM25 Vectorizer: {'Initialized' if unified_ingestion.bm25_vectorizer else 'Not initialized'}")

Unified Index Configuration:
  Vector Store Index: curator-pommeline
  Settings Index: curator-pommeline-ccb01621
  Using Index Name: curator-pommeline
{"asctime": "2025-10-30 03:28:52,075", "name": "pinecone_index_client", "levelname": "INFO", "message": "Initialized PineconeIndexClient for dense index 'curator-pommeline' (dim: 768, metric: dotproduct)"}


2025-10-30 03:28:52,075 - pinecone_index_client - INFO - Initialized PineconeIndexClient for dense index 'curator-pommeline' (dim: 768, metric: dotproduct)


{"asctime": "2025-10-30 03:28:52,080", "name": "pinecone_vector_store", "levelname": "INFO", "message": "Connected to Pinecone Index container: {'namespaces': {'curator-pommeline': {'vectorCount': 92}, 'pommeline': {'vectorCount': 0}, '': {'vectorCount': 0}, 'curator-pommeline-f03bab83': {'vectorCount': 0}, 'curator-pommeline-7b1a7bbb': {'vectorCount': 0}, 'curator-pommeline-12fa085f': {'vectorCount': 0}, 'curator-pommeline-a9b4d456': {'vectorCount': 0}}, 'dimension': 768, 'indexFullness': 0.0, 'totalVectorCount': 92}"}


2025-10-30 03:28:52,080 - pinecone_vector_store - INFO - Connected to Pinecone Index container: {'namespaces': {'curator-pommeline': {'vectorCount': 92}, 'pommeline': {'vectorCount': 0}, '': {'vectorCount': 0}, 'curator-pommeline-f03bab83': {'vectorCount': 0}, 'curator-pommeline-7b1a7bbb': {'vectorCount': 0}, 'curator-pommeline-12fa085f': {'vectorCount': 0}, 'curator-pommeline-a9b4d456': {'vectorCount': 0}}, 'dimension': 768, 'indexFullness': 0.0, 'totalVectorCount': 92}


{"asctime": "2025-10-30 03:28:52,080", "name": "bm25_vectorizer", "levelname": "INFO", "message": "Initialized BM25Vectorizer with k1=1.2, b=0.75, fixed_dim=768"}


2025-10-30 03:28:52,080 - bm25_vectorizer - INFO - Initialized BM25Vectorizer with k1=1.2, b=0.75, fixed_dim=768


{"asctime": "2025-10-30 03:28:52,081", "name": "unified_index_ingestion", "levelname": "INFO", "message": "Initialized UnifiedIndexIngestion: index='curator-pommeline', dim=768, id='ccb01621'"}


2025-10-30 03:28:52,081 - unified_index_ingestion - INFO - Initialized UnifiedIndexIngestion: index='curator-pommeline', dim=768, id='ccb01621'


Initialized UnifiedIndexIngestion:
  Index: curator-pommeline
  Ingestion ID: ccb01621
  Vector Dimension: 768
  BM25 Vectorizer: Initialized


In [9]:
# Perform unified index ingestion with both dense and sparse vectors
logger.info(f"Starting unified index ingestion for {len(all_chunks)} chunks")

ingestion_result = unified_ingestion.ingest_documents(all_chunks)

# Extract results from the dictionary
dense_count = ingestion_result.get("dense_vectors", 0)
sparse_count = ingestion_result.get("sparse_vectors", 0)
failed_count = ingestion_result.get("failed", 0)

logger.info(f"Unified index ingestion completed successfully!")
logger.info(f"  Dense vectors stored: {dense_count}")
logger.info(f"  Sparse vectors stored: {sparse_count}")
logger.info(f"  Failed chunks: {failed_count}")
logger.info(f"  Total vectors: {dense_count + sparse_count}")

# Set the current ingestion ID so BM25 retrieval works
settings.current_ingestion_id = unified_ingestion.ingestion_id
logger.info(f"Set current_ingestion_id to: {settings.current_ingestion_id}")
logger.info("This enables BM25 keyword search functionality")

# Verify the BM25 vectorizer was registered
from src.retrieval.bm25_vectorizer import get_bm25_vectorizer
test_vectorizer = get_bm25_vectorizer(unified_ingestion.ingestion_id)
if test_vectorizer:
    logger.info("BM25 vectorizer successfully registered and retrievable")
else:
    logger.error("BM25 vectorizer registration failed")

2025-10-30 03:28:52,088 - ingestion_demo - INFO - Starting unified index ingestion for 106 chunks


{"asctime": "2025-10-30 03:28:52,089", "name": "unified_index_ingestion", "levelname": "INFO", "message": "Starting unified index ingestion of 106 chunks into 768-dim space"}


2025-10-30 03:28:52,089 - unified_index_ingestion - INFO - Starting unified index ingestion of 106 chunks into 768-dim space


{"asctime": "2025-10-30 03:28:52,091", "name": "unified_index_ingestion", "levelname": "INFO", "message": "Fitting BM25 vectorizer on document corpus"}


2025-10-30 03:28:52,091 - unified_index_ingestion - INFO - Fitting BM25 vectorizer on document corpus


{"asctime": "2025-10-30 03:28:52,091", "name": "bm25_vectorizer", "levelname": "INFO", "message": "Fitting BM25Vectorizer on 106 documents"}


2025-10-30 03:28:52,091 - bm25_vectorizer - INFO - Fitting BM25Vectorizer on 106 documents


{"asctime": "2025-10-30 03:28:52,098", "name": "bm25_vectorizer", "levelname": "INFO", "message": "BM25Vectorizer fitted with vocabulary size: 768 (fixed_dim: 768)"}


2025-10-30 03:28:52,098 - bm25_vectorizer - INFO - BM25Vectorizer fitted with vocabulary size: 768 (fixed_dim: 768)


{"asctime": "2025-10-30 03:28:52,129", "name": "embedder", "levelname": "INFO", "message": "Loading embedding model: google/embeddinggemma-300m"}


2025-10-30 03:28:52,129 - embedder - INFO - Loading embedding model: google/embeddinggemma-300m
2025-10-30 03:28:52,130 - sentence_transformers.SentenceTransformer - INFO - Load pretrained SentenceTransformer: google/embeddinggemma-300m
2025-10-30 03:28:59,493 - sentence_transformers.SentenceTransformer - INFO - 14 prompts are loaded, with the keys: ['query', 'document', 'BitextMining', 'Clustering', 'Classification', 'InstructionRetrieval', 'MultilabelClassification', 'PairClassification', 'Reranking', 'Retrieval', 'Retrieval-query', 'Retrieval-document', 'STS', 'Summarization']


{"asctime": "2025-10-30 03:29:00,265", "name": "embedder", "levelname": "INFO", "message": "Model loaded successfully. Embedding dimension: 768"}


2025-10-30 03:29:00,265 - embedder - INFO - Model loaded successfully. Embedding dimension: 768


{"asctime": "2025-10-30 03:29:00,559", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_texts_count took 10.000", "operation": "embedding_texts_count", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:00,559 - metrics - INFO - Performance: embedding_texts_count took 10.000


{"asctime": "2025-10-30 03:29:00,560", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_generation took 8461.768", "operation": "embedding_generation", "value": 8461.76791191101, "latency_ms": "8461.768", "latency_info": "8461.8ms"}


2025-10-30 03:29:00,560 - metrics - INFO - Performance: embedding_generation took 8461.768


{"asctime": "2025-10-30 03:29:00,597", "name": "metrics", "levelname": "INFO", "message": "Performance: vectors_upserted took 20.000", "operation": "vectors_upserted", "value": 20, "latency_ms": "20.000"}


2025-10-30 03:29:00,597 - metrics - INFO - Performance: vectors_upserted took 20.000


{"asctime": "2025-10-30 03:29:00,597", "name": "metrics", "levelname": "INFO", "message": "Performance: dense_vectors_ingested took 10.000", "operation": "dense_vectors_ingested", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:00,597 - metrics - INFO - Performance: dense_vectors_ingested took 10.000


{"asctime": "2025-10-30 03:29:00,598", "name": "metrics", "levelname": "INFO", "message": "Performance: sparse_vectors_ingested took 10.000", "operation": "sparse_vectors_ingested", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:00,598 - metrics - INFO - Performance: sparse_vectors_ingested took 10.000


{"asctime": "2025-10-30 03:29:00,598", "name": "unified_index_ingestion", "levelname": "INFO", "message": "Processed batch 1: dense=10, sparse=10"}


2025-10-30 03:29:00,598 - unified_index_ingestion - INFO - Processed batch 1: dense=10, sparse=10


{"asctime": "2025-10-30 03:29:00,835", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_texts_count took 10.000", "operation": "embedding_texts_count", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:00,835 - metrics - INFO - Performance: embedding_texts_count took 10.000


{"asctime": "2025-10-30 03:29:00,836", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_generation took 237.117", "operation": "embedding_generation", "value": 237.11681365966797, "latency_ms": "237.117", "latency_info": "237.1ms"}


2025-10-30 03:29:00,836 - metrics - INFO - Performance: embedding_generation took 237.117


{"asctime": "2025-10-30 03:29:00,852", "name": "metrics", "levelname": "INFO", "message": "Performance: vectors_upserted took 20.000", "operation": "vectors_upserted", "value": 20, "latency_ms": "20.000"}


2025-10-30 03:29:00,852 - metrics - INFO - Performance: vectors_upserted took 20.000


{"asctime": "2025-10-30 03:29:00,853", "name": "metrics", "levelname": "INFO", "message": "Performance: dense_vectors_ingested took 10.000", "operation": "dense_vectors_ingested", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:00,853 - metrics - INFO - Performance: dense_vectors_ingested took 10.000


{"asctime": "2025-10-30 03:29:00,853", "name": "metrics", "levelname": "INFO", "message": "Performance: sparse_vectors_ingested took 10.000", "operation": "sparse_vectors_ingested", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:00,853 - metrics - INFO - Performance: sparse_vectors_ingested took 10.000


{"asctime": "2025-10-30 03:29:00,853", "name": "unified_index_ingestion", "levelname": "INFO", "message": "Processed batch 2: dense=10, sparse=10"}


2025-10-30 03:29:00,853 - unified_index_ingestion - INFO - Processed batch 2: dense=10, sparse=10


{"asctime": "2025-10-30 03:29:01,087", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_texts_count took 10.000", "operation": "embedding_texts_count", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:01,087 - metrics - INFO - Performance: embedding_texts_count took 10.000


{"asctime": "2025-10-30 03:29:01,088", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_generation took 234.182", "operation": "embedding_generation", "value": 234.18188095092773, "latency_ms": "234.182", "latency_info": "234.2ms"}


2025-10-30 03:29:01,088 - metrics - INFO - Performance: embedding_generation took 234.182


{"asctime": "2025-10-30 03:29:01,105", "name": "metrics", "levelname": "INFO", "message": "Performance: vectors_upserted took 20.000", "operation": "vectors_upserted", "value": 20, "latency_ms": "20.000"}


2025-10-30 03:29:01,105 - metrics - INFO - Performance: vectors_upserted took 20.000


{"asctime": "2025-10-30 03:29:01,106", "name": "metrics", "levelname": "INFO", "message": "Performance: dense_vectors_ingested took 10.000", "operation": "dense_vectors_ingested", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:01,106 - metrics - INFO - Performance: dense_vectors_ingested took 10.000


{"asctime": "2025-10-30 03:29:01,106", "name": "metrics", "levelname": "INFO", "message": "Performance: sparse_vectors_ingested took 10.000", "operation": "sparse_vectors_ingested", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:01,106 - metrics - INFO - Performance: sparse_vectors_ingested took 10.000


{"asctime": "2025-10-30 03:29:01,107", "name": "unified_index_ingestion", "levelname": "INFO", "message": "Processed batch 3: dense=10, sparse=10"}


2025-10-30 03:29:01,107 - unified_index_ingestion - INFO - Processed batch 3: dense=10, sparse=10


{"asctime": "2025-10-30 03:29:01,315", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_texts_count took 10.000", "operation": "embedding_texts_count", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:01,315 - metrics - INFO - Performance: embedding_texts_count took 10.000


{"asctime": "2025-10-30 03:29:01,315", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_generation took 208.395", "operation": "embedding_generation", "value": 208.39500427246094, "latency_ms": "208.395", "latency_info": "208.4ms"}


2025-10-30 03:29:01,315 - metrics - INFO - Performance: embedding_generation took 208.395


{"asctime": "2025-10-30 03:29:01,332", "name": "metrics", "levelname": "INFO", "message": "Performance: vectors_upserted took 20.000", "operation": "vectors_upserted", "value": 20, "latency_ms": "20.000"}


2025-10-30 03:29:01,332 - metrics - INFO - Performance: vectors_upserted took 20.000


{"asctime": "2025-10-30 03:29:01,333", "name": "metrics", "levelname": "INFO", "message": "Performance: dense_vectors_ingested took 10.000", "operation": "dense_vectors_ingested", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:01,333 - metrics - INFO - Performance: dense_vectors_ingested took 10.000


{"asctime": "2025-10-30 03:29:01,334", "name": "metrics", "levelname": "INFO", "message": "Performance: sparse_vectors_ingested took 10.000", "operation": "sparse_vectors_ingested", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:01,334 - metrics - INFO - Performance: sparse_vectors_ingested took 10.000


{"asctime": "2025-10-30 03:29:01,335", "name": "unified_index_ingestion", "levelname": "INFO", "message": "Processed batch 4: dense=10, sparse=10"}


2025-10-30 03:29:01,335 - unified_index_ingestion - INFO - Processed batch 4: dense=10, sparse=10


{"asctime": "2025-10-30 03:29:01,707", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_texts_count took 10.000", "operation": "embedding_texts_count", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:01,707 - metrics - INFO - Performance: embedding_texts_count took 10.000


{"asctime": "2025-10-30 03:29:01,708", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_generation took 371.904", "operation": "embedding_generation", "value": 371.9038963317871, "latency_ms": "371.904", "latency_info": "371.9ms"}


2025-10-30 03:29:01,708 - metrics - INFO - Performance: embedding_generation took 371.904


{"asctime": "2025-10-30 03:29:01,722", "name": "metrics", "levelname": "INFO", "message": "Performance: vectors_upserted took 20.000", "operation": "vectors_upserted", "value": 20, "latency_ms": "20.000"}


2025-10-30 03:29:01,722 - metrics - INFO - Performance: vectors_upserted took 20.000


{"asctime": "2025-10-30 03:29:01,723", "name": "metrics", "levelname": "INFO", "message": "Performance: dense_vectors_ingested took 10.000", "operation": "dense_vectors_ingested", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:01,723 - metrics - INFO - Performance: dense_vectors_ingested took 10.000


{"asctime": "2025-10-30 03:29:01,723", "name": "metrics", "levelname": "INFO", "message": "Performance: sparse_vectors_ingested took 10.000", "operation": "sparse_vectors_ingested", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:01,723 - metrics - INFO - Performance: sparse_vectors_ingested took 10.000


{"asctime": "2025-10-30 03:29:01,723", "name": "unified_index_ingestion", "levelname": "INFO", "message": "Processed batch 5: dense=10, sparse=10"}


2025-10-30 03:29:01,723 - unified_index_ingestion - INFO - Processed batch 5: dense=10, sparse=10


{"asctime": "2025-10-30 03:29:01,956", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_texts_count took 10.000", "operation": "embedding_texts_count", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:01,956 - metrics - INFO - Performance: embedding_texts_count took 10.000


{"asctime": "2025-10-30 03:29:01,957", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_generation took 233.387", "operation": "embedding_generation", "value": 233.38699340820312, "latency_ms": "233.387", "latency_info": "233.4ms"}


2025-10-30 03:29:01,957 - metrics - INFO - Performance: embedding_generation took 233.387


{"asctime": "2025-10-30 03:29:01,972", "name": "metrics", "levelname": "INFO", "message": "Performance: vectors_upserted took 20.000", "operation": "vectors_upserted", "value": 20, "latency_ms": "20.000"}


2025-10-30 03:29:01,972 - metrics - INFO - Performance: vectors_upserted took 20.000


{"asctime": "2025-10-30 03:29:01,973", "name": "metrics", "levelname": "INFO", "message": "Performance: dense_vectors_ingested took 10.000", "operation": "dense_vectors_ingested", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:01,973 - metrics - INFO - Performance: dense_vectors_ingested took 10.000


{"asctime": "2025-10-30 03:29:01,973", "name": "metrics", "levelname": "INFO", "message": "Performance: sparse_vectors_ingested took 10.000", "operation": "sparse_vectors_ingested", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:01,973 - metrics - INFO - Performance: sparse_vectors_ingested took 10.000


{"asctime": "2025-10-30 03:29:01,974", "name": "unified_index_ingestion", "levelname": "INFO", "message": "Processed batch 6: dense=10, sparse=10"}


2025-10-30 03:29:01,974 - unified_index_ingestion - INFO - Processed batch 6: dense=10, sparse=10


{"asctime": "2025-10-30 03:29:02,146", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_texts_count took 10.000", "operation": "embedding_texts_count", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:02,146 - metrics - INFO - Performance: embedding_texts_count took 10.000


{"asctime": "2025-10-30 03:29:02,146", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_generation took 171.792", "operation": "embedding_generation", "value": 171.79179191589355, "latency_ms": "171.792", "latency_info": "171.8ms"}


2025-10-30 03:29:02,146 - metrics - INFO - Performance: embedding_generation took 171.792


{"asctime": "2025-10-30 03:29:02,165", "name": "metrics", "levelname": "INFO", "message": "Performance: vectors_upserted took 20.000", "operation": "vectors_upserted", "value": 20, "latency_ms": "20.000"}


2025-10-30 03:29:02,165 - metrics - INFO - Performance: vectors_upserted took 20.000


{"asctime": "2025-10-30 03:29:02,166", "name": "metrics", "levelname": "INFO", "message": "Performance: dense_vectors_ingested took 10.000", "operation": "dense_vectors_ingested", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:02,166 - metrics - INFO - Performance: dense_vectors_ingested took 10.000


{"asctime": "2025-10-30 03:29:02,166", "name": "metrics", "levelname": "INFO", "message": "Performance: sparse_vectors_ingested took 10.000", "operation": "sparse_vectors_ingested", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:02,166 - metrics - INFO - Performance: sparse_vectors_ingested took 10.000


{"asctime": "2025-10-30 03:29:02,167", "name": "unified_index_ingestion", "levelname": "INFO", "message": "Processed batch 7: dense=10, sparse=10"}


2025-10-30 03:29:02,167 - unified_index_ingestion - INFO - Processed batch 7: dense=10, sparse=10


{"asctime": "2025-10-30 03:29:02,436", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_texts_count took 10.000", "operation": "embedding_texts_count", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:02,436 - metrics - INFO - Performance: embedding_texts_count took 10.000


{"asctime": "2025-10-30 03:29:02,436", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_generation took 268.572", "operation": "embedding_generation", "value": 268.5718536376953, "latency_ms": "268.572", "latency_info": "268.6ms"}


2025-10-30 03:29:02,436 - metrics - INFO - Performance: embedding_generation took 268.572


{"asctime": "2025-10-30 03:29:02,451", "name": "metrics", "levelname": "INFO", "message": "Performance: vectors_upserted took 20.000", "operation": "vectors_upserted", "value": 20, "latency_ms": "20.000"}


2025-10-30 03:29:02,451 - metrics - INFO - Performance: vectors_upserted took 20.000


{"asctime": "2025-10-30 03:29:02,452", "name": "metrics", "levelname": "INFO", "message": "Performance: dense_vectors_ingested took 10.000", "operation": "dense_vectors_ingested", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:02,452 - metrics - INFO - Performance: dense_vectors_ingested took 10.000


{"asctime": "2025-10-30 03:29:02,452", "name": "metrics", "levelname": "INFO", "message": "Performance: sparse_vectors_ingested took 10.000", "operation": "sparse_vectors_ingested", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:02,452 - metrics - INFO - Performance: sparse_vectors_ingested took 10.000


{"asctime": "2025-10-30 03:29:02,452", "name": "unified_index_ingestion", "levelname": "INFO", "message": "Processed batch 8: dense=10, sparse=10"}


2025-10-30 03:29:02,452 - unified_index_ingestion - INFO - Processed batch 8: dense=10, sparse=10


{"asctime": "2025-10-30 03:29:02,693", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_texts_count took 10.000", "operation": "embedding_texts_count", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:02,693 - metrics - INFO - Performance: embedding_texts_count took 10.000


{"asctime": "2025-10-30 03:29:02,694", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_generation took 240.743", "operation": "embedding_generation", "value": 240.74292182922363, "latency_ms": "240.743", "latency_info": "240.7ms"}


2025-10-30 03:29:02,694 - metrics - INFO - Performance: embedding_generation took 240.743


{"asctime": "2025-10-30 03:29:02,707", "name": "metrics", "levelname": "INFO", "message": "Performance: vectors_upserted took 20.000", "operation": "vectors_upserted", "value": 20, "latency_ms": "20.000"}


2025-10-30 03:29:02,707 - metrics - INFO - Performance: vectors_upserted took 20.000


{"asctime": "2025-10-30 03:29:02,708", "name": "metrics", "levelname": "INFO", "message": "Performance: dense_vectors_ingested took 10.000", "operation": "dense_vectors_ingested", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:02,708 - metrics - INFO - Performance: dense_vectors_ingested took 10.000


{"asctime": "2025-10-30 03:29:02,708", "name": "metrics", "levelname": "INFO", "message": "Performance: sparse_vectors_ingested took 10.000", "operation": "sparse_vectors_ingested", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:02,708 - metrics - INFO - Performance: sparse_vectors_ingested took 10.000


{"asctime": "2025-10-30 03:29:02,709", "name": "unified_index_ingestion", "levelname": "INFO", "message": "Processed batch 9: dense=10, sparse=10"}


2025-10-30 03:29:02,709 - unified_index_ingestion - INFO - Processed batch 9: dense=10, sparse=10


{"asctime": "2025-10-30 03:29:02,910", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_texts_count took 10.000", "operation": "embedding_texts_count", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:02,910 - metrics - INFO - Performance: embedding_texts_count took 10.000


{"asctime": "2025-10-30 03:29:02,910", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_generation took 201.117", "operation": "embedding_generation", "value": 201.11727714538574, "latency_ms": "201.117", "latency_info": "201.1ms"}


2025-10-30 03:29:02,910 - metrics - INFO - Performance: embedding_generation took 201.117


{"asctime": "2025-10-30 03:29:02,926", "name": "metrics", "levelname": "INFO", "message": "Performance: vectors_upserted took 20.000", "operation": "vectors_upserted", "value": 20, "latency_ms": "20.000"}


2025-10-30 03:29:02,926 - metrics - INFO - Performance: vectors_upserted took 20.000


{"asctime": "2025-10-30 03:29:02,929", "name": "metrics", "levelname": "INFO", "message": "Performance: dense_vectors_ingested took 10.000", "operation": "dense_vectors_ingested", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:02,929 - metrics - INFO - Performance: dense_vectors_ingested took 10.000


{"asctime": "2025-10-30 03:29:02,931", "name": "metrics", "levelname": "INFO", "message": "Performance: sparse_vectors_ingested took 10.000", "operation": "sparse_vectors_ingested", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:02,931 - metrics - INFO - Performance: sparse_vectors_ingested took 10.000


{"asctime": "2025-10-30 03:29:02,935", "name": "unified_index_ingestion", "levelname": "INFO", "message": "Processed batch 10: dense=10, sparse=10"}


2025-10-30 03:29:02,935 - unified_index_ingestion - INFO - Processed batch 10: dense=10, sparse=10


{"asctime": "2025-10-30 03:29:03,113", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_texts_count took 6.000", "operation": "embedding_texts_count", "value": 6, "latency_ms": "6.000"}


2025-10-30 03:29:03,113 - metrics - INFO - Performance: embedding_texts_count took 6.000


{"asctime": "2025-10-30 03:29:03,114", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_generation took 170.260", "operation": "embedding_generation", "value": 170.25995254516602, "latency_ms": "170.260", "latency_info": "170.3ms"}


2025-10-30 03:29:03,114 - metrics - INFO - Performance: embedding_generation took 170.260


{"asctime": "2025-10-30 03:29:03,127", "name": "metrics", "levelname": "INFO", "message": "Performance: vectors_upserted took 12.000", "operation": "vectors_upserted", "value": 12, "latency_ms": "12.000"}


2025-10-30 03:29:03,127 - metrics - INFO - Performance: vectors_upserted took 12.000


{"asctime": "2025-10-30 03:29:03,128", "name": "metrics", "levelname": "INFO", "message": "Performance: dense_vectors_ingested took 6.000", "operation": "dense_vectors_ingested", "value": 6, "latency_ms": "6.000"}


2025-10-30 03:29:03,128 - metrics - INFO - Performance: dense_vectors_ingested took 6.000


{"asctime": "2025-10-30 03:29:03,128", "name": "metrics", "levelname": "INFO", "message": "Performance: sparse_vectors_ingested took 6.000", "operation": "sparse_vectors_ingested", "value": 6, "latency_ms": "6.000"}


2025-10-30 03:29:03,128 - metrics - INFO - Performance: sparse_vectors_ingested took 6.000


{"asctime": "2025-10-30 03:29:03,129", "name": "unified_index_ingestion", "levelname": "INFO", "message": "Processed batch 11: dense=6, sparse=6"}


2025-10-30 03:29:03,129 - unified_index_ingestion - INFO - Processed batch 11: dense=6, sparse=6


{"asctime": "2025-10-30 03:29:03,131", "name": "bm25_vectorizer", "levelname": "INFO", "message": "BM25Vectorizer saved to data/models/bm25_ccb01621.pkl"}


2025-10-30 03:29:03,131 - bm25_vectorizer - INFO - BM25Vectorizer saved to data/models/bm25_ccb01621.pkl


{"asctime": "2025-10-30 03:29:03,131", "name": "unified_index_ingestion", "levelname": "INFO", "message": "BM25 vectorizer saved for ingestion ID: ccb01621"}


2025-10-30 03:29:03,131 - unified_index_ingestion - INFO - BM25 vectorizer saved for ingestion ID: ccb01621


{"asctime": "2025-10-30 03:29:03,131", "name": "unified_index_ingestion", "levelname": "INFO", "message": "Unified index ingestion completed: {'dense_vectors': 106, 'sparse_vectors': 106, 'failed': 0}"}


2025-10-30 03:29:03,131 - unified_index_ingestion - INFO - Unified index ingestion completed: {'dense_vectors': 106, 'sparse_vectors': 106, 'failed': 0}


{"asctime": "2025-10-30 03:29:03,132", "name": "metrics", "levelname": "INFO", "message": "Performance: unified_index_ingestion_total took 212.000", "operation": "unified_index_ingestion_total", "value": 212, "latency_ms": "212.000"}


2025-10-30 03:29:03,132 - metrics - INFO - Performance: unified_index_ingestion_total took 212.000


{"asctime": "2025-10-30 03:29:03,132", "name": "metrics", "levelname": "INFO", "message": "Performance: unified_index_ingestion took 11043.119", "operation": "unified_index_ingestion", "value": 11043.118715286255, "latency_ms": "11043.119", "latency_info": "11043.1ms"}


2025-10-30 03:29:03,132 - metrics - INFO - Performance: unified_index_ingestion took 11043.119
2025-10-30 03:29:03,133 - ingestion_demo - INFO - Unified index ingestion completed successfully!
2025-10-30 03:29:03,133 - ingestion_demo - INFO -   Dense vectors stored: 106
2025-10-30 03:29:03,133 - ingestion_demo - INFO -   Sparse vectors stored: 106
2025-10-30 03:29:03,134 - ingestion_demo - INFO -   Failed chunks: 0
2025-10-30 03:29:03,134 - ingestion_demo - INFO -   Total vectors: 212
2025-10-30 03:29:03,134 - ingestion_demo - INFO - Set current_ingestion_id to: ccb01621
2025-10-30 03:29:03,134 - ingestion_demo - INFO - This enables BM25 keyword search functionality
2025-10-30 03:29:03,135 - ingestion_demo - INFO - BM25 vectorizer successfully registered and retrievable


## Verification

Verify that the documents were successfully ingested by checking the index statistics and performing a test search.

In [10]:
# Get final index statistics
final_stats = vector_store.get_stats()
print("\nFinal Index Statistics:")
for key, value in final_stats.items():
    print(f"  {key}: {value}")

# Check what's in the index directly
print(f"\nDetailed namespace analysis:")
namespaces = final_stats.get("namespaces", {})
for ns_name, ns_data in namespaces.items():
    vector_count = ns_data.get("vectorCount", 0)
    if vector_count > 0:
        print(f"  Namespace '{ns_name}': {vector_count} vectors")


Final Index Statistics:
  total_documents: 0
  embedding_dimension: 768
  index_name: curator-pommeline
  index_fullness: 0
  index_type: pinecone_index_container
  namespaces: {'curator-pommeline': {'vectorCount': 304}, 'curator-pommeline-f03bab83': {'vectorCount': 0}, '': {'vectorCount': 0}, 'pommeline': {'vectorCount': 0}, 'curator-pommeline-7b1a7bbb': {'vectorCount': 0}, 'curator-pommeline-12fa085f': {'vectorCount': 0}, 'curator-pommeline-a9b4d456': {'vectorCount': 0}}

Detailed namespace analysis:
  Namespace 'curator-pommeline': 304 vectors


In [11]:
# Test search functionality
test_queries = [
    "iPhone 16 Pro features",
    "MacBook Air M3 performance",
    "Student discount policy",
    "Return policy for electronics"
]

print("\nTesting Search Functionality:")
print("=" * 50)

for query in test_queries:
    results = vector_store.search(query, top_k=3)
    print(f"\nQuery: '{query}'")
    print(f"Results found: {len(results)}")
    
    for i, (doc, score) in enumerate(results):
        print(f"  {i+1}. Score: {score:.4f}")
        print(f"     Source: {doc['source_file']}")
        print(f"     Preview: {doc['content'][:100]}...")


Testing Search Functionality:
{"asctime": "2025-10-30 03:29:03,151", "name": "embedder", "levelname": "INFO", "message": "Loading embedding model: google/embeddinggemma-300m"}


2025-10-30 03:29:03,151 - embedder - INFO - Loading embedding model: google/embeddinggemma-300m
2025-10-30 03:29:03,153 - sentence_transformers.SentenceTransformer - INFO - Load pretrained SentenceTransformer: google/embeddinggemma-300m
2025-10-30 03:29:10,425 - sentence_transformers.SentenceTransformer - INFO - 14 prompts are loaded, with the keys: ['query', 'document', 'BitextMining', 'Clustering', 'Classification', 'InstructionRetrieval', 'MultilabelClassification', 'PairClassification', 'Reranking', 'Retrieval', 'Retrieval-query', 'Retrieval-document', 'STS', 'Summarization']


{"asctime": "2025-10-30 03:29:10,605", "name": "embedder", "levelname": "INFO", "message": "Model loaded successfully. Embedding dimension: 768"}


2025-10-30 03:29:10,605 - embedder - INFO - Model loaded successfully. Embedding dimension: 768


{"asctime": "2025-10-30 03:29:10,724", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_texts_count took 1.000", "operation": "embedding_texts_count", "value": 1, "latency_ms": "1.000"}


2025-10-30 03:29:10,724 - metrics - INFO - Performance: embedding_texts_count took 1.000


{"asctime": "2025-10-30 03:29:10,724", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_generation took 7573.320", "operation": "embedding_generation", "value": 7573.319911956787, "latency_ms": "7573.320", "latency_info": "7573.3ms"}


2025-10-30 03:29:10,724 - metrics - INFO - Performance: embedding_generation took 7573.320


{"asctime": "2025-10-30 03:29:10,795", "name": "metrics", "levelname": "INFO", "message": "Performance: queries_performed took 1.0ms", "operation": "queries_performed", "value": 1, "latency_ms": "1.0ms"}


2025-10-30 03:29:10,795 - metrics - INFO - Performance: queries_performed took 1.0ms


{"asctime": "2025-10-30 03:29:10,795", "name": "metrics", "levelname": "INFO", "message": "Performance: search_results_count took 3.000", "operation": "search_results_count", "value": 3, "latency_ms": "3.000"}


2025-10-30 03:29:10,795 - metrics - INFO - Performance: search_results_count took 3.000


{"asctime": "2025-10-30 03:29:10,796", "name": "metrics", "levelname": "INFO", "message": "Performance: vector_search took 7644.737", "operation": "vector_search", "value": 7644.736766815186, "latency_ms": "7644.737", "latency_info": "7644.7ms"}


2025-10-30 03:29:10,796 - metrics - INFO - Performance: vector_search took 7644.737



Query: 'iPhone 16 Pro features'
Results found: 3
  1. Score: 0.6314
     Source: data/products/iphone_16_pro.md
     Preview: # iPhone 16 Pro

The iPhone 16 Pro represents Apple's latest flagship smartphone, combining cutting-...
  2. Score: 0.6314
     Source: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/iphone_16_pro.md
     Preview: # iPhone 16 Pro

The iPhone 16 Pro represents Apple's latest flagship smartphone, combining cutting-...
  3. Score: 0.5547
     Source: data/products/iphone_16_pro.md
     Preview: ## Pricing and Availability

The iPhone 16 Pro is available starting at $999 for the 128GB model, wi...
{"asctime": "2025-10-30 03:29:10,833", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_texts_count took 1.000", "operation": "embedding_texts_count", "value": 1, "latency_ms": "1.000"}


2025-10-30 03:29:10,833 - metrics - INFO - Performance: embedding_texts_count took 1.000


{"asctime": "2025-10-30 03:29:10,834", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_generation took 37.307", "operation": "embedding_generation", "value": 37.3072624206543, "latency_ms": "37.307", "latency_info": "37.3ms"}


2025-10-30 03:29:10,834 - metrics - INFO - Performance: embedding_generation took 37.307


{"asctime": "2025-10-30 03:29:10,838", "name": "metrics", "levelname": "INFO", "message": "Performance: queries_performed took 1.0ms", "operation": "queries_performed", "value": 1, "latency_ms": "1.0ms"}


2025-10-30 03:29:10,838 - metrics - INFO - Performance: queries_performed took 1.0ms


{"asctime": "2025-10-30 03:29:10,839", "name": "metrics", "levelname": "INFO", "message": "Performance: search_results_count took 3.000", "operation": "search_results_count", "value": 3, "latency_ms": "3.000"}


2025-10-30 03:29:10,839 - metrics - INFO - Performance: search_results_count took 3.000


{"asctime": "2025-10-30 03:29:10,839", "name": "metrics", "levelname": "INFO", "message": "Performance: vector_search took 43.162", "operation": "vector_search", "value": 43.161630630493164, "latency_ms": "43.162", "latency_info": "43.2ms"}


2025-10-30 03:29:10,839 - metrics - INFO - Performance: vector_search took 43.162



Query: 'MacBook Air M3 performance'
Results found: 3
  1. Score: 0.6526
     Source: data/products/macbook_air_m3.md
     Preview: # MacBook Air with M3 Chip

The MacBook Air with M3 chip combines exceptional performance with incre...
  2. Score: 0.6526
     Source: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/macbook_air_m3.md
     Preview: # MacBook Air with M3 Chip

The MacBook Air with M3 chip combines exceptional performance with incre...
  3. Score: 0.5635
     Source: data/products/macbook_air_m3.md
     Preview: ### M3 Chip
- **8-core CPU** with 4 performance cores and 4 efficiency cores
- **8-core GPU** for sm...
{"asctime": "2025-10-30 03:29:11,181", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_texts_count took 1.000", "operation": "embedding_texts_count", "value": 1, "latency_ms": "1.000"}


2025-10-30 03:29:11,181 - metrics - INFO - Performance: embedding_texts_count took 1.000


{"asctime": "2025-10-30 03:29:11,181", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_generation took 341.395", "operation": "embedding_generation", "value": 341.39513969421387, "latency_ms": "341.395", "latency_info": "341.4ms"}


2025-10-30 03:29:11,181 - metrics - INFO - Performance: embedding_generation took 341.395


{"asctime": "2025-10-30 03:29:11,188", "name": "metrics", "levelname": "INFO", "message": "Performance: queries_performed took 1.0ms", "operation": "queries_performed", "value": 1, "latency_ms": "1.0ms"}


2025-10-30 03:29:11,188 - metrics - INFO - Performance: queries_performed took 1.0ms


{"asctime": "2025-10-30 03:29:11,189", "name": "metrics", "levelname": "INFO", "message": "Performance: search_results_count took 3.000", "operation": "search_results_count", "value": 3, "latency_ms": "3.000"}


2025-10-30 03:29:11,189 - metrics - INFO - Performance: search_results_count took 3.000


{"asctime": "2025-10-30 03:29:11,189", "name": "metrics", "levelname": "INFO", "message": "Performance: vector_search took 349.491", "operation": "vector_search", "value": 349.4911193847656, "latency_ms": "349.491", "latency_info": "349.5ms"}


2025-10-30 03:29:11,189 - metrics - INFO - Performance: vector_search took 349.491



Query: 'Student discount policy'
Results found: 3
  1. Score: 0.6160
     Source: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/policies/student_discount.md
     Preview: # Student Discount Program

We offer exclusive educational pricing for students, teachers, and educa...
  2. Score: 0.5259
     Source: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/policies/student_discount.md
     Preview: ### iPad and iPhone
- **iPad Pro**: Up to $100 discount
- **iPad Air**: Up to $50 discount
- **iPad*...
  3. Score: 0.5233
     Source: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/policies/student_discount.md
     Preview: available throughout the year, with special promotions during back-to-school season.

**Q: Can I com...
{"asctime": "2025-10-30 03:29:11,272", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_texts_count took 1.000", "operation": "embedding_texts_count", "value": 1, "latency_ms": "1.000"}


2025-10-30 03:29:11,272 - metrics - INFO - Performance: embedding_texts_count took 1.000


{"asctime": "2025-10-30 03:29:11,272", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_generation took 82.817", "operation": "embedding_generation", "value": 82.81683921813965, "latency_ms": "82.817", "latency_info": "82.8ms"}


2025-10-30 03:29:11,272 - metrics - INFO - Performance: embedding_generation took 82.817


{"asctime": "2025-10-30 03:29:11,278", "name": "metrics", "levelname": "INFO", "message": "Performance: queries_performed took 1.0ms", "operation": "queries_performed", "value": 1, "latency_ms": "1.0ms"}


2025-10-30 03:29:11,278 - metrics - INFO - Performance: queries_performed took 1.0ms


{"asctime": "2025-10-30 03:29:11,279", "name": "metrics", "levelname": "INFO", "message": "Performance: search_results_count took 3.000", "operation": "search_results_count", "value": 3, "latency_ms": "3.000"}


2025-10-30 03:29:11,279 - metrics - INFO - Performance: search_results_count took 3.000


{"asctime": "2025-10-30 03:29:11,279", "name": "metrics", "levelname": "INFO", "message": "Performance: vector_search took 89.825", "operation": "vector_search", "value": 89.82491493225098, "latency_ms": "89.825", "latency_info": "89.8ms"}


2025-10-30 03:29:11,279 - metrics - INFO - Performance: vector_search took 89.825



Query: 'Return policy for electronics'
Results found: 3
  1. Score: 0.6778
     Source: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/policies/return_policy.md
     Preview: ### Return Conditions
- **Items must be in original condition** with all original packaging and acce...
  2. Score: 0.6381
     Source: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/policies/return_policy.md
     Preview: ### In-Store Returns
1. **Bring the item** to any retail store location
2. **Present original receip...
  3. Score: 0.6176
     Source: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/policies/return_policy.md
     Preview: # Return and Refund Policy

We want you to be completely satisfied with your purchase. If you're not...


In [12]:
# Test the retrieve tool with detailed debugging
from src.tools.retrieve import retrieve_documents
from src.config import settings

print("\nTesting Retrieve Tool:")
print("=" * 40)

# Check current configuration
print(f"Current Configuration:")
print(f"  Current ingestion ID: '{settings.current_ingestion_id}'")
print(f"  Expected ingestion ID: '{unified_ingestion.ingestion_id}'")
print(f"  Match: {settings.current_ingestion_id == unified_ingestion.ingestion_id}")

# Check index stats
vector_store_stats = vector_store.get_stats()
print(f"  Index stats: {vector_store_stats}")

for query in test_queries[:2]:
    print(f"\nQuery: '{query}'")
    print("-" * 30)
    
    # Test with semantic mode first
    response = retrieve_documents(query, top_k=3, search_mode="semantic")
    
    print(f"Semantic search results: {response.total_results}")
    
    if response.total_results > 0:
        print(f"  Top result: {response.results[0].score:.4f} - {response.results[0].source_file}")
    else:
        print("  No semantic results found")
    
    # Test hybrid mode if semantic works
    if response.total_results > 0:
        hybrid_response = retrieve_documents(query, top_k=5, search_mode="hybrid")
        print(f"Hybrid search results: {hybrid_response.total_results}")
        print(f"  Components used: {hybrid_response.search_metadata.get('components_used', {})}")

{"asctime": "2025-10-30 03:29:11,297", "name": "cache", "levelname": "INFO", "message": "Started cache cleanup task with 300s interval"}


2025-10-30 03:29:11,297 - cache - INFO - Started cache cleanup task with 300s interval



Testing Retrieve Tool:
Current Configuration:
  Current ingestion ID: 'ccb01621'
  Expected ingestion ID: 'ccb01621'
  Match: True
  Index stats: {'total_documents': 0, 'embedding_dimension': 768, 'index_name': 'curator-pommeline', 'index_fullness': 0, 'index_type': 'pinecone_index_container', 'namespaces': {'': {'vectorCount': 0}, 'curator-pommeline-7b1a7bbb': {'vectorCount': 0}, 'curator-pommeline': {'vectorCount': 304}, 'curator-pommeline-f03bab83': {'vectorCount': 0}, 'curator-pommeline-a9b4d456': {'vectorCount': 0}, 'curator-pommeline-12fa085f': {'vectorCount': 0}, 'pommeline': {'vectorCount': 0}}}

Query: 'iPhone 16 Pro features'
------------------------------
{"asctime": "2025-10-30 03:29:11,347", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_texts_count took 1.000", "operation": "embedding_texts_count", "value": 1, "latency_ms": "1.000"}


2025-10-30 03:29:11,347 - metrics - INFO - Performance: embedding_texts_count took 1.000


{"asctime": "2025-10-30 03:29:11,347", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_generation took 37.222", "operation": "embedding_generation", "value": 37.22190856933594, "latency_ms": "37.222", "latency_info": "37.2ms"}


2025-10-30 03:29:11,347 - metrics - INFO - Performance: embedding_generation took 37.222


{"asctime": "2025-10-30 03:29:11,354", "name": "metrics", "levelname": "INFO", "message": "Performance: queries_performed took 1.0ms", "operation": "queries_performed", "value": 1, "latency_ms": "1.0ms"}


2025-10-30 03:29:11,354 - metrics - INFO - Performance: queries_performed took 1.0ms


{"asctime": "2025-10-30 03:29:11,355", "name": "metrics", "levelname": "INFO", "message": "Performance: tool_retrieve took 45.0ms", "operation": "tool_retrieve", "value": 45.03011703491211, "latency_ms": "45.0ms", "latency_info": "45.0ms"}


2025-10-30 03:29:11,355 - metrics - INFO - Performance: tool_retrieve took 45.0ms


Semantic search results: 3
  Top result: 0.6314 - data/products/iphone_16_pro.md
{"asctime": "2025-10-30 03:29:11,387", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_texts_count took 1.000", "operation": "embedding_texts_count", "value": 1, "latency_ms": "1.000"}


2025-10-30 03:29:11,387 - metrics - INFO - Performance: embedding_texts_count took 1.000


{"asctime": "2025-10-30 03:29:11,388", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_generation took 32.301", "operation": "embedding_generation", "value": 32.30118751525879, "latency_ms": "32.301", "latency_info": "32.3ms"}


2025-10-30 03:29:11,388 - metrics - INFO - Performance: embedding_generation took 32.301


{"asctime": "2025-10-30 03:29:11,394", "name": "metrics", "levelname": "INFO", "message": "Performance: queries_performed took 1.0ms", "operation": "queries_performed", "value": 1, "latency_ms": "1.0ms"}


2025-10-30 03:29:11,394 - metrics - INFO - Performance: queries_performed took 1.0ms


{"asctime": "2025-10-30 03:29:11,404", "name": "metrics", "levelname": "INFO", "message": "Performance: queries_performed took 1.0ms", "operation": "queries_performed", "value": 1, "latency_ms": "1.0ms"}


2025-10-30 03:29:11,404 - metrics - INFO - Performance: queries_performed took 1.0ms


{"asctime": "2025-10-30 03:29:11,404", "name": "metrics", "levelname": "INFO", "message": "Performance: bm25_search_results took 10.000", "operation": "bm25_search_results", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:11,404 - metrics - INFO - Performance: bm25_search_results took 10.000


{"asctime": "2025-10-30 03:29:11,405", "name": "metrics", "levelname": "INFO", "message": "Performance: tool_retrieve took 49.4ms", "operation": "tool_retrieve", "value": 49.43680763244629, "latency_ms": "49.4ms", "latency_info": "49.4ms"}


2025-10-30 03:29:11,405 - metrics - INFO - Performance: tool_retrieve took 49.4ms


Hybrid search results: 2
  Components used: {'dense': True, 'bm25': True}

Query: 'MacBook Air M3 performance'
------------------------------
{"asctime": "2025-10-30 03:29:11,440", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_texts_count took 1.000", "operation": "embedding_texts_count", "value": 1, "latency_ms": "1.000"}


2025-10-30 03:29:11,440 - metrics - INFO - Performance: embedding_texts_count took 1.000


{"asctime": "2025-10-30 03:29:11,440", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_generation took 34.913", "operation": "embedding_generation", "value": 34.912824630737305, "latency_ms": "34.913", "latency_info": "34.9ms"}


2025-10-30 03:29:11,440 - metrics - INFO - Performance: embedding_generation took 34.913


{"asctime": "2025-10-30 03:29:11,447", "name": "metrics", "levelname": "INFO", "message": "Performance: queries_performed took 1.0ms", "operation": "queries_performed", "value": 1, "latency_ms": "1.0ms"}


2025-10-30 03:29:11,447 - metrics - INFO - Performance: queries_performed took 1.0ms


{"asctime": "2025-10-30 03:29:11,448", "name": "metrics", "levelname": "INFO", "message": "Performance: tool_retrieve took 42.4ms", "operation": "tool_retrieve", "value": 42.38319396972656, "latency_ms": "42.4ms", "latency_info": "42.4ms"}


2025-10-30 03:29:11,448 - metrics - INFO - Performance: tool_retrieve took 42.4ms


Semantic search results: 3
  Top result: 0.6526 - data/products/macbook_air_m3.md
{"asctime": "2025-10-30 03:29:11,482", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_texts_count took 1.000", "operation": "embedding_texts_count", "value": 1, "latency_ms": "1.000"}


2025-10-30 03:29:11,482 - metrics - INFO - Performance: embedding_texts_count took 1.000


{"asctime": "2025-10-30 03:29:11,483", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_generation took 34.594", "operation": "embedding_generation", "value": 34.594058990478516, "latency_ms": "34.594", "latency_info": "34.6ms"}


2025-10-30 03:29:11,483 - metrics - INFO - Performance: embedding_generation took 34.594


{"asctime": "2025-10-30 03:29:11,490", "name": "metrics", "levelname": "INFO", "message": "Performance: queries_performed took 1.0ms", "operation": "queries_performed", "value": 1, "latency_ms": "1.0ms"}


2025-10-30 03:29:11,490 - metrics - INFO - Performance: queries_performed took 1.0ms


{"asctime": "2025-10-30 03:29:11,494", "name": "metrics", "levelname": "INFO", "message": "Performance: queries_performed took 1.0ms", "operation": "queries_performed", "value": 1, "latency_ms": "1.0ms"}


2025-10-30 03:29:11,494 - metrics - INFO - Performance: queries_performed took 1.0ms


{"asctime": "2025-10-30 03:29:11,495", "name": "metrics", "levelname": "INFO", "message": "Performance: bm25_search_results took 10.000", "operation": "bm25_search_results", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:11,495 - metrics - INFO - Performance: bm25_search_results took 10.000


{"asctime": "2025-10-30 03:29:11,495", "name": "metrics", "levelname": "INFO", "message": "Performance: tool_retrieve took 47.0ms", "operation": "tool_retrieve", "value": 47.041893005371094, "latency_ms": "47.0ms", "latency_info": "47.0ms"}


2025-10-30 03:29:11,495 - metrics - INFO - Performance: tool_retrieve took 47.0ms


Hybrid search results: 3
  Components used: {'dense': True, 'bm25': True}


In [13]:
# Test each search mode separately for debugging
print("\nTesting Search Modes Separately:")
print("=" * 50)

test_query = "iPhone 16 Pro features"

# Test 1: Dense search only
print(f"\n1. Testing DENSE search only:")
response = retrieve_documents(test_query, top_k=5, search_mode="semantic")
print(f"   Results: {response.total_results}")
print(f"   Components used: {response.search_metadata.get('components_used', {})}")
if response.results:
    for i, doc in enumerate(response.results):
        print(f"     {i+1}. Score: {doc.score:.4f} - {doc.source_file}")

# Test 2: BM25 search only  
print(f"\n2. Testing BM25 search only:")
response = retrieve_documents(test_query, top_k=5, search_mode="keyword")
print(f"   Results: {response.total_results}")
print(f"   Components used: {response.search_metadata.get('components_used', {})}")
if response.results:
    for i, doc in enumerate(response.results):
        print(f"     {i+1}. Score: {doc.score:.4f} - {doc.source_file}")

# Test 3: Hybrid search
print(f"\n3. Testing HYBRID search:")
response = retrieve_documents(test_query, top_k=5, search_mode="hybrid")
if response.results:
    for i, doc in enumerate(response.results):
        print(f"     {i+1}. Score: {doc.score:.4f} - {doc.source_file}")
else:
    print("   No results from hybrid search")

# Check ingestion ID status
print(f"\nCurrent Configuration:")
print(f"   Current ingestion ID: '{settings.current_ingestion_id}'")
print(f"   Expected ingestion ID: '{unified_ingestion.ingestion_id}'")
print(f"   Match: {settings.current_ingestion_id == unified_ingestion.ingestion_id}")


Testing Search Modes Separately:

1. Testing DENSE search only:
{"asctime": "2025-10-30 03:29:11,541", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_texts_count took 1.000", "operation": "embedding_texts_count", "value": 1, "latency_ms": "1.000"}


2025-10-30 03:29:11,541 - metrics - INFO - Performance: embedding_texts_count took 1.000


{"asctime": "2025-10-30 03:29:11,542", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_generation took 39.720", "operation": "embedding_generation", "value": 39.72005844116211, "latency_ms": "39.720", "latency_info": "39.7ms"}


2025-10-30 03:29:11,542 - metrics - INFO - Performance: embedding_generation took 39.720


{"asctime": "2025-10-30 03:29:11,548", "name": "metrics", "levelname": "INFO", "message": "Performance: queries_performed took 1.0ms", "operation": "queries_performed", "value": 1, "latency_ms": "1.0ms"}


2025-10-30 03:29:11,548 - metrics - INFO - Performance: queries_performed took 1.0ms


{"asctime": "2025-10-30 03:29:11,549", "name": "metrics", "levelname": "INFO", "message": "Performance: tool_retrieve took 47.2ms", "operation": "tool_retrieve", "value": 47.151803970336914, "latency_ms": "47.2ms", "latency_info": "47.2ms"}


2025-10-30 03:29:11,549 - metrics - INFO - Performance: tool_retrieve took 47.2ms


   Results: 5
   Components used: {'dense': True, 'bm25': False}
     1. Score: 0.6314 - data/products/iphone_16_pro.md
     2. Score: 0.5547 - data/products/iphone_16_pro.md
     3. Score: 0.5359 - /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/macbook_air_m3.md
     4. Score: 0.5185 - data/products/iphone_16_pro.md
     5. Score: 0.4913 - data/products/iphone_16_pro.md

2. Testing BM25 search only:
{"asctime": "2025-10-30 03:29:11,554", "name": "metrics", "levelname": "INFO", "message": "Performance: queries_performed took 1.0ms", "operation": "queries_performed", "value": 1, "latency_ms": "1.0ms"}


2025-10-30 03:29:11,554 - metrics - INFO - Performance: queries_performed took 1.0ms


{"asctime": "2025-10-30 03:29:11,555", "name": "metrics", "levelname": "INFO", "message": "Performance: bm25_search_results took 10.000", "operation": "bm25_search_results", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:11,555 - metrics - INFO - Performance: bm25_search_results took 10.000


{"asctime": "2025-10-30 03:29:11,556", "name": "metrics", "levelname": "INFO", "message": "Performance: tool_retrieve took 5.5ms", "operation": "tool_retrieve", "value": 5.536079406738281, "latency_ms": "5.5ms", "latency_info": "5.5ms"}


2025-10-30 03:29:11,556 - metrics - INFO - Performance: tool_retrieve took 5.5ms


   Results: 4
   Components used: {'dense': False, 'bm25': True}
     1. Score: 0.7811 - /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/iphone_16_pro.md
     2. Score: 0.4184 - /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/iphone_16_pro.md
     3. Score: 0.1712 - /Users/aamirsyedaltaf/Documents/curator-pommeline/data/policies/student_discount.md
     4. Score: 0.1635 - data/products/macbook_air_m3.md

3. Testing HYBRID search:
{"asctime": "2025-10-30 03:29:11,600", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_texts_count took 1.000", "operation": "embedding_texts_count", "value": 1, "latency_ms": "1.000"}


2025-10-30 03:29:11,600 - metrics - INFO - Performance: embedding_texts_count took 1.000


{"asctime": "2025-10-30 03:29:11,601", "name": "metrics", "levelname": "INFO", "message": "Performance: embedding_generation took 44.738", "operation": "embedding_generation", "value": 44.738054275512695, "latency_ms": "44.738", "latency_info": "44.7ms"}


2025-10-30 03:29:11,601 - metrics - INFO - Performance: embedding_generation took 44.738


{"asctime": "2025-10-30 03:29:11,607", "name": "metrics", "levelname": "INFO", "message": "Performance: queries_performed took 1.0ms", "operation": "queries_performed", "value": 1, "latency_ms": "1.0ms"}


2025-10-30 03:29:11,607 - metrics - INFO - Performance: queries_performed took 1.0ms


{"asctime": "2025-10-30 03:29:11,612", "name": "metrics", "levelname": "INFO", "message": "Performance: queries_performed took 1.0ms", "operation": "queries_performed", "value": 1, "latency_ms": "1.0ms"}


2025-10-30 03:29:11,612 - metrics - INFO - Performance: queries_performed took 1.0ms


{"asctime": "2025-10-30 03:29:11,613", "name": "metrics", "levelname": "INFO", "message": "Performance: bm25_search_results took 10.000", "operation": "bm25_search_results", "value": 10, "latency_ms": "10.000"}


2025-10-30 03:29:11,613 - metrics - INFO - Performance: bm25_search_results took 10.000


{"asctime": "2025-10-30 03:29:11,613", "name": "metrics", "levelname": "INFO", "message": "Performance: tool_retrieve took 57.4ms", "operation": "tool_retrieve", "value": 57.39283561706543, "latency_ms": "57.4ms", "latency_info": "57.4ms"}


2025-10-30 03:29:11,613 - metrics - INFO - Performance: tool_retrieve took 57.4ms


     1. Score: 0.9836 - data/products/iphone_16_pro.md
     2. Score: 0.9677 - /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/iphone_16_pro.md

Current Configuration:
   Current ingestion ID: 'ccb01621'
   Expected ingestion ID: 'ccb01621'
   Match: True


## Cleanup

Clean up the unique index created in this notebook.

**WARNING**: This cell will permanently delete the unique index created in this notebook run. This ensures clean resource management and prevents leftover data in your Pinecone instance.

In [14]:
# Clean up the unique index created in this notebook
import requests
from dotenv import load_dotenv
import sys

load_dotenv()
sys.path.append(str(pathlib.Path().absolute().parent / "src"))
from src.config import settings

# Get the final index statistics before cleanup
final_stats = vector_store.get_stats()
index_name = final_stats["index_name"]
namespaces = final_stats.get("namespaces", {})

print(f"Cleaning up unique index '{index_name}'...")
print(f"Final statistics before cleanup:")
print(f"   Total vectors: {final_stats.get('total_documents', 'unknown')}")
print(f"   Namespaces: {namespaces}")

# Calculate total vectors to delete
total_vectors_to_delete = sum(
    ns_data.get('vectorCount', 0) 
    for ns_data in namespaces.values()
)

print(f"Cleanup Summary:")
print(f"   Index: {index_name}")
print(f"   Total vectors to delete: {total_vectors_to_delete}")

for ns_name, ns_data in namespaces.items():
    vector_count = ns_data.get('vectorCount', 0)
    if vector_count > 0:
        print(f"   Namespace '{ns_name}': {vector_count} vectors")

# For Pinecone index container, clear ALL namespaces completely
cleared_namespaces = []
failed_namespaces = []

for ns_name, ns_data in namespaces.items():
    vector_count = ns_data.get('vectorCount', 0)
    if vector_count > 0:
        print(f"Attempting to clear namespace '{ns_name}' ({vector_count} vectors)...")
        
        # Use the delete API with namespace and deleteAll flag
        delete_request = {
            "namespace": ns_name,
            "deleteAll": True
        }
        
        response = requests.post(
            f"{settings.pinecone_host}/vectors/delete",
            json=delete_request,
            timeout=30
        )
        
        if response.status_code == 200:
            cleared_namespaces.append(ns_name)
            result = response.json()
            print(f"   Namespace '{ns_name}' cleared successfully")
            if result:
                print(f"   Response: {result}")
        else:
            failed_namespaces.append(ns_name)
            print(f"   Failed to clear namespace '{ns_name}': {response.status_code}")
            print(f"   Response: {response.text}")

# Cleanup Summary
print(f"\nCleanup Summary:")
print(f"   Successfully cleared namespaces: {len(cleared_namespaces)}")
for ns in cleared_namespaces:
    print(f"      - {ns}")

if failed_namespaces:
    print(f"   Failed to clear namespaces: {len(failed_namespaces)}")
    for ns in failed_namespaces:
        print(f"      - {ns}")
    print(f"   Note: Manual cleanup may be required for failed namespaces")

print(f"\nCleanup process completed")
print(f"In production with Pinecone cloud, you would use:")
print(f"   pc.delete_index('{index_name}') to permanently delete the index")
print(f"Unique index UUID '{index_uuid}' has been processed for cleanup")
print(f"Resources have been freed from your local Pinecone instance")

# Clear the ingestion ID to prevent conflicts
settings.current_ingestion_id = ""
print(f"Cleared current_ingestion_id from settings")

Cleaning up unique index 'curator-pommeline'...
Final statistics before cleanup:
   Total vectors: 0
   Namespaces: {'curator-pommeline-7b1a7bbb': {'vectorCount': 0}, 'curator-pommeline-f03bab83': {'vectorCount': 0}, '': {'vectorCount': 0}, 'curator-pommeline-a9b4d456': {'vectorCount': 0}, 'curator-pommeline': {'vectorCount': 304}, 'curator-pommeline-12fa085f': {'vectorCount': 0}, 'pommeline': {'vectorCount': 0}}
Cleanup Summary:
   Index: curator-pommeline
   Total vectors to delete: 304
   Namespace 'curator-pommeline': 304 vectors
Attempting to clear namespace 'curator-pommeline' (304 vectors)...


   Namespace 'curator-pommeline' cleared successfully

Cleanup Summary:
   Successfully cleared namespaces: 1
      - curator-pommeline

Cleanup process completed
In production with Pinecone cloud, you would use:
   pc.delete_index('curator-pommeline') to permanently delete the index
Unique index UUID 'ccb01621' has been processed for cleanup
Resources have been freed from your local Pinecone instance
Cleared current_ingestion_id from settings
