# Document Ingestion Demo for Pommeline Product Knowledge Base

This notebook demonstrates how to ingest product documents into the Pinecone vector store for the Pommeline knowledge base.

## Features:
- Checks for existing 'pommeline' index in Pinecone
- Creates index if it doesn't exist with HNSW algorithm and dotproduct similarity
- Ingests product documents with proper chunking and embedding
- Normalizes embeddings before storage for optimal retrieval
- Provides detailed logging and progress tracking

In [1]:
# Install required packages if not already installed
# !uv add pinecone-client sentence-transformers python-dotenv

In [2]:
import os
import sys
import pathlib
import logging
from typing import List, Dict, Any

# Add parent directory to path for imports to handle relative imports
sys.path.append(str(pathlib.Path().absolute().parent))
# Add src to path as well
sys.path.append(str(pathlib.Path().absolute().parent / "src"))

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger("ingestion_demo")

In [3]:
# Load environment variables
from dotenv import load_dotenv
load_dotenv()

# Import our modules using package imports
sys.path.insert(0, str(pathlib.Path().absolute().parent))
from src.ingestion.vector_store import get_vector_store
from src.ingestion.chunker import SemanticChunker, DocumentChunk
from src.ingestion.embedder import EmbeddingGenerator
from src.utils.file_loader import load_documents_from_directory
from src.config import settings

logger.info("Successfully imported all required modules")

  from .autonotebook import tqdm as notebook_tqdm


{"asctime": "2025-10-29 13:41:17,959", "name": "pinecone_index_client", "levelname": "INFO", "message": "Initialized PineconeIndexClient for dense index 'curator-pommeline' (dim: 768, metric: dotproduct)"}


2025-10-29 13:41:17,959 - pinecone_index_client - INFO - Initialized PineconeIndexClient for dense index 'curator-pommeline' (dim: 768, metric: dotproduct)


{"asctime": "2025-10-29 13:41:17,968", "name": "pinecone_vector_store", "levelname": "INFO", "message": "Connected to Pinecone Index container: {'namespaces': {'curator-pommeline-7b1a7bbb': {'vectorCount': 0}, 'curator-pommeline': {'vectorCount': 106}, 'pommeline': {'vectorCount': 0}, 'curator-pommeline-12fa085f': {'vectorCount': 0}, '': {'vectorCount': 109}}, 'dimension': 768, 'indexFullness': 0.0, 'totalVectorCount': 215}"}


2025-10-29 13:41:17,968 - pinecone_vector_store - INFO - Connected to Pinecone Index container: {'namespaces': {'curator-pommeline-7b1a7bbb': {'vectorCount': 0}, 'curator-pommeline': {'vectorCount': 106}, 'pommeline': {'vectorCount': 0}, 'curator-pommeline-12fa085f': {'vectorCount': 0}, '': {'vectorCount': 109}}, 'dimension': 768, 'indexFullness': 0.0, 'totalVectorCount': 215}
2025-10-29 13:41:17,978 - ingestion_demo - INFO - Successfully imported all required modules


## Configuration

Set up the index configuration for the Pommeline knowledge base.

In [4]:
# Index configuration with UUID for unique identification
import uuid

# Generate a unique UUID for this notebook run
index_uuid = str(uuid.uuid4())[:8]  # Use first 8 characters for brevity
INDEX_NAME = f"curator-pommeline-{index_uuid}"
DIMENSION = 768  # Dimension for google/embeddinggemma-300m model
METRIC = "dotproduct"  # Use dotproduct similarity for normalized embeddings

# Update settings for our specific index
settings.pinecone_index_name = INDEX_NAME
settings.pinecone_dimension = DIMENSION
settings.pinecone_metric = METRIC

print(f"Generated unique index UUID: {index_uuid}")
print(f"Index configuration: {INDEX_NAME}")
print(f"Dimension: {DIMENSION}, Metric: {METRIC}")
print(f"\n📝 Note: This index will be automatically cleaned up at the end of the notebook.")

Generated unique index UUID: f316c5d1
Index configuration: curator-pommeline-f316c5d1
Dimension: 768, Metric: dotproduct

📝 Note: This index will be automatically cleaned up at the end of the notebook.


## Initialize Vector Store

Connect to Pinecone and set up the 'pommeline' index.

In [5]:
# Initialize vector store with our unique configuration
vector_store = get_vector_store()

# Check current status
stats = vector_store.get_stats()
print("Vector Store Status:")
for key, value in stats.items():
    print(f"  {key}: {value}")

print(f"\n🎯 Using unique index: '{stats['index_name']}'")
print(f"📝 Documents will be stored in the '{stats['index_name']}' namespace for traceability.")
print(f"🧹 This index will be automatically cleaned up at the end of the notebook.")

Vector Store Status:
  total_documents: 0
  embedding_dimension: 768
  index_name: curator-pommeline
  index_fullness: 0
  index_type: local_in_memory
  namespaces: {'curator-pommeline': {'vectorCount': 106}, 'curator-pommeline-7b1a7bbb': {'vectorCount': 0}, 'pommeline': {'vectorCount': 0}, '': {'vectorCount': 109}, 'curator-pommeline-12fa085f': {'vectorCount': 0}}

🎯 Using unique index: 'curator-pommeline'
📝 Documents will be stored in the 'curator-pommeline' namespace for traceability.
🧹 This index will be automatically cleaned up at the end of the notebook.


## Load Product Documents

Load all product and policy documents from the data directory.

In [6]:
# Define data directories
data_dir = pathlib.Path().absolute().parent / "data"
products_dir = data_dir / "products"
policies_dir = data_dir / "policies"

print(f"Loading documents from: {data_dir}")
print(f"Products directory: {products_dir}")
print(f"Policies directory: {policies_dir}")

# Check if directories exist
if not products_dir.exists():
    logger.warning(f"Products directory not found: {products_dir}")
if not policies_dir.exists():
    logger.warning(f"Policies directory not found: {policies_dir}")

Loading documents from: /Users/aamirsyedaltaf/Documents/curator-pommeline/data
Products directory: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products
Policies directory: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/policies


In [7]:
# Load documents from both directories
all_documents = []

# Load product documents
if products_dir.exists():
    product_docs = load_documents_from_directory(str(products_dir))
    all_documents.extend(product_docs)
    logger.info(f"Loaded {len(product_docs)} product documents")

# Load policy documents  
if policies_dir.exists():
    policy_docs = load_documents_from_directory(str(policies_dir))
    all_documents.extend(policy_docs)
    logger.info(f"Loaded {len(policy_docs)} policy documents")

logger.info(f"Total documents loaded: {len(all_documents)}")

# Display document information
for i, doc in enumerate(all_documents[:3]):  # Show first 3 documents
    print(f"\nDocument {i+1}:")
    print(f"  Source: {doc.get('source', 'Unknown')}")
    print(f"  Content length: {len(doc.get('content', ''))}")
    print(f"  Preview: {doc.get('content', '')[:200]}...")

{"asctime": "2025-10-29 13:41:17,999", "name": "file_loader", "levelname": "INFO", "message": "Loaded 3 documents from /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products"}


2025-10-29 13:41:17,999 - file_loader - INFO - Loaded 3 documents from /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products
2025-10-29 13:41:17,999 - ingestion_demo - INFO - Loaded 3 product documents


{"asctime": "2025-10-29 13:41:18,000", "name": "file_loader", "levelname": "INFO", "message": "Loaded 2 documents from /Users/aamirsyedaltaf/Documents/curator-pommeline/data/policies"}


2025-10-29 13:41:18,000 - file_loader - INFO - Loaded 2 documents from /Users/aamirsyedaltaf/Documents/curator-pommeline/data/policies
2025-10-29 13:41:18,000 - ingestion_demo - INFO - Loaded 2 policy documents
2025-10-29 13:41:18,000 - ingestion_demo - INFO - Total documents loaded: 5



Document 1:
  Source: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/iphone_16_pro.md
  Content length: 2364
  Preview: # iPhone 16 Pro

The iPhone 16 Pro represents Apple's latest flagship smartphone, combining cutting-edge technology with premium design and exceptional performance.

## Key Features

### Display and D...

Document 2:
  Source: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/airpods_pro_2.md
  Content length: 3271
  Preview: # AirPods Pro (2nd Generation)

The second generation AirPods Pro represent Apple's commitment to premium wireless audio with advanced noise cancellation and spatial audio capabilities.

## Key Featur...

Document 3:
  Source: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/macbook_air_m3.md
  Content length: 4708
  Preview: # MacBook Air with M3 Chip

The MacBook Air with M3 chip combines exceptional performance with incredible portability, featuring a stunning Liquid Retina display and all-day bat

## Document Chunking

Split documents into smaller chunks for better retrieval.

In [8]:
# Initialize document chunker
chunker = SemanticChunker(
    chunk_size=500,
    chunk_overlap=50,
    min_chunk_size=50,
)

# Chunk all documents
all_chunks = []

for doc in all_documents:
    chunks = chunker.chunk_text(
        text=doc['content'],
        source=doc['source']
    )
    all_chunks.extend(chunks)

logger.info(f"Created {len(all_chunks)} chunks from {len(all_documents)} documents")

# Display chunk information
print(f"Total chunks created: {len(all_chunks)}")
print(f"Average chunk length: {sum(len(chunk.content) for chunk in all_chunks) / len(all_chunks):.1f} characters")

# Show first few chunks
for i, chunk in enumerate(all_chunks[:3]):
    print(f"\nChunk {i+1}:")
    print(f"  ID: {chunk.chunk_id}")
    print(f"  Source: {chunk.source_file}")
    print(f"  Length: {len(chunk.content)} characters")
    print(f"  Preview: {chunk.content[:150]}...")

{"asctime": "2025-10-29 13:41:18,005", "name": "chunker", "levelname": "INFO", "message": "Created 11 chunks from /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/iphone_16_pro.md"}


2025-10-29 13:41:18,005 - chunker - INFO - Created 11 chunks from /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/iphone_16_pro.md


{"asctime": "2025-10-29 13:41:18,006", "name": "chunker", "levelname": "INFO", "message": "Created 15 chunks from /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/airpods_pro_2.md"}


2025-10-29 13:41:18,006 - chunker - INFO - Created 15 chunks from /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/airpods_pro_2.md


{"asctime": "2025-10-29 13:41:18,007", "name": "chunker", "levelname": "INFO", "message": "Created 20 chunks from /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/macbook_air_m3.md"}


2025-10-29 13:41:18,007 - chunker - INFO - Created 20 chunks from /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/macbook_air_m3.md


{"asctime": "2025-10-29 13:41:18,008", "name": "chunker", "levelname": "INFO", "message": "Created 30 chunks from /Users/aamirsyedaltaf/Documents/curator-pommeline/data/policies/student_discount.md"}


2025-10-29 13:41:18,008 - chunker - INFO - Created 30 chunks from /Users/aamirsyedaltaf/Documents/curator-pommeline/data/policies/student_discount.md


{"asctime": "2025-10-29 13:41:18,008", "name": "chunker", "levelname": "INFO", "message": "Created 30 chunks from /Users/aamirsyedaltaf/Documents/curator-pommeline/data/policies/return_policy.md"}


2025-10-29 13:41:18,008 - chunker - INFO - Created 30 chunks from /Users/aamirsyedaltaf/Documents/curator-pommeline/data/policies/return_policy.md
2025-10-29 13:41:18,009 - ingestion_demo - INFO - Created 106 chunks from 5 documents


Total chunks created: 106
Average chunk length: 240.2 characters

Chunk 1:
  ID: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/iphone_16_pro.md_chunk_0
  Source: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/iphone_16_pro.md
  Length: 164 characters
  Preview: # iPhone 16 Pro

The iPhone 16 Pro represents Apple's latest flagship smartphone, combining cutting-edge technology with premium design and exceptiona...

Chunk 2:
  ID: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/iphone_16_pro.md_chunk_1
  Source: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/iphone_16_pro.md
  Length: 344 characters
  Preview: ### Display and Design
- **6.3-inch Super Retina XDR display** with ProMotion technology
- **Titanium construction** for enhanced durability and reduc...

Chunk 3:
  ID: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/iphone_16_pro.md_chunk_2
  Source: /Users/aamirsyedaltaf/Documents/curator-pommelin

## Embedding Generation and Storage

Generate embeddings for chunks and store them in the Pinecone index.

In [9]:
import numpy as np

# Initialize embedder
embedder = EmbeddingGenerator(
    model_name=settings.embedding_model,
)

# Test embedding generation on a sample
sample_text = "This is a sample text to test embedding generation."
sample_embedding = embedder.generate_single_embedding(sample_text)

print(f"Sample embedding shape: {sample_embedding.shape}")
print(f"Sample embedding norm: {np.linalg.norm(sample_embedding):.4f}")

{"asctime": "2025-10-29 13:41:18,022", "name": "embedder", "levelname": "INFO", "message": "Loading embedding model: google/embeddinggemma-300m"}


2025-10-29 13:41:18,022 - embedder - INFO - Loading embedding model: google/embeddinggemma-300m
2025-10-29 13:41:18,024 - sentence_transformers.SentenceTransformer - INFO - Load pretrained SentenceTransformer: google/embeddinggemma-300m
2025-10-29 13:41:25,115 - sentence_transformers.SentenceTransformer - INFO - 14 prompts are loaded, with the keys: ['query', 'document', 'BitextMining', 'Clustering', 'Classification', 'InstructionRetrieval', 'MultilabelClassification', 'PairClassification', 'Reranking', 'Retrieval', 'Retrieval-query', 'Retrieval-document', 'STS', 'Summarization']


{"asctime": "2025-10-29 13:41:25,631", "name": "embedder", "levelname": "INFO", "message": "Model loaded successfully. Embedding dimension: 768"}


2025-10-29 13:41:25,631 - embedder - INFO - Model loaded successfully. Embedding dimension: 768


{"asctime": "2025-10-29 13:41:25,711", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_texts_count", "operation": "embedding_texts_count", "value": 1}


2025-10-29 13:41:25,711 - metrics - INFO - Metric recorded: embedding_texts_count


{"asctime": "2025-10-29 13:41:25,712", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_generation", "operation": "embedding_generation", "value": 7699.0978717803955}


2025-10-29 13:41:25,712 - metrics - INFO - Metric recorded: embedding_generation


Sample embedding shape: (768,)
Sample embedding norm: 1.0000


In [10]:
import numpy as np

# Process chunks in batches to avoid memory issues
batch_size = 10
total_processed = 0
failed_chunks = []

logger.info(f"Processing {len(all_chunks)} chunks in batches of {batch_size}")

for i in range(0, len(all_chunks), batch_size):
    batch = all_chunks[i:i + batch_size]
    
    try:
        # Add documents to vector store (this handles embedding generation internally)
        doc_ids = vector_store.add_documents(batch)
        total_processed += len(batch)
        
        logger.info(f"Processed batch {i//batch_size + 1}/{(len(all_chunks)-1)//batch_size + 1}: {len(batch)} chunks")
        
        # Show progress
        if (i + batch_size) % 50 == 0 or i + batch_size >= len(all_chunks):
            progress = (i + batch_size) / len(all_chunks) * 100
            logger.info(f"Progress: {progress:.1f}% ({i + batch_size}/{len(all_chunks)} chunks)")
            
    except Exception as e:
        logger.error(f"Failed to process batch {i//batch_size + 1}: {e}")
        failed_chunks.extend([(chunk, str(e)) for chunk in batch])

logger.info(f"\nIngestion completed!")
logger.info(f"Successfully processed: {total_processed} chunks")
logger.info(f"Failed chunks: {len(failed_chunks)}")

if failed_chunks:
    logger.warning("Failed chunks:")
    for chunk, error in failed_chunks[:5]:  # Show first 5 failures
        logger.warning(f"  {chunk.chunk_id}: {error}")

2025-10-29 13:41:25,721 - ingestion_demo - INFO - Processing 106 chunks in batches of 10


{"asctime": "2025-10-29 13:41:25,722", "name": "pinecone_vector_store", "levelname": "INFO", "message": "Adding 10 document chunks to local Pinecone vector store"}


2025-10-29 13:41:25,722 - pinecone_vector_store - INFO - Adding 10 document chunks to local Pinecone vector store


{"asctime": "2025-10-29 13:41:25,723", "name": "embedder", "levelname": "INFO", "message": "Loading embedding model: google/embeddinggemma-300m"}


2025-10-29 13:41:25,723 - embedder - INFO - Loading embedding model: google/embeddinggemma-300m
2025-10-29 13:41:25,725 - sentence_transformers.SentenceTransformer - INFO - Load pretrained SentenceTransformer: google/embeddinggemma-300m
2025-10-29 13:41:33,317 - sentence_transformers.SentenceTransformer - INFO - 14 prompts are loaded, with the keys: ['query', 'document', 'BitextMining', 'Clustering', 'Classification', 'InstructionRetrieval', 'MultilabelClassification', 'PairClassification', 'Reranking', 'Retrieval', 'Retrieval-query', 'Retrieval-document', 'STS', 'Summarization']


{"asctime": "2025-10-29 13:41:33,386", "name": "embedder", "levelname": "INFO", "message": "Model loaded successfully. Embedding dimension: 768"}


2025-10-29 13:41:33,386 - embedder - INFO - Model loaded successfully. Embedding dimension: 768


{"asctime": "2025-10-29 13:41:33,726", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_texts_count", "operation": "embedding_texts_count", "value": 10}


2025-10-29 13:41:33,726 - metrics - INFO - Metric recorded: embedding_texts_count


{"asctime": "2025-10-29 13:41:33,727", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_generation", "operation": "embedding_generation", "value": 8003.880977630615}


2025-10-29 13:41:33,727 - metrics - INFO - Metric recorded: embedding_generation


{"asctime": "2025-10-29 13:41:33,749", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: vectors_upserted", "operation": "vectors_upserted", "value": 10}


2025-10-29 13:41:33,749 - metrics - INFO - Metric recorded: vectors_upserted


{"asctime": "2025-10-29 13:41:33,749", "name": "pinecone_vector_store", "levelname": "INFO", "message": "Successfully upserted 10 vectors to local store in namespace 'curator-pommeline'"}


2025-10-29 13:41:33,749 - pinecone_vector_store - INFO - Successfully upserted 10 vectors to local store in namespace 'curator-pommeline'


{"asctime": "2025-10-29 13:41:33,750", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: documents_added", "operation": "documents_added", "value": 10}


2025-10-29 13:41:33,750 - metrics - INFO - Metric recorded: documents_added
2025-10-29 13:41:33,751 - ingestion_demo - INFO - Processed batch 1/11: 10 chunks


{"asctime": "2025-10-29 13:41:33,751", "name": "pinecone_vector_store", "levelname": "INFO", "message": "Adding 10 document chunks to local Pinecone vector store"}


2025-10-29 13:41:33,751 - pinecone_vector_store - INFO - Adding 10 document chunks to local Pinecone vector store


{"asctime": "2025-10-29 13:41:34,000", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_texts_count", "operation": "embedding_texts_count", "value": 10}


2025-10-29 13:41:34,000 - metrics - INFO - Metric recorded: embedding_texts_count


{"asctime": "2025-10-29 13:41:34,001", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_generation", "operation": "embedding_generation", "value": 249.38201904296875}


2025-10-29 13:41:34,001 - metrics - INFO - Metric recorded: embedding_generation


{"asctime": "2025-10-29 13:41:34,018", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: vectors_upserted", "operation": "vectors_upserted", "value": 10}


2025-10-29 13:41:34,018 - metrics - INFO - Metric recorded: vectors_upserted


{"asctime": "2025-10-29 13:41:34,019", "name": "pinecone_vector_store", "levelname": "INFO", "message": "Successfully upserted 10 vectors to local store in namespace 'curator-pommeline'"}


2025-10-29 13:41:34,019 - pinecone_vector_store - INFO - Successfully upserted 10 vectors to local store in namespace 'curator-pommeline'


{"asctime": "2025-10-29 13:41:34,019", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: documents_added", "operation": "documents_added", "value": 10}


2025-10-29 13:41:34,019 - metrics - INFO - Metric recorded: documents_added
2025-10-29 13:41:34,020 - ingestion_demo - INFO - Processed batch 2/11: 10 chunks


{"asctime": "2025-10-29 13:41:34,020", "name": "pinecone_vector_store", "levelname": "INFO", "message": "Adding 10 document chunks to local Pinecone vector store"}


2025-10-29 13:41:34,020 - pinecone_vector_store - INFO - Adding 10 document chunks to local Pinecone vector store


{"asctime": "2025-10-29 13:41:34,260", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_texts_count", "operation": "embedding_texts_count", "value": 10}


2025-10-29 13:41:34,260 - metrics - INFO - Metric recorded: embedding_texts_count


{"asctime": "2025-10-29 13:41:34,261", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_generation", "operation": "embedding_generation", "value": 240.18502235412598}


2025-10-29 13:41:34,261 - metrics - INFO - Metric recorded: embedding_generation


{"asctime": "2025-10-29 13:41:34,277", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: vectors_upserted", "operation": "vectors_upserted", "value": 10}


2025-10-29 13:41:34,277 - metrics - INFO - Metric recorded: vectors_upserted


{"asctime": "2025-10-29 13:41:34,278", "name": "pinecone_vector_store", "levelname": "INFO", "message": "Successfully upserted 10 vectors to local store in namespace 'curator-pommeline'"}


2025-10-29 13:41:34,278 - pinecone_vector_store - INFO - Successfully upserted 10 vectors to local store in namespace 'curator-pommeline'


{"asctime": "2025-10-29 13:41:34,279", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: documents_added", "operation": "documents_added", "value": 10}


2025-10-29 13:41:34,279 - metrics - INFO - Metric recorded: documents_added
2025-10-29 13:41:34,279 - ingestion_demo - INFO - Processed batch 3/11: 10 chunks


{"asctime": "2025-10-29 13:41:34,279", "name": "pinecone_vector_store", "levelname": "INFO", "message": "Adding 10 document chunks to local Pinecone vector store"}


2025-10-29 13:41:34,279 - pinecone_vector_store - INFO - Adding 10 document chunks to local Pinecone vector store


{"asctime": "2025-10-29 13:41:34,479", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_texts_count", "operation": "embedding_texts_count", "value": 10}


2025-10-29 13:41:34,479 - metrics - INFO - Metric recorded: embedding_texts_count


{"asctime": "2025-10-29 13:41:34,480", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_generation", "operation": "embedding_generation", "value": 200.71101188659668}


2025-10-29 13:41:34,480 - metrics - INFO - Metric recorded: embedding_generation


{"asctime": "2025-10-29 13:41:34,500", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: vectors_upserted", "operation": "vectors_upserted", "value": 10}


2025-10-29 13:41:34,500 - metrics - INFO - Metric recorded: vectors_upserted


{"asctime": "2025-10-29 13:41:34,501", "name": "pinecone_vector_store", "levelname": "INFO", "message": "Successfully upserted 10 vectors to local store in namespace 'curator-pommeline'"}


2025-10-29 13:41:34,501 - pinecone_vector_store - INFO - Successfully upserted 10 vectors to local store in namespace 'curator-pommeline'


{"asctime": "2025-10-29 13:41:34,501", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: documents_added", "operation": "documents_added", "value": 10}


2025-10-29 13:41:34,501 - metrics - INFO - Metric recorded: documents_added
2025-10-29 13:41:34,502 - ingestion_demo - INFO - Processed batch 4/11: 10 chunks


{"asctime": "2025-10-29 13:41:34,502", "name": "pinecone_vector_store", "levelname": "INFO", "message": "Adding 10 document chunks to local Pinecone vector store"}


2025-10-29 13:41:34,502 - pinecone_vector_store - INFO - Adding 10 document chunks to local Pinecone vector store


{"asctime": "2025-10-29 13:41:34,874", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_texts_count", "operation": "embedding_texts_count", "value": 10}


2025-10-29 13:41:34,874 - metrics - INFO - Metric recorded: embedding_texts_count


{"asctime": "2025-10-29 13:41:34,875", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_generation", "operation": "embedding_generation", "value": 372.29204177856445}


2025-10-29 13:41:34,875 - metrics - INFO - Metric recorded: embedding_generation


{"asctime": "2025-10-29 13:41:34,896", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: vectors_upserted", "operation": "vectors_upserted", "value": 10}


2025-10-29 13:41:34,896 - metrics - INFO - Metric recorded: vectors_upserted


{"asctime": "2025-10-29 13:41:34,897", "name": "pinecone_vector_store", "levelname": "INFO", "message": "Successfully upserted 10 vectors to local store in namespace 'curator-pommeline'"}


2025-10-29 13:41:34,897 - pinecone_vector_store - INFO - Successfully upserted 10 vectors to local store in namespace 'curator-pommeline'


{"asctime": "2025-10-29 13:41:34,898", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: documents_added", "operation": "documents_added", "value": 10}


2025-10-29 13:41:34,898 - metrics - INFO - Metric recorded: documents_added
2025-10-29 13:41:34,898 - ingestion_demo - INFO - Processed batch 5/11: 10 chunks
2025-10-29 13:41:34,899 - ingestion_demo - INFO - Progress: 47.2% (50/106 chunks)


{"asctime": "2025-10-29 13:41:34,899", "name": "pinecone_vector_store", "levelname": "INFO", "message": "Adding 10 document chunks to local Pinecone vector store"}


2025-10-29 13:41:34,899 - pinecone_vector_store - INFO - Adding 10 document chunks to local Pinecone vector store


{"asctime": "2025-10-29 13:41:35,113", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_texts_count", "operation": "embedding_texts_count", "value": 10}


2025-10-29 13:41:35,113 - metrics - INFO - Metric recorded: embedding_texts_count


{"asctime": "2025-10-29 13:41:35,116", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_generation", "operation": "embedding_generation", "value": 215.5168056488037}


2025-10-29 13:41:35,116 - metrics - INFO - Metric recorded: embedding_generation


{"asctime": "2025-10-29 13:41:35,138", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: vectors_upserted", "operation": "vectors_upserted", "value": 10}


2025-10-29 13:41:35,138 - metrics - INFO - Metric recorded: vectors_upserted


{"asctime": "2025-10-29 13:41:35,139", "name": "pinecone_vector_store", "levelname": "INFO", "message": "Successfully upserted 10 vectors to local store in namespace 'curator-pommeline'"}


2025-10-29 13:41:35,139 - pinecone_vector_store - INFO - Successfully upserted 10 vectors to local store in namespace 'curator-pommeline'


{"asctime": "2025-10-29 13:41:35,140", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: documents_added", "operation": "documents_added", "value": 10}


2025-10-29 13:41:35,140 - metrics - INFO - Metric recorded: documents_added
2025-10-29 13:41:35,140 - ingestion_demo - INFO - Processed batch 6/11: 10 chunks


{"asctime": "2025-10-29 13:41:35,141", "name": "pinecone_vector_store", "levelname": "INFO", "message": "Adding 10 document chunks to local Pinecone vector store"}


2025-10-29 13:41:35,141 - pinecone_vector_store - INFO - Adding 10 document chunks to local Pinecone vector store


{"asctime": "2025-10-29 13:41:35,321", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_texts_count", "operation": "embedding_texts_count", "value": 10}


2025-10-29 13:41:35,321 - metrics - INFO - Metric recorded: embedding_texts_count


{"asctime": "2025-10-29 13:41:35,322", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_generation", "operation": "embedding_generation", "value": 180.6631088256836}


2025-10-29 13:41:35,322 - metrics - INFO - Metric recorded: embedding_generation


{"asctime": "2025-10-29 13:41:35,336", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: vectors_upserted", "operation": "vectors_upserted", "value": 10}


2025-10-29 13:41:35,336 - metrics - INFO - Metric recorded: vectors_upserted


{"asctime": "2025-10-29 13:41:35,338", "name": "pinecone_vector_store", "levelname": "INFO", "message": "Successfully upserted 10 vectors to local store in namespace 'curator-pommeline'"}


2025-10-29 13:41:35,338 - pinecone_vector_store - INFO - Successfully upserted 10 vectors to local store in namespace 'curator-pommeline'


{"asctime": "2025-10-29 13:41:35,338", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: documents_added", "operation": "documents_added", "value": 10}


2025-10-29 13:41:35,338 - metrics - INFO - Metric recorded: documents_added
2025-10-29 13:41:35,339 - ingestion_demo - INFO - Processed batch 7/11: 10 chunks


{"asctime": "2025-10-29 13:41:35,339", "name": "pinecone_vector_store", "levelname": "INFO", "message": "Adding 10 document chunks to local Pinecone vector store"}


2025-10-29 13:41:35,339 - pinecone_vector_store - INFO - Adding 10 document chunks to local Pinecone vector store


{"asctime": "2025-10-29 13:41:35,613", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_texts_count", "operation": "embedding_texts_count", "value": 10}


2025-10-29 13:41:35,613 - metrics - INFO - Metric recorded: embedding_texts_count


{"asctime": "2025-10-29 13:41:35,614", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_generation", "operation": "embedding_generation", "value": 274.7378349304199}


2025-10-29 13:41:35,614 - metrics - INFO - Metric recorded: embedding_generation


{"asctime": "2025-10-29 13:41:35,633", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: vectors_upserted", "operation": "vectors_upserted", "value": 10}


2025-10-29 13:41:35,633 - metrics - INFO - Metric recorded: vectors_upserted


{"asctime": "2025-10-29 13:41:35,634", "name": "pinecone_vector_store", "levelname": "INFO", "message": "Successfully upserted 10 vectors to local store in namespace 'curator-pommeline'"}


2025-10-29 13:41:35,634 - pinecone_vector_store - INFO - Successfully upserted 10 vectors to local store in namespace 'curator-pommeline'


{"asctime": "2025-10-29 13:41:35,635", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: documents_added", "operation": "documents_added", "value": 10}


2025-10-29 13:41:35,635 - metrics - INFO - Metric recorded: documents_added
2025-10-29 13:41:35,635 - ingestion_demo - INFO - Processed batch 8/11: 10 chunks


{"asctime": "2025-10-29 13:41:35,635", "name": "pinecone_vector_store", "levelname": "INFO", "message": "Adding 10 document chunks to local Pinecone vector store"}


2025-10-29 13:41:35,635 - pinecone_vector_store - INFO - Adding 10 document chunks to local Pinecone vector store


{"asctime": "2025-10-29 13:41:35,888", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_texts_count", "operation": "embedding_texts_count", "value": 10}


2025-10-29 13:41:35,888 - metrics - INFO - Metric recorded: embedding_texts_count


{"asctime": "2025-10-29 13:41:35,890", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_generation", "operation": "embedding_generation", "value": 253.65090370178223}


2025-10-29 13:41:35,890 - metrics - INFO - Metric recorded: embedding_generation


{"asctime": "2025-10-29 13:41:35,907", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: vectors_upserted", "operation": "vectors_upserted", "value": 10}


2025-10-29 13:41:35,907 - metrics - INFO - Metric recorded: vectors_upserted


{"asctime": "2025-10-29 13:41:35,908", "name": "pinecone_vector_store", "levelname": "INFO", "message": "Successfully upserted 10 vectors to local store in namespace 'curator-pommeline'"}


2025-10-29 13:41:35,908 - pinecone_vector_store - INFO - Successfully upserted 10 vectors to local store in namespace 'curator-pommeline'


{"asctime": "2025-10-29 13:41:35,908", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: documents_added", "operation": "documents_added", "value": 10}


2025-10-29 13:41:35,908 - metrics - INFO - Metric recorded: documents_added
2025-10-29 13:41:35,909 - ingestion_demo - INFO - Processed batch 9/11: 10 chunks


{"asctime": "2025-10-29 13:41:35,909", "name": "pinecone_vector_store", "levelname": "INFO", "message": "Adding 10 document chunks to local Pinecone vector store"}


2025-10-29 13:41:35,909 - pinecone_vector_store - INFO - Adding 10 document chunks to local Pinecone vector store


{"asctime": "2025-10-29 13:41:36,129", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_texts_count", "operation": "embedding_texts_count", "value": 10}


2025-10-29 13:41:36,129 - metrics - INFO - Metric recorded: embedding_texts_count


{"asctime": "2025-10-29 13:41:36,130", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_generation", "operation": "embedding_generation", "value": 220.11804580688477}


2025-10-29 13:41:36,130 - metrics - INFO - Metric recorded: embedding_generation


{"asctime": "2025-10-29 13:41:36,143", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: vectors_upserted", "operation": "vectors_upserted", "value": 10}


2025-10-29 13:41:36,143 - metrics - INFO - Metric recorded: vectors_upserted


{"asctime": "2025-10-29 13:41:36,143", "name": "pinecone_vector_store", "levelname": "INFO", "message": "Successfully upserted 10 vectors to local store in namespace 'curator-pommeline'"}


2025-10-29 13:41:36,143 - pinecone_vector_store - INFO - Successfully upserted 10 vectors to local store in namespace 'curator-pommeline'


{"asctime": "2025-10-29 13:41:36,144", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: documents_added", "operation": "documents_added", "value": 10}


2025-10-29 13:41:36,144 - metrics - INFO - Metric recorded: documents_added
2025-10-29 13:41:36,144 - ingestion_demo - INFO - Processed batch 10/11: 10 chunks
2025-10-29 13:41:36,145 - ingestion_demo - INFO - Progress: 94.3% (100/106 chunks)


{"asctime": "2025-10-29 13:41:36,145", "name": "pinecone_vector_store", "levelname": "INFO", "message": "Adding 6 document chunks to local Pinecone vector store"}


2025-10-29 13:41:36,145 - pinecone_vector_store - INFO - Adding 6 document chunks to local Pinecone vector store


{"asctime": "2025-10-29 13:41:36,318", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_texts_count", "operation": "embedding_texts_count", "value": 6}


2025-10-29 13:41:36,318 - metrics - INFO - Metric recorded: embedding_texts_count


{"asctime": "2025-10-29 13:41:36,319", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_generation", "operation": "embedding_generation", "value": 173.37393760681152}


2025-10-29 13:41:36,319 - metrics - INFO - Metric recorded: embedding_generation


{"asctime": "2025-10-29 13:41:36,331", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: vectors_upserted", "operation": "vectors_upserted", "value": 6}


2025-10-29 13:41:36,331 - metrics - INFO - Metric recorded: vectors_upserted


{"asctime": "2025-10-29 13:41:36,331", "name": "pinecone_vector_store", "levelname": "INFO", "message": "Successfully upserted 6 vectors to local store in namespace 'curator-pommeline'"}


2025-10-29 13:41:36,331 - pinecone_vector_store - INFO - Successfully upserted 6 vectors to local store in namespace 'curator-pommeline'


{"asctime": "2025-10-29 13:41:36,332", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: documents_added", "operation": "documents_added", "value": 6}


2025-10-29 13:41:36,332 - metrics - INFO - Metric recorded: documents_added
2025-10-29 13:41:36,332 - ingestion_demo - INFO - Processed batch 11/11: 6 chunks
2025-10-29 13:41:36,332 - ingestion_demo - INFO - Progress: 103.8% (110/106 chunks)
2025-10-29 13:41:36,333 - ingestion_demo - INFO - 
Ingestion completed!
2025-10-29 13:41:36,334 - ingestion_demo - INFO - Successfully processed: 106 chunks
2025-10-29 13:41:36,335 - ingestion_demo - INFO - Failed chunks: 0


## Verification

Verify that the documents were successfully ingested by checking the index statistics and performing a test search.

In [11]:
# Get final index statistics
final_stats = vector_store.get_stats()
print("\nFinal Index Statistics:")
for key, value in final_stats.items():
    print(f"  {key}: {value}")


Final Index Statistics:
  total_documents: 0
  embedding_dimension: 768
  index_name: curator-pommeline
  index_fullness: 0
  index_type: local_in_memory
  namespaces: {'': {'vectorCount': 109}, 'curator-pommeline': {'vectorCount': 106}, 'pommeline': {'vectorCount': 0}, 'curator-pommeline-12fa085f': {'vectorCount': 0}, 'curator-pommeline-7b1a7bbb': {'vectorCount': 0}}


In [12]:
# Test search functionality
test_queries = [
    "iPhone 16 Pro features",
    "MacBook Air M3 performance",
    "Student discount policy",
    "Return policy for electronics"
]

print("\nTesting Search Functionality:")
print("=" * 50)

for query in test_queries:
    try:
        results = vector_store.search(query, top_k=3)
        print(f"\nQuery: '{query}'")
        print(f"Results found: {len(results)}")
        
        for i, (doc, score) in enumerate(results):
            print(f"  {i+1}. Score: {score:.4f}")
            print(f"     Source: {doc['source_file']}")
            print(f"     Preview: {doc['content'][:100]}...")
            
    except Exception as e:
        print(f"\nQuery: '{query}' - ERROR: {e}")


Testing Search Functionality:
{"asctime": "2025-10-29 13:41:36,473", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_texts_count", "operation": "embedding_texts_count", "value": 1}


2025-10-29 13:41:36,473 - metrics - INFO - Metric recorded: embedding_texts_count


{"asctime": "2025-10-29 13:41:36,474", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_generation", "operation": "embedding_generation", "value": 123.04997444152832}


2025-10-29 13:41:36,474 - metrics - INFO - Metric recorded: embedding_generation


{"asctime": "2025-10-29 13:41:36,480", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: queries_performed", "operation": "queries_performed", "value": 1}


2025-10-29 13:41:36,480 - metrics - INFO - Metric recorded: queries_performed


{"asctime": "2025-10-29 13:41:36,480", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: search_results_count", "operation": "search_results_count", "value": 3}


2025-10-29 13:41:36,480 - metrics - INFO - Metric recorded: search_results_count


{"asctime": "2025-10-29 13:41:36,481", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: vector_search", "operation": "vector_search", "value": 130.01275062561035}


2025-10-29 13:41:36,481 - metrics - INFO - Metric recorded: vector_search



Query: 'iPhone 16 Pro features'
Results found: 3
  1. Score: 0.6314
     Source: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/iphone_16_pro.md
     Preview: # iPhone 16 Pro

The iPhone 16 Pro represents Apple's latest flagship smartphone, combining cutting-...
  2. Score: 0.5547
     Source: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/iphone_16_pro.md
     Preview: ## Pricing and Availability

The iPhone 16 Pro is available starting at $999 for the 128GB model, wi...
  3. Score: 0.5359
     Source: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/macbook_air_m3.md
     Preview: ### Hearing
- **Mono Audio** for balanced listening
- **Live Listen** with Made for iPhone hearing a...
{"asctime": "2025-10-29 13:41:36,527", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_texts_count", "operation": "embedding_texts_count", "value": 1}


2025-10-29 13:41:36,527 - metrics - INFO - Metric recorded: embedding_texts_count


{"asctime": "2025-10-29 13:41:36,527", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_generation", "operation": "embedding_generation", "value": 46.29802703857422}


2025-10-29 13:41:36,527 - metrics - INFO - Metric recorded: embedding_generation


{"asctime": "2025-10-29 13:41:36,533", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: queries_performed", "operation": "queries_performed", "value": 1}


2025-10-29 13:41:36,533 - metrics - INFO - Metric recorded: queries_performed


{"asctime": "2025-10-29 13:41:36,534", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: search_results_count", "operation": "search_results_count", "value": 3}


2025-10-29 13:41:36,534 - metrics - INFO - Metric recorded: search_results_count


{"asctime": "2025-10-29 13:41:36,535", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: vector_search", "operation": "vector_search", "value": 53.94387245178223}


2025-10-29 13:41:36,535 - metrics - INFO - Metric recorded: vector_search



Query: 'MacBook Air M3 performance'
Results found: 3
  1. Score: 0.6526
     Source: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/macbook_air_m3.md
     Preview: # MacBook Air with M3 Chip

The MacBook Air with M3 chip combines exceptional performance with incre...
  2. Score: 0.5635
     Source: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/macbook_air_m3.md
     Preview: ### M3 Chip
- **8-core CPU** with 4 performance cores and 4 efficiency cores
- **8-core GPU** for sm...
  3. Score: 0.5355
     Source: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/products/macbook_air_m3.md
     Preview: ## Technical Specifications

| Component | Specification |
|-----------|---------------|
| Processor...
{"asctime": "2025-10-29 13:41:36,713", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_texts_count", "operation": "embedding_texts_count", "value": 1}


2025-10-29 13:41:36,713 - metrics - INFO - Metric recorded: embedding_texts_count


{"asctime": "2025-10-29 13:41:36,714", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_generation", "operation": "embedding_generation", "value": 177.63805389404297}


2025-10-29 13:41:36,714 - metrics - INFO - Metric recorded: embedding_generation


{"asctime": "2025-10-29 13:41:36,719", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: queries_performed", "operation": "queries_performed", "value": 1}


2025-10-29 13:41:36,719 - metrics - INFO - Metric recorded: queries_performed


{"asctime": "2025-10-29 13:41:36,720", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: search_results_count", "operation": "search_results_count", "value": 3}


2025-10-29 13:41:36,720 - metrics - INFO - Metric recorded: search_results_count


{"asctime": "2025-10-29 13:41:36,721", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: vector_search", "operation": "vector_search", "value": 184.79585647583008}


2025-10-29 13:41:36,721 - metrics - INFO - Metric recorded: vector_search



Query: 'Student discount policy'
Results found: 3
  1. Score: 0.6160
     Source: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/policies/student_discount.md
     Preview: # Student Discount Program

We offer exclusive educational pricing for students, teachers, and educa...
  2. Score: 0.5259
     Source: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/policies/student_discount.md
     Preview: ### iPad and iPhone
- **iPad Pro**: Up to $100 discount
- **iPad Air**: Up to $50 discount
- **iPad*...
  3. Score: 0.5233
     Source: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/policies/student_discount.md
     Preview: available throughout the year, with special promotions during back-to-school season.

**Q: Can I com...
{"asctime": "2025-10-29 13:41:36,809", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_texts_count", "operation": "embedding_texts_count", "value": 1}


2025-10-29 13:41:36,809 - metrics - INFO - Metric recorded: embedding_texts_count


{"asctime": "2025-10-29 13:41:36,809", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_generation", "operation": "embedding_generation", "value": 87.97907829284668}


2025-10-29 13:41:36,809 - metrics - INFO - Metric recorded: embedding_generation


{"asctime": "2025-10-29 13:41:36,816", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: queries_performed", "operation": "queries_performed", "value": 1}


2025-10-29 13:41:36,816 - metrics - INFO - Metric recorded: queries_performed


{"asctime": "2025-10-29 13:41:36,816", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: search_results_count", "operation": "search_results_count", "value": 3}


2025-10-29 13:41:36,816 - metrics - INFO - Metric recorded: search_results_count


{"asctime": "2025-10-29 13:41:36,817", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: vector_search", "operation": "vector_search", "value": 95.27587890625}


2025-10-29 13:41:36,817 - metrics - INFO - Metric recorded: vector_search



Query: 'Return policy for electronics'
Results found: 3
  1. Score: 0.6778
     Source: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/policies/return_policy.md
     Preview: ### Return Conditions
- **Items must be in original condition** with all original packaging and acce...
  2. Score: 0.6381
     Source: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/policies/return_policy.md
     Preview: ### In-Store Returns
1. **Bring the item** to any retail store location
2. **Present original receip...
  3. Score: 0.6176
     Source: /Users/aamirsyedaltaf/Documents/curator-pommeline/data/policies/return_policy.md
     Preview: # Return and Refund Policy

We want you to be completely satisfied with your purchase. If you're not...


In [13]:
# Test the retrieve tool
from src.tools.retrieve import retrieve_documents

print("\nTesting Retrieve Tool:")
print("=" * 30)

for query in test_queries[:2]:  # Test first 2 queries
    try:
        response = retrieve_documents(query, top_k=2)
        print(f"\nQuery: '{query}'")
        print(f"Total results: {response.total_results}")
        print(f"Search metadata: {response.search_metadata}")
        
        for i, doc in enumerate(response.results):
            print(f"  {i+1}. Score: {doc.score:.4f}")
            print(f"     Source: {doc.source_file}")
            print(f"     Preview: {doc.content[:100]}...")
            
    except Exception as e:
        print(f"\nQuery: '{query}' - ERROR: {e}")

{"asctime": "2025-10-29 13:41:36,831", "name": "pinecone_index_client", "levelname": "INFO", "message": "Initialized PineconeIndexClient for dense index 'curator-pommeline-f316c5d1' (dim: 768, metric: dotproduct)"}


2025-10-29 13:41:36,831 - pinecone_index_client - INFO - Initialized PineconeIndexClient for dense index 'curator-pommeline-f316c5d1' (dim: 768, metric: dotproduct)


{"asctime": "2025-10-29 13:41:36,838", "name": "pinecone_vector_store", "levelname": "INFO", "message": "Connected to Pinecone Index container: {'namespaces': {'': {'vectorCount': 109}, 'curator-pommeline': {'vectorCount': 106}, 'curator-pommeline-12fa085f': {'vectorCount': 0}, 'pommeline': {'vectorCount': 0}, 'curator-pommeline-7b1a7bbb': {'vectorCount': 0}}, 'dimension': 768, 'indexFullness': 0.0, 'totalVectorCount': 215}"}


2025-10-29 13:41:36,838 - pinecone_vector_store - INFO - Connected to Pinecone Index container: {'namespaces': {'': {'vectorCount': 109}, 'curator-pommeline': {'vectorCount': 106}, 'curator-pommeline-12fa085f': {'vectorCount': 0}, 'pommeline': {'vectorCount': 0}, 'curator-pommeline-7b1a7bbb': {'vectorCount': 0}}, 'dimension': 768, 'indexFullness': 0.0, 'totalVectorCount': 215}


{"asctime": "2025-10-29 13:41:36,843", "name": "cache", "levelname": "INFO", "message": "Started cache cleanup task with 300s interval"}


2025-10-29 13:41:36,843 - cache - INFO - Started cache cleanup task with 300s interval



Testing Retrieve Tool:
{"asctime": "2025-10-29 13:41:36,847", "name": "retrieve_tool", "levelname": "INFO", "message": "Retrieving documents for query: 'iPhone 16 Pro features'"}


2025-10-29 13:41:36,847 - retrieve_tool - INFO - Retrieving documents for query: 'iPhone 16 Pro features'


{"asctime": "2025-10-29 13:41:36,848", "name": "retrieve_tool", "levelname": "INFO", "message": "Using hybrid searcher with index: curator-pommeline"}


2025-10-29 13:41:36,848 - retrieve_tool - INFO - Using hybrid searcher with index: curator-pommeline


{"asctime": "2025-10-29 13:41:36,888", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_texts_count", "operation": "embedding_texts_count", "value": 1}


2025-10-29 13:41:36,888 - metrics - INFO - Metric recorded: embedding_texts_count


{"asctime": "2025-10-29 13:41:36,888", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_generation", "operation": "embedding_generation", "value": 40.36998748779297}


2025-10-29 13:41:36,888 - metrics - INFO - Metric recorded: embedding_generation


{"asctime": "2025-10-29 13:41:36,894", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: queries_performed", "operation": "queries_performed", "value": 1}


2025-10-29 13:41:36,894 - metrics - INFO - Metric recorded: queries_performed


{"asctime": "2025-10-29 13:41:36,894", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: search_results_count", "operation": "search_results_count", "value": 8}


2025-10-29 13:41:36,894 - metrics - INFO - Metric recorded: search_results_count


{"asctime": "2025-10-29 13:41:36,895", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: vector_search", "operation": "vector_search", "value": 46.4630126953125}


2025-10-29 13:41:36,895 - metrics - INFO - Metric recorded: vector_search


{"asctime": "2025-10-29 13:41:36,895", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: hybrid_search_results", "operation": "hybrid_search_results", "value": 4}


2025-10-29 13:41:36,895 - metrics - INFO - Metric recorded: hybrid_search_results


{"asctime": "2025-10-29 13:41:36,895", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: hybrid_search", "operation": "hybrid_search", "value": 47.08719253540039}


2025-10-29 13:41:36,895 - metrics - INFO - Metric recorded: hybrid_search


{"asctime": "2025-10-29 13:41:36,896", "name": "retrieve_tool", "levelname": "INFO", "message": "Retrieved 0 documents"}


2025-10-29 13:41:36,896 - retrieve_tool - INFO - Retrieved 0 documents


{"asctime": "2025-10-29 13:41:36,896", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: retrieve_documents_count", "operation": "retrieve_documents_count", "value": 0}


2025-10-29 13:41:36,896 - metrics - INFO - Metric recorded: retrieve_documents_count


{"asctime": "2025-10-29 13:41:36,896", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: tool_retrieve", "operation": "tool_retrieve", "value": 48.84195327758789}


2025-10-29 13:41:36,896 - metrics - INFO - Metric recorded: tool_retrieve



Query: 'iPhone 16 Pro features'
Total results: 0
Search metadata: {'query_length': 22, 'search_method': 'hybrid', 'components_used': {'bm25': False, 'dense': False}, 'similarity_threshold': 0.15, 'original_results': 0, 'filtered_results': 0}
{"asctime": "2025-10-29 13:41:36,897", "name": "retrieve_tool", "levelname": "INFO", "message": "Retrieving documents for query: 'MacBook Air M3 performance'"}


2025-10-29 13:41:36,897 - retrieve_tool - INFO - Retrieving documents for query: 'MacBook Air M3 performance'


{"asctime": "2025-10-29 13:41:36,897", "name": "retrieve_tool", "levelname": "INFO", "message": "Using hybrid searcher with index: curator-pommeline"}


2025-10-29 13:41:36,897 - retrieve_tool - INFO - Using hybrid searcher with index: curator-pommeline


{"asctime": "2025-10-29 13:41:36,934", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_texts_count", "operation": "embedding_texts_count", "value": 1}


2025-10-29 13:41:36,934 - metrics - INFO - Metric recorded: embedding_texts_count


{"asctime": "2025-10-29 13:41:36,935", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: embedding_generation", "operation": "embedding_generation", "value": 37.7500057220459}


2025-10-29 13:41:36,935 - metrics - INFO - Metric recorded: embedding_generation


{"asctime": "2025-10-29 13:41:36,941", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: queries_performed", "operation": "queries_performed", "value": 1}


2025-10-29 13:41:36,941 - metrics - INFO - Metric recorded: queries_performed


{"asctime": "2025-10-29 13:41:36,941", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: search_results_count", "operation": "search_results_count", "value": 8}


2025-10-29 13:41:36,941 - metrics - INFO - Metric recorded: search_results_count


{"asctime": "2025-10-29 13:41:36,942", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: vector_search", "operation": "vector_search", "value": 44.573307037353516}


2025-10-29 13:41:36,942 - metrics - INFO - Metric recorded: vector_search


{"asctime": "2025-10-29 13:41:36,942", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: hybrid_search_results", "operation": "hybrid_search_results", "value": 4}


2025-10-29 13:41:36,942 - metrics - INFO - Metric recorded: hybrid_search_results


{"asctime": "2025-10-29 13:41:36,943", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: hybrid_search", "operation": "hybrid_search", "value": 45.408010482788086}


2025-10-29 13:41:36,943 - metrics - INFO - Metric recorded: hybrid_search


{"asctime": "2025-10-29 13:41:36,943", "name": "retrieve_tool", "levelname": "INFO", "message": "Retrieved 0 documents"}


2025-10-29 13:41:36,943 - retrieve_tool - INFO - Retrieved 0 documents


{"asctime": "2025-10-29 13:41:36,943", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: retrieve_documents_count", "operation": "retrieve_documents_count", "value": 0}


2025-10-29 13:41:36,943 - metrics - INFO - Metric recorded: retrieve_documents_count


{"asctime": "2025-10-29 13:41:36,944", "name": "metrics", "levelname": "INFO", "message": "Metric recorded: tool_retrieve", "operation": "tool_retrieve", "value": 47.25384712219238}


2025-10-29 13:41:36,944 - metrics - INFO - Metric recorded: tool_retrieve



Query: 'MacBook Air M3 performance'
Total results: 0
Search metadata: {'query_length': 26, 'search_method': 'hybrid', 'components_used': {'bm25': False, 'dense': False}, 'similarity_threshold': 0.15, 'original_results': 0, 'filtered_results': 0}


## Summary

This notebook has successfully:
1. ✅ Generated a unique index UUID for this notebook run
2. ✅ Connected to Pinecone and configured the unique index
3. ✅ Loaded product and policy documents from the data directory
4. ✅ Chunked documents into optimal sizes for retrieval
5. ✅ Generated normalized embeddings and stored them in the unique index
6. ✅ Verified the ingestion with test searches

The unique index `curator-pommeline-{index_uuid}` is now ready for use with the retrieval tools and will be automatically cleaned up at the end of this notebook.

## 🗑️ Cleanup

Clean up the unique index created in this notebook.

> **⚠️ WARNING**: This cell will permanently delete the unique index created in this notebook run.
>
This ensures clean resource management and prevents leftover data in your Pinecone instance.

In [14]:
# Clean up the unique index created in this notebook
import requests
import json
from dotenv import load_dotenv
load_dotenv()
import sys
sys.path.append(str(pathlib.Path().absolute().parent / "src"))
from src.config import settings

try:
    # Get the final index statistics before cleanup
    final_stats = vector_store.get_stats()
    index_name = final_stats["index_name"]
    namespaces = final_stats.get("namespaces", {})

    print(f"🗑️  Cleaning up unique index '{index_name}'...")
    print(f"📊 Final statistics before cleanup:")
    print(f"   Total vectors: {final_stats.get('total_documents', 'unknown')}")
    print(f"   Namespaces: {namespaces}")

    # Calculate total vectors to delete
    total_vectors_to_delete = sum(
        ns_data.get('vectorCount', 0) 
        for ns_data in namespaces.values()
    )

    print(f"📋 Cleanup Summary:")
    print(f"   Index: {index_name}")
    print(f"   Total vectors to delete: {total_vectors_to_delete}")

    for ns_name, ns_data in namespaces.items():
        vector_count = ns_data.get('vectorCount', 0)
        if vector_count > 0:
            print(f"   Namespace '{ns_name}': {vector_count} vectors")

    # For Pinecone index container, we need to clear the namespace
    # Using deleteAll API to clear the specific namespace
    try:
        print(f"🧹 Attempting to clear namespace '{index_name}'...")
        
        # Use the delete API with namespace and deleteAll flag
        delete_request = {
            "namespace": index_name,
            "deleteAll": True
        }
        
        response = requests.post(
            f"{settings.pinecone_host}/vectors/delete",
            json=delete_request,
            timeout=30
        )
        
        if response.status_code == 200:
            result = response.json()
            print(f"   ✅ Namespace clear request sent successfully")
            if result:
                print(f"   📝 Response: {result}")
        else:
            print(f"   ⚠️  Namespace clear request failed: {response.status_code}")
            print(f"   📝 Response: {response.text}")
            print(f"   📝 Note: Manual cleanup may be required")
            
    except Exception as e:
        print(f"   ⚠️  Namespace cleanup failed: {e}")
        print(f"   📝 Note: Index cleanup may require manual intervention")

    print(f"✅ Cleanup process completed")
    print(f"💡 In production with Pinecone cloud, you would use:")
    print(f"   pc.delete_index('{index_name}') to permanently delete the index")
    print(f"🎯 Unique index UUID '{index_uuid}' has been processed for cleanup")
    print(f"🧹 Resources have been freed from your local Pinecone instance")

except Exception as e:
    print(f"❌ Cleanup failed: {e}")
    print(f"📝 You may need to manually clean up index: {index_name}")


🗑️  Cleaning up unique index 'curator-pommeline'...
📊 Final statistics before cleanup:
   Total vectors: 0
   Namespaces: {'pommeline': {'vectorCount': 0}, 'curator-pommeline-7b1a7bbb': {'vectorCount': 0}, 'curator-pommeline': {'vectorCount': 106}, '': {'vectorCount': 109}, 'curator-pommeline-12fa085f': {'vectorCount': 0}}
📋 Cleanup Summary:
   Index: curator-pommeline
   Total vectors to delete: 215
   Namespace 'curator-pommeline': 106 vectors
   Namespace '': 109 vectors
🧹 Attempting to clear namespace 'curator-pommeline'...
   ✅ Namespace clear request sent successfully
✅ Cleanup process completed
💡 In production with Pinecone cloud, you would use:
   pc.delete_index('curator-pommeline') to permanently delete the index
🎯 Unique index UUID 'f316c5d1' has been processed for cleanup
🧹 Resources have been freed from your local Pinecone instance
