# Task 2: Rigid Verification Suite

## Checklist
1. Embedding Model Consistency (384 dims, all-MiniLM-L6-v2)
2. Chunking Quality Verification
3. Embedding Integrity (No NaNs, correct shape)
4. Vector Store Persistence
5. Semantic Retrieval Sanity Tests (5 queries)
6. Metadata Completeness
7. Filtering Readiness
8. Performance Reality Check

In [None]:
import chromadb
from sentence_transformers import SentenceTransformer
import numpy as np
import time
import pandas as pd

# Config
VECTOR_STORE_DIR = '../vector_store'
EMBEDDING_MODEL_NAME = 'all-MiniLM-L6-v2'
EXPECTED_DIM = 384

## 1. Embedding Model Consistency
**Verify:** Embedding dimension = 384. Model = all-MiniLM-L6-v2.

In [None]:
print(f"Loading model: {EMBEDDING_MODEL_NAME}...")
model = SentenceTransformer(EMBEDDING_MODEL_NAME)

test_text = "This is a test sentence for dimension check."
embedding = model.encode(test_text)

print(f"Model Output Shape: {embedding.shape}")
if embedding.shape[0] == EXPECTED_DIM:
    print("‚úÖ PASS: Embedding dimension is 384.")
else:
    print(f"‚ùå FAIL: Expected 384, got {embedding.shape[0]}")

## 2 & 6. Chunking Quality & Metadata Completeness
**Verify:** Chunks are readable (size ~500), Metadata contains all fields.

In [None]:
client = chromadb.PersistentClient(path=VECTOR_STORE_DIR)
collection = client.get_collection("complaints_rag")

# Fetch a few random items
results = collection.get(limit=5, include=['documents', 'metadatas'])

required_metadata = ['complaint_id', 'product', 'issue', 'company', 'date_received', 'chunk_index', 'total_chunks']

print(f"INSPECTING {len(results['documents'])} CHUNKS:\n")

for i, (doc, meta) in enumerate(zip(results['documents'], results['metadatas'])):
    print(f"[Chunk {i+1}]")
    print(f"Length: {len(doc)} chars")
    print(f"Text (First 100 chars): {doc[:100]}...")
    
    # Check Metadata
    missing = [key for key in required_metadata if key not in meta]
    if not missing:
        print("‚úÖ Metadata Complete")
    else:
        print(f"‚ùå Metadata Missing: {missing}")
    print("-" * 50)

## 3. Embedding Integrity
**Verify:** No NaNs, correct count.

In [None]:
# ChromaDB doesn't easily let us fetch ALL embeddings as numpy arrays efficiently for thousands without memory hit,
# but we can fetch a batch to validate.
vec_results = collection.get(limit=100, include=['embeddings'])
embeddings_sample = np.array(vec_results['embeddings'])

if np.isnan(embeddings_sample).any():
    print("‚ùå FAIL: NaNs detected in embeddings.")
else:
    print("‚úÖ PASS: No NaNs in sample batch.")
    
if embeddings_sample.shape[1] == EXPECTED_DIM:
    print("‚úÖ PASS: Stored embeddings have correct dimension (384).")
else:
    print(f"‚ùå FAIL: Stored dimension {embeddings_sample.shape[1]}.")

## 5. Semantic Retrieval Sanity Tests (5 Queries)
**Verify:** Relevance and intuition.

In [None]:
queries = [
    "Credit card billing disputes",
    "Unauthorized transactions",
    "Delayed money transfers",
    "High interest on personal loans",
    "Account closure issues"
]

for q in queries:
    print(f"\nüîç Query: '{q}'")
    start_time = time.time()
    q_vec = model.encode([q]).tolist()
    results = collection.query(query_embeddings=q_vec, n_results=1)
    latency = (time.time() - start_time) * 1000
    
    doc = results['documents'][0][0]
    meta = results['metadatas'][0][0]
    
    print(f"   ‚è±Ô∏è Latency: {latency:.2f} ms")
    print(f"   Product: {meta.get('product')}")
    print(f"   Issue: {meta.get('issue')}")
    print(f"   Snippet: {doc[:150]}...")

## 7. Filtering Readiness
**Verify:** Filtering by product works.

In [None]:
print("Testing Filter: Product = 'Credit card'...")
filter_query = "fees"
q_vec_f = model.encode([filter_query]).tolist()

results_f = collection.query(
    query_embeddings=q_vec_f,
    n_results=5,
    where={"product": "Credit card"}
)

all_credit_cards = all(meta['product'] == 'Credit card' for meta in results_f['metadatas'][0])
if all_credit_cards:
    print("‚úÖ PASS: All results match filter 'Credit card'.")
else:
    print("‚ùå FAIL: Filter leaked other products.")
    
print("Sample Products Returned:", [m['product'] for m in results_f['metadatas'][0]])