# Vector Database for Croatian RAG System

## Learning Objectives

This notebook explains the vector database components of our Croatian RAG system:

1. **What are vector databases and why do we need them?**
2. **Text embeddings and multilingual models**
3. **ChromaDB storage and operations**
4. **Similarity search methods**
5. **Croatian language considerations**

## 1. Introduction to Vector Databases

### What is a Vector Database?

A vector database is a specialized database designed to store, index, and query high-dimensional vectors (embeddings). Unlike traditional databases that store structured data, vector databases work with:

- **Embeddings**: Dense numerical representations of text, images, or other data
- **Similarity Search**: Finding similar items based on vector distance
- **High Performance**: Optimized for nearest neighbor searches

### Why Do We Need Vector Databases for RAG?

In a RAG system, we need to:
1. **Store document chunks** as searchable vectors
2. **Find relevant information** for user queries
3. **Handle semantic similarity** (not just keyword matching)
4. **Scale to thousands of documents**

Traditional databases can't efficiently handle semantic search - that's where vector databases shine.

In [None]:
# Let's start by importing our vector database components
import sys
sys.path.append('../src')

from vectordb.embeddings import CroatianEmbeddingModel, EmbeddingConfig, get_recommended_model
from vectordb.storage import ChromaDBStorage, StorageConfig
from vectordb.search import SemanticSearchEngine, SearchConfig, SearchMethod

import numpy as np
import logging

# Set up logging to see what's happening
logging.basicConfig(level=logging.INFO)
print("✅ Vector database components imported successfully!")

## 2. Text Embeddings and Multilingual Models

### What are Text Embeddings?

Text embeddings are dense vector representations of text that capture semantic meaning. For example:

- **"Zagreb je glavni grad"** → [0.1, 0.8, 0.3, ...] (384 dimensions)
- **"Glavni grad Hrvatske"** → [0.12, 0.82, 0.31, ...] (similar vector)
- **"Banana je žuta"** → [0.9, 0.1, 0.7, ...] (very different vector)

### Why Multilingual Models?

Croatian language has unique characteristics:
- **Diacritics**: č, ć, š, ž, đ
- **Morphology**: Complex word forms
- **Limited training data**: Compared to English

Multilingual models like `paraphrase-multilingual-MiniLM-L12-v2` are trained on many languages including Croatian.

In [None]:
# Let's explore different multilingual models available
from vectordb.embeddings import CroatianEmbeddingModel

print("🌍 Available multilingual models for Croatian:")
print()

for use_case in ["general", "fast", "accurate", "cross-lingual"]:
    model_name = get_recommended_model(use_case)
    print(f"📚 {use_case.capitalize():<12}: {model_name}")

print()
print("💡 Recommendations:")
print("   • General learning: Use 'general' (good balance of speed/accuracy)")
print("   • Production: Use 'accurate' for best Croatian understanding")
print("   • Resource limited: Use 'fast' for quick experiments")

In [None]:
# Let's create an embedding model and explore Croatian text encoding
config = EmbeddingConfig(
    model_name="paraphrase-multilingual-MiniLM-L12-v2",
    device="cpu",  # Use CPU for compatibility
    batch_size=4
)

# Create the embedding model
embedding_model = CroatianEmbeddingModel(config)

print("🧠 Created Croatian embedding model with configuration:")
print(f"   • Model: {config.model_name}")
print(f"   • Device: {config.device}")
print(f"   • Max sequence length: {config.max_seq_length}")
print(f"   • Batch size: {config.batch_size}")

# Note: Model loading will happen when we first encode text

In [None]:
# Croatian test texts with various characteristics
croatian_texts = [
    "Zagreb je glavni grad Republike Hrvatske.",  # Standard Croatian
    "Dubrovnik je poznat kao 'biser Jadrana'.",   # Cultural reference
    "Plitvička jezera su najljepši nacionalni park.",  # Nature/tourism
    "Ćevapi su popularna balanska hrana.",        # Food culture
    "Hajduk Split je poznati nogometni klub.",    # Sports
    "Šuma Brijuni nalazi se u Istri.",           # Geography with diacritics
]

print("🇭🇷 Croatian test texts:")
for i, text in enumerate(croatian_texts, 1):
    print(f"   {i}. {text}")

# Let's encode these texts (this will download the model on first run)
print("\n⏳ Encoding Croatian texts... (this may take a moment on first run)")

try:
    embeddings = embedding_model.encode_text(croatian_texts)
    
    print(f"\n✅ Successfully encoded {len(croatian_texts)} Croatian texts!")
    print(f"   • Embedding shape: {embeddings.shape}")
    print(f"   • Each text becomes a {embeddings.shape[1]}-dimensional vector")
    
    # Show first few dimensions of first text as example
    print(f"\n🔢 First text embedding (first 10 dimensions):")
    print(f"   '{croatian_texts[0][:30]}...'")
    print(f"   → {embeddings[0][:10]}")
    
except Exception as e:
    print(f"❌ Error encoding texts: {e}")
    print("   This might be due to missing model files or network issues")

In [None]:
# Let's demonstrate semantic similarity between Croatian texts
if 'embeddings' in locals():
    print("🔍 Semantic Similarity Analysis:")
    print()
    
    # Compare similarity between different texts
    zagreb_embedding = embeddings[0]  # "Zagreb je glavni grad..."
    
    print("📊 Similarity to 'Zagreb je glavni grad Republike Hrvatske.':")
    print()
    
    for i, text in enumerate(croatian_texts[1:], 1):
        similarity = embedding_model.compute_similarity(
            zagreb_embedding, 
            embeddings[i], 
            metric="cosine"
        )
        
        # Create visual bar
        bar_length = int(similarity * 20)
        bar = "█" * bar_length + "░" * (20 - bar_length)
        
        print(f"   {similarity:.3f} {bar} {text[:40]}...")
    
    print()
    print("💡 Observations:")
    print("   • Higher scores = more semantically similar")
    print("   • Geographic/city texts should score higher")
    print("   • Different topics (food, sports) should score lower")
else:
    print("⚠️  Skipping similarity analysis - embeddings not available")

## 3. ChromaDB Storage and Operations

### What is ChromaDB?

ChromaDB is a popular open-source vector database that provides:

- **Local Storage**: No cloud dependencies
- **Persistent Data**: Data survives restarts
- **Collections**: Organize documents into groups
- **Metadata Filtering**: Search with additional conditions
- **Automatic Indexing**: Fast similarity search

### Key Concepts:

- **Collection**: A group of documents (like a database table)
- **Document**: Text content + metadata + embedding
- **Metadata**: Additional information (source, language, date, etc.)
- **Distance Metrics**: How to measure similarity (cosine, euclidean, etc.)

In [None]:
# Let's set up ChromaDB storage for our Croatian documents
import tempfile
import os

# Create a temporary directory for this demo
temp_db_path = tempfile.mkdtemp(prefix="croatian_rag_demo_")

storage_config = StorageConfig(
    db_path=temp_db_path,
    collection_name="croatian_learning_demo",
    distance_metric="cosine",  # Good for text similarity
    persist=True,              # Save data to disk
    allow_reset=True           # Allow clearing data for demos
)

print("💾 Setting up ChromaDB storage:")
print(f"   • Database path: {temp_db_path}")
print(f"   • Collection: {storage_config.collection_name}")
print(f"   • Distance metric: {storage_config.distance_metric}")

# Create storage client
try:
    storage = ChromaDBStorage(storage_config)
    collection = storage.create_collection()
    
    print("\n✅ ChromaDB storage initialized successfully!")
    print(f"   Collection '{collection.name}' ready for documents")
    
except Exception as e:
    print(f"❌ Error initializing storage: {e}")

In [None]:
# Let's add our Croatian documents to the vector database
if 'storage' in locals() and 'embeddings' in locals():
    
    # Prepare document metadata
    document_metadata = [
        {"source": "zagreb_info.txt", "topic": "geography", "city": "Zagreb", "language": "hr"},
        {"source": "dubrovnik_guide.txt", "topic": "tourism", "city": "Dubrovnik", "language": "hr"},
        {"source": "parks_croatia.txt", "topic": "nature", "region": "Lika", "language": "hr"},
        {"source": "balkan_food.txt", "topic": "food", "cuisine": "balkan", "language": "hr"},
        {"source": "sports_croatia.txt", "topic": "sports", "team": "Hajduk", "language": "hr"},
        {"source": "istria_nature.txt", "topic": "geography", "region": "Istria", "language": "hr"},
    ]
    
    print("📤 Adding Croatian documents to vector database...")
    
    try:
        # Convert numpy embeddings to lists for ChromaDB
        embeddings_list = embeddings.tolist()
        
        # Add documents with embeddings and metadata
        document_ids = storage.add_documents(
            documents=croatian_texts,
            metadatas=document_metadata,
            embeddings=embeddings_list
        )
        
        print(f"\n✅ Successfully added {len(document_ids)} documents!")
        print("   Document IDs:", document_ids[:3], "...")
        
        # Get collection info
        info = storage.get_collection_info()
        print(f"\n📊 Collection Statistics:")
        print(f"   • Total documents: {info['count']}")
        print(f"   • Distance metric: {info['distance_metric']}")
        
    except Exception as e:
        print(f"❌ Error adding documents: {e}")
        
else:
    print("⚠️  Skipping document storage - prerequisites not available")

In [None]:
# Let's explore metadata filtering capabilities
if 'storage' in locals():
    print("🔍 Exploring Metadata Filtering:")
    print()
    
    # Get all documents about geography
    try:
        geography_docs = storage.get_documents(
            where={"topic": "geography"},
            include=["documents", "metadatas"]
        )
        
        print(f"📍 Geography documents ({len(geography_docs['ids'])} found):")
        for i, (doc, meta) in enumerate(zip(geography_docs['documents'], geography_docs['metadatas'])):
            city_or_region = meta.get('city', meta.get('region', 'Unknown'))
            print(f"   {i+1}. {city_or_region}: {doc[:50]}...")
        
        print()
        
        # Get documents from specific cities
        zagreb_docs = storage.get_documents(
            where={"city": "Zagreb"},
            include=["documents", "metadatas"]
        )
        
        print(f"🏛️ Zagreb documents ({len(zagreb_docs['ids'])} found):")
        for doc in zagreb_docs['documents']:
            print(f"   • {doc}")
            
    except Exception as e:
        print(f"❌ Error filtering documents: {e}")
        
else:
    print("⚠️  Skipping metadata filtering - storage not available")

## 4. Similarity Search Methods

Our search engine supports three different search methods:

### 🧠 Semantic Search
- Uses embeddings to find semantically similar content
- Great for: Concepts, synonyms, related topics
- Example: "glavni grad" matches "Zagreb je glavni grad"

### 🔤 Keyword Search  
- Uses traditional text matching
- Great for: Specific terms, names, exact phrases
- Example: "Dubrovnik" matches documents containing "Dubrovnik"

### 🎯 Hybrid Search
- Combines semantic and keyword approaches
- Great for: Best of both worlds
- Balances meaning and exact matches

In [None]:
# Let's set up our search engine and test different search methods
if 'embedding_model' in locals() and 'storage' in locals():
    
    # Create search engine
    search_config = SearchConfig(
        method=SearchMethod.SEMANTIC,
        top_k=3,
        similarity_threshold=0.1,  # Low threshold to see more results
        rerank=True,               # Enable result reranking
        include_metadata=True      # Include document metadata
    )
    
    search_engine = SemanticSearchEngine(embedding_model, storage, search_config)
    
    print("🔍 Search engine ready with configuration:")
    print(f"   • Method: {search_config.method.value}")
    print(f"   • Top results: {search_config.top_k}")
    print(f"   • Similarity threshold: {search_config.similarity_threshold}")
    print(f"   • Reranking: {search_config.rerank}")
    
else:
    print("⚠️  Cannot create search engine - missing prerequisites")

In [None]:
# Test semantic search with Croatian queries
if 'search_engine' in locals():
    
    croatian_queries = [
        "Koji je glavni grad Hrvatske?",     # Question about capital city
        "Turistička mjesta u Dalmaciji",    # Tourism query
        "Nacionalni parkovi Hrvatske",      # Nature/parks query
        "Hrvatska nogometna momčad",        # Sports query
    ]
    
    print("🧠 SEMANTIC SEARCH RESULTS:")
    print("=" * 60)
    
    for query in croatian_queries:
        print(f"\n🔍 Query: '{query}'")
        print("-" * 40)
        
        try:
            response = search_engine.search(query, method=SearchMethod.SEMANTIC)
            
            print(f"⏱️  Search time: {response.total_time:.3f}s")
            print(f"📊 Results found: {response.total_results}")
            print()
            
            for result in response.results:
                # Create relevance bar visualization
                bar_length = int(result.score * 20)
                relevance_bar = "█" * bar_length + "░" * (20 - bar_length)
                
                print(f"   #{result.rank} [{result.score:.3f}] {relevance_bar}")
                print(f"       📄 {result.content}")
                
                # Show relevant metadata
                topic = result.metadata.get('topic', 'unknown')
                location = result.metadata.get('city', result.metadata.get('region', 'unknown'))
                print(f"       🏷️  Topic: {topic}, Location: {location}")
                print()
            
            if not response.results:
                print("   ❌ No results found above similarity threshold")
                
        except Exception as e:
            print(f"   ❌ Search error: {e}")
            
else:
    print("⚠️  Skipping semantic search - search engine not available")

In [None]:
# Compare different search methods on the same query
if 'search_engine' in locals():
    
    test_query = "Zagreb glavni grad"
    
    print(f"🔬 SEARCH METHOD COMPARISON:")
    print(f"Query: '{test_query}'")
    print("=" * 60)
    
    methods = [
        (SearchMethod.SEMANTIC, "🧠 Semantic", "Understands meaning and context"),
        (SearchMethod.KEYWORD, "🔤 Keyword", "Matches exact terms"),
        (SearchMethod.HYBRID, "🎯 Hybrid", "Combines semantic + keyword")
    ]
    
    for method, icon_name, description in methods:
        print(f"\n{icon_name} Search - {description}")
        print("-" * 50)
        
        try:
            response = search_engine.search(test_query, method=method, top_k=2)
            
            print(f"⏱️  Time: {response.total_time:.3f}s | Results: {response.total_results}")
            
            for result in response.results:
                print(f"   📊 Score: {result.score:.3f} | {result.content[:60]}...")
                
        except Exception as e:
            print(f"   ❌ Error: {e}")
    
    print("\n💡 Observations:")
    print("   • Semantic: Should find Zagreb content even if query doesn't match exactly")
    print("   • Keyword: Should strongly match 'Zagreb' and 'glavni' terms")
    print("   • Hybrid: Should balance both approaches for best results")
            
else:
    print("⚠️  Skipping method comparison - search engine not available")

In [None]:
# Test metadata filtering in search
if 'search_engine' in locals():
    
    print("🏷️  METADATA FILTERING IN SEARCH:")
    print("=" * 60)
    
    # Search only in geography-related documents
    print("\n🌍 Searching only geography documents:")
    print("-" * 40)
    
    try:
        geo_response = search_engine.search(
            query="beautiful places",
            filters={"topic": "geography"},
            top_k=3
        )
        
        for result in geo_response.results:
            location = result.metadata.get('city', result.metadata.get('region', 'Unknown'))
            print(f"   📍 {location}: {result.content}")
            print(f"      Score: {result.score:.3f}")
            
    except Exception as e:
        print(f"   ❌ Error: {e}")
    
    # Search only in tourism-related documents
    print("\n🏖️ Searching only tourism documents:")
    print("-" * 40)
    
    try:
        tourism_response = search_engine.search(
            query="beautiful coast",
            filters={"topic": "tourism"},
            top_k=3
        )
        
        for result in tourism_response.results:
            print(f"   🏛️ {result.content}")
            print(f"      Score: {result.score:.3f}")
            
    except Exception as e:
        print(f"   ❌ Error: {e}")
    
    print("\n💡 Metadata filtering allows you to:")
    print("   • Search within specific document types")
    print("   • Filter by language, date, source, etc.")
    print("   • Create domain-specific search experiences")
            
else:
    print("⚠️  Skipping metadata filtering - search engine not available")

## 5. Croatian Language Considerations

### Challenges with Croatian Text Processing:

1. **Diacritics (dijakritički znakovi)**:
   - č, ć, š, ž, đ must be preserved
   - Different from similar letters (c ≠ č ≠ ć)

2. **Morphology (morfologija)**:
   - Complex word forms: "grad" vs "grada" vs "gradu"
   - Rich case system (7 cases)

3. **Cultural Context**:
   - References like "biser Jadrana" (pearl of the Adriatic)
   - Regional differences (kajkavian, čakavian, štokavian)

### How Our System Handles Croatian:

✅ **Multilingual embeddings** understand Croatian semantics  
✅ **UTF-8 encoding** preserves all diacritics  
✅ **Semantic search** handles word form variations  
✅ **Metadata filtering** allows language-specific searches  

In [None]:
# Let's test Croatian-specific challenges
if 'search_engine' in locals():
    
    print("🇭🇷 CROATIAN LANGUAGE TESTING:")
    print("=" * 60)
    
    # Test diacritic handling
    print("\n✏️ Testing diacritic preservation:")
    print("-" * 40)
    
    diacritic_tests = [
        ("Zagreb", "Should match Zagreb content"),
        ("Plitvička", "Should match Plitvička jezera with č preserved"),
        ("Ćevapi", "Should match ćevapi with ć preserved")
    ]
    
    for query, explanation in diacritic_tests:
        print(f"\n🔍 Query: '{query}' - {explanation}")
        
        try:
            response = search_engine.search(query, top_k=1)
            if response.results:
                result = response.results[0]
                print(f"   ✅ Match: {result.content}")
                print(f"   📊 Score: {result.score:.3f}")
            else:
                print(f"   ❌ No matches found")
                
        except Exception as e:
            print(f"   ❌ Error: {e}")
    
    # Test morphological variations (different word forms)
    print("\n📝 Testing morphological understanding:")
    print("-" * 40)
    
    morphology_tests = [
        ("grad", "Basic form"),
        ("grada", "Genitive case"), 
        ("gradu", "Dative/Locative case"),
        ("gradovi", "Plural form")
    ]
    
    print("All forms should semantically match Zagreb content:")
    
    for query, case_type in morphology_tests:
        try:
            response = search_engine.search(query, top_k=1)
            if response.results:
                score = response.results[0].score
                bar = "█" * int(score * 10) + "░" * (10 - int(score * 10))
                print(f"   {query:<8} ({case_type:<15}): {score:.3f} {bar}")
            else:
                print(f"   {query:<8} ({case_type:<15}): No match")
                
        except Exception as e:
            print(f"   {query:<8}: Error - {e}")
    
    print("\n💡 Semantic embeddings help handle word variations!")
            
else:
    print("⚠️  Skipping Croatian language tests - search engine not available")

In [None]:
# Test result formatting and context extraction for RAG
if 'search_engine' in locals():
    
    from vectordb.search import SearchResultFormatter
    
    print("📋 RESULT FORMATTING FOR RAG:")
    print("=" * 60)
    
    # Search for context about Croatian cities
    query = "Tell me about Croatian cities and their characteristics"
    
    try:
        response = search_engine.search(query, top_k=4)
        
        print(f"\n🔍 Query: '{query}'")
        print("-" * 50)
        
        # Show formatted display
        formatted_display = SearchResultFormatter.format_for_display(response)
        print("📊 Formatted Display Output:")
        print(formatted_display)
        
        print("\n" + "=" * 60)
        
        # Extract context chunks for RAG generation
        context_chunks = SearchResultFormatter.extract_context_chunks(response, max_length=500)
        
        print("📝 Context Chunks for RAG Generation:")
        print(f"Total chunks: {len(context_chunks)}")
        print(f"Total length: {sum(len(chunk) for chunk in context_chunks)} characters")
        print()
        
        for i, chunk in enumerate(context_chunks, 1):
            print(f"Chunk {i}: {chunk}")
            print()
        
        print("💡 These chunks would be sent to the LLM for answer generation!")
        
    except Exception as e:
        print(f"❌ Error: {e}")
        
else:
    print("⚠️  Skipping result formatting - search engine not available")

## 6. Summary and Key Takeaways

### What We've Learned:

🎯 **Vector Database Fundamentals**:
- Vector databases store embeddings for semantic search
- ChromaDB provides local, persistent storage
- Metadata filtering enables sophisticated queries

🧠 **Text Embeddings**:
- Multilingual models handle Croatian text well
- Semantic similarity captures meaning beyond keywords
- Embeddings convert text to searchable vectors

🔍 **Search Methods**:
- Semantic: Best for concepts and related topics
- Keyword: Best for exact terms and names
- Hybrid: Combines both for optimal results

🇭🇷 **Croatian Language Support**:
- Diacritics are preserved and handled correctly
- Morphological variations are understood semantically
- Cultural context is captured in embeddings

### Next Steps in RAG Pipeline:

1. ✅ **Document Processing** - Done
2. ✅ **Vector Database** - Just completed!
3. ⏳ **Retrieval System** - Next up
4. ⏳ **Generation** - Final step
5. ⏳ **Complete Pipeline** - Integration

The vector database is the core of our RAG system - it enables fast, semantic search through Croatian documents to find relevant context for answer generation.

In [None]:
# Clean up temporary database
import shutil

try:
    if 'temp_db_path' in locals() and os.path.exists(temp_db_path):
        shutil.rmtree(temp_db_path)
        print(f"🧹 Cleaned up temporary database: {temp_db_path}")
except Exception as e:
    print(f"⚠️  Warning: Could not clean up temp directory: {e}")

print("\n🎉 Vector Database Learning Complete!")
print("Ready to move on to the Retrieval System components.")