# üìñ Chapter 04 ‚Äî Vector Database Setup

## üéØ Objectives

In this chapter, we will set up ChromaDB as our local vector database and ingest our processed documents with their embeddings.

We will create a persistent collection, generate embeddings for all 5,086 documents, and verify the database is working correctly.

## üì¶ Step 01 ‚Äî Install and Initialize ChromaDB
Install ChromaDB and create a persistent client.

In [24]:
import chromadb
import json
import time
import google.generativeai as genai

from chromadb.config import Settings
from sentence_transformers import SentenceTransformer
from src.config import CHROMA_DB_DIR, PROCESSED_DATA_DIR, EMBEDDING_MODEL
from src.utils.emoji_log import task, success, data, info, error

In [25]:
task("Initializing ChromaDB...")

client = chromadb.PersistentClient(path=CHROMA_DB_DIR)

success("Chroma_db initialized successfully")
data(f"Database path: {CHROMA_DB_DIR}")

üöÄ Initializing ChromaDB...
‚úÖ Chroma_db initialized successfully
üìä Database path: c:\Users\dinni\OneDrive\Ê°åÈù¢\Travel_rag\chroma_db


In [26]:
collections = client.list_collections()
info(f"Existing collections: {len(collections)}")

if len(collections) > 0:
    for col in collections:
        print(f"  - {col.name}")
else:
    info("No collections yet")

üí¨ Existing collections: 0
üí¨ No collections yet


## üóÇÔ∏è Step 02 ‚Äî Create Collection
Create a collection to store our Taiwan attraction documents.

In [27]:
# Collection Core elements
# IDs, such primary key
# Documents, optional, original content
# Embedding, Must have
# Metadatas, Optional, additional structural data for filter

task("Creating ChromaDB collection...")

collection = client.create_collection(
    name="taiwan_attractions",
    metadata={
        "description": "Taiwan tourist attractions with embeddings",
        "embedding_model": "text-embedding-004",
        "embedding_dimension": 768
    }
)

success("Collection created successfully!")
data(f"Collection name: {collection.name}")
data(f"Document count: {collection.count()}")

üöÄ Creating ChromaDB collection...
‚úÖ Collection created successfully!
üìä Collection name: taiwan_attractions
üìä Document count: 0


In [28]:
all_collections = client.list_collections()
info(f"Total collections in database: {len(all_collections)}")
for col in all_collections:
    print(f"  - {col.name} ({col.count()} documents)")

üí¨ Total collections in database: 1
  - taiwan_attractions (0 documents)


## üßÆ Step 03 ‚Äî Generate and Store Embeddings
Load processed documents, generate embeddings, and store them in ChromaDB.

In [29]:
task("Loading processed documents...")

with open(PROCESSED_DATA_DIR / "documents.json", "r", encoding="utf-8") as f:
    documents = json.load(f)

success(f"Loaded {len(documents)} documents")

üöÄ Loading processed documents...
‚úÖ Loaded 5086 documents


In [30]:
# Prepare for batch processing
task("Generating embeddings and storing in ChromaDB...")

batch_size = 100
total = len(documents)
success_count = 0
error_count = 0

model = SentenceTransformer("all-MiniLM-L6-v2")

# Deal with documents
for i, doc in enumerate(documents, 1):
    try:
        # create embedding
        embedding = model.encode(doc["content"])

        # Filter None
        clean_metadata = {
            k: v for k, v in doc["metadata"].items() if v is not None
        }

        # save to ChromaDB
        collection.add(
            ids=[doc["id"]],
            documents=[doc["content"]],
            embeddings=[embedding.tolist()],
            metadatas=[clean_metadata]
        )

        success_count += 1

        if i % batch_size == 0:
            data(f"Progress: {i}/{total} ({i/total*100:.1f}%)")

    except Exception as e:
        error_count += 1
        error(f"Error processing document {doc['id']}: {str(e)}")
        continue

success("Embedding generation complete!")
data(f"Successfully processed: {success_count}/{total}")

if error_count > 0:
    error(f"Errors: {error_count}")

final_count = collection.count()
data(f"Total documents in collection: {final_count}")

üöÄ Generating embeddings and storing in ChromaDB...
üìä Progress: 100/5086 (2.0%)
üìä Progress: 200/5086 (3.9%)
üìä Progress: 300/5086 (5.9%)
üìä Progress: 400/5086 (7.9%)
üìä Progress: 500/5086 (9.8%)
üìä Progress: 600/5086 (11.8%)
üìä Progress: 700/5086 (13.8%)
üìä Progress: 800/5086 (15.7%)
üìä Progress: 900/5086 (17.7%)
üìä Progress: 1000/5086 (19.7%)
üìä Progress: 1100/5086 (21.6%)
üìä Progress: 1200/5086 (23.6%)
üìä Progress: 1300/5086 (25.6%)
üìä Progress: 1400/5086 (27.5%)
üìä Progress: 1500/5086 (29.5%)
üìä Progress: 1600/5086 (31.5%)
üìä Progress: 1700/5086 (33.4%)
üìä Progress: 1800/5086 (35.4%)
üìä Progress: 1900/5086 (37.4%)
üìä Progress: 2000/5086 (39.3%)
üìä Progress: 2100/5086 (41.3%)
üìä Progress: 2200/5086 (43.3%)
üìä Progress: 2300/5086 (45.2%)
üìä Progress: 2400/5086 (47.2%)
üìä Progress: 2500/5086 (49.2%)
üìä Progress: 2600/5086 (51.1%)
üìä Progress: 2700/5086 (53.1%)
üìä Progress: 2800/5086 (55.1%)
üìä Progress: 2900/5086 (57.0%)
üì

## üîç Step 04 ‚Äî Test Retrieval
Test similarity search to verify the database works correctly.

## üìä Step 05 ‚Äî Database Statistics
Check collection statistics and verify all documents are stored.