# Assignment 1: Vector Database Creation and Retrieval
## Day 6 Session 2 - RAG Fundamentals

**OBJECTIVE:** Create a vector database from a folder of documents and implement basic retrieval functionality.

**LEARNING GOALS:**
- Understand document loading with SimpleDirectoryReader
- Learn vector store setup with LanceDB
- Implement vector index creation
- Perform semantic search and retrieval

**DATASET:** Use the data folder in `Day_6/session_2/data/` which contains multiple file types

**INSTRUCTIONS:**
1. Complete each function by replacing the TODO comments with actual implementation
2. Run each cell after completing the function to test it
3. The answers can be found in the existing notebooks in the `llamaindex_rag/` folder

---
## üìù Setup: Configure Your API Key (Optional)

**IMPORTANT:** This assignment primarily uses **local embeddings** (no API key required).

However, if you want to use OpenAI or OpenRouter for LLM operations later:

### Option 1: OpenAI API Key
Get your API key from: https://platform.openai.com/api-keys

### Option 2: OpenRouter API Key (Recommended - cheaper!)
Get your API key from: https://openrouter.ai/keys

### How to Enter Your API Key:
Run the cell below and enter your API key when prompted. It will be securely stored for this session.

In [None]:
# API Key Configuration (Optional - for future LLM operations)
import os
from getpass import getpass

# Check if API key is already set in environment
if not os.getenv("OPENROUTER_API_KEY") and not os.getenv("OPENAI_API_KEY"):
    print("\nüîë API Key Setup (Optional)")
    print("=" * 50)
    print("This assignment uses LOCAL embeddings (no API key required).")
    print("\nHowever, you can optionally configure an API key for future LLM operations:")
    print("  1. OpenAI API Key - https://platform.openai.com/api-keys")
    print("  2. OpenRouter API Key - https://openrouter.ai/keys (cheaper option)")
    print("\nPress Enter to skip, or paste your API key below:")
    
    api_key = getpass("API Key (or press Enter to skip): ").strip()
    
    if api_key:
        # Detect which type of key it is
        if api_key.startswith("sk-or-"):
            os.environ["OPENROUTER_API_KEY"] = api_key
            print("‚úÖ OpenRouter API key configured!")
        elif api_key.startswith("sk-"):
            os.environ["OPENAI_API_KEY"] = api_key
            print("‚úÖ OpenAI API key configured!")
        else:
            print("‚ö†Ô∏è  Warning: API key format not recognized. Setting as OPENROUTER_API_KEY.")
            os.environ["OPENROUTER_API_KEY"] = api_key
    else:
        print("‚ÑπÔ∏è  Skipping API key setup - using local embeddings only (perfect for this assignment!)")
else:
    print("‚úÖ API key already configured in environment")

---
## üìö Step 1: Import Required Libraries

**What this does:**
- Imports LlamaIndex components for document loading, vector storage, and indexing
- Imports LanceDB for local vector database storage
- Imports HuggingFace embeddings for converting text to numerical vectors

**Key Libraries:**
- `SimpleDirectoryReader`: Loads documents from folders
- `VectorStoreIndex`: Creates searchable index from documents
- `LanceDBVectorStore`: Local vector database (fast, no API needed)
- `HuggingFaceEmbedding`: Free, local text embeddings

In [None]:
# Import required libraries
import os
from pathlib import Path
from typing import List
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, StorageContext, Settings
from llama_index.vector_stores.lancedb import LanceDBVectorStore
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

print("‚úÖ Libraries imported successfully!")

---
## ‚öôÔ∏è Step 2: Configure LlamaIndex Settings

**What this does:**
- Configures LlamaIndex to use **local embeddings** (no API calls, completely free!)
- Uses the BAAI/bge-small-en-v1.5 model for converting text to 384-dimensional vectors
- This model runs on your computer, so no internet connection or API key needed

**Why local embeddings?**
- ‚úÖ Completely free (no API costs)
- ‚úÖ Fast (runs on your machine)
- ‚úÖ Private (your documents never leave your computer)
- ‚úÖ Good quality for learning and many applications

**Model Details:**
- BAAI/bge-small-en-v1.5: 384-dimensional embeddings, ~133MB model size
- First run will download the model (one-time, ~1-2 minutes)
- Subsequent runs use cached model (instant)

In [None]:
# Configure LlamaIndex Settings (Using local embeddings - No API key needed)
def setup_llamaindex_settings():
    """
    Configure LlamaIndex with local embeddings.
    This assignment focuses on vector database operations using free, local models.
    """
    # Check for API keys (optional, for future use)
    has_openrouter = bool(os.getenv("OPENROUTER_API_KEY"))
    has_openai = bool(os.getenv("OPENAI_API_KEY"))
    
    if not has_openrouter and not has_openai:
        print("‚ÑπÔ∏è  No API key configured - that's OK for this assignment!")
        print("   This assignment only uses local embeddings for vector operations.")
    else:
        print("‚úÖ API key found (for optional future LLM operations)")
    
    # Configure local embeddings (no API key required)
    print("\nüîÑ Loading local embedding model...")
    Settings.embed_model = HuggingFaceEmbedding(
        model_name="BAAI/bge-small-en-v1.5",
        trust_remote_code=True
    )
    
    print("‚úÖ LlamaIndex configured with local embeddings")
    print("   Using BAAI/bge-small-en-v1.5 for document embeddings (384 dimensions)")
    print("   First run may take 1-2 minutes to download model (~133MB)")

# Setup the configuration
setup_llamaindex_settings()

---
## üìÇ Function 1: Load Documents from Folder

**Your Task:** Complete the `load_documents_from_folder()` function below.

**What this function does:**
- Takes a folder path as input
- Uses `SimpleDirectoryReader` to automatically detect and load various file types
- Supports: PDFs, text files, Word docs, HTML, CSVs, and more
- Returns a list of Document objects that can be indexed

**Key Concept - Document Loading:**
Document ingestion is the first step in any RAG system. We need to load various file types (PDFs, text, HTML, etc.) into memory before we can create embeddings and search them.

**Parameters:**
- `input_dir`: Path to the folder containing documents
- `recursive=True`: Also load files from subdirectories

**TODO:** Replace the `pass` statement with your implementation using `SimpleDirectoryReader`

In [None]:
def load_documents_from_folder(folder_path: str):
    """
    Load documents from a folder using SimpleDirectoryReader.
    
    TODO: Complete this function to load documents from the given folder path.
    HINT: Use SimpleDirectoryReader with recursive parameter to load all files
    
    Args:
        folder_path (str): Path to the folder containing documents
        
    Returns:
        List of documents loaded from the folder
    """
    # TODO: Your code here
    # Create SimpleDirectoryReader instance with recursive loading
    # Load and return documents
    pass

# Test the function after you complete it
test_folder = "data"
documents = load_documents_from_folder(test_folder)
print(f"Loaded {len(documents)} documents")

---
## üìñ Understanding Vector Stores and Embeddings

Before we create the vector store, let's understand some key concepts:

### üèóÔ∏è 1. What is an "Instance"?

**Instance** = A working copy of an object

Think of it like:
- **Class** (LanceDBVectorStore) = Blueprint for a house
- **Instance** (vector_store) = An actual house built from that blueprint

```python
vector_store = LanceDBVectorStore(...)  # Creating an instance
```

Now `vector_store` is a working object you can use:
- `vector_store.add()` - Add documents
- `vector_store.query()` - Search documents

---

### üóÇÔ∏è 2. Where is the Database Created?

When you specify `db_path = "./vectordb"`, it means:
- `./` = Current working directory (where Jupyter is running)
- Since you're in `D:\Claude\Bootcamp\Day- 6\`, the database will be created at:
  - `D:\Claude\Bootcamp\Day- 6\vectordb\`

After running, you'll see a new folder with files like:
- `documents.lance` (the actual database)
- Index files

---

### üß† 3. What are Document Embeddings? (MOST IMPORTANT!)

**Embeddings** = Converting text to numbers that capture meaning

**Example:**

Original Documents (text):
- Doc 1: "Python is a programming language"
- Doc 2: "JavaScript is used for web development"
- Doc 3: "I love cooking pasta"

Document Embeddings (numbers):
- Doc 1: `[0.8, 0.9, 0.1, 0.05, ...]` (384 numbers)
- Doc 2: `[0.75, 0.85, 0.15, 0.1, ...]` (384 numbers)
- Doc 3: `[0.1, 0.05, 0.9, 0.95, ...]` (384 numbers)

**Why numbers?**
- Computers can't understand "Python" or "programming"
- BUT computers CAN measure distance between numbers!
- Similar meanings ‚Üí Similar numbers

---

### üìä 4. Storing Documents vs Storing Embeddings

**Traditional Search (Keyword Matching):**
```
Database stores: "Python is a programming language"
Search: "coding languages"
Result: ‚ùå NO MATCH ("coding" ‚â† "programming")
```

**Vector Search (Semantic/Meaning-Based):**
```
Database stores: [0.8, 0.9, 0.1, ...] ‚Üê Python doc embedding
Your query: "coding languages" ‚Üí [0.78, 0.88, 0.12, ...]
Computer calculates: Distance = 0.05 (VERY CLOSE!)
Result: ‚úÖ Returns Python doc (understands "coding" ‚âà "programming")
```

---

### üéØ The Magic of Embeddings:

Embeddings understand **MEANING**, not just words:

| Your Search | Traditional DB | Embedding DB |
|-------------|----------------|-------------|
| "king" | ‚ùå No match for "queen" | ‚úÖ Finds "queen" (similar concept) |
| "happy" | ‚ùå No match for "joyful" | ‚úÖ Finds "joyful" (same sentiment) |
| "Python tutorial" | ‚ùå No match for "learn programming" | ‚úÖ Finds "learn programming" (same intent) |

---

### üîç Why RAG Uses Embeddings:

When you ask: **"How do AI agents work?"**

RAG system:
1. Converts your question to embedding: `[0.6, 0.7, 0.3, ...]`
2. Compares to ALL document embeddings in database
3. Finds documents with similar embeddings (similar meaning)
4. Returns: "AI_Agent_Frameworks.pdf" (even if it never says "how do they work")

**That's the "Retrieval" in Retrieval-Augmented Generation!**

---
## üóÑÔ∏è Function 2: Create Vector Store

**Your Task:** Complete the `create_vector_store()` function below.

**What this function does:**
- Creates a local LanceDB vector database
- LanceDB stores document embeddings (numerical vectors) on your disk
- No API calls needed - everything runs locally

**Key Concept - Vector Store:**
A vector store is a specialized database optimized for storing and searching high-dimensional vectors (embeddings). Unlike traditional databases that search by exact matches, vector databases find similar vectors using distance calculations.

**Parameters:**
- `uri`: Path where the database files will be stored
- `table_name`: Name of the table to store document vectors (like a table in SQL)

**Why LanceDB?**
- ‚úÖ Works completely offline (no API calls)
- ‚úÖ Fast similarity search
- ‚úÖ Lightweight (~few MB for typical document collections)
- ‚úÖ Perfect for learning and small-to-medium projects

**TODO:** Complete the function by creating a `LanceDBVectorStore` instance

In [None]:
def create_vector_store(db_path: str = "./vectordb", table_name: str = "documents"):
    """
    Create a LanceDB vector store for storing document embeddings.
    
    TODO: Complete this function to create a LanceDB vector store.
    HINT: Create the directory first, then instantiate LanceDBVectorStore with uri and table_name
    
    Args:
        db_path (str): Path where the vector database will be stored
        table_name (str): Name of the table in the vector database
        
    Returns:
        LanceDBVectorStore: Configured vector store
    """
    # TODO: Your code here
    # Create the directory if it doesn't exist (use Path from pathlib)
    # Create and return LanceDBVectorStore instance
    pass

# Test the function after you complete it
vector_store = create_vector_store("./assignment_vectordb")
print(f"Vector store created: {vector_store is not None}")

---
## üîó Function 3: Create Vector Index

**Your Task:** Complete the `create_vector_index()` function below.

**What this function does:**
- Takes your loaded documents and the vector store
- Creates embeddings for ALL documents (converts text to 384-dimensional vectors)
- Stores these embeddings in the vector database
- Returns an index that can be used for searching

**Key Concept - Vector Index:**
The vector index is the searchable structure that connects your original documents with their embeddings. When you search, the index:
1. Converts your query to an embedding
2. Finds the closest document embeddings in the vector store
3. Returns the original document text

**What happens during index creation:**
1. For each document ‚Üí Generate embedding using BAAI/bge-small-en-v1.5
2. Store embedding in LanceDB vector store
3. Create searchable index structure

**Time taken:**
- ~1-2 seconds per document (first time)
- For 39 documents ‚âà 30-60 seconds
- Subsequent runs faster (embeddings cached)

**TODO:** Complete the function by:
1. Creating a StorageContext with the vector store
2. Creating a VectorStoreIndex from documents using that storage context

In [None]:
def create_vector_index(documents: List, vector_store):
    """
    Create a vector index from documents using the provided vector store.
    
    TODO: Complete this function to create a searchable vector index.
    HINT: Create StorageContext first, then use VectorStoreIndex.from_documents()
    
    Args:
        documents: List of documents to index
        vector_store: LanceDB vector store to use for storage
        
    Returns:
        VectorStoreIndex: The created vector index
    """
    # TODO: Your code here
    # Create storage context with vector store
    # Create index from documents
    # This will: 1) Generate embeddings for all documents
    #           2) Store embeddings in the vector store
    pass

# Test the function after you complete it
if documents and vector_store:
    index = create_vector_index(documents, vector_store)
    print(f"Vector index created: {index is not None}")
    print(f"Indexed {len(documents)} documents successfully!")
else:
    print("Complete previous functions first to test this one")

---
## üîç Function 4: Search Documents

**Your Task:** Complete the `search_documents()` function below.

**What this function does:**
- Takes a search query (plain English text)
- Converts query to an embedding
- Finds the most similar document embeddings in the vector store
- Returns the actual document text (not the embeddings)

**Key Concept - Semantic Search:**
Unlike keyword search (exact word matching), semantic search finds documents with similar **meaning**:
- Query: "machine learning tutorials" ‚Üí Finds: "AI and deep learning guides"
- Query: "Italian food recipes" ‚Üí Finds: "Cooking pasta and pizza"
- Query: "financial analysis" ‚Üí Finds: "Investment and stock market data"

**How it works:**
1. Query "What are AI agents?" ‚Üí Embedding: `[0.65, 0.73, 0.32, ...]`
2. Compare to all document embeddings using distance calculation
3. Find closest matches:
   - Document A: Distance = 0.08 (VERY SIMILAR) ‚úÖ
   - Document B: Distance = 0.15 (SIMILAR) ‚úÖ
   - Document C: Distance = 0.89 (NOT SIMILAR) ‚ùå
4. Return top-k closest documents (e.g., top 3)

**Parameters:**
- `similarity_top_k`: How many results to return (e.g., 3 means "return 3 most similar documents")

**TODO:** Complete the function by:
1. Creating a retriever from the index with similarity_top_k parameter
2. Using the retriever to search for the query

In [None]:
def search_documents(index, query: str, top_k: int = 3):
    """
    Search for relevant documents using the vector index.
    
    TODO: Complete this function to perform semantic search on the index.
    HINT: Use index.as_retriever() with similarity_top_k parameter, then retrieve(query)
    
    Args:
        index: Vector index to search
        query (str): Search query
        top_k (int): Number of top results to return
        
    Returns:
        List of retrieved document nodes
    """
    # TODO: Your code here
    # Create retriever from index with similarity_top_k
    # Retrieve documents for the query
    pass

# Test the function after you complete it
if 'index' in locals() and index is not None:
    test_query = "What are AI agents?"
    results = search_documents(index, test_query, top_k=2)
    print(f"Found {len(results)} results for query: '{test_query}'")
    print("\nüîé Search Results:")
    for i, result in enumerate(results, 1):
        text_preview = result.text[:100] if hasattr(result, 'text') else 'No text'
        score = f" (Similarity: {result.score:.4f})" if hasattr(result, 'score') else ""
        print(f"  {i}. {text_preview}...{score}")
else:
    print("Complete all previous functions first to test this one")

---
## üöÄ Final Test: Complete RAG Pipeline

**What this cell does:**
Once you've completed all 4 functions above, this cell will:
1. Run the complete vector database pipeline from start to finish
2. Test with multiple diverse search queries
3. Show similarity scores for each result
4. Verify that all components work together

**Test Queries:**
We'll test with 4 different topics to demonstrate semantic search:
- AI and technology
- Agent evaluation
- Cooking and recipes
- Financial analysis

This proves your vector database can handle diverse topics and find relevant results!

**What to look for:**
- ‚úÖ All 4 functions complete successfully
- ‚úÖ Documents load (should see ~39 documents)
- ‚úÖ Vector store and index created
- ‚úÖ Search returns relevant results with similarity scores
- ‚úÖ Higher scores (closer to 1.0) = more similar documents

In [None]:
# Final test of the complete pipeline
print("üöÄ Testing Complete Vector Database Pipeline")
print("=" * 50)

# Re-run the complete pipeline to ensure everything works
data_folder = "data"
vector_db_path = "./assignment_vectordb"

# Step 1: Load documents
print("\nüìÇ Step 1: Loading documents...")
documents = load_documents_from_folder(data_folder)
print(f"   Loaded {len(documents)} documents")

# Step 2: Create vector store
print("\nüóÑÔ∏è Step 2: Creating vector store...")
vector_store = create_vector_store(vector_db_path)
print("   Vector store status:", "‚úÖ Created" if vector_store else "‚ùå Failed")

# Step 3: Create vector index
print("\nüîó Step 3: Creating vector index...")
print("   (This may take 30-60 seconds for ~39 documents...)")
if documents and vector_store:
    index = create_vector_index(documents, vector_store)
    print("   Index status:", "‚úÖ Created" if index else "‚ùå Failed")
else:
    index = None
    print("   ‚ùå Cannot create index - missing documents or vector store")

# Step 4: Test multiple search queries
print("\nüîç Step 4: Testing search functionality...")
if index:
    search_queries = [
        "What are AI agents?",
        "How to evaluate agent performance?", 
        "Italian recipes and cooking",
        "Financial analysis and investment"
    ]
    
    for query in search_queries:
        print(f"\n   üîé Query: '{query}'")
        results = search_documents(index, query, top_k=2)
        
        if results:
            for i, result in enumerate(results, 1):
                text_preview = result.text[:100] if hasattr(result, 'text') else "No text available"
                score = f" (Score: {result.score:.4f})" if hasattr(result, 'score') else ""
                print(f"      {i}. {text_preview}...{score}")
        else:
            print("      No results found")
else:
    print("   ‚ùå Cannot test search - index not created")

print("\n" + "=" * 50)
print("üéØ Assignment Status:")
print(f"   Documents loaded: {'‚úÖ' if documents else '‚ùå'}")
print(f"   Vector store created: {'‚úÖ' if vector_store else '‚ùå'}")
print(f"   Index created: {'‚úÖ' if index else '‚ùå'}")
print(f"   Search working: {'‚úÖ' if index else '‚ùå'}")

if documents and vector_store and index:
    print("\nüéâ Congratulations! You've successfully completed the assignment!")
    print("   You've built a complete vector database with semantic search functionality!")
    print("\nüìö What you learned:")
    print("   ‚úÖ Document loading from folders")
    print("   ‚úÖ Vector store setup with LanceDB")
    print("   ‚úÖ Document embedding and indexing")
    print("   ‚úÖ Semantic search (meaning-based, not keyword-based)")
    print("\nüöÄ You're ready for Assignment 2: Advanced RAG techniques!")
else:
    print("\nüìù Please complete the TODO functions above to finish the assignment.")