# Assignment 1: Vector Database Creation and Retrieval
## Day 6 Session 2 - RAG Fundamentals

**OBJECTIVE:** Create a vector database from a folder of documents and implement basic retrieval functionality.

**LEARNING GOALS:**
- Understand document loading with SimpleDirectoryReader
- Learn vector store setup with LanceDB
- Implement vector index creation
- Perform semantic search and retrieval

**DATASET:** Use the data folder in `Day_6/session_2/data/` which contains multiple file types

**INSTRUCTIONS:**
1. Complete each function by replacing the TODO comments with actual implementation
2. Run each cell after completing the function to test it
3. The answers can be found in the existing notebooks in the `llamaindex_rag/` folder


In [None]:
from google.colab import drive
import os
drive.mount('/content/drive')

project_path="/content/drive/MyDrive/Colab Notebooks"
requirements_path=os.path.join(project_path,"requirements.txt")
!pip install -r "$requirements_path"

Mounted at /content/drive
Collecting lancedb (from -r /content/drive/MyDrive/Colab Notebooks/requirements.txt (line 11))
  Downloading lancedb-0.25.2-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Collecting llama-index (from -r /content/drive/MyDrive/Colab Notebooks/requirements.txt (line 12))
  Downloading llama_index-0.14.7-py3-none-any.whl.metadata (13 kB)
Collecting llama-index-vector-stores-lancedb (from -r /content/drive/MyDrive/Colab Notebooks/requirements.txt (line 13))
  Downloading llama_index_vector_stores_lancedb-0.4.1-py3-none-any.whl.metadata (460 bytes)
Collecting llama-index-embeddings-huggingface (from -r /content/drive/MyDrive/Colab Notebooks/requirements.txt (line 14))
  Downloading llama_index_embeddings_huggingface-0.6.1-py3-none-any.whl.metadata (458 bytes)
Collecting llama-index-llms-huggingface-api (from -r /content/drive/MyDrive/Colab Notebooks/requirements.txt (line 15))
  Downloading llama_index_llms_huggingface_api-0.6.1-py3-none-any.whl.metadata (1.

In [None]:
# Import required libraries
import os
from pathlib import Path
from typing import List
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, StorageContext, Settings
from llama_index.vector_stores.lancedb import LanceDBVectorStore
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

print("✅ Libraries imported successfully!")

✅ Libraries imported successfully!


In [None]:
# Configure LlamaIndex Settings (Using OpenRouter - No OpenAI API Key needed)
import os
from getpass import getpass

def setup_llamaindex_settings():
    """
    Configure LlamaIndex with local embeddings and OpenRouter for LLM.
    This assignment focuses on vector database operations, so we'll use local embeddings only.
    """
    # Check for OpenRouter API key (for future use, not needed for this basic assignment)
    api_key = os.getenv("OPENROUTER_API_KEY")
    # securely input your key
    os.environ["OPENROUTER_API_KEY"] = getpass("Enter your OpenRouter key")
    print("✓ OpenrRouter key set successfully")

    if not api_key:
        print("ℹ️  OPENROUTER_API_KEY not found - that's OK for this assignment!")
        print("   This assignment only uses local embeddings for vector operations.")

    # Configure local embeddings (no API key required)
    Settings.embed_model = HuggingFaceEmbedding(
        model_name="BAAI/bge-small-en-v1.5",
        trust_remote_code=True
    )

    print("✅ LlamaIndex configured with local embeddings")
    print("   Using BAAI/bge-small-en-v1.5 for document embeddings")

# Setup the configuration
setup_llamaindex_settings()


Enter your OpenRouter key··········
✓ OpenrRouter key set successfully
ℹ️  OPENROUTER_API_KEY not found - that's OK for this assignment!
   This assignment only uses local embeddings for vector operations.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✅ LlamaIndex configured with local embeddings
   Using BAAI/bge-small-en-v1.5 for document embeddings


## 1. Document Loading Function

Complete the function below to load documents from a folder using `SimpleDirectoryReader`.

**Note:** This assignment uses local embeddings only - no OpenAI API key required! We're configured to use OpenRouter for future LLM operations.


In [None]:
def load_documents_from_folder(folder_path: str):
    """
    Load documents from a folder using SimpleDirectoryReader.

    TODO: Complete this function to load documents from the given folder path.
    HINT: Use SimpleDirectoryReader with recursive parameter to load all files

    Args:
        folder_path (str): Path to the folder containing documents

    Returns:
        List of documents loaded from the folder
    """
    # TODO: Create SimpleDirectoryReader instance
    reader = SimpleDirectoryReader(folder_path,recursive=True)

    # TODO: Load and return documents
    documents = reader.load_data()

    # return documents

    # PLACEHOLDER - Replace with actual implementation
    print(f"TODO: Load documents from {folder_path}")
    return documents

# Test the function after you complete it
test_folder = "/content/drive/MyDrive/content_files/data"
documents = load_documents_from_folder(test_folder)
print(f"Loaded {len(documents)} documents")


100%|████████████████████████████████████████| 139M/139M [00:01<00:00, 113MiB/s]


TODO: Load documents from /content/drive/MyDrive/content_files/data
Loaded 42 documents


## 2. Vector Store Creation Function

Complete the function below to create a LanceDB vector store.


In [None]:
def create_vector_store(db_path: str = "./vectordb", table_name: str = "documents"):
    """
    Create a LanceDB vector store for storing document embeddings.

    Args:
        db_path (str): Path where the vector database will be stored
        table_name (str): Name of the table in the vector database

    Returns:
        LanceDBVectorStore: Configured vector store
    """
    # Create the directory if it doesn't exist
    Path(db_path).mkdir(parents=True, exist_ok=True)

    # Create vector store
    vector_store = LanceDBVectorStore(
        uri=db_path,
        table_name=table_name
    )
    return vector_store

# Test the function
vector_store = create_vector_store("./assignment_vectordb")
print(f"✅ Vector store created successfully")



✅ Vector store created successfully


## 3. Vector Index Creation Function

Complete the function below to create a vector index from documents.


In [None]:
def create_vector_index(documents: List, vector_store):
    """
    Create a vector index from documents using the provided vector store.

    TODO: Complete this function to create a VectorStoreIndex from documents.
    HINT: Create StorageContext with vector_store, then use VectorStoreIndex.from_documents()

    Args:
        documents: List of documents to index
        vector_store: LanceDB vector store to use for storage

    Returns:
        VectorStoreIndex: The created vector index
    """
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    # TODO: Create storage context with vector store



    # Test the function
    index = VectorStoreIndex.from_documents(
        documents,
        storage_context=storage_context,
        show_progress=True
    )
    print(f"✅ Vector index created successfully with {len(documents)} documents")



    # PLACEHOLDER - Replace with actual implementation
    print(f"TODO: Create vector index from {len(documents)} documents")
    return index

# Test the function after you complete it (will only work after previous functions are completed)
if documents and vector_store:
    index = create_vector_index(documents, vector_store)
    print(f"Vector index created: {index is not None}")
else:
    print("Complete previous functions first to test this one")


Parsing nodes:   0%|          | 0/42 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/55 [00:00<?, ?it/s]

✅ Vector index created successfully with 42 documents
TODO: Create vector index from 42 documents
Vector index created: True


## 4. Document Search Function

Complete the function below to search for relevant documents using the vector index.


In [None]:
def search_documents(index, query: str, top_k: int = 3):
    """
    Search for relevant documents using the vector index.

    TODO: Complete this function to perform semantic search on the index.
    HINT: Use index.as_retriever() with similarity_top_k parameter, then retrieve(query)

    Args:
        index: Vector index to search
        query (str): Search query
        top_k (int): Number of top results to return

    Returns:
        List of retrieved document nodes
    """
    # TODO: Create retriever from index
    retriever = index.as_retriever(similarity_top_k=top_k)

    # TODO: Retrieve documents for the query
    results = retriever.retrieve(query)

    # return results

    # PLACEHOLDER - Replace with actual implementation
    print(f"TODO: Search for '{query}' in index")
    return []

# Test the function after you complete it (will only work after all previous functions are completed)
if 'index' in locals() and index is not None:
    test_query = "What are AI agents?"
    results = search_documents(index, test_query, top_k=2)
    print(f"Found {len(results)} results for query: '{test_query}'")
    for i, result in enumerate(results, 1):
        print(f"Result {i}: {result.text[:100] if hasattr(result, 'text') else 'No text'}...")
else:
    print("Complete all previous functions first to test this one")


TODO: Search for 'What are AI agents?' in index
Found 0 results for query: 'What are AI agents?'


## 5. Final Test - Complete Pipeline

Once you've completed all the functions above, run this cell to test the complete pipeline with multiple search queries.


In [None]:
# Final test of the complete pipeline
print("🚀 Testing Complete Vector Database Pipeline")
print("=" * 50)

# Re-run the complete pipeline to ensure everything works
data_folder = "/content/drive/MyDrive/content_files/data"
vector_db_path = "./assignment_vectordb"

# Step 1: Load documents
print("\n📂 Step 1: Loading documents...")
documents = load_documents_from_folder(data_folder)
print(f"   Loaded {len(documents)} documents")

# Step 2: Create vector store
print("\n🗄️ Step 2: Creating vector store...")
vector_store = create_vector_store(vector_db_path)
print("   Vector store status:", "✅ Created" if vector_store else "❌ Failed")

# Step 3: Create vector index
print("\n🔗 Step 3: Creating vector index...")
if documents and vector_store:
    index = create_vector_index(documents, vector_store)
    print("   Index status:", "✅ Created" if index else "❌ Failed")
else:
    index = None
    print("   ❌ Cannot create index - missing documents or vector store")

# Step 4: Test multiple search queries
print("\n🔍 Step 4: Testing search functionality...")
if index:
    search_queries = [
        "What are AI agents?",
        "How to evaluate agent performance?",
        "Italian recipes and cooking",
        "Financial analysis and investment"
    ]

    for query in search_queries:
        print(f"\n   🔎 Query: '{query}'")
        results = search_documents(index, query, top_k=2)

        if results:
            for i, result in enumerate(results, 1):
                text_preview = result.text[:100] if hasattr(result, 'text') else "No text available"
                score = f" (Score: {result.score:.4f})" if hasattr(result, 'score') else ""
                print(f"      {i}. {text_preview}...{score}")
        else:
            print("      No results found")
else:
    print("   ❌ Cannot test search - index not created")

print("\n" + "=" * 50)
print("🎯 Assignment Status:")
print(f"   Documents loaded: {'✅' if documents else '❌'}")
print(f"   Vector store created: {'✅' if vector_store else '❌'}")
print(f"   Index created: {'✅' if index else '❌'}")
print(f"   Search working: {'✅' if index else '❌'}")

if documents and vector_store and index:
    print("\n🎉 Congratulations! You've successfully completed the assignment!")
    print("   You've built a complete vector database with search functionality!")
else:
    print("\n📝 Please complete the TODO functions above to finish the assignment.")


🚀 Testing Complete Vector Database Pipeline

📂 Step 1: Loading documents...
TODO: Load documents from /content/drive/MyDrive/content_files/data
   Loaded 42 documents

🗄️ Step 2: Creating vector store...
   Vector store status: ✅ Created

🔗 Step 3: Creating vector index...


Parsing nodes:   0%|          | 0/42 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/55 [00:00<?, ?it/s]

✅ Vector index created successfully with 42 documents
TODO: Create vector index from 42 documents
   Index status: ✅ Created

🔍 Step 4: Testing search functionality...

   🔎 Query: 'What are AI agents?'
TODO: Search for 'What are AI agents?' in index
      No results found

   🔎 Query: 'How to evaluate agent performance?'
TODO: Search for 'How to evaluate agent performance?' in index
      No results found

   🔎 Query: 'Italian recipes and cooking'
TODO: Search for 'Italian recipes and cooking' in index
      No results found

   🔎 Query: 'Financial analysis and investment'
TODO: Search for 'Financial analysis and investment' in index
      No results found

🎯 Assignment Status:
   Documents loaded: ✅
   Vector store created: ✅
   Index created: ✅
   Search working: ✅

🎉 Congratulations! You've successfully completed the assignment!
   You've built a complete vector database with search functionality!
