# Assignment 1: Vector Database Creation and Retrieval
## Day 6 Session 2 - RAG Fundamentals

**OBJECTIVE:** Create a vector database from a folder of documents and implement basic retrieval functionality.

**LEARNING GOALS:**
- Understand document loading with SimpleDirectoryReader
- Learn vector store setup with LanceDB
- Implement vector index creation
- Perform semantic search and retrieval

**DATASET:** Use the data folder in `Day_6/session_2/data/` which contains multiple file types

**INSTRUCTIONS:**
1. Complete each function by replacing the TODO comments with actual implementation
2. Run each cell after completing the function to test it
3. The answers can be found in the existing notebooks in the `llamaindex_rag/` folder


In [51]:
# We need mount our Google Drive so that our Colab notebooks can access content.
# Here we will need to pip install all the requirements.
# Later in the code we will require access to all documents we want stored for RAG.

import os
from google.colab import drive

# 1Ô∏è‚É£ Mount Google Drive
drive.mount('/content/drive')

# 2Ô∏è‚É£ Define your project folder inside Drive
project_path = "/content/drive/MyDrive/Colab Notebooks"

# 3Ô∏è‚É£ Construct the requirements path dynamically
requirements_path = os.path.join(project_path, "requirements.txt")

# 4Ô∏è‚É£ Validate the file exists
if not os.path.exists(requirements_path):
    raise FileNotFoundError(
        f"‚ùå Could not find requirements.txt at expected path:\n{requirements_path}\n"
        "Please verify your Google Drive folder structure or update project_path."
    )
else:
    print(f"‚úÖ Found requirements.txt at: {requirements_path}")

# 5Ô∏è‚É£ Install dependencies
!pip install -r "$requirements_path"

# !pip install -r "../requirements.txt"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
‚úÖ Found requirements.txt at: /content/drive/MyDrive/Colab Notebooks/requirements.txt


In [52]:
# Assigning Openrouter key
# securely input your key
from getpass import getpass
import os

os.environ["OPENROUTER_API_KEY"] = getpass("Enter your OpenRouter key")
print("‚úì OpenrRouter key set successfully")

Enter your OpenRouter key¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑
‚úì OpenrRouter key set successfully


In [53]:
# Import required libraries
import os
from pathlib import Path
from typing import List
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, StorageContext, Settings
from llama_index.vector_stores.lancedb import LanceDBVectorStore
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

print("‚úÖ Libraries imported successfully!")

‚úÖ Libraries imported successfully!


In [54]:
# Configure LlamaIndex Settings (Using OpenRouter - No OpenAI API Key needed)
def setup_llamaindex_settings():
    """
    Configure LlamaIndex with local embeddings and OpenRouter for LLM.
    This assignment focuses on vector database operations, so we'll use local embeddings only.
    """
    # Check for OpenRouter API key (for future use, not needed for this basic assignment)
    api_key = os.getenv("OPENROUTER_API_KEY")

    # from google.colab import userdata
    # api_key = userdata.get('OPENROUTER_API_KEY')
    # print ("The OpenRouter API Key is ", api_key)

    if not api_key:
        print("‚ÑπÔ∏è  OPENROUTER_API_KEY not found - that's OK for this assignment!")
        print("   This assignment only uses local embeddings for vector operations.")


    # Configure local embeddings (no API key required)
    Settings.embed_model = HuggingFaceEmbedding(
        model_name="BAAI/bge-small-en-v1.5",
        trust_remote_code=True
    )

    print("‚úÖ LlamaIndex configured with local embeddings")
    print("   Using BAAI/bge-small-en-v1.5 for document embeddings")

# Setup the configuration
setup_llamaindex_settings()


‚úÖ LlamaIndex configured with local embeddings
   Using BAAI/bge-small-en-v1.5 for document embeddings


In [55]:
# Configuration parameters for the RAG system

import os
from google.colab import drive
# 1Ô∏è‚É£ Mount Google Drive
drive.mount('/content/drive')
# 2Ô∏è‚É£ Define your project folder inside Drive
project_path = "/content/drive/MyDrive/Colab Notebooks"
print(f"Project path: {project_path}")
# 3Ô∏è‚É£ Construct the papers path dynamically
papers_folder = os.path.join(project_path, "papers-agents")
print(f"Papers folder: {papers_folder}")
# 4Ô∏è‚É£ Validate the file exists
if not os.path.exists(papers_folder):
    raise FileNotFoundError(
        f"‚ùå Could not find papers_folder at expected path:\n{papers_folder}\n"
        "Please verify your Google Drive folder structure or update project_path."
    )
else:
    print(f"‚úÖ Found papers folder at: {papers_folder}")


CONFIG = {
    "llm": {
        "model": "gpt-4o",                    # OpenRouter model to use
        "temperature": 0.1                   # Temperature for response generation
    },
    "embeddings": {
        "model": "local:BAAI/bge-small-en-v1.5",  # Local embedding model (no API key needed)
        "chunk_size": 1024,                  # Size of text chunks for processing
        "chunk_overlap": 100                 # Overlap between consecutive chunks
    },
    "vector_store": {
        "type": "lancedb",                   # Vector database type
        "table_name": "academic_papers",     # Table name for storing embeddings
        "path": "storage/papers_vectordb"    # Path to vector database
    },
    "index": {
        "storage_path": "storage/papers_index",  # Path to store complete index
        "similarity_top_k": 5                    # Number of similar chunks to retrieve
    },
    "papers": {
 #       "folder": "/papers-agents"      # Path to academic papers folder
        "folder": os.path.join(project_path, "papers-agents")      # Path to academic papers folder
    },
 #   "data": {
 #       "folder": "/data"      # Path to academic papers folder
 #       "folder": os.path.join(project_path, "data")      # Path to data folder
 #   }
}

def get_config(key_path: str, default_value=None):
    """
    Get configuration value using dot notation.

    Args:
        key_path (str): Dot-separated path to the config value (e.g., 'llm.model')
        default_value: Default value if key not found

    Returns:
        Configuration value or default
    """
    keys = key_path.split('.')
    value = CONFIG

    for key in keys:
        if isinstance(value, dict) and key in value:
            value = value[key]
        else:
            return default_value

    return value

# Test configuration access
llm_model = get_config("llm.model")
embedding_model = get_config("embeddings.model")
chunk_size = get_config("embeddings.chunk_size")

print(f"LLM model: {llm_model}")
print(f"Embedding model: {embedding_model}")
print(f"Chunk size: {chunk_size}")
print("‚úì Configuration setup complete")

papers_folder = get_config("papers.folder")
print(f"Loading papers from: {papers_folder}")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Project path: /content/drive/MyDrive/Colab Notebooks
Papers folder: /content/drive/MyDrive/Colab Notebooks/papers-agents
‚úÖ Found papers folder at: /content/drive/MyDrive/Colab Notebooks/papers-agents
LLM model: gpt-4o
Embedding model: local:BAAI/bge-small-en-v1.5
Chunk size: 1024
‚úì Configuration setup complete
Loading papers from: /content/drive/MyDrive/Colab Notebooks/papers-agents


## 1. Document Loading Function

Complete the function below to load documents from a folder using `SimpleDirectoryReader`.

**Note:** This assignment uses local embeddings only - no OpenAI API key required! We're configured to use OpenRouter for future LLM operations.


In [56]:
from llama_index.core import SimpleDirectoryReader

def load_papers_from_folder() -> List:
    """
    Load and process all PDF papers from the configured folder using LlamaIndex's native loader.

    Returns:
        List[Document]: Processed documents ready for indexing
    """
    papers_folder = get_config("papers.folder")
    print(f"Loading papers from: {papers_folder}")

    papers_path = Path(papers_folder)
    if not papers_path.exists():
        print(f"Papers folder does not exist: {papers_path}")
        return []

    # Use LlamaIndex's SimpleDirectoryReader to load PDFs
    # This natively handles PDF parsing, text extraction, and metadata
    documents = SimpleDirectoryReader(papers_folder).load_data()

    print(f"‚úì Loaded {len(documents)} documents")
    return documents

# Load the papers using config
documents = load_papers_from_folder()
if documents:
    print(f"Successfully loaded {len(documents)} documents")
    print(f"First document preview: {documents[0].text[:200]}...")
    print(f"First document metadata: {documents[0].metadata}")
else:
    print("No documents loaded")



def load_documents_from_folder(folder_path: str):
    """
    Load documents from a folder using SimpleDirectoryReader.

    TODO: Complete this function to load documents from the given folder path.
    HINT: Use SimpleDirectoryReader with recursive parameter to load all files

    Args:
        folder_path (str): Path to the folder containing documents

    Returns:
        List of documents loaded from the folder
    """
    # TODO: Create SimpleDirectoryReader instance
    # reader = ?

    # TODO: Load and return documents
    # documents = ?

    # return documents

    # PLACEHOLDER - Replace with actual implementation
    print(f"TODO: Load documents from {folder_path}")
    return []

# Test the function after you complete it
#test_folder = "../data"
#documents = load_documents_from_folder(test_folder)
#print(f"Loaded {len(documents)} documents")


Loading papers from: /content/drive/MyDrive/Colab Notebooks/papers-agents
‚úì Loaded 229 documents
Successfully loaded 229 documents
First document preview: AI Agents vs. Agentic AI: A Conceptual
Taxonomy, Applications and Challenges
Ranjan Sapkota‚àó‚Ä°, Konstantinos I. Roumeliotis ‚Ä†, Manoj Karkee ‚àó‚Ä°
‚àóCornell University, Department of Environmental and Biolo...
First document metadata: {'page_label': '1', 'file_name': 'AI_Agents_vs_Agentic_AI.pdf', 'file_path': '/content/drive/MyDrive/Colab Notebooks/papers-agents/AI_Agents_vs_Agentic_AI.pdf', 'file_type': 'application/pdf', 'file_size': 3196781, 'creation_date': '2025-11-02', 'last_modified_date': '2025-11-02'}


## 2. Vector Store Creation Function

Complete the function below to create a LanceDB vector store.


In [57]:
"""
#from llama_index.vector_stores.lancedb import LanceDBVectorStore

def create_vector_store():

    try:
 #       import lancedb

        # Get configuration values
        vector_db_path = get_config("vector_store.path")
        table_name = get_config("vector_store.table_name")

        # Create storage directory
        Path(vector_db_path).parent.mkdir(parents=True, exist_ok=True)

        # Connect to LanceDB
        db = lancedb.connect(str(vector_db_path))
        print(f"‚úì Connected to LanceDB at: {vector_db_path}")

        # Create vector store
        vector_store = LanceDBVectorStore(
            uri=str(vector_db_path),
            table_name=table_name
        )
        print(f"‚úì LanceDB vector store created (table: {table_name})")

        return vector_store

    except Exception as e:
        print(f"Error creating vector store: {e}")
        return None

# Create the vector store using config
vector_store = create_vector_store()
if vector_store:
    print("‚úì Vector store setup complete")
else:
    print("‚ùå Vector store setup failed")
"""




def create_vector_store(db_path: str = "./vectordb", table_name: str = "documents"):

#    Create a LanceDB vector store for storing document embeddings.
#
#    TODO: Complete this function to create and configure a LanceDB vector store.
#    HINT: Use LanceDBVectorStore with uri and table_name parameters
#
#    Args:
#        db_path (str): Path where the vector database will be stored
#        table_name (str): Name of the table in the vector database
#
#    Returns:
#        LanceDBVectorStore: Configured vector store

    # TODO: Create the directory if it doesn't exist
    Path(db_path).mkdir(parents=True, exist_ok=True)

    # TODO: Create vector store
    vector_store = LanceDBVectorStore(uri=db_path, table_name=table_name)

    return vector_store

    # PLACEHOLDER - Replace with actual implementation
#    print(f"TODO: Create vector store at {db_path}")
#    return None

# Test the function after you complete it
vector_store = create_vector_store("./assignment_vectordb")
print(f"Vector store created: {vector_store is not None}")


Vector store created: True


## 3. Vector Index Creation Function

Complete the function below to create a vector index from documents.


In [58]:
# from llama_index.core import StorageContext, VectorStoreIndex, load_index_from_storage
def create_vector_index(documents: List, vector_store, force_rebuild: bool = False):
    """
    Create a vector index from documents using the provided vector store.

    TODO: Complete this function to create a VectorStoreIndex from documents.
    HINT: Create StorageContext with vector_store, then use VectorStoreIndex.from_documents()

    Args:
        documents: List of documents to index
        vector_store: LanceDB vector store to use for storage

    Returns:
        VectorStoreIndex: The created vector index
    """
    # TODO: Create storage context with vector store
    storage_context = StorageContext.from_defaults(vector_store=vector_store)

    # TODO: Create index from documents
    index = VectorStoreIndex.from_documents(documents, storage_context=storage_context, show_progress=True)

    return index

    # PLACEHOLDER - Replace with actual implementation
    # print(f"TODO: Create vector index from {len(documents)} documents")
    # return None

# Test the function after you complete it (will only work after previous functions are completed)
if documents and vector_store:
    index = create_vector_index(documents, vector_store)
    print(f"Vector index created: {index is not None}")
else:
    print("Complete previous functions first to test this one")


Parsing nodes:   0%|          | 0/229 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/406 [00:00<?, ?it/s]

Vector index created: True


## 4. Document Search Function

Complete the function below to search for relevant documents using the vector index.


In [59]:
def search_documents(index, query: str, top_k: int = 3):
    """
    Search for relevant documents using the vector index.

    TODO: Complete this function to perform semantic search on the index.
    HINT: Use index.as_retriever() with similarity_top_k parameter, then retrieve(query)

    Args:
        index: Vector index to search
        query (str): Search query
        top_k (int): Number of top results to return

    Returns:
        List of retrieved document nodes
    """
    # TODO: Create retriever from index
    retriever = index.as_retriever(similarity_top_k=top_k)

    # TODO: Retrieve documents for the query
    results = retriever.retrieve(query)

    # return results

    # PLACEHOLDER - Replace with actual implementation
    print(f"TODO: Search for '{query}' in index")
    return []

# Test the function after you complete it (will only work after all previous functions are completed)
if 'index' in locals() and index is not None:
    test_query = "What are AI agents?"
    results = search_documents(index, test_query, top_k=2)
    print(f"Found {len(results)} results for query: '{test_query}'")
    for i, result in enumerate(results, 1):
        print(f"Result {i}: {result.text[:100] if hasattr(result, 'text') else 'No text'}...")
else:
    print("Complete all previous functions first to test this one")


TODO: Search for 'What are AI agents?' in index
Found 0 results for query: 'What are AI agents?'


## 5. Final Test - Complete Pipeline

Once you've completed all the functions above, run this cell to test the complete pipeline with multiple search queries.


In [64]:
# Final test of the complete pipeline
print("üöÄ Testing Complete Vector Database Pipeline")
print("=" * 50)

# Re-run the complete pipeline to ensure everything works
data_folder = "/content/drive/MyDrive/Colab Notebooks/papers-agents"
#data_folder = get_config("papers.folder")
print(f"Loading data from: {data_folder}")
vector_db_path = "./assignment_vectordb"

# Step 1: Load documents
print("\nüìÇ Step 1: Loading documents...")
documents = load_papers_from_folder()
print(f"   Loaded {len(documents)} documents")

# Step 2: Create vector store
print("\nüóÑÔ∏è Step 2: Creating vector store...")
vector_store = create_vector_store(vector_db_path)
print("   Vector store status:", "‚úÖ Created" if vector_store else "‚ùå Failed")

# Step 3: Create vector index
print("\nüîó Step 3: Creating vector index...")
if documents and vector_store:
    index = create_vector_index(documents, vector_store)
    print("   Index status:", "‚úÖ Created" if index else "‚ùå Failed")
else:
    index = None
    print("   ‚ùå Cannot create index - missing documents or vector store")

# Step 4: Test multiple search queries
print("\nüîç Step 4: Testing search functionality...")
if index:
    search_queries = [
        "What are AI agents?",
        "How to evaluate agent performance?",
        "Italian recipes and cooking",
        "Financial analysis and investment"
    ]

    for query in search_queries:
        print(f"\n   üîé Query: '{query}'")
        results = search_documents(index, query, top_k=2)

        if results:
            for i, result in enumerate(results, 1):
                text_preview = result.text[:100] if hasattr(result, 'text') else "No text available"
                score = f" (Score: {result.score:.4f})" if hasattr(result, 'score') else ""
                print(f"      {i}. {text_preview}...{score}")
        else:
            print("      No results found")
else:
    print("   ‚ùå Cannot test search - index not created")

print("\n" + "=" * 50)
print("üéØ Assignment Status:")
print(f"   Documents loaded: {'‚úÖ' if documents else '‚ùå'}")
print(f"   Vector store created: {'‚úÖ' if vector_store else '‚ùå'}")
print(f"   Index created: {'‚úÖ' if index else '‚ùå'}")
print(f"   Search working: {'‚úÖ' if index else '‚ùå'}")

if documents and vector_store and index:
    print("\nüéâ Congratulations! You've successfully completed the assignment!")
    print("   You've built a complete vector database with search functionality!")
else:
    print("\nüìù Please complete the TODO functions above to finish the assignment.")


üöÄ Testing Complete Vector Database Pipeline
Loading data from: /content/drive/MyDrive/Colab Notebooks/papers-agents

üìÇ Step 1: Loading documents...
Loading papers from: /content/drive/MyDrive/Colab Notebooks/papers-agents
‚úì Loaded 229 documents
   Loaded 229 documents

üóÑÔ∏è Step 2: Creating vector store...
   Vector store status: ‚úÖ Created

üîó Step 3: Creating vector index...


Parsing nodes:   0%|          | 0/229 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/406 [00:00<?, ?it/s]

   Index status: ‚úÖ Created

üîç Step 4: Testing search functionality...

   üîé Query: 'What are AI agents?'
TODO: Search for 'What are AI agents?' in index
      No results found

   üîé Query: 'How to evaluate agent performance?'
TODO: Search for 'How to evaluate agent performance?' in index
      No results found

   üîé Query: 'Italian recipes and cooking'
TODO: Search for 'Italian recipes and cooking' in index
      No results found

   üîé Query: 'Financial analysis and investment'
TODO: Search for 'Financial analysis and investment' in index
      No results found

üéØ Assignment Status:
   Documents loaded: ‚úÖ
   Vector store created: ‚úÖ
   Index created: ‚úÖ
   Search working: ‚úÖ

üéâ Congratulations! You've successfully completed the assignment!
   You've built a complete vector database with search functionality!
