# Assignment 1: Vector Database Creation and Retrieval
## Day 6 Session 2 - RAG Fundamentals

**OBJECTIVE:** Create a vector database from a folder of documents and implement basic retrieval functionality.

**LEARNING GOALS:**
- Understand document loading with SimpleDirectoryReader
- Learn vector store setup with LanceDB
- Implement vector index creation
- Perform semantic search and retrieval

**DATASET:** Use the data folder in `Day_6/session_2/data/` which contains multiple file types

**INSTRUCTIONS:**
1. Complete each function by replacing the TODO comments with actual implementation
2. Run each cell after completing the function to test it
3. The answers can be found in the existing notebooks in the `llamaindex_rag/` folder


In [41]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [42]:
!pip install -r /content/drive/MyDrive/Outskill/rag_day7/assignments/requirements.txt



In [43]:
import os
from getpass import getpass

os.environ["OPENROUTER_API_KEY"] = getpass("Enter OpenRouter Key")

Enter OpenRouter Key¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


In [44]:
# Import required libraries
import os
from pathlib import Path
from typing import List
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, StorageContext, Settings
from llama_index.vector_stores.lancedb import LanceDBVectorStore
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

print("‚úÖ Libraries imported successfully!")

‚úÖ Libraries imported successfully!


In [45]:
import os
from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.openrouter import OpenRouter # Import OpenRouter LLM

# Configure LlamaIndex Settings (Using OpenRouter - No OpenAI API Key needed)
def setup_llamaindex_settings():
    """
    Configure LlamaIndex with local embeddings and OpenRouter for LLM.
    This assignment focuses on vector database operations, so we'll use local embeddings only.
    """
    # Check for OpenRouter API key
    api_key = os.getenv("OPENROUTER_API_KEY")
    if not api_key:
        print("‚ÑπÔ∏è  OPENROUTER_API_KEY not found. For querying with an LLM, please provide it.")
        # If no API key, Settings.llm will not be configured, and querying will likely fail later.
        Settings.llm = None # Explicitly set to None if key is missing
    else:
        # Configure LLM to use OpenRouter
        # You can choose a different model if desired, e.g., 'openai/gpt-3.5-turbo'
        Settings.llm = OpenRouter(api_key=api_key, model="mistralai/mistral-7b-instruct-v0.2")

    # Configure local embeddings (no API key required)
    Settings.embed_model = HuggingFaceEmbedding(
        model_name="BAAI/bge-small-en-v1.5",
        trust_remote_code=True
    )

    print("‚úÖ LlamaIndex configured with local embeddings")
    print("   Using BAAI/bge-small-en-v1.5 for document embeddings")
    if Settings.llm:
        print(f"   LLM configured to use OpenRouter with model: {Settings.llm.model}")
    else:
        print("   LLM not configured (OpenRouter API key missing or not set).")

# Setup the configuration
setup_llamaindex_settings()

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-small-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


‚úÖ LlamaIndex configured with local embeddings
   Using BAAI/bge-small-en-v1.5 for document embeddings
   LLM configured to use OpenRouter with model: mistralai/mistral-7b-instruct-v0.2


## 1. Document Loading Function

Complete the function below to load documents from a folder using `SimpleDirectoryReader`.

**Note:** This assignment uses local embeddings only - no OpenAI API key required! We're configured to use OpenRouter for future LLM operations.


In [46]:
def load_documents_from_folder(folder_path: str):
    """
    Load documents from a folder using SimpleDirectoryReader.

    TODO: Complete this function to load documents from the given folder path.
    HINT: Use SimpleDirectoryReader with recursive parameter to load all files

    Args:
        folder_path (str): Path to the folder containing documents

    Returns:
        List of documents loaded from the folder
    """
    # TODO: Create SimpleDirectoryReader instance
    reader = SimpleDirectoryReader(folder_path, recursive=True)

    # TODO: Load and return documents
    documents = reader.load_data()

    # PLACEHOLDER - Replace with actual implementation
    print(f"TODO: Load documents from {folder_path}")
    return documents

# Test the function after you complete it
test_folder = "/content/drive/MyDrive/Outskill/rag_day7/assignments/papers/agents"
documents = load_documents_from_folder(test_folder)
print(f"Loaded {len(documents)} documents")

if documents:
    print(f"Successfully loaded {len(documents)} documents")
    print(f"First document preview: {documents[0].text[:200]}...")
    print(f"First document metadata: {documents[0].metadata}")
else:
    print("No documents loaded")


TODO: Load documents from /content/drive/MyDrive/Outskill/rag_day7/assignments/papers/agents
Loaded 229 documents
Successfully loaded 229 documents
First document preview: AI Agents vs. Agentic AI: A Conceptual
Taxonomy, Applications and Challenges
Ranjan Sapkota ‚àó‚Ä°, Konstantinos I. Roumeliotis ‚Ä†, Manoj Karkee ‚àó‚Ä°
‚àóCornell University, Department of Environmental and Biol...
First document metadata: {'page_label': '1', 'file_name': 'AI_Agents_vs_Agentic_AI.pdf', 'file_path': '/content/drive/MyDrive/Outskill/rag_day7/assignments/papers/agents/AI_Agents_vs_Agentic_AI.pdf', 'file_type': 'application/pdf', 'file_size': 3196781, 'creation_date': '2026-02-08', 'last_modified_date': '2026-02-08'}


## 2. Vector Store Creation Function

Complete the function below to create a LanceDB vector store.


In [47]:
def create_vector_store(db_path: str = "./vectordb", table_name: str = "documents"):
  """
  Create a LanceDB vector store for storing document embeddings.

  TODO: Complete this function to create and configure a LanceDB vector store.
  HINT: Use LanceDBVectorStore with uri and table_name parameters

  Args:
      db_path (str): Path where the vector database will be stored
      table_name (str): Name of the table in the vector database

  Returns:
      LanceDBVectorStore: Configured vector store
  """
  try:
    import lancedb

    # Basic validation
    if not table_name or not table_name.strip():
        raise ValueError("vector_store.table_name is empty")

    # Only ensure local directories; skip for s3:// or gs://
    # Create storage directory
    if "://" not in db_path:
        Path(db_path).parent.mkdir(parents=True, exist_ok=True)

    # Connect (creates DB dir/files if needed)
    db = lancedb.connect(db_path)
    print(f"‚úì Connected to LanceDB at: {db_path}")

    # Create (instantiate) vector store
    vector_store = LanceDBVectorStore(uri=str(db_path), table_name=table_name)
    print(f"‚úì LanceDB vector store created (table: {table_name})")

    return vector_store
  except Exception as e:
      print(f"Error creating vector store: {e}")
      return None

# Test the function after you complete it
vector_store = create_vector_store("./assignment_vectordb", "documents")
print(f"Vector store created: {vector_store is not None}")


‚úì Connected to LanceDB at: ./assignment_vectordb
‚úì LanceDB vector store created (table: documents)
Vector store created: True


## 3. Vector Index Creation Function

Complete the function below to create a vector index from documents.


In [48]:
from llama_index.core import StorageContext, VectorStoreIndex, load_index_from_storage
import time
from pathlib import Path
import lancedb # Import lancedb here for table checks

def create_vector_index(documents: List, vector_store):
    """
    Create a vector index from documents using the provided vector store.

    TODO: Complete this function to create a VectorStoreIndex from documents.
    HINT: Create StorageContext with vector_store, then use VectorStoreIndex.from_documents()

    Args:
        documents: List of documents to index
        vector_store: LanceDB vector store to use for storage

    Returns:
        VectorStoreIndex: The created vector index
    """
    index_path = Path("/content/index_store")
    index_path.mkdir(parents=True, exist_ok=True)

    index = None
    table_exists_and_has_data = False

    try:
        # Check if the LanceDB table actually exists and has data
        db_conn = lancedb.connect(vector_store.uri)
        if vector_store.table_name in db_conn.table_names():
            table_obj = db_conn.open_table(vector_store.table_name)
            if table_obj.count_rows() > 0:
                table_exists_and_has_data = True
        else:
            print(f"LanceDB table '{vector_store.table_name}' does not exist.")
    except Exception as e:
        print(f"Error checking LanceDB table status: {e}. Assuming table needs to be created/populated.")
        table_exists_and_has_data = False


    # Only attempt to load an existing index if the LanceDB table has data AND LlamaIndex metadata exists
    if table_exists_and_has_data and (index_path / "index_store.json").exists():
        print("Loading existing index from disk and LanceDB...")
        try:
            storage_context = StorageContext.from_defaults(
                persist_dir=str(index_path),
                vector_store=vector_store
            )
            index = load_index_from_storage(storage_context)
            print("‚úì Successfully loaded existing index")
        except Exception as e:
            print(f"Error loading existing index metadata (might be inconsistent with LanceDB data): {e}")
            print("Creating new index from documents...")
            index = None # Force recreation if loading fails
    else:
        print("LanceDB table is empty/non-existent or LlamaIndex metadata not found. Creating new index...")

    # If index is still None (either no existing data/metadata, or loading failed/skipped)
    if index is None:
        if not documents:
            print("x No documents to index")
            return None

        start_time = time.time()

        # Create storage context with vector store
        storage_context = StorageContext.from_defaults(vector_store=vector_store)

        # Create index with progress bar
        index = VectorStoreIndex.from_documents(
            documents,
            storage_context=storage_context,
            show_progress=True
        )

        end_time = time.time()
        print(f"‚úì Index created in {end_time - start_time:.2f} seconds")

        # Save index to storage (metadata only, actual vectors are in LanceDB)
        print("Saving index metadata to storage...")
        index.storage_context.persist(persist_dir=str(index_path))
        print("‚úì Index metadata saved successfully")

    return index

# Test the function after you complete it (will only work after previous functions are completed)
if documents and vector_store:
    index = create_vector_index(documents, vector_store)
    print(f"Vector index created: {index is not None}")
else:
    print("Complete previous functions first to test this one")


Error checking LanceDB table status: 'LanceDBVectorStore' object has no attribute 'table_name'. Assuming table needs to be created/populated.
LanceDB table is empty/non-existent or LlamaIndex metadata not found. Creating new index...


Parsing nodes:   0%|          | 0/229 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/408 [00:00<?, ?it/s]

‚úì Index created in 258.40 seconds
Saving index metadata to storage...
‚úì Index metadata saved successfully
Vector index created: True


## 4. Document Search Function

Complete the function below to search for relevant documents using the vector index.


In [54]:
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import VectorIndexRetriever

def search_documents(index, query: str, top_k: int = 3):
    """
    Search for relevant documents using the vector index.

    TODO: Complete this function to perform semantic search on the index.
    HINT: Use index.as_retriever() with similarity_top_k parameter, then retrieve(query)

    Args:
        index: Vector index to search
        query (str): Search query
        top_k (int): Number of top results to return

    Returns:
        List of retrieved document nodes
    """
    # TODO: Create retriever from index
    retriever = VectorIndexRetriever(index=index,similarity_top_k=top_k)
    print(f"‚úì Retriever configured to find top {top_k} similar chunks")

    query_engine = RetrieverQueryEngine(retriever=retriever)
    print("‚úì Query engine setup successfully")

    # TODO: Retrieve documents for the query
    results = query_engine.query(query)

    # PLACEHOLDER - Replace with actual implementation
    print(f"TODO: Search for '{query}' in index")
    return results.source_nodes # Return the list of source nodes


# Removed redundant API key settings. OpenRouter key should be set via getpass and LLM configured in setup_llamaindex_settings.

# Test the function after you complete it (will only work after all previous functions are completed)
if 'index' in locals() and index is not None:
    test_query = "What are AI agents?"
    results = search_documents(index, test_query, top_k=2)
    # print(results)
    print(f"Found {len(results)} results for query: '{test_query}'")
    for i, result in enumerate(results, 1):
        print(f"Result {i}: {result.text[:100] if hasattr(result, 'text') else 'No text'}...")
else:
    print("Complete all previous functions first to test this one")

‚úì Retriever configured to find top 2 similar chunks
‚úì Query engine setup successfully
TODO: Search for 'What are AI agents?' in index
Found 1 results for query: 'What are AI agents?'
Result 1: AI Agents vs. Agentic AI: A Conceptual
Taxonomy, Applications and Challenges
Ranjan Sapkota ‚àó‚Ä°, Kons...


## 5. Final Test - Complete Pipeline

Once you've completed all the functions above, run this cell to test the complete pipeline with multiple search queries.


In [55]:
# Final test of the complete pipeline
print("üöÄ Testing Complete Vector Database Pipeline")
print("=" * 50)

# Re-run the complete pipeline to ensure everything works
data_folder = "/content/drive/MyDrive/Outskill/rag_day7/assignments/papers/agents"
vector_db_path = "./assignment_vectordb"

# Step 1: Load documents
print("\nüìÇ Step 1: Loading documents...")
documents = load_documents_from_folder(data_folder)
print(f"   Loaded {len(documents)} documents")

# Step 2: Create vector store
print("\nüóÑÔ∏è Step 2: Creating vector store...")
vector_store = create_vector_store(vector_db_path)
print("   Vector store status:", "‚úÖ Created" if vector_store else "‚ùå Failed")

# Step 3: Create vector index
print("\nüîó Step 3: Creating vector index...")
if documents and vector_store:
    index = create_vector_index(documents, vector_store)
    print("   Index status:", "‚úÖ Created" if index else "‚ùå Failed")
else:
    index = None
    print("   ‚ùå Cannot create index - missing documents or vector store")

# Step 4: Test multiple search queries
print("\nüîç Step 4: Testing search functionality...")
if index:
    search_queries = [
        "What are AI agents?",
        "How to evaluate agent performance?",
        "Italian recipes and cooking",
        "Financial analysis and investment"
    ]

    for query in search_queries:
        print(f"\n   üîé Query: '{query}'")
        results = search_documents(index, query, top_k=2)

        if results:
            for i, result in enumerate(results, 1):
                text_preview = result.text[:100] if hasattr(result, 'text') else "No text available"
                score = f" (Score: {result.score:.4f})" if hasattr(result, 'score') else ""
                print(f"      {i}. {text_preview}...{score}")
        else:
            print("      No results found")
else:
    print("   ‚ùå Cannot test search - index not created")

print("\n" + "=" * 50)
print("üéØ Assignment Status:")
print(f"   Documents loaded: {'‚úÖ' if documents else '‚ùå'}")
print(f"   Vector store created: {'‚úÖ' if vector_store else '‚ùå'}")
print(f"   Index created: {'‚úÖ' if index else '‚ùå'}")
print(f"   Search working: {'‚úÖ' if index else '‚ùå'}")

if documents and vector_store and index:
    print("\nüéâ Congratulations! You've successfully completed the assignment!")
    print("   You've built a complete vector database with search functionality!")
else:
    print("\nüìù Please complete the TODO functions above to finish the assignment.")


üöÄ Testing Complete Vector Database Pipeline

üìÇ Step 1: Loading documents...
TODO: Load documents from /content/drive/MyDrive/Outskill/rag_day7/assignments/papers/agents
   Loaded 229 documents

üóÑÔ∏è Step 2: Creating vector store...
‚úì Connected to LanceDB at: ./assignment_vectordb
‚úì LanceDB vector store created (table: documents)
   Vector store status: ‚úÖ Created

üîó Step 3: Creating vector index...
Error checking LanceDB table status: 'LanceDBVectorStore' object has no attribute 'table_name'. Assuming table needs to be created/populated.
LanceDB table is empty/non-existent or LlamaIndex metadata not found. Creating new index...


Parsing nodes:   0%|          | 0/229 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/408 [00:00<?, ?it/s]

‚úì Index created in 256.73 seconds
Saving index metadata to storage...
‚úì Index metadata saved successfully
   Index status: ‚úÖ Created

üîç Step 4: Testing search functionality...

   üîé Query: 'What are AI agents?'
‚úì Retriever configured to find top 2 similar chunks
‚úì Query engine setup successfully
TODO: Search for 'What are AI agents?' in index
      1. AI Agents vs. Agentic AI: A Conceptual
Taxonomy, Applications and Challenges
Ranjan Sapkota ‚àó‚Ä°, Kons... (Score: 0.6715)

   üîé Query: 'How to evaluate agent performance?'
‚úì Retriever configured to find top 2 similar chunks
‚úì Query engine setup successfully
TODO: Search for 'How to evaluate agent performance?' in index
      1. steps, but the answers are limited to Yes/No responses [7]. As the industry continues to pivot towar... (Score: 0.6765)

   üîé Query: 'Italian recipes and cooking'
‚úì Retriever configured to find top 2 similar chunks
‚úì Query engine setup successfully
TODO: Search for 'Italian recipes a