# Multimodal RAG System Tutorial

This notebook extends our basic RAG system to handle multiple data types including PDFs, CSV files, JSON, Markdown, HTML, images, and audio files. We'll demonstrate the advanced capabilities of LlamaIndex's `SimpleDirectoryReader` for multimodal data processing.

## What's New in This Tutorial

Building upon our previous RAG system, we now add:
- **Multimodal Document Loading**: CSV, JSON, Markdown, HTML, Images, Audio
- **Advanced SimpleDirectoryReader Features**: File filtering, metadata extraction, custom processors
- **Cross-Modal Queries**: Search across different data types simultaneously
- **Structured Data Integration**: Combine tabular data with unstructured text
- **Visual Content Processing**: Extract information from images and charts

## Supported File Types (Per LlamaIndex Documentation)

According to the [SimpleDirectoryReader documentation](https://developers.llamaindex.ai/python/framework/module_guides/loading/simpledirectoryreader/), the following formats are automatically supported:

- **.csv** - comma-separated values
- **.docx** - Microsoft Word  
- **.epub** - EPUB ebook format
- **.hwp** - Hangul Word Processor
- **.ipynb** - Jupyter Notebook
- **.jpeg, .jpg** - JPEG image
- **.mbox** - MBOX email archive
- **.md** - Markdown
- **.mp3, .mp4** - audio and video
- **.pdf** - Portable Document Format
- **.png** - Portable Network Graphics
- **.ppt, .pptm, .pptx** - Microsoft PowerPoint


# Mount Google drive for data and files

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 1. Environment Setup and Configuration

First, let's set up our environment with hardcoded configurations. We'll use OpenRouter for the LLM and local embeddings for cost-effective processing.


In [2]:
!pip install -q -r "/content/requirements.txt"

[?25l     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/803.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[90m‚ï∫[0m[90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m450.6/803.2 kB[0m [31m13.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m803.2/803.2 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m176.0/176.0 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚

In [3]:
import os
from getpass import getpass

# securely input your key
os.environ["OPENROUTER_API_KEY"] = getpass("Enter your OpenRouter key")
print("‚úì OpenrRouter key set successfully")

Enter your OpenRouter key¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑
‚úì OpenrRouter key set successfully


In [4]:
# Environment setup with hardcoded configurations
import time
from pathlib import Path
from typing import Dict, List, Optional, Tuple
import pandas as pd
import json

from dotenv import load_dotenv

# Hardcoded configuration
CONFIG = {
    "llm_model": "gpt-5-mini",
    "embedding_model": "local:BAAI/bge-small-en-v1.5",
    "chunk_size": 1024,
    "chunk_overlap": 100,
    "similarity_top_k": 5,
    "data_path": "/content/drive/MyDrive/Outskill RAG/data",
    "vector_db_path": "/content/storage/multimodal_vectordb",
    "index_storage_path": "/content/storage/multimodal_index"
}

def setup_environment():
    """
    Setup environment variables and basic configuration.

    Returns:
        bool: Success status
    """
    # Load environment variables from .env file
    load_dotenv()

    # Disable tokenizer warning
    os.environ["TOKENIZERS_PARALLELISM"] = "false"

    # Check for required API key
    api_key = os.getenv("OPENROUTER_API_KEY")
    if not api_key:
        print("OPENROUTER_API_KEY not found in environment variables")
        print("Please add your OpenRouter API key to a .env file")
        return False

    print("‚úì Environment variables loaded successfully")
    print(f"‚úì LLM Model: {CONFIG['llm_model']}")
    print(f"‚úì Embedding Model: {CONFIG['embedding_model']}")
    return True

# Run the setup
success = setup_environment()
if success:
    print("Environment setup complete!")
else:
    print("Environment setup failed!")


‚úì Environment variables loaded successfully
‚úì LLM Model: gpt-5-mini
‚úì Embedding Model: local:BAAI/bge-small-en-v1.5
Environment setup complete!


## 2. LlamaIndex Configuration for Multimodal Data

Let's configure LlamaIndex with our hardcoded settings for OpenRouter LLM and local embeddings.


## Understanding Multimodal vs. Unimodal Index Creation in LlamaIndex

In the previous notebook, we built a unimodal RAG system, it handled only PDFs (text documents).
In this notebook, we extend that system into a multimodal RAG, where multiple file types such as CSVs, JSON files, Markdown, HTML, images, and audio, can all be indexed and searched together.

This requires a slightly different approach to how indexes are created and reloaded, even though the core indexing process remains the same.

### 1. Key Concept

Both unimodal and multimodal systems use the same indexing API:

`index = VectorStoreIndex.from_documents(documents)`

However, they differ in how the index is reconstructed later.
This difference is subtle but important when designing persistent systems.

### 2. Index Loading Methods Compared

| Aspect | Unimodal (PDFs) | Multimodal (Mixed Files) | Explanation |
|--------|----------------------------|----------------------------|---------|
| **Document Types** | Single type (PDF papers) | Multiple types (PDF, CSV, HTML, Images, Audio) | Affects how metadata is structured |
| **Index Creation** | `VectorStoreIndex.from_documents()` | `VectorStoreIndex.from_documents()` | **Identical** |
| **Storage Context** | Full StorageContext (docstore, graph, index, vector) | Also Full StorageContext | **Same foundation** |
| **Index Reloading** | `load_index_from_storage()` | `VectorStoreIndex.from_vector_store()` | Multimodal often reloads via vector store only |
| **Metadata Complexity** | Simple (title, file path, pages) | Rich cross-modal metadata (data type, modality, structure) | Multimodal requires flexible schema |

### 3. Why the Loading Difference Exists

In unimodal systems, the data structure is predictable, almost every document is PDF text.

```python
# ROBUST: Complete index reconstruction
storage_context = StorageContext.from_defaults(
    persist_dir=str(index_path),
    vector_store=vector_store
)
index = load_index_from_storage(storage_context)
# Perfect restoration with all metadata and relationships
```

LlamaIndex can perfectly restore every node, relationship, and metadata field.

In multimodal systems, however, some nodes may represent non-textual content (like images or tables). These relationships are harder to serialize and reconstruct identically, so a lighter approach is often used:

```python
# BASIC: Vector-only reconstruction
storage_context = StorageContext.from_defaults(
    persist_dir=str(index_path),
    vector_store=vector_store
)
index = VectorStoreIndex.from_vector_store(
    vector_store=vector_store,
    storage_context=storage_context
)
# May lose some complex relationships between file types
```

This reloads only the vector embeddings and their metadata, skipping more complex graph reconstruction.
It‚Äôs faster and usually sufficient for multimodal retrieval, but may omit fine grained document links.

### 4. Why This Difference Matters

**For Unimodal Systems:**
- Documents are homogeneous (all PDFs)
- `load_index_from_storage()` ensures perfect reconstruction
- Critical for academic reproducibility

**For Multimodal Systems:**
- Documents are heterogeneous (PDFs, images, audio, CSV)
- `from_vector_store()` focuses on vector similarity
- Cross-modal relationships handled differently
- May prioritize performance over perfect metadata preservation


### 5. The Multimodal Advantage

Despite the loading difference, multimodal indexing provides unique capabilities:

1. **Cross-Modal Search**: Find information across PDFs, images, and data files
2. **Rich Content Types**: Handle structured (CSV) and unstructured (text) data together  
3. **Audio Integration**: Include transcribed audio content in searches
4. **Visual Content**: Extract information from charts and diagrams
5. **Unified Knowledge Base**: Single search across all organizational content


In [5]:
# LlamaIndex configuration with hardcoded settings
from llama_index.core import Settings
from llama_index.llms.openrouter import OpenRouter
from llama_index.core.embeddings import resolve_embed_model
from llama_index.core.node_parser import SentenceSplitter

def configure_llamaindex_settings():
    """Configure LlamaIndex global settings using hardcoded configuration."""

    # Set up LLM with OpenRouter using hardcoded model
    Settings.llm = OpenRouter(
        api_key=os.getenv("OPENROUTER_API_KEY"),
        model=CONFIG["llm_model"]
    )
    print(f"‚úì LLM configured: {CONFIG['llm_model']}")

    # Set up local embedding model (downloads locally first time, then cached)
    Settings.embed_model = resolve_embed_model(CONFIG["embedding_model"])
    print(f"‚úì Embedding model configured: {CONFIG['embedding_model']}")

    # Set up node parser for chunking with hardcoded settings
    Settings.node_parser = SentenceSplitter(
        chunk_size=CONFIG["chunk_size"],
        chunk_overlap=CONFIG["chunk_overlap"]
    )
    print(f"‚úì Text chunking configured: {CONFIG['chunk_size']} chars with {CONFIG['chunk_overlap']} overlap")

# Configure the settings
configure_llamaindex_settings()
print("‚úì LlamaIndex settings configured for multimodal processing")


‚úì LLM configured: gpt-5-mini


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

‚úì Embedding model configured: local:BAAI/bge-small-en-v1.5
‚úì Text chunking configured: 1024 chars with 100 overlap
‚úì LlamaIndex settings configured for multimodal processing


## 3. Exploring Our Multimodal Dataset

Let's examine the different types of files we have available for processing. This will show the diversity of data types that SimpleDirectoryReader can handle.


In [6]:
def explore_dataset(data_path: str = None):
    """
    Explore and categorize the files in our dataset by type.

    Args:
        data_path (str): Path to the data directory
    """
    if data_path is None:
        data_path = CONFIG["data_path"]

    data_dir = Path(data_path)
    if not data_dir.exists():
        print(f"Data directory not found: {data_dir}")
        return

    # Categorize files by type
    file_types = {}
    all_files = []

    # Walk through all files recursively
    for file_path in data_dir.rglob("*"):
        if file_path.is_file():
            suffix = file_path.suffix.lower()
            file_size = file_path.stat().st_size

            if suffix not in file_types:
                file_types[suffix] = []

            file_info = {
                "path": str(file_path),
                "name": file_path.name,
                "size_mb": round(file_size / (1024 * 1024), 2),
                "size_bytes": file_size
            }

            file_types[suffix].append(file_info)
            all_files.append(file_info)

    # Display summary
    print("---Dataset Overview---")
    print(f"Total files found: {len(all_files)}")

    print(f"\nFile Types Distribution:")
    for file_type, files in sorted(file_types.items()):
        if file_type:  # Skip files without extension
            total_size = sum(f["size_mb"] for f in files)
            print(f"  {file_type}: {len(files)} files ({total_size:.2f} MB)")

            # Show file details
            for file_info in files[:3]:  # Show first 3 files of each type
                print(f"    - {file_info['name']} ({file_info['size_mb']} MB)")
            if len(files) > 3:
                print(f"    ... and {len(files) - 3} more")

            print()

    return file_types, all_files

# Explore our dataset
file_types, all_files = explore_dataset()
print(f"‚úì Found {len(all_files)} files across {len(file_types)} different file types")


---Dataset Overview---
Total files found: 21

File Types Distribution:
  .csv: 4 files (0.00 MB)
    - agent_performance_benchmark.csv (0.0 MB)
    - agent_evaluation_metrics.csv (0.0 MB)
    - investment_portfolio.csv (0.0 MB)
    ... and 1 more

  .html: 2 files (0.00 MB)
    - fitness_tracker.html (0.0 MB)
    - agent_tutorial.html (0.0 MB)

  .md: 4 files (0.00 MB)
    - market_analysis.md (0.0 MB)
    - recipe_instructions.md (0.0 MB)
    - agent_framework_comparison.md (0.0 MB)
    ... and 1 more

  .mp3: 3 files (2.95 MB)
    - in_the_end.mp3 (0.6 MB)
    - ai_agents.mp3 (1.54 MB)
    - rags.mp3 (0.81 MB)

  .pdf: 2 files (1.92 MB)
    - Emerging_Agent_Architectures.pdf (1.58 MB)
    - AI_Agent_Frameworks.pdf (0.34 MB)

  .png: 6 files (0.55 MB)
    - agent_performance_comparison.png (0.17 MB)
    - agent_types_comparison.png (0.1 MB)
    - stock_performance.png (0.13 MB)
    ... and 3 more

‚úì Found 21 files across 6 different file types


## 4. Basic Multimodal Document Loading

Now let's use SimpleDirectoryReader to load all files from our data directory. This demonstrates the core multimodal capability.


### üîç Index Creation Implementation Note

The following implementation uses **multimodal-optimized loading** - notice the difference from the academic papers notebook:

#### Key Implementation Differences:

1. **Index Loading Method**: Uses `VectorStoreIndex.from_vector_store()` instead of `load_index_from_storage()`
2. **Reasoning**: Optimized for cross-modal search performance over perfect metadata preservation  
3. **Trade-off**: Slightly less metadata fidelity but better handling of diverse file types
4. **Benefit**: More flexible loading for heterogeneous document collections

This approach prioritizes the core multimodal capability while maintaining good performance and reliability.


In [7]:
from llama_index.core import SimpleDirectoryReader

def load_multimodal_documents(data_path: str = None, recursive: bool = True):
    """
    Load documents from multiple file types using SimpleDirectoryReader.

    Args:
        data_path (str): Path to directory containing multimodal data
        recursive (bool): Whether to search subdirectories

    Returns:
        List of Document objects
    """
    if data_path is None:
        data_path = CONFIG["data_path"]

    print(f"Loading multimodal documents from: {data_path}")

    # Create SimpleDirectoryReader with recursive search
    reader = SimpleDirectoryReader(
        input_dir=data_path,
        recursive=recursive,
        # Let SimpleDirectoryReader handle all supported file types automatically
    )

    print("Processing files...")
    start_time = time.time()

    # Load all documents
    documents = reader.load_data()

    end_time = time.time()

    print(f"Successfully loaded {len(documents)} documents in {end_time - start_time:.2f} seconds")

    # Analyze loaded documents by file type
    doc_types = {}
    for doc in documents:
        file_type = doc.metadata.get('file_type', 'unknown')
        if file_type not in doc_types:
            doc_types[file_type] = []
        doc_types[file_type].append(doc)

    print(f"\nDocuments by MIME type:")
    for mime_type, docs in sorted(doc_types.items()):
        print(f"  {mime_type}: {len(docs)} documents")

    return documents

# Load all multimodal documents
documents = load_multimodal_documents()

# Show sample document information
if documents:
    print(f"\nSample Document Analysis:")
    sample_doc = documents[0]
    print(f"File: {sample_doc.metadata.get('file_name', 'Unknown')}")
    print(f"Type: {sample_doc.metadata.get('file_type', 'Unknown')}")
    print(f"Size: {sample_doc.metadata.get('file_size', 0)} bytes")
    print(f"Text preview: {sample_doc.text[:200]}...")
    print(f"Metadata keys: {list(sample_doc.metadata.keys())}")


Loading multimodal documents from: /content/drive/MyDrive/Outskill RAG/data
Processing files...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 139M/139M [00:01<00:00, 101MiB/s]


Successfully loaded 42 documents in 50.80 seconds

Documents by MIME type:
  application/pdf: 23 documents
  audio/mpeg: 3 documents
  image/png: 6 documents
  text/csv: 4 documents
  text/html: 2 documents
  text/markdown: 4 documents

Sample Document Analysis:
File: AI_Agent_Frameworks.pdf
Type: application/pdf
Size: 360523 bytes
Text preview: A Comprehensive Survey of AI Agent Frameworks
and Their Applications in Financial Services
Satyadhar Joshi
Independent
Alumnus, International MBA, Bar-Ilan University, Israel
satyadhar.joshi@gmail.com...
Metadata keys: ['page_label', 'file_name', 'file_path', 'file_type', 'file_size', 'creation_date', 'last_modified_date']


## 5. Creating Multimodal Vector Index

Now let's create a vector index that can handle our multimodal documents using LanceDB for efficient storage and retrieval.


In [8]:
# Vector store and index creation
from llama_index.vector_stores.lancedb import LanceDBVectorStore
from llama_index.core import StorageContext, VectorStoreIndex

def create_multimodal_vector_store(vector_db_path: str = None):
    """Create and configure LanceDB vector store for multimodal data."""
    if vector_db_path is None:
        vector_db_path = CONFIG["vector_db_path"]

    try:
        import lancedb

        # Create storage directory
        Path(vector_db_path).parent.mkdir(parents=True, exist_ok=True)

        # Connect to LanceDB
        db = lancedb.connect(str(vector_db_path))
        print(f"‚úì Connected to LanceDB at: {vector_db_path}")

        # Create vector store
        vector_store = LanceDBVectorStore(
            uri=str(vector_db_path),
            table_name="multimodal_documents"
        )
        print("‚úì LanceDB vector store created for multimodal data")

        return vector_store

    except Exception as e:
        print(f"Error creating vector store: {e}")
        return None

def create_multimodal_index(documents: List,
                           vector_store,
                           index_storage_path: str = None,
                           force_rebuild: bool = False):
    """Create or load a multimodal vector index."""

    if index_storage_path is None:
        index_storage_path = CONFIG["index_storage_path"]

    index_path = Path(index_storage_path)
    index_path.mkdir(parents=True, exist_ok=True)

    # Check if index already exists
    index_store_file = index_path / "index_store.json"

    if not force_rebuild and index_store_file.exists():
        print("Loading existing multimodal index...")
        try:
            storage_context = StorageContext.from_defaults(
                persist_dir=str(index_path),
                vector_store=vector_store
            )

            index = VectorStoreIndex.from_vector_store(
                vector_store=vector_store,
                storage_context=storage_context
            )
            print("‚úì Successfully loaded existing multimodal index")
            return index

        except Exception as e:
            print(f"Error loading existing index: {e}")
            print("Creating new index...")

    if not documents:
        print("x No documents to index")
        return None

    print("Creating new multimodal vector index...")
    start_time = time.time()

    # Create storage context with vector store
    storage_context = StorageContext.from_defaults(vector_store=vector_store)

    # Create index with progress bar
    index = VectorStoreIndex.from_documents(
        documents,
        storage_context=storage_context,
        show_progress=True
    )

    end_time = time.time()
    print(f"‚úì Multimodal index created in {end_time - start_time:.2f} seconds")

    # Save index to storage
    print("Saving multimodal index to storage...")
    index.storage_context.persist(persist_dir=str(index_path))
    print("‚úì Index saved successfully")

    return index

# Create vector store and index for multimodal data
print("Setting up multimodal vector storage...")
multimodal_vector_store = create_multimodal_vector_store()

if multimodal_vector_store and documents:
    multimodal_index = create_multimodal_index(
        documents=documents,
        vector_store=multimodal_vector_store,
        force_rebuild=False
    )

    if multimodal_index:
        print("‚úì Multimodal RAG system ready for cross-modal queries!")
    else:
        print("x Failed to create multimodal index")
else:
    print("x Vector store creation failed or no documents available")


Setting up multimodal vector storage...




‚úì Connected to LanceDB at: /content/storage/multimodal_vectordb
‚úì LanceDB vector store created for multimodal data
Creating new multimodal vector index...


Parsing nodes:   0%|          | 0/42 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/54 [00:00<?, ?it/s]

‚úì Multimodal index created in 27.16 seconds
Saving multimodal index to storage...
‚úì Index saved successfully
‚úì Multimodal RAG system ready for cross-modal queries!


## 6. Multimodal Query Engine and Cross-Modal Search

Now let's create a query engine that can search across all our different data types and demonstrate cross-modal queries.


In [9]:
# Query engine setup
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import VectorIndexRetriever

def setup_multimodal_query_engine(index, similarity_top_k: int = None):
    """Setup query engine for multimodal search."""
    if similarity_top_k is None:
        similarity_top_k = CONFIG["similarity_top_k"]

    if not index:
        print("x Index not available. Please create index first.")
        return None

    try:
        # Create retriever for multimodal search
        retriever = VectorIndexRetriever(
            index=index,
            similarity_top_k=similarity_top_k,
        )
        print(f"‚úì Multimodal retriever configured to find top {similarity_top_k} similar chunks")

        # Create query engine
        query_engine = RetrieverQueryEngine(retriever=retriever)
        print("‚úì Multimodal query engine setup successfully")

        return query_engine

    except Exception as e:
        print(f"x Error setting up query engine: {e}")
        return None

def search_multimodal_documents(query_engine, query: str, include_metadata: bool = True) -> Dict[str, any]:
    """Search across multimodal documents and return detailed results."""
    if not query_engine:
        return {
            "success": False,
            "error": "Query engine not initialized.",
            "response": "",
            "sources": [],
        }

    try:
        print(f"Searching across multimodal data: '{query}'")
        start_time = time.time()

        # Query the multimodal RAG system
        response = query_engine.query(query)

        end_time = time.time()

        # Extract source information from retrieved nodes
        sources = []
        if hasattr(response, "source_nodes"):
            for node in response.source_nodes:
                source_info = {
                    "text": (
                        node.text[:300] + "..."
                        if len(node.text) > 300
                        else node.text
                    ),
                    "score": getattr(node, "score", 0.0),
                }

                # Add metadata if available and requested
                if include_metadata and hasattr(node, "metadata"):
                    metadata = node.metadata
                    source_info.update({
                        "file_name": metadata.get("file_name", "Unknown"),
                        "file_type": metadata.get("file_type", "Unknown"),
                        "file_path": metadata.get("file_path", "Unknown"),
                        "file_size": metadata.get("file_size", 0),
                    })

                sources.append(source_info)

        result = {
            "success": True,
            "response": str(response),
            "sources": sources,
            "query": query,
            "search_time": end_time - start_time,
            "num_sources": len(sources),
        }

        print(f"‚úì Search completed in {end_time - start_time:.2f} seconds")
        print(f"Found {len(sources)} relevant sources across different file types")

        return result

    except Exception as e:
        print(f"x Error during search: {e}")
        return {"success": False, "error": str(e), "response": "", "sources": []}

# Setup multimodal query engine
if 'multimodal_index' in locals() and multimodal_index:
    multimodal_query_engine = setup_multimodal_query_engine(multimodal_index)

    if multimodal_query_engine:
        print("‚úì Multimodal query engine ready for cross-modal search!")
    else:
        print("x Failed to setup multimodal query engine")
else:
    print("x Multimodal index not available")


‚úì Multimodal retriever configured to find top 5 similar chunks
‚úì Multimodal query engine setup successfully
‚úì Multimodal query engine ready for cross-modal search!


## 7. Interactive Multimodal Query Examples

Let's demonstrate the power of our multimodal RAG system with cross-modal queries that search across different data types simultaneously.


In [10]:
def ask_multimodal_question(query_engine, question: str, show_sources: bool = True):
    """
    Ask a custom question to the multimodal RAG system and display results.

    Args:
        query_engine: The configured multimodal query engine
        question (str): Your question about the multimodal data
        show_sources (bool): Whether to display source information
    """

    result = search_multimodal_documents(query_engine, question, include_metadata=True)

    if result["success"]:
        print(f"Answer: {result['response']}")
        print(f"\nSearch completed in {result['search_time']:.2f} seconds")
        print(f"Found {result['num_sources']} relevant sources across different data types")

        if show_sources and result["sources"]:
            # Show file type distribution
            file_types = {}
            for source in result["sources"]:
                file_type = source.get("file_type", "unknown")
                if file_type not in file_types:
                    file_types[file_type] = 0
                file_types[file_type] += 1

            print(f"\nSource File Types: {dict(file_types)}")

            print(f"\nTop Sources:")
            for i, source in enumerate(result["sources"][:3], 1):
                print(f"{i}. {source.get('file_name', 'Unknown')} ({source.get('file_type', 'Unknown')})")
                print(f"   Score: {source.get('score', 0):.3f}")
                print(f"   Content: {source['text'][:150]}...\n")

    else:
        print(f"x Error: {result['error']}")

# Demo queries for diverse data types
diverse_queries = [
    "What is the prep time for Spaghetti Carbonara?",  # Should hit cooking CSV
    "Which stock had the highest return in my portfolio?",  # Should hit finance CSV
    "What is the best time to visit Tokyo?",  # Should hit travel markdown
    "How many calories did I burn on Tuesday?",  # Should hit health HTML
    "What are the steps to make Carbonara?",  # Should hit cooking markdown
    "What was NVIDIA's performance?",  # Should hit finance data
]

# Test one query from each topic
for i, question in enumerate(diverse_queries[:3], 1):
    print(f"\n{'-'*15} Query {i}: {question[:30]}... {'-'*15}")
    ask_multimodal_question(multimodal_query_engine, question, show_sources=True)

# Custom question area
print(f"\n{'-'*10} Custom Question {'-'*10}")
custom_question = "What is the prep time for Italian recipes?"
ask_multimodal_question(multimodal_query_engine, custom_question, show_sources=True)



--------------- Query 1: What is the prep time for Spag... ---------------
Searching across multimodal data: 'What is the prep time for Spaghetti Carbonara?'
‚úì Search completed in 5.07 seconds
Found 5 relevant sources across different file types
Answer: 15 minutes.

Search completed in 5.07 seconds
Found 5 relevant sources across different data types

Source File Types: {'text/markdown': 1, 'text/csv': 2, 'audio/mpeg': 1, 'text/html': 1}

Top Sources:
1. recipe_instructions.md (text/markdown)
   Score: 0.649
   Content: # üçù Classic Spaghetti Carbonara Recipe

## Ingredients
- 400g spaghetti pasta
- 4 large egg yolks
- 100g pecorino romano cheese (grated)
- 150g guanci...

2. italian_recipes.csv (text/csv)
   Score: 0.481
   Content: Spaghetti Carbonara, Italian, 20, Easy, Pasta, 450
Margherita Pizza, Italian, 45, Medium, Tomato, 320
Risotto Milanese, Italian, 35, Hard, Rice, 380
T...

3. agent_performance_benchmark.csv (text/csv)
   Score: 0.394
   Content: ReAct-GPT4, reasoning,

In [11]:
custom_question="in the end it doesn't even matter"  # it's a lyric of a song
ask_multimodal_question(multimodal_query_engine, custom_question, show_sources=True)

Searching across multimodal data: 'in the end it doesn't even matter'
‚úì Search completed in 10.46 seconds
Found 5 relevant sources across different file types
Answer: Empty Response

Search completed in 10.46 seconds
Found 5 relevant sources across different data types

Source File Types: {'audio/mpeg': 1, 'application/pdf': 4}

Top Sources:
1. in_the_end.mp3 (audio/mpeg)
   Score: 0.400
   Content: I tried so hard and got so far In the end, it doesn't even matter I had to fall to lose it all In the end, it doesn't even matter...

2. Emerging_Agent_Architectures.pdf (application/pdf)
   Score: 0.376
   Content: complete problems [16, 23, 32]. They often do this by breaking a larger problem into smaller subproblems, and then
solving each one with the appropria...

3. Emerging_Agent_Architectures.pdf (application/pdf)
   Score: 0.375
   Content: Message subscribing or filtering improves multi-agent
performance by ensuring agents only receive information relevant to their tasks.
In vert

## Conclusion

üéâ **Congratulations!** You have successfully built an advanced **Multimodal RAG System** using LlamaIndex's `SimpleDirectoryReader` with comprehensive cross-modal capabilities.

### What We Accomplished

This tutorial demonstrated building a RAG system that can handle multiple data types:

#### 1. **Multimodal Document Loading**
- ‚úÖ **PDF Documents**: Academic research papers on AI agents
- ‚úÖ **CSV Files**: Agent performance benchmarks and evaluation metrics  
- ‚úÖ **Markdown Files**: Framework comparisons and documentation
- ‚úÖ **HTML Files**: Tutorial and instructional content
- ‚úÖ **Image Files**: Charts, diagrams, and visual content
- ‚úÖ **Audio Files**: Supplementary audio content

#### 2. **Key Features Implemented**
- ‚úÖ **Hardcoded Configuration**: No external config files needed
- ‚úÖ **Cross-Modal Search**: Query across all file types simultaneously
- ‚úÖ **Semantic Similarity**: Find relevant content regardless of source format
- ‚úÖ **Source Attribution**: Track which file types contributed to answers
- ‚úÖ **LanceDB Vector Store**: Efficient multimodal document storage
- ‚úÖ **OpenRouter Integration**: Using `gpt-4o` for response generation
- ‚úÖ **Local Embeddings**: `BAAI/bge-small-en-v1.5` for cost-effective embedding

#### 3. **SimpleDirectoryReader Capabilities**
According to the [official documentation](https://developers.llamaindex.ai/python/framework/module_guides/loading/simpledirectoryreader/), we successfully utilized:

```python
# Basic multimodal loading
SimpleDirectoryReader(input_dir="../../data", recursive=True)

# Advanced features available
SimpleDirectoryReader(
    input_dir="path/to/directory",
    recursive=True,                    # Search subdirectories
    required_exts=[".pdf", ".csv"],    # Filter file types
    exclude=["file1.txt"],            # Exclude specific files
    file_metadata=custom_func,         # Custom metadata extraction
    num_files_limit=100,              # Limit number of files
    encoding="utf-8"                  # Specify encoding
)
```

### Real-World Applications

This multimodal RAG system can be applied to:

- **Research and Academia**: Query across papers, datasets, and supplementary materials
- **Documentation Systems**: Search technical docs, tutorials, configs, and examples
- **Business Intelligence**: Combine reports, spreadsheets, presentations, and recordings
- **Content Management**: Organize and search diverse content libraries
- **Knowledge Bases**: Build comprehensive Q&A systems with diverse source materials

### Next Steps

1. **Extend File Types**: Add `.docx`, `.pptx`, `.epub` support
2. **Custom Metadata**: Implement domain-specific metadata extraction
3. **Hybrid Search**: Combine vector search with keyword search
4. **Performance Optimization**: Use iterative loading for large datasets
5. **Multi-Language Support**: Test with international documents

### Usage Tips

- **Query Optimization**: Use specific queries that benefit from cross-modal information
- **File Organization**: Structure data directories logically
- **Custom Questions**: Modify the `custom_question` variable to test your own queries
- **Monitor Sources**: Check file type distribution in results to understand retrieval patterns

Happy building with multimodal RAG! üöÄüìöüîç

---

**Ready to explore?** Run the cells above and try your own questions with the interactive query interface!
