# 17. Multimodal RAG - Images + Text üñºÔ∏è

**Complexity:** ‚≠ê‚≠ê‚≠ê‚≠ê | **Duration:** ~25-30 minutes

---

## Overview

**Multimodal RAG** extends traditional text-based RAG to handle **images, diagrams, charts, and visual content** alongside text. This is essential for:

- üìä **Technical documentation** with diagrams
- üìÑ **PDF reports** with charts and tables
- üèóÔ∏è **Architectural drawings** and blueprints
- üì∏ **Visual Q&A** systems
- üé® **Design documents** with mockups

### Key Technologies

1. **GPT-4 Vision (GPT-4V)**: Multimodal LLM that understands images
2. **OCR (Tesseract)**: Extract text from images
3. **PDF Processing**: Extract images from PDFs
4. **Image Embeddings**: Vector representations of images

### Architecture Pattern

```
Document (PDF/HTML) 
  ‚îú‚îÄ‚Üí Extract Text ‚Üí Embed ‚Üí Vector Store
  ‚îî‚îÄ‚Üí Extract Images ‚Üí OCR/Vision ‚Üí Embed ‚Üí Vector Store
                              ‚Üì
Query ‚Üí Retrieve (Text + Images) ‚Üí GPT-4V ‚Üí Answer
```

### When to Use

- ‚úÖ Documents contain critical visual information
- ‚úÖ Charts/graphs convey key data
- ‚úÖ Diagrams explain complex processes
- ‚úÖ Tables contain structured information
- ‚ùå Purely textual documents (use standard RAG)
- ‚ùå Real-time video analysis (different architecture)

---

## Prerequisites

```bash
pip install pillow pytesseract pdf2image
```

**System Requirements:**
- Tesseract OCR installed: `brew install tesseract` (macOS) or `apt-get install tesseract-ocr` (Linux)
- Poppler installed (for PDF processing): `brew install poppler` (macOS)

---

## Setup

Import dependencies and configure environment.

In [None]:
import sys
import base64
from io import BytesIO
from pathlib import Path
from typing import List, Dict, Any

# Add project root to path
sys.path.append('../..')

# Core dependencies
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.messages import HumanMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document

# Image processing
from PIL import Image
import pytesseract

# Shared utilities
from shared import (
    load_vector_store,
    save_vector_store,
    format_docs,
    print_section_header,
    print_results,
    VECTOR_STORE_DIR,
    SECTION_WIDTH
)

print("=" * SECTION_WIDTH)
print("MULTIMODAL RAG SETUP")
print("=" * SECTION_WIDTH)
print("\n‚úÖ Imports successful")
print(f"‚úÖ Vector store directory: {VECTOR_STORE_DIR}")

## 1. Image Processing Utilities

Functions to handle image extraction, OCR, and encoding.

In [None]:
def encode_image_to_base64(image: Image.Image) -> str:
    """
    Encode PIL Image to base64 string for API transmission.
    
    Args:
        image: PIL Image object
        
    Returns:
        Base64 encoded string
    """
    buffered = BytesIO()
    image.save(buffered, format="PNG")
    return base64.b64encode(buffered.getvalue()).decode('utf-8')


def extract_text_from_image(image: Image.Image) -> str:
    """
    Extract text from image using Tesseract OCR.
    
    Args:
        image: PIL Image object
        
    Returns:
        Extracted text string
    """
    try:
        text = pytesseract.image_to_string(image)
        return text.strip()
    except Exception as e:
        print(f"‚ö†Ô∏è  OCR failed: {e}")
        return ""


def describe_image_with_vision(image: Image.Image, prompt: str = "Describe this image in detail") -> str:
    """
    Use GPT-4 Vision to describe image content.
    
    Args:
        image: PIL Image object
        prompt: Description prompt
        
    Returns:
        Image description from GPT-4V
    """
    # Encode image
    base64_image = encode_image_to_base64(image)
    
    # Initialize GPT-4 Vision
    vision_model = ChatOpenAI(model="gpt-4o", temperature=0)
    
    # Create message with image
    message = HumanMessage(
        content=[
            {"type": "text", "text": prompt},
            {
                "type": "image_url",
                "image_url": {"url": f"data:image/png;base64,{base64_image}"}
            }
        ]
    )
    
    # Get description
    response = vision_model.invoke([message])
    return response.content


print("‚úÖ Image processing utilities defined")

## 2. Create Sample Documents with Images

For demonstration, we'll create synthetic documents that combine text and image descriptions.

In [None]:
# Sample documents with embedded image descriptions
multimodal_docs = [
    {
        "text": """LangChain Expression Language (LCEL) Architecture
        
        LCEL is a declarative way to compose chains. The architecture follows a pipe operator pattern
        where components are connected using the | operator. This enables:
        - Composability: Chain components together
        - Streaming: Stream tokens as they're generated
        - Async: Run chains concurrently
        - Fallbacks: Add error handling easily
        
        The typical LCEL chain looks like: prompt | model | output_parser
        """,
        "image_description": """[DIAGRAM] The diagram shows three boxes connected by pipe operators:
        Box 1: 'ChatPromptTemplate' with input variables (context, input)
        Box 2: 'ChatOpenAI' (gpt-4o-mini) with temperature=0
        Box 3: 'StrOutputParser' outputting final string
        Arrows flow left to right showing data transformation at each stage.""",
        "metadata": {"source": "langchain_docs", "type": "architecture_diagram"}
    },
    {
        "text": """RAG Performance Benchmarks
        
        Performance comparison across different RAG architectures shows significant trade-offs
        between speed and quality. Simple RAG offers the fastest response times at ~2 seconds,
        while Agentic RAG provides the highest quality at the cost of 20-40 second latency.
        
        Cost per query ranges from $0.00036 for Simple RAG to $0.00360 for Agentic RAG,
        representing a 10x difference.
        """,
        "image_description": """[CHART] Bar chart showing:
        X-axis: Architecture names (Simple RAG, Memory RAG, Branched RAG, HyDe, Adaptive RAG, CRAG, Self-RAG, Agentic RAG)
        Y-axis Left: Latency in seconds (bars in blue)
        Y-axis Right: Cost per query in dollars (line in red)
        The chart clearly shows latency increasing from 2s to 40s, and cost increasing from $0.00036 to $0.00360.
        Agentic RAG has the highest bars and highest cost point.""",
        "metadata": {"source": "performance_docs", "type": "performance_chart"}
    },
    {
        "text": """Vector Store Architecture
        
        FAISS (Facebook AI Similarity Search) is used for efficient vector similarity search.
        The architecture consists of:
        1. Document Embedding: Convert text to 1536-dimensional vectors (OpenAI)
        2. Index Building: Create FAISS index for fast nearest neighbor search
        3. Persistence: Save index to disk for reuse
        4. Retrieval: Query the index with embedded questions
        
        FAISS supports billions of vectors with millisecond query times.
        """,
        "image_description": """[FLOWCHART] Process flow diagram:
        Step 1: 'Documents' (blue box) ‚Üí 'Text Splitter' (green box) produces 'Chunks'
        Step 2: 'Chunks' ‚Üí 'Embeddings' (orange box) produces 'Vectors (1536d)'
        Step 3: 'Vectors' ‚Üí 'FAISS Index' (purple box) with note 'Save to disk'
        Step 4: 'Query' ‚Üí 'Embed Query' ‚Üí 'Search Index' ‚Üí 'Top-k Documents'
        Dashed line shows persistence path from FAISS Index to disk storage.""",
        "metadata": {"source": "architecture_docs", "type": "flowchart"}
    },
    {
        "text": """Contextual RAG Results
        
        Anthropic's Contextual RAG technique shows impressive improvements in retrieval quality.
        By prepending document-level context to each chunk before embedding, we see:
        - 15-30% improvement in retrieval precision
        - Minimal query-time overhead (context added during indexing)
        - Better semantic matching for technical documents
        
        The technique is especially effective for code documentation and API references.
        """,
        "image_description": """[TABLE] Comparison table with 3 columns:
        Column 1: Metric | Column 2: Simple RAG | Column 3: Contextual RAG
        Row 1: Precision@5 | 68% | 87% (+19%)
        Row 2: Recall@10 | 72% | 89% (+17%)
        Row 3: Query Time | 1.8s | 2.1s (+0.3s)
        Row 4: Index Size | 12MB | 15MB (+25%)
        Green highlighting on Contextual RAG improvements.""",
        "metadata": {"source": "evaluation_docs", "type": "comparison_table"}
    }
]

print(f"‚úÖ Created {len(multimodal_docs)} multimodal documents")
print(f"   - {sum(1 for d in multimodal_docs if 'DIAGRAM' in d['image_description'])} diagrams")
print(f"   - {sum(1 for d in multimodal_docs if 'CHART' in d['image_description'])} charts")
print(f"   - {sum(1 for d in multimodal_docs if 'FLOWCHART' in d['image_description'])} flowcharts")
print(f"   - {sum(1 for d in multimodal_docs if 'TABLE' in d['image_description'])} tables")

## 3. Build Multimodal Vector Store

Combine text and image descriptions into a unified retrieval system.

In [None]:
# Create LangChain Document objects
langchain_docs = []

for doc in multimodal_docs:
    # Combine text and image description for embedding
    combined_content = f"{doc['text']}\n\n[VISUAL CONTENT]: {doc['image_description']}"
    
    langchain_docs.append(
        Document(
            page_content=combined_content,
            metadata=doc['metadata']
        )
    )

# Create embeddings and vector store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(langchain_docs, embeddings)

# Save for reuse
save_vector_store(vectorstore, VECTOR_STORE_DIR / "multimodal")

print("\n‚úÖ Multimodal vector store created")
print(f"   Total documents: {len(langchain_docs)}")
print(f"   Saved to: {VECTOR_STORE_DIR / 'multimodal'}")

## 4. Standard Multimodal RAG

Query the multimodal vector store and generate answers using GPT-4o (which has vision capabilities).

In [None]:
from shared.prompts import RAG_PROMPT_TEMPLATE

# Create retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})

# Build multimodal RAG chain with GPT-4o (supports vision context)
llm = ChatOpenAI(model="gpt-4o", temperature=0)

multimodal_chain = (
    {"context": retriever | format_docs, "input": RunnablePassthrough()}
    | RAG_PROMPT_TEMPLATE
    | llm
    | StrOutputParser()
)

print("‚úÖ Multimodal RAG chain created with GPT-4o")

### Test Query 1: LCEL Architecture

Ask about LCEL chain composition (references diagram).

In [None]:
query1 = "How does LCEL chain composition work? What are the main components?"

print_section_header("Query 1: LCEL Architecture")
response1 = multimodal_chain.invoke(query1)
print_results(query1, response1)

### Test Query 2: Performance Comparison

Ask about performance trade-offs (references chart).

In [None]:
query2 = "What are the performance trade-offs between Simple RAG and Agentic RAG?"

print_section_header("Query 2: Performance Trade-offs")
response2 = multimodal_chain.invoke(query2)
print_results(query2, response2)

### Test Query 3: Technical Details

Ask about Contextual RAG improvements (references table).

In [None]:
query3 = "What improvement metrics does Contextual RAG achieve compared to Simple RAG?"

print_section_header("Query 3: Contextual RAG Metrics")
response3 = multimodal_chain.invoke(query3)
print_results(query3, response3)

## 5. Advanced: Direct Image Analysis

For scenarios where you have actual image files, use GPT-4 Vision directly.

In [None]:
def analyze_image_with_rag(image_path: str, question: str) -> str:
    """
    Analyze an image using GPT-4 Vision, then use RAG for additional context.
    
    Args:
        image_path: Path to image file
        question: Question about the image
        
    Returns:
        Combined answer from vision + RAG
    """
    # Load image
    image = Image.open(image_path)
    
    # Get vision description
    vision_prompt = f"""Analyze this image and answer the question: {question}
    
    Provide a detailed technical description of what you see, focusing on elements
    relevant to the question."""
    
    vision_description = describe_image_with_vision(image, vision_prompt)
    
    # Use vision description to retrieve relevant context
    relevant_docs = retriever.invoke(vision_description)
    context = format_docs(relevant_docs)
    
    # Generate final answer combining vision + RAG
    final_prompt = f"""Based on the image analysis and retrieved context, answer the question.
    
    Image Analysis:
    {vision_description}
    
    Retrieved Context:
    {context}
    
    Question: {question}
    
    Provide a comprehensive answer that combines insights from both the image and the context."""
    
    response = llm.invoke(final_prompt)
    return response.content

print("‚úÖ Advanced image analysis function defined")
print("   Use: analyze_image_with_rag('path/to/image.png', 'your question')")

## 6. Production Optimizations

Best practices for multimodal RAG in production.

### 6.1 Image Preprocessing Pipeline

```python
def preprocess_image(image: Image.Image) -> Image.Image:
    """
    Optimize image for vision model:
    - Resize to max 2048x2048 (GPT-4V limit)
    - Convert to RGB if needed
    - Compress if file size > 20MB
    """
    max_size = 2048
    
    # Resize if too large
    if max(image.size) > max_size:
        image.thumbnail((max_size, max_size), Image.Resampling.LANCZOS)
    
    # Convert to RGB
    if image.mode != 'RGB':
        image = image.convert('RGB')
    
    return image
```

### 6.2 Caching Strategy

```python
# Cache vision descriptions to avoid redundant API calls
import hashlib
import json

vision_cache = {}

def cached_describe_image(image: Image.Image, prompt: str) -> str:
    # Generate cache key from image + prompt
    image_bytes = BytesIO()
    image.save(image_bytes, format='PNG')
    image_hash = hashlib.md5(image_bytes.getvalue()).hexdigest()
    cache_key = f"{image_hash}_{prompt}"
    
    if cache_key in vision_cache:
        return vision_cache[cache_key]
    
    description = describe_image_with_vision(image, prompt)
    vision_cache[cache_key] = description
    
    return description
```

### 6.3 Batch Processing

```python
from concurrent.futures import ThreadPoolExecutor

def batch_process_images(image_paths: List[str], max_workers: int = 4) -> List[str]:
    """
    Process multiple images in parallel.
    """
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        descriptions = list(executor.map(
            lambda path: describe_image_with_vision(Image.open(path)),
            image_paths
        ))
    return descriptions
```

### 6.4 Cost Optimization

**GPT-4 Vision Costs (as of 2025):**
- Low detail: $0.00765 per image
- High detail: ~$0.01590 per image (depends on resolution)

**Optimization strategies:**
1. Use OCR for text-heavy images (cheaper than vision)
2. Cache descriptions aggressively
3. Use low-detail mode when possible
4. Batch process during off-peak hours
5. Consider GPT-4o-mini for simpler images

### 6.5 Error Handling

```python
def robust_image_analysis(image_path: str, question: str) -> str:
    """
    Analyze image with fallbacks:
    1. Try GPT-4 Vision
    2. Fall back to OCR if vision fails
    3. Return error message if both fail
    """
    try:
        image = Image.open(image_path)
        return analyze_image_with_rag(image_path, question)
    except Exception as vision_error:
        print(f"‚ö†Ô∏è  Vision failed: {vision_error}, trying OCR...")
        try:
            text = extract_text_from_image(image)
            return multimodal_chain.invoke(f"{question}\n\nExtracted text: {text}")
        except Exception as ocr_error:
            return f"‚ùå Image analysis failed: {ocr_error}"
```

## 7. Summary & Best Practices

### Key Takeaways

‚úÖ **Multimodal RAG combines:**
- Text retrieval (standard RAG)
- Image understanding (GPT-4 Vision)
- OCR for text extraction
- Unified vector store

‚úÖ **Use cases:**
- Technical documentation with diagrams
- PDF reports with charts/tables
- Visual Q&A systems
- Architectural/design documents

‚úÖ **Production considerations:**
- Image preprocessing (resize, convert, compress)
- Caching vision descriptions
- Batch processing for efficiency
- Error handling with fallbacks
- Cost optimization (OCR vs Vision)

### Performance Comparison

| Approach | Latency | Cost/Query | Accuracy |
|---|---|---|---|
| **Text-only RAG** | ~2s | $0.0004 | Good for text |
| **OCR + RAG** | ~3s | $0.0005 | Good for text-heavy images |
| **Vision + RAG** | ~5-8s | $0.012 | Excellent for all images |
| **Hybrid (OCR + Vision)** | ~6-10s | $0.008 | Best overall |

### When to Use Each Approach

**OCR + RAG:**
- Text-heavy images (scanned documents, screenshots)
- Simple charts with labels
- Cost-sensitive applications

**Vision + RAG:**
- Complex diagrams, flowcharts
- Charts with visual patterns
- Images with spatial relationships
- High-accuracy requirements

**Hybrid:**
- Production systems (use OCR first, vision as fallback)
- Mixed document types
- Balance cost and quality

### Next Steps

1. **Integrate with PDF processing**: Use `pdf2image` to extract images from PDFs
2. **Add image embeddings**: Use CLIP for semantic image search
3. **Build UI**: Streamlit/Gradio interface for uploading images
4. **Deploy**: Package as FastAPI endpoint with image upload
5. **Monitor costs**: Track vision API usage and implement budgets

---

**üìö Related Notebooks:**
- [03_simple_rag.ipynb](../fundamentals/03_simple_rag.ipynb) - Text-only RAG baseline
- [12_contextual_rag.ipynb](12_contextual_rag.ipynb) - Context-augmented retrieval
- [18_finetuning_embeddings.ipynb](18_finetuning_embeddings.ipynb) - Custom embeddings

**üîó External Resources:**
- [GPT-4 Vision API](https://platform.openai.com/docs/guides/vision)
- [Tesseract OCR](https://github.com/tesseract-ocr/tesseract)
- [PIL/Pillow Docs](https://pillow.readthedocs.io/)

---

üéâ **Multimodal RAG Complete!**

You now have a production-ready system that handles both text and images. Experiment with your own documents and images!