# Enhanced Multimodal RAGAS Implementation - Complete Tutorial

This notebook provides a comprehensive, step-by-step implementation of the Enhanced Multimodal RAGAS system. It combines traditional RAGAS (Retrieval-Augmented Generation Assessment) with custom multimodal metrics using ImageBind embeddings.

##  Learning Objectives

By the end of this notebook, you will understand:
1. **How to set up multimodal embeddings** using ImageBind for text, vision, and audio
2. **How to integrate RAGAS with custom multimodal metrics** for comprehensive evaluation
3. **How to evaluate retrieval systems across different modalities** (text-only, vision-only, audio-only, multimodal)
4. **How to implement comprehensive evaluation workflows** with detailed metrics and visualizations
5. **How to interpret evaluation results** and understand what each metric tells us

##  System Architecture Overview

The system consists of several key components:
- **ImageBind Model**: Creates unified embeddings across text, vision, and audio modalities
- **FAISS Indices**: Enables efficient similarity search for each modality
- **RAGAS Metrics**: Provides standard RAG evaluation (answer relevancy, faithfulness, context precision/recall)
- **Custom Multimodal Metrics**: Adds modality-specific evaluation capabilities
- **SageMaker Integration**: Uses AWS SageMaker endpoints for LLM-based evaluation

##  What We'll Evaluate

We'll test four different retrieval strategies:
1. **Text-only retrieval**: Using only text embeddings
2. **Vision-only retrieval**: Using only image embeddings
3. **Audio-only retrieval**: Using only audio embeddings
4. **Full multimodal retrieval**: Combining all three modalities

Each strategy will be evaluated using both traditional RAGAS metrics and our custom multimodal metrics.

## Step 1: Import Required Libraries

Let's start by importing all the necessary libraries. Each group serves a specific purpose in our multimodal evaluation system.

In [None]:
!git clone https://github.com/facebookresearch/ImageBind.git
%cd ImageBind
# Skip the requirements.txt as it has incompatible PyTorch versions
%pip install pandas numpy torch torchaudio faiss-cpu scikit-learn boto3 langchain-aws langchain-core ragas datasets tqdm matplotlib seaborn ipython soundfile decord ftfy regex Pillow pytorchvideo==0.1.5
%cd ../

In [None]:
# Core Python libraries for data manipulation and system operations

IMAGE_BIND_PATH = ''
assert IMAGE_BIND_PATH != "", "Please put your path to the ImageBind folder downloaded by the above cell"

import os
import sys
import json
import warnings
from pathlib import Path
from typing import Dict, List, Tuple, Any, Optional, Union
import asyncio

# Data manipulation and numerical computing
import pandas as pd
import numpy as np

# Machine learning and similarity search
import sys
import torchvision.transforms.functional as F
# Create the missing module
sys.modules['torchvision.transforms.functional_tensor'] = F

import torch
import faiss
from sklearn.preprocessing import normalize
from sklearn.metrics import ndcg_score
from sklearn.metrics.pairwise import cosine_similarity

# ImageBind for multimodal embeddings
# This is the core of our multimodal system - it creates unified embeddings
sys.path.append(IMAGE_BIND_PATH)
from imagebind import data
from imagebind.models import imagebind_model
from imagebind.models.imagebind_model import ModalityType

# AWS and SageMaker for LLM integration
import boto3
from botocore.response import StreamingBody

# LangChain for LLM interaction
from langchain_aws.chat_models.sagemaker_endpoint import ChatSagemakerEndpoint, ChatModelContentHandler
from langchain_core.messages import HumanMessage, AIMessageChunk, SystemMessage
from langchain_core.embeddings import Embeddings

# RAGAS for traditional RAG evaluation
import ragas
from ragas.run_config import RunConfig
from ragas.metrics.base import MetricWithLLM, MetricWithEmbeddings
from ragas import evaluate
from ragas.metrics import answer_relevancy, faithfulness, context_precision, context_recall
from ragas.dataset_schema import SingleTurnSample
from datasets import Dataset

# Visualization and progress tracking
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

# Custom utilities (our helper functions)

from utils import *
from utils import MultimodalShowcase

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

print(" All libraries imported successfully!")
print(f" PyTorch version: {torch.__version__}")
print(f" CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f" GPU: {torch.cuda.get_device_name(0)}")

## Step 2: Deploy SageMaker Endpoint

Before we proceed with the evaluation, we need to deploy a SageMaker endpoint for LLM-based evaluation. This step will deploy the Qwen2.5-1.5B-Instruct model and automatically capture the endpoint name for use throughout the notebook.

In [None]:
# Import additional SageMaker libraries for deployment
import sagemaker
from sagemaker.huggingface import HuggingFaceModel
from sagemaker.huggingface import get_huggingface_llm_image_uri
from sagemaker.utils import name_from_base
from sagemaker import get_execution_role

print(" Deploying Qwen2.5-1.5B-Instruct Model to SageMaker Endpoint...")
print(" This may take 5-10 minutes for the endpoint to be ready.")

# Get the Hugging Face LLM image URI
llm_image = get_huggingface_llm_image_uri(
    "huggingface",
    version="3.0.1"
)

# Get the execution role
role = get_execution_role()
print(f" Using IAM role: {role}")

# Configure the model
hub = {
    'HF_TASK': 'text-generation', 
    'HF_MODEL_ID': 'Qwen/Qwen2.5-1.5B-Instruct'
}

# Create the HuggingFace model
model_for_deployment = HuggingFaceModel(
    role=role,
    env=hub,
    image_uri=llm_image,
)

# Generate a unique endpoint name
endpoint_name = name_from_base("qwen25")
print(f" Generated endpoint name: {endpoint_name}")
# Deployment configuration
instance_type = "ml.g5.2xlarge"
number_of_gpu = 1
health_check_timeout = 300

# Deploy the model
print(f" Deploying to {instance_type} instance...")
predictor = model_for_deployment.deploy(
    endpoint_name=endpoint_name,
    initial_instance_count=1,
    instance_type=instance_type,
    container_startup_health_check_timeout=health_check_timeout,
    routing_config={
        "RoutingStrategy": sagemaker.enums.RoutingStrategy.LEAST_OUTSTANDING_REQUESTS
    }
)

# Automatically assign the endpoint name to our configuration variable
SAGEMAKER_ENDPOINT_NAME = endpoint_name

print(f" SageMaker endpoint deployed successfully!")
print(f" Endpoint name: {SAGEMAKER_ENDPOINT_NAME}")
print(f" Endpoint status: {predictor.endpoint_name}")

## Step 3: Configuration and Setup

Now that our SageMaker endpoint is deployed, let's configure the key parameters for our evaluation system. The endpoint name has been automatically captured from the deployment above.

### Step 3.1: Download and Extract the Cinepile Dataset

Before we can begin our multimodal RAGAS evaluation, we need to download the Cinepile dataset, which contains the multimodal movie data (text, 
images, and audio) that our system will use for retrieval and evaluation.

The Cinepile dataset is a curated collection of movie-related content that includes:  

• **Text descriptions** and dialogue from movies  
• **Video frames** extracted from movie clips    
• **Audio clips** from movie scenes  
• **Total size**: 487MB compressed, ~584MB when extracted  

In [None]:
# Download the Cinepile dataset (487MB)
!wget "https://d22xjg1p9prwde.cloudfront.net/cinepile-dataset.tar.gz"

# Extract the dataset
!tar -xzf cinepile-dataset.tar.gz

In [None]:
#  Configuration Parameters
# The SAGEMAKER_ENDPOINT_NAME has been automatically set from the deployment above
# If you need to use an existing endpoint, uncomment and modify the line below:
# SAGEMAKER_ENDPOINT_NAME = ""

# Data paths - these point to our multimodal dataset
CINEPILE_DATA_PATH = ""
assert CINEPILE_DATA_PATH != "", "Please Provide Your Path To The Cinepile Dataset"
set_cinepile_data_path(CINEPILE_DATA_PATH)

# Device configuration - use GPU if available for faster processing
device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(f" Using device: {device}")

# Initialize global variables that will store our models and data
# These will be populated as we progress through the notebook
model = None  # ImageBind model
data_entries = None  # Our multimodal dataset
question_data = None  # Questions for evaluation
text_index = None  # FAISS index for text embeddings
vision_index = None  # FAISS index for vision embeddings
audio_index = None  # FAISS index for audio embeddings
multimodal_indices = None  # Combined multimodal indices
normalized_embeddings = None  # Normalized embedding vectors

print(" Configuration complete!")
print(f" SageMaker endpoint: {SAGEMAKER_ENDPOINT_NAME}")
print(f" Data path: {CINEPILE_DATA_PATH}")

## Step 4: Load and Explore the Dataset

Our dataset contains multimodal movie data from the Cinepile dataset. Each entry includes:
- **Text**: Descriptions and dialogue
- **Vision**: Movie frames/images
- **Audio**: Sound clips from movies

Let's load the data and examine its structure.

In [None]:
# Load the multimodal dataset
print(" Loading cinepile dataset...")
data_entries = load_cinepile_data()

print(f" Loaded {len(data_entries)} multimodal entries")
print("\n Let's examine the first entry to understand our data structure:")
print("=" * 60)

# Display the structure of the first data entry
first_entry = data_entries[0]
print(f" Keys in each entry: {list(first_entry.keys())}")
print("\n Sample entry details:")
for key, value in first_entry.items():
    if isinstance(value, str):
        # Truncate long strings for readability
        display_value = value[:100] + "..." if len(value) > 100 else value
        print(f"  {key}: {display_value}")
    else:
        print(f"  {key}: {type(value)} - {value}")

print("\n Understanding the data structure:")
print("   • Each entry represents a multimodal movie scene")
print("   • Text contains dialogue or descriptions")
print("   • Vision contains image file paths")
print("   • Audio contains audio file paths")
print("   • This rich multimodal data allows us to test different retrieval strategies")

## Step 5: Initialize ImageBind Model

ImageBind is Meta's groundbreaking model that creates unified embeddings across multiple modalities. It can understand and relate:
- Text descriptions
- Images/video frames  
- Audio clips
- And more!

The key insight is that ImageBind learns a shared embedding space where semantically similar content from different modalities are close together.

In [None]:
# Initialize the ImageBind model
print(" Loading ImageBind model...")
print("   This may take a moment as we load the pre-trained weights")

# Load the huge version of ImageBind for best performance
model = imagebind_model.imagebind_huge(pretrained=True).eval().to(device)

print(" ImageBind model loaded successfully!")
print(f" Model device: {next(model.parameters()).device}")
print("\n What makes ImageBind special:")
print("   • Creates unified embeddings across text, vision, and audio")
print("   • Semantically similar content from different modalities are close in embedding space")
print("   • Enables cross-modal retrieval (e.g., find images using text queries)")
print("   • Pre-trained on massive multimodal datasets")

# Test that the model is working
print("\n Quick model test...")
try:
    # Create a simple test embedding
    test_text = ["A simple test"]
    test_inputs = {ModalityType.TEXT: data.load_and_transform_text(test_text, device)}
    with torch.no_grad():
        test_embeddings = model(test_inputs)
    print(f" Model test successful! Embedding shape: {test_embeddings[ModalityType.TEXT].shape}")
except Exception as e:
    print(f" Model test failed: {e}")

## Step 6: Create Multimodal ImageBind Embeddings

Now we'll create embeddings for all our data across three modalities. This is where the magic happens - ImageBind will create vector representations that capture the semantic meaning of our text, images, and audio.

###  What are embeddings?
Embeddings are dense vector representations that capture semantic meaning. Similar content will have similar embeddings (measured by cosine similarity).

###  Processing Strategy
We process data in batches to manage memory efficiently, especially when dealing with large datasets and GPU memory constraints.

In [None]:
# Create embeddings for all modalities
print(" Creating ImageBind embeddings for all modalities...")
print(f" Processing {len(data_entries)} entries in batches")

# Create embeddings with batch processing for memory efficiency
embeddings = create_imagebind_embeddings(
    model=model, 
    data_entries=data_entries, 
    batch_size=6
)
print("\n Embeddings created successfully!")
print("\n Embedding Statistics:")
print(f"    Text embeddings: {embeddings['text'].shape}")
print(f"    Vision embeddings: {embeddings['vision'].shape}")
print(f"    Audio embeddings: {embeddings['audio'].shape}")

# Show sample embeddings to understand the data
print("\n Sample embedding vectors (first 10 dimensions):")
print(f"    Text sample: {embeddings['text'][0][:10]}")
print(f"    Vision sample: {embeddings['vision'][0][:10]}")
print(f"    Audio sample: {embeddings['audio'][0][:10]}")

print("\n Key insights about these embeddings:")
print("   • Each embedding is a 1024-dimensional vector")
print("   • Vectors are normalized (unit length) for cosine similarity")
print("   • Similar content across modalities will have similar embeddings")
print("   • These embeddings enable cross-modal retrieval")

## Step 7: Create FAISS Indices for Efficient Similarity Search

FAISS (Facebook AI Similarity Search) is a library for efficient similarity search. We'll create separate indices for each modality to enable fast retrieval.

###  Why FAISS?
- **Speed**: Much faster than brute-force similarity search
- **Scalability**: Can handle millions of vectors
- **Memory efficiency**: Optimized data structures
- **GPU support**: Can leverage GPU acceleration

In [None]:
# Create separate FAISS indices for each modality
print(" Creating FAISS indices for efficient similarity search...")

indices, normalized_embeddings = create_separate_indices(embeddings)

# Extract individual indices for easy access
text_index = indices['text']
vision_index = indices['vision']
audio_index = indices['audio']

print(" FAISS indices created successfully!")
print("\n Index Statistics:")
print(f"    Text index: {text_index.ntotal} vectors, dimension {text_index.d}")
print(f"    Vision index: {vision_index.ntotal} vectors, dimension {vision_index.d}")
print(f"    Audio index: {audio_index.ntotal} vectors, dimension {audio_index.d}")

print("\n Understanding FAISS indices:")
print("   • Each index stores normalized embeddings for fast cosine similarity")
print("   • IndexFlatIP performs exact inner product search")
print("   • Normalized vectors make inner product equivalent to cosine similarity")
print("   • These indices enable sub-second retrieval even with large datasets")

# Test the indices with a simple search
print("\n Testing index functionality...")
try:
    # Search for the most similar items to the first text embedding
    test_query = normalized_embeddings['text'][0:1]  # First embedding as query
    distances, indices_result = text_index.search(test_query, k=3)
    print(f" Index test successful! Found {len(indices_result[0])} similar items")
    print(f"   Similarity scores: {distances[0]}")
except Exception as e:
    print(f" Index test failed: {e}")

## Step 8: Create Multimodal Combined Indices

Beyond single-modality retrieval, we can combine embeddings from multiple modalities. This enables more sophisticated retrieval strategies that leverage the strengths of different modalities.

In [None]:
# Create multimodal combined indices
print(" Creating multimodal combined indices...")
print("   This enables retrieval using multiple modalities simultaneously")

multimodal_indices, multimodal_embeddings = create_multimodal_indices(normalized_embeddings)

print(" Multimodal indices created successfully!")
print("\n Multimodal Index Statistics:")
for combo_name, index in multimodal_indices.items():
    print(f"    {combo_name}: {index.ntotal} vectors, dimension {index.d}")

print("\n Understanding multimodal combinations:")
print("   • text_vision: Averages text and vision embeddings")
print("   • text_audio: Averages text and audio embeddings")
print("   • full_multimodal: Averages all three modalities")
print("   • Each combination captures different aspects of semantic similarity")
print("   • Enables queries that consider multiple types of information")

# Show the available multimodal combinations
print(f"\n Available retrieval strategies: {list(multimodal_indices.keys())}")

print("\n When to use each strategy:")
print("    Text-only: When you have text queries and want text-based results")
print("    Vision-only: When you have image queries or want visually similar results")
print("    Audio-only: When you have audio queries or want acoustically similar results")
print("    Multimodal: When you want to leverage multiple types of information")

## Step 9: Demonstrate Retrieval Capabilities

Let's see our multimodal retrieval system in action! We'll demonstrate how different modalities can be used for retrieval and show the actual results.

###  Interactive Retrieval Demo
We'll use our MultimodalShowcase class to demonstrate retrieval across different modalities with real examples from our movie dataset.

In [None]:
# Initialize the showcase system for interactive demonstrations
print(" Setting up multimodal retrieval showcase...")

showcase = MultimodalShowcase(
    data_entries=data_entries,
    question_data=question_data,
    text_index=text_index,
    vision_index=vision_index,
    audio_index=audio_index,
    multimodal_indices=multimodal_indices,
    normalized_embeddings=normalized_embeddings
)

# Setup the showcase with question data
showcase.setup_showcase()

print(" Showcase system ready!")
print(f" Loaded {len(showcase.question_data['questions'])} evaluation questions")

###  Text-Based Retrieval Demo

Let's start with text-based retrieval. This shows how our system finds relevant content using only textual information.

In [None]:
# Demonstrate text-based retrieval with formatted text display
print(" DEMONSTRATING TEXT-BASED RETRIEVAL")
print("=" * 60)

# Select a few questions to demonstrate
demo_questions = [0, 1, 2]

for question_idx in demo_questions:
    question = showcase.question_data['questions'].iloc[question_idx]
    movie_name = showcase.question_data['dataframe'].iloc[question_idx]['movie_name']
    
    print(f"\n" + "" * 25)
    print(f"DEMO QUESTION {question_idx}: {movie_name}")
    print(f"Question: {question}")
    print("" * 50)
    
    # Text retrieval - shows FULL formatted text content
    print(f"\n" + "" * 15 + " TEXT WITH CHOICES " + "" * 15)
    showcase.show_text_retrieval_with_choices(question_idx, top_k=3)
    
    print("\n" + "="*80)

print("\n Text Retrieval Insights:")
print("   • Uses semantic similarity between question text and document text")
print("   • Good for finding conceptually related content")
print("   • May miss visual or auditory cues that could be relevant")

###  Vision-Based Retrieval Demo

Now let's see how vision-based retrieval works. This is particularly interesting because it can find visually similar scenes even when the text descriptions are different.

In [None]:
# Demonstrate vision-based retrieval with actual image display
print(" DEMONSTRATING VISION-BASED RETRIEVAL")
print("=" * 60)

# Use the same questions but show vision retrieval
demo_questions = [0, 1]

for question_idx in demo_questions:
    question = showcase.question_data['questions'].iloc[question_idx]
    movie_name = showcase.question_data['dataframe'].iloc[question_idx]['movie_name']
    
    print(f"\n" + "" * 25)
    print(f"DEMO QUESTION {question_idx}: {movie_name}")
    print(f"Question: {question}")
    print("" * 50)
    
    # Vision retrieval - displays actual images/frames
    print(f"\n" + "" * 15 + " VISION RETRIEVAL " + "" * 15)
    showcase.show_vision_retrieval_with_choices(question_idx, top_k=3)
    
    print("\n" + "="*80)

print("\n Vision Retrieval Insights:")
print("   • Finds visually similar scenes and compositions")
print("   • Can identify similar lighting, colors, and visual elements")
print("   • Particularly useful for visual question answering")
print("   • May capture visual context that text descriptions miss")

###  Audio-Based Retrieval Demo

Audio-based retrieval can find content based on acoustic similarity, including music, sound effects, and speech patterns.

In [None]:
# Demonstrate audio-based retrieval with audio playback
print(" DEMONSTRATING AUDIO-BASED RETRIEVAL")
print("=" * 60)

# Use selected questions for audio demonstration
demo_questions = [0, 1]

for question_idx in demo_questions:
    question = showcase.question_data['questions'].iloc[question_idx]
    movie_name = showcase.question_data['dataframe'].iloc[question_idx]['movie_name']
    
    print(f"\n" + "" * 25)
    print(f"DEMO QUESTION {question_idx}: {movie_name}")
    print(f"Question: {question}")
    print("" * 50)
    
    # Audio retrieval - plays actual audio files
    print(f"\n" + "" * 15 + " AUDIO RETRIEVAL " + "" * 15)
    showcase.show_audio_retrieval_with_choices(question_idx, top_k=3)
    
    print("\n" + "="*80)

print("\n Audio Retrieval Insights:")
print("   • Finds acoustically similar content")
print("   • Can identify similar music, sound effects, or speech patterns")
print("   • Useful for questions about audio content or emotional tone")
print("   • Captures auditory context that other modalities might miss")

###  Multimodal Retrieval Demo

Finally, let's see the full multimodal retrieval in action, combining text, vision, and audio for comprehensive content understanding.

In [None]:
# Demonstrate full multimodal retrieval with all modalities
print(" DEMONSTRATING FULL MULTIMODAL RETRIEVAL")
print("=" * 60)

# Use selected questions for multimodal demonstration
demo_questions = [0, 1]

for question_idx in demo_questions:
    question = showcase.question_data['questions'].iloc[question_idx]
    movie_name = showcase.question_data['dataframe'].iloc[question_idx]['movie_name']
    
    print(f"\n" + "" * 25)
    print(f"DEMO QUESTION {question_idx}: {movie_name}")
    print(f"Question: {question}")
    print("" * 50)
    
    # Multimodal retrieval - shows text, images, and plays audio
    print(f"\n" + "" * 15 + " MULTIMODAL RETRIEVAL " + "" * 15)
    showcase.show_multimodal_retrieval_with_choices(question_idx, top_k=3)
    
    print("\n" + "="*80)

print("\n Multimodal Retrieval Insights:")
print("   • Combines information from text, vision, and audio")
print("   • Provides the most comprehensive understanding of content")
print("   • Can capture nuances that single modalities might miss")
print("   • Best for complex questions requiring multiple types of understanding")
print("   • May be computationally more expensive but often more accurate")

## Step 10: Comprehensive Multimodal RAGAS Evaluation

Now we'll run our comprehensive evaluation system that combines traditional RAGAS metrics with our custom multimodal metrics.

###  What We're Evaluating

**Traditional RAGAS Metrics:**
- **Answer Relevancy**: How relevant is the generated answer to the question?
- **Faithfulness**: Is the answer faithful to the retrieved context?
- **Context Precision**: How precise is the retrieved context?
- **Context Recall**: How much of the relevant context was retrieved?

**Custom Multimodal Metrics:**
- **Multimodal Faithfulness**: Cross-modal consistency of retrieved content
- **Multimodal Relevancy**: Semantic relevance across modalities
- **Average Similarity**: Overall embedding similarity scores

**Standard Retrieval Metrics:**
- **Precision@1**: Is the top result correct?
- **MRR@5**: Mean Reciprocal Rank of correct answers in top 5
- **NDCG@5**: Normalized Discounted Cumulative Gain

In [None]:
# Import our comprehensive evaluation system
from utils import EnhancedMultimodalEvaluator
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

# Configuration for evaluation
NUM_QUESTIONS_TO_TEST = 5  # Adjust this based on your needs

print(" Starting Enhanced Multimodal Evaluation")
print(f" Testing {NUM_QUESTIONS_TO_TEST} questions across 4 retrieval strategies")
print("\n Evaluation Strategies:")
print("   1.  Text-only retrieval")
print("   2.  Vision-only retrieval")
print("   3.  Audio-only retrieval")
print("   4.  Full multimodal retrieval")

# Initialize and setup the evaluator
evaluator = EnhancedMultimodalEvaluator(SAGEMAKER_ENDPOINT_NAME, NUM_QUESTIONS_TO_TEST)
evaluator.setup_evaluation_components()

print("\n Evaluation system initialized!")
print(" Ready to run comprehensive multimodal evaluation...")

### ‍ Running the Evaluation

This is where we run our comprehensive evaluation across all strategies and metrics. The evaluation will:

1. **Test each retrieval strategy** on our question set
2. **Calculate traditional RAGAS metrics** using the SageMaker LLM
3. **Compute custom multimodal metrics** using ImageBind embeddings
4. **Generate detailed performance comparisons** across all strategies

⏱ **Note**: This may take several minutes as we're running LLM evaluations for each question and strategy.

In [None]:
# Run the comprehensive evaluation
print("‍ RUNNING COMPREHENSIVE MULTIMODAL EVALUATION")
print("=" * 70)
print("⏱ This may take several minutes as we evaluate each strategy...")
print("\n Progress will be shown for each question and strategy")

# Execute the evaluation
evaluation_results = evaluator.run_comprehensive_evaluation()

print("\n Evaluation completed successfully!")
print(f" Evaluated {len(evaluation_results)} different strategies")
print(f" Each strategy tested on {NUM_QUESTIONS_TO_TEST} questions")

## Step 11: Results Analysis and Visualization

Now let's analyze our results! We'll create comprehensive visualizations and tables to understand how each retrieval strategy performs.

###  Performance Comparison Table

This table shows the performance of each strategy across all metrics. **Green highlighting** indicates the best performing strategy for each metric.

In [None]:
print(" COMPREHENSIVE EVALUATION RESULTS")
print("=" * 70)

# Create styled table with green highlighting for best performance
styled_table = evaluator.create_styled_comparison_table()
if styled_table is not None:
    print("\n Performance Comparison Table (Best values highlighted in green):")
    display(styled_table)
else:
    print(" Could not create styled table, showing basic results...")
    
# Get the raw data for additional analysis
df, best_metrics, modality_effectiveness = evaluator.create_comparison_table()

print("\n BEST PERFORMING STRATEGIES BY METRIC:")
print("=" * 50)
for metric, value in best_metrics.items():
    if isinstance(value, tuple):
        strategy, score = value
        print(f" {metric}: {strategy} ({score:.3f})")
    else:
        # Find the best strategy for this metric
        best_strategy = df[metric].idxmax()
        best_value = df[metric].max()
        print(f" {metric}: {best_strategy} ({best_value:.3f})")

print("\n OVERALL MODALITY EFFECTIVENESS RANKING:")
print("=" * 50)

# Calculate average scores for each strategy
strategy_averages = []
for strategy, metrics in modality_effectiveness.items():
    avg_score = sum(metrics.values()) / len(metrics)
    strategy_averages.append((strategy, avg_score))

# Sort by average score (highest first)
strategy_averages.sort(key=lambda x: x[1], reverse=True)

# Display the ranking
for i, (strategy, avg_score) in enumerate(strategy_averages, 1):
    print(f"{i}. {strategy}: {avg_score:.3f} (average across all metrics)")

###  Performance Visualization

Let's create visualizations to better understand the performance patterns across different strategies and metrics.

In [None]:
# Create comprehensive visualizations
print(" CREATING PERFORMANCE VISUALIZATIONS")
print("=" * 50)

# Set up the plotting style
plt.style.use('default')
sns.set_palette("husl")

# Create a heatmap visualization
plt.figure(figsize=(15, 8))
sns.heatmap(df, annot=True, fmt='.3f', cmap='RdYlGn', center=0.5)
plt.title('Multimodal Evaluation Results\n(Green = Better Performance)', fontweight='bold')
plt.xlabel('Evaluation Metrics')
plt.ylabel('Retrieval Strategies')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

print(" Visualizations created successfully!")

## Step 12: Understanding the Results - Key Insights

Let's dive deep into what our evaluation results tell us about multimodal retrieval performance.

###  Metric Interpretation Guide

Understanding what each metric means is crucial for interpreting our results:

In [None]:
# Provide detailed analysis of the results
print(" DETAILED RESULTS ANALYSIS")
print("=" * 50)

print("\n METRIC INTERPRETATION GUIDE:")
print("-" * 40)

metric_explanations = {
    "Precision@1": "Is the top retrieved result correct? (1.0 = perfect, 0.0 = wrong)",
    "Recall@1": "Did we retrieve the correct answer in the top result?",
    "MRR@5": "Mean Reciprocal Rank - how quickly do we find the right answer?",
    "NDCG@5": "Normalized Discounted Cumulative Gain - quality of ranking",
    "RAGAS Answer Relevancy": "How relevant is the generated answer to the question?",
    "RAGAS Faithfulness": "Is the answer faithful to the retrieved context?",
    "RAGAS Context Precision": "How precise is the retrieved context?",
    "RAGAS Context Recall": "How much relevant context was retrieved?",
    "Multimodal Faithfulness": "Cross-modal consistency of retrieved content",
    "Multimodal Relevancy": "Semantic relevance across modalities",
    "Multimodal Avg Similarity": "Overall embedding similarity scores"
}

for metric, explanation in metric_explanations.items():
    print(f" {metric}:")
    print(f"   {explanation}")
    print()

print("\n PRACTICAL RECOMMENDATIONS:")
print("-" * 35)
recommendations = [
    " Use text-only for purely conceptual questions",
    " Use vision-only for visual scene analysis",
    " Use audio-only for questions about sound, music, or emotion",
    " Use multimodal for complex questions requiring multiple types of understanding",
    " Monitor both traditional and multimodal metrics for comprehensive evaluation",
    " Consider computational cost vs. performance trade-offs in production"
]

for rec in recommendations:
    print(rec)

## Step 13: Save Results and Export Data

Let's save our evaluation results for future reference and analysis.

In [None]:
# Save all results for future reference
print(" SAVING EVALUATION RESULTS")
print("=" * 40)

# Save the detailed results using the evaluator's save function
evaluator.save_results()

# Save the comparison table as CSV for easy analysis
df.to_csv("multimodal_evaluation_comparison.csv")
print(" Comparison table saved as 'multimodal_evaluation_comparison.csv'")

# Calculate the best overall strategy
strategy_averages = []
for strategy, metrics in modality_effectiveness.items():
    avg_score = sum(metrics.values()) / len(metrics)
    strategy_averages.append((strategy, avg_score))

# Sort by average score (highest first)
strategy_averages.sort(key=lambda x: x[1], reverse=True)
best_strategy, best_score = strategy_averages[0]

# Save a summary report
summary_report = {
    "evaluation_summary": {
        "num_questions_tested": NUM_QUESTIONS_TO_TEST,
        "strategies_evaluated": list(df.index),
        "metrics_calculated": list(df.columns),
        "best_overall_strategy": best_strategy,
        "best_overall_score": best_score
    },
    "best_metrics": best_metrics,
    "modality_effectiveness_ranking": dict(modality_effectiveness)
}

with open("evaluation_summary_report.json", "w") as f:
    json.dump(summary_report, f, indent=2, default=str)

print(" Summary report saved as 'evaluation_summary_report.json'")
print(" Detailed results saved as 'enhanced_multimodal_evaluation_results.json'")

print("\n FILES CREATED:")
print("    multimodal_evaluation_comparison.csv - Performance comparison table")
print("    evaluation_summary_report.json - High-level summary")
print("    enhanced_multimodal_evaluation_results.json - Detailed results")

print("\n EVALUATION COMPLETE!")
print("All results have been saved and are ready for further analysis.")

##  Conclusion and Next Steps

Congratulations! You've successfully implemented and run a comprehensive multimodal RAGAS evaluation system. Here's what you've accomplished:

###  What You've Learned

1. **Multimodal Embeddings**: How to create unified embeddings across text, vision, and audio using ImageBind
2. **Efficient Retrieval**: How to build FAISS indices for fast similarity search across modalities
3. **Comprehensive Evaluation**: How to combine traditional RAGAS metrics with custom multimodal metrics
4. **Performance Analysis**: How to interpret and visualize multimodal retrieval performance
5. **Strategic Insights**: When to use different retrieval strategies based on your use case

###  Next Steps and Extensions

Here are some ways you can extend this work:

1. **Scale Up**: Test with larger datasets and more questions
2. **Fine-tune Weights**: Experiment with different weighting schemes for multimodal combinations
3. **Add More Modalities**: Extend to include other modalities like depth, thermal, etc.
4. **Custom Metrics**: Develop domain-specific evaluation metrics
5. **Production Deployment**: Optimize for real-time retrieval in production systems
6. **A/B Testing**: Compare against other multimodal retrieval approaches

###  Key Takeaways

- **No Single Best Strategy**: Different retrieval strategies excel in different scenarios
- **Multimodal is Powerful**: Combining modalities can provide richer understanding
- **Evaluation is Complex**: Comprehensive evaluation requires multiple metrics and perspectives
- **Context Matters**: The best approach depends on your specific use case and data

###  Resources for Further Learning

- [ImageBind Paper](https://arxiv.org/abs/2305.05665) - The foundational research
- [RAGAS Documentation](https://docs.ragas.io/) - Traditional RAG evaluation
- [FAISS Documentation](https://faiss.ai/) - Efficient similarity search
- [Multimodal AI Research](https://paperswithcode.com/task/multimodal-learning) - Latest developments

Thank you for following this comprehensive tutorial! You now have a solid foundation for building and evaluating multimodal retrieval systems. 