# VSS 2026 RDM Analysis Pipeline

This notebook provides a complete, reproducible pipeline for computing and analyzing Representational Dissimilarity Matrices (RDMs) for CLIP and DINOv3 category embeddings.

## Pipeline Overview

1. **Compute RDM Matrices** - Compute pairwise RDM matrices from category embeddings
2. **Filter and Reorganize RDMs** (Optional) - Filter out low-quality categories and reorganize by type
3. **Correlate RDM Matrices** - Compare RDM matrices between different models
4. **Correlate Category Embeddings** (Optional) - Compare category-level embeddings directly

## Prerequisites

Make sure you have the following packages installed:
- `numpy`
- `pandas`
- `scipy`
- `matplotlib`
- `seaborn`
- `scikit-learn`
- `tqdm`

## Setup and Imports

In [None]:
import numpy as np
import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict
from tqdm import tqdm
import multiprocessing as mp
from concurrent.futures import ThreadPoolExecutor
from scipy.stats import pearsonr, spearmanr
from scipy.cluster.hierarchy import linkage, dendrogram, optimal_leaf_ordering
from scipy.spatial.distance import squareform
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')

# Set matplotlib backend
import matplotlib
matplotlib.use('Agg')

print("All imports successful!")

## Configuration

**Please update the paths below according to your setup:**

In [None]:
# ============================================================================
# CONFIGURATION - UPDATE THESE PATHS FOR YOUR SETUP
# ============================================================================
#
# IMPORTANT: If you already have average embeddings saved as .npz files,
# you can skip Step 1 and start directly from Step 2! Just set CLIP_OUTPUT_DIR
# (or DINOV3_OUTPUT_DIR) to point to the directory containing your 
# category_average_embeddings.npz file.
#
# Example: If your file is at "./clip_rdm_results_26/category_average_embeddings.npz",
# set CLIP_OUTPUT_DIR = "./clip_rdm_results_26"
# ============================================================================

# Paths for CLIP embeddings
# For Step 1: Set these to compute RDMs from individual embeddings
CLIP_EMBEDDING_LIST = None  # Path to text file with CLIP embedding paths (one per line), or None to scan directory
CLIP_EMBEDDINGS_DIR = "/path/to/clip_embeddings"  # Base directory for CLIP embeddings (only needed for Step 1)

# For Step 2+: Point this to directory containing category_average_embeddings.npz
# Example: "./clip_rdm_results_26" if your .npz file is in that directory
CLIP_OUTPUT_DIR = "./clip_rdm_results"  # Directory containing category_average_embeddings.npz (or output dir for Step 1)

# Paths for DINOv3 embeddings (optional)
# For Step 1: Set these to compute RDMs from individual embeddings
DINOV3_EMBEDDING_LIST = None  # Path to text file with DINOv3 embedding paths, or None to scan directory
DINOV3_EMBEDDINGS_DIR = "/path/to/dinov3_embeddings"  # Base directory for DINOv3 embeddings (only needed for Step 1)

# For Step 2+: Point this to directory containing category_average_embeddings.npz
# Example: "./dinov3_rdm_results_26" if your .npz file is in that directory
DINOV3_OUTPUT_DIR = "./dinov3_rdm_results"  # Directory containing category_average_embeddings.npz (or output dir for Step 1)
MATCH_FROM_CLIP_LIST = True  # If True, match DINOv3 filenames from CLIP list (ensures same images)

# CDI words CSV file (required for category type organization in Step 2)
CDI_PATH = "./data/cdi_words.csv"

# Filtering options (optional, used in Step 2)
EXCLUSION_FILE = None  # Path to text file with categories to exclude (one per line), or None
INCLUSION_FILE = None  # Path to text file with categories to include (one per line), or None
FILTERED_OUTPUT_DIR = "./clip_rdm_results_filtered"  # Output directory for filtered results

# Processing options
NUM_WORKERS = None  # Number of parallel workers (None = auto-detect, max 16)
USE_PARALLEL = True  # Enable parallel loading (only used in Step 1)
USE_CLUSTERING = True  # Enable hierarchical clustering within category groups (used in Step 2)
SAVE_DENDROGRAMS = False  # Save dendrogram plots for each category group (used in Step 2)

# Correlation options
CORRELATE_RDMS = False  # Set to True to correlate RDM matrices (requires both CLIP and DINOv3 results)
CORRELATE_EMBEDDINGS = False  # Set to True to correlate category embeddings

print("Configuration loaded. Please review and update paths as needed.")

## Helper Functions

These functions are used throughout the pipeline.

In [None]:
def load_embedding_paths(txt_path):
    """Load embedding file paths from text file"""
    print(f"Loading embedding paths from {txt_path}...")
    with open(txt_path, 'r') as f:
        paths = [line.strip() for line in f if line.strip()]
    print(f"Found {len(paths)} embedding paths")
    return paths

def scan_embedding_directory(embeddings_dir):
    """Scan directory for all .npy embedding files"""
    embeddings_dir = Path(embeddings_dir)
    print(f"Scanning {embeddings_dir} for .npy files...")
    
    npy_files = list(embeddings_dir.rglob("*.npy"))
    
    if len(npy_files) == 0:
        raise ValueError(f"No .npy files found in {embeddings_dir}")
    
    paths = [str(f.relative_to(embeddings_dir)) for f in npy_files]
    paths.sort()
    
    print(f"Found {len(paths)} embedding files")
    return paths

def match_embedding_paths_from_list(reference_list_path, target_embeddings_dir):
    """Match embedding paths from a reference list to target directory"""
    target_embeddings_dir = Path(target_embeddings_dir)
    
    print(f"Loading reference embedding list from {reference_list_path}...")
    with open(reference_list_path, 'r') as f:
        reference_paths = [line.strip() for line in f if line.strip()]
    
    print(f"Found {len(reference_paths)} reference paths")
    
    reference_mapping = {}
    for ref_path in reference_paths:
        ref_path_obj = Path(ref_path)
        if len(ref_path_obj.parts) >= 2:
            category = ref_path_obj.parts[-2]
            filename = ref_path_obj.name
            reference_mapping[(category, filename)] = ref_path
    
    target_files = {}
    if target_embeddings_dir.exists():
        for npy_file in target_embeddings_dir.rglob("*.npy"):
            rel_path = npy_file.relative_to(target_embeddings_dir)
            if len(rel_path.parts) >= 2:
                category = rel_path.parts[0]
                filename = rel_path.name
                if category not in target_files:
                    target_files[category] = {}
                target_files[category][filename] = str(rel_path)
    
    matched_paths = []
    for (category, filename), ref_path in reference_mapping.items():
        if category in target_files and filename in target_files[category]:
            matched_paths.append(target_files[category][filename])
    
    print(f"Matched {len(matched_paths)} files ({len(matched_paths)/len(reference_mapping)*100:.1f}%)")
    matched_paths.sort()
    
    return matched_paths

def load_single_embedding(args):
    """Load a single embedding file (worker function for parallel processing)"""
    path, embeddings_dir, is_absolute = args
    
    try:
        if is_absolute:
            full_path = Path(path)
        else:
            full_path = Path(embeddings_dir) / path
        
        if not full_path.exists():
            return None, None
        
        path_parts = full_path.parts
        if len(path_parts) < 2:
            return None, None
        category = path_parts[-2]
        
        embedding = np.load(full_path)
        if embedding.ndim > 1:
            embedding = embedding.flatten()
        
        return category, embedding
    except Exception:
        return None, None

def load_embeddings_by_category(embedding_paths, embeddings_dir, num_workers=None, use_parallel=True):
    """Load embeddings grouped by category"""
    print("Loading embeddings by category...")
    embeddings_by_category = defaultdict(list)
    
    embeddings_dir = Path(embeddings_dir)
    
    if num_workers is None:
        num_workers = min(16, mp.cpu_count())
    
    path_args = []
    for path in embedding_paths:
        is_absolute = Path(path).is_absolute()
        path_args.append((path, str(embeddings_dir), is_absolute))
    
    if use_parallel and len(embedding_paths) > 100:
        print(f"Using {num_workers} parallel workers...")
        print(f"Processing {len(path_args):,} embedding files...")
        
        chunk_size = max(5000, num_workers * 200)
        total_chunks = (len(path_args) + chunk_size - 1) // chunk_size
        
        successful = 0
        failed = 0
        
        with ThreadPoolExecutor(max_workers=num_workers) as executor:
            for chunk_idx in range(0, len(path_args), chunk_size):
                chunk = path_args[chunk_idx:chunk_idx + chunk_size]
                chunk_num = (chunk_idx // chunk_size) + 1
                
                results = list(executor.map(load_single_embedding, chunk))
                
                for category, embedding in results:
                    if category is not None and embedding is not None:
                        embeddings_by_category[category].append(embedding)
                        successful += 1
                    else:
                        failed += 1
                
                if chunk_num % 50 == 0 or chunk_num == total_chunks:
                    progress = (chunk_num / total_chunks) * 100
                    print(f"Progress: {progress:.1f}% ({chunk_num}/{total_chunks} chunks, {successful:,} loaded)")
        
        if failed > 0:
            print(f"\nWarning: Failed to load {failed:,} embeddings")
        print(f"Successfully loaded {successful:,} embeddings")
    else:
        print("Using sequential loading...")
        for args in tqdm(path_args, desc="Loading embeddings"):
            category, embedding = load_single_embedding(args)
            if category is not None and embedding is not None:
                embeddings_by_category[category].append(embedding)
    
    print(f"\nLoaded embeddings for {len(embeddings_by_category)} categories:")
    for category, emb_list in sorted(embeddings_by_category.items()):
        print(f"  {category}: {len(emb_list)} embeddings")
    
    return embeddings_by_category

def compute_category_averages(embeddings_by_category):
    """Compute average embedding for each category"""
    print("\nComputing category average embeddings...")
    category_averages = {}
    categories = []
    
    for category, emb_list in sorted(embeddings_by_category.items()):
        if len(emb_list) == 0:
            continue
        
        emb_array = np.array(emb_list)
        avg_embedding = np.mean(emb_array, axis=0)
        category_averages[category] = avg_embedding
        categories.append(category)
        
        print(f"  {category}: {len(emb_list)} embeddings -> shape {avg_embedding.shape}")
    
    return category_averages, categories

def compute_similarity_matrix(category_averages, categories):
    """Compute pairwise cosine similarity matrix"""
    print("\nComputing cosine similarity matrix...")
    
    embeddings = np.array([category_averages[cat] for cat in categories])
    
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    normalized_embeddings = embeddings / (norms + 1e-10)
    
    similarity_matrix = np.dot(normalized_embeddings, normalized_embeddings.T)
    
    return similarity_matrix

def compute_distance_matrix(similarity_matrix):
    """Compute cosine distance matrix from similarity matrix"""
    print("Computing cosine distance matrix...")
    
    distance_matrix = 1 - similarity_matrix
    np.fill_diagonal(distance_matrix, 0)
    distance_matrix = (distance_matrix + distance_matrix.T) / 2
    
    return distance_matrix

def load_category_types(cdi_path):
    """Load category type information from CDI words CSV"""
    print(f"\nLoading category types from {cdi_path}...")
    cdi_df = pd.read_csv(cdi_path)
    
    category_types = {}
    for _, row in cdi_df.iterrows():
        category_types[row['uni_lemma']] = {
            'is_animate': bool(row.get('is_animate', 0)),
            'is_bodypart': bool(row.get('is_bodypart', 0)),
            'is_small': bool(row.get('is_small', 0)),
            'is_big': bool(row.get('is_big', 0))
        }
    
    print(f"Loaded type information for {len(category_types)} categories")
    return category_types

def organize_categories_by_type(categories, category_types):
    """Organize categories into groups: animate, small, bodyparts, big"""
    print("\nOrganizing categories by type...")
    
    organized = {
        'animate': [],
        'small': [],
        'bodyparts': [],
        'big': [],
        'others': []
    }
    
    for cat in categories:
        if cat not in category_types:
            organized['others'].append(cat)
            continue
        
        types = category_types[cat]
        if types['is_animate']:
            organized['animate'].append(cat)
        elif types['is_bodypart']:
            organized['bodyparts'].append(cat)
        elif types['is_small']:
            organized['small'].append(cat)
        elif types['is_big']:
            organized['big'].append(cat)
        else:
            organized['others'].append(cat)
    
    for key in organized:
        organized[key] = sorted(organized[key])
        print(f"  {key}: {len(organized[key])} categories")
    
    return organized

print("Helper functions loaded!")

## Step 1: Compute RDM Matrices

This step computes pairwise RDM matrices from category embeddings. You can run this for CLIP, DINOv3, or both.

### 1.1 Compute CLIP RDM

In [None]:
# Compute CLIP RDM
if CLIP_EMBEDDINGS_DIR:
    print("="*60)
    print("COMPUTING CLIP RDM")
    print("="*60)
    
    # Load embedding paths
    if CLIP_EMBEDDING_LIST:
        embedding_paths = load_embedding_paths(CLIP_EMBEDDING_LIST)
    else:
        embedding_paths = scan_embedding_directory(CLIP_EMBEDDINGS_DIR)
    
    # Load embeddings by category
    embeddings_by_category = load_embeddings_by_category(
        embedding_paths, 
        CLIP_EMBEDDINGS_DIR,
        num_workers=NUM_WORKERS,
        use_parallel=USE_PARALLEL
    )
    
    # Compute category averages
    category_averages, categories = compute_category_averages(embeddings_by_category)
    
    # Save category averages
    output_dir = Path(CLIP_OUTPUT_DIR)
    output_dir.mkdir(exist_ok=True, parents=True)
    
    print("\nSaving category average embeddings...")
    embeddings = np.array([category_averages[cat] for cat in categories])
    npz_path = output_dir / 'category_average_embeddings.npz'
    np.savez(npz_path, 
             embeddings=embeddings, 
             categories=np.array(categories))
    print(f"  Saved to {npz_path}")
    
    # Compute similarity and distance matrices
    similarity_matrix = compute_similarity_matrix(category_averages, categories)
    distance_matrix = compute_distance_matrix(similarity_matrix)
    
    # Save matrices
    print("\nSaving data files...")
    np.save(output_dir / 'similarity_matrix.npy', similarity_matrix)
    np.save(output_dir / 'distance_matrix.npy', distance_matrix)
    
    sim_df = pd.DataFrame(similarity_matrix, index=categories, columns=categories)
    sim_df.to_csv(output_dir / 'similarity_matrix.csv')
    
    dist_df = pd.DataFrame(distance_matrix, index=categories, columns=categories)
    dist_df.to_csv(output_dir / 'distance_matrix.csv')
    
    # Load category types and create organized RDM
    cdi_path = Path(CDI_PATH)
    if cdi_path.exists():
        category_types = load_category_types(cdi_path)
        organized_categories = organize_categories_by_type(categories, category_types)
        
        # Create ordered list
        ordered_categories = (
            organized_categories['animate'] +
            organized_categories['bodyparts'] +
            organized_categories['small'] +
            organized_categories['big'] +
            organized_categories['others']
        )
        
        # Reorder matrices
        cat_to_idx = {cat: idx for idx, cat in enumerate(categories)}
        ordered_indices = [cat_to_idx[cat] for cat in ordered_categories if cat in cat_to_idx]
        reordered_matrix = distance_matrix[np.ix_(ordered_indices, ordered_indices)]
        reordered_categories = [cat for cat in ordered_categories if cat in cat_to_idx]
        
        # Create heatmap
        n_categories = len(reordered_categories)
        fig_size = max(20, n_categories * 0.5)
        
        plt.figure(figsize=(fig_size, fig_size))
        ax = sns.heatmap(reordered_matrix, 
                    xticklabels=reordered_categories,
                    yticklabels=reordered_categories,
                    cmap='RdYlBu_r',
                    vmin=0,
                    vmax=2,
                    square=True,
                    cbar_kws={'label': 'Cosine Distance', 'shrink': 0.8})
        
        plt.title('CLIP Category RDM (Organized by Type)', fontsize=24, pad=20)
        plt.xticks(rotation=45, ha='right', fontsize=10)
        plt.yticks(rotation=0, fontsize=10)
        plt.tight_layout()
        plt.savefig(output_dir / 'rdm_organized_by_type.png', dpi=300, bbox_inches='tight')
        plt.close()
        print(f"Saved organized RDM to {output_dir / 'rdm_organized_by_type.png'}")
    
    # Create full RDM
    n_categories = len(categories)
    fig_size = max(20, n_categories * 0.5)
    
    plt.figure(figsize=(fig_size, fig_size))
    ax = sns.heatmap(distance_matrix, 
                xticklabels=categories,
                yticklabels=categories,
                cmap='RdYlBu_r',
                vmin=0,
                vmax=2,
                square=True,
                cbar_kws={'label': 'Cosine Distance', 'shrink': 0.8})
    
    plt.title('CLIP Category RDM (Full)', fontsize=24, pad=20)
    plt.xticks(rotation=45, ha='right', fontsize=10)
    plt.yticks(rotation=0, fontsize=10)
    plt.tight_layout()
    plt.savefig(output_dir / 'rdm_full.png', dpi=300, bbox_inches='tight')
    plt.close()
    print(f"Saved full RDM to {output_dir / 'rdm_full.png'}")
    
    print(f"\nCLIP RDM computation complete! Results saved to {output_dir}")
    print(f"Total categories: {len(categories)}")
    print(f"Mean similarity: {similarity_matrix.mean():.4f}")
    print(f"Mean distance: {distance_matrix.mean():.4f}")
else:
    print("Skipping CLIP RDM computation (CLIP_EMBEDDINGS_DIR not set)")

### 1.2 Compute DINOv3 RDM

In [None]:
# Compute DINOv3 RDM
if DINOV3_EMBEDDINGS_DIR:
    print("="*60)
    print("COMPUTING DINOv3 RDM")
    print("="*60)
    
    # Load embedding paths
    if MATCH_FROM_CLIP_LIST and CLIP_EMBEDDING_LIST:
        embedding_paths = match_embedding_paths_from_list(CLIP_EMBEDDING_LIST, DINOV3_EMBEDDINGS_DIR)
    elif DINOV3_EMBEDDING_LIST:
        embedding_paths = load_embedding_paths(DINOV3_EMBEDDING_LIST)
    else:
        embedding_paths = scan_embedding_directory(DINOV3_EMBEDDINGS_DIR)
    
    # Load embeddings by category
    embeddings_by_category = load_embeddings_by_category(
        embedding_paths, 
        DINOV3_EMBEDDINGS_DIR,
        num_workers=NUM_WORKERS,
        use_parallel=USE_PARALLEL
    )
    
    # Compute category averages
    category_averages, categories = compute_category_averages(embeddings_by_category)
    
    # Save category averages
    output_dir = Path(DINOV3_OUTPUT_DIR)
    output_dir.mkdir(exist_ok=True, parents=True)
    
    print("\nSaving category average embeddings...")
    embeddings = np.array([category_averages[cat] for cat in categories])
    npz_path = output_dir / 'category_average_embeddings.npz'
    np.savez(npz_path, 
             embeddings=embeddings, 
             categories=np.array(categories))
    print(f"  Saved to {npz_path}")
    
    # Compute similarity and distance matrices
    similarity_matrix = compute_similarity_matrix(category_averages, categories)
    distance_matrix = compute_distance_matrix(similarity_matrix)
    
    # Save matrices
    print("\nSaving data files...")
    np.save(output_dir / 'similarity_matrix.npy', similarity_matrix)
    np.save(output_dir / 'distance_matrix.npy', distance_matrix)
    
    sim_df = pd.DataFrame(similarity_matrix, index=categories, columns=categories)
    sim_df.to_csv(output_dir / 'similarity_matrix.csv')
    
    dist_df = pd.DataFrame(distance_matrix, index=categories, columns=categories)
    dist_df.to_csv(output_dir / 'distance_matrix.csv')
    
    # Load category types and create organized RDM
    cdi_path = Path(CDI_PATH)
    if cdi_path.exists():
        category_types = load_category_types(cdi_path)
        organized_categories = organize_categories_by_type(categories, category_types)
        
        # Create ordered list
        ordered_categories = (
            organized_categories['animate'] +
            organized_categories['bodyparts'] +
            organized_categories['small'] +
            organized_categories['big'] +
            organized_categories['others']
        )
        
        # Reorder matrices
        cat_to_idx = {cat: idx for idx, cat in enumerate(categories)}
        ordered_indices = [cat_to_idx[cat] for cat in ordered_categories if cat in cat_to_idx]
        reordered_matrix = distance_matrix[np.ix_(ordered_indices, ordered_indices)]
        reordered_categories = [cat for cat in ordered_categories if cat in cat_to_idx]
        
        # Create heatmap
        n_categories = len(reordered_categories)
        fig_size = max(20, n_categories * 0.5)
        
        plt.figure(figsize=(fig_size, fig_size))
        ax = sns.heatmap(reordered_matrix, 
                    xticklabels=reordered_categories,
                    yticklabels=reordered_categories,
                    cmap='RdYlBu_r',
                    vmin=0,
                    vmax=2,
                    square=True,
                    cbar_kws={'label': 'Cosine Distance', 'shrink': 0.8})
        
        plt.title('DINOv3 Category RDM (Organized by Type)', fontsize=24, pad=20)
        plt.xticks(rotation=45, ha='right', fontsize=10)
        plt.yticks(rotation=0, fontsize=10)
        plt.tight_layout()
        plt.savefig(output_dir / 'rdm_organized_by_type.png', dpi=300, bbox_inches='tight')
        plt.close()
        print(f"Saved organized RDM to {output_dir / 'rdm_organized_by_type.png'}")
    
    # Create full RDM
    n_categories = len(categories)
    fig_size = max(20, n_categories * 0.5)
    
    plt.figure(figsize=(fig_size, fig_size))
    ax = sns.heatmap(distance_matrix, 
                xticklabels=categories,
                yticklabels=categories,
                cmap='RdYlBu_r',
                vmin=0,
                vmax=2,
                square=True,
                cbar_kws={'label': 'Cosine Distance', 'shrink': 0.8})
    
    plt.title('DINOv3 Category RDM (Full)', fontsize=24, pad=20)
    plt.xticks(rotation=45, ha='right', fontsize=10)
    plt.yticks(rotation=0, fontsize=10)
    plt.tight_layout()
    plt.savefig(output_dir / 'rdm_full.png', dpi=300, bbox_inches='tight')
    plt.close()
    print(f"Saved full RDM to {output_dir / 'rdm_full.png'}")
    
    print(f"\nDINOv3 RDM computation complete! Results saved to {output_dir}")
    print(f"Total categories: {len(categories)}")
    print(f"Mean similarity: {similarity_matrix.mean():.4f}")
    print(f"Mean distance: {distance_matrix.mean():.4f}")
else:
    print("Skipping DINOv3 RDM computation (DINOV3_EMBEDDINGS_DIR not set)")

## Step 2: Filter and Reorganize RDMs

This step filters out low-quality categories and reorganizes the RDM by category type with optional hierarchical clustering.

In [None]:
# Filter and reorganize CLIP RDM
if EXCLUSION_FILE or INCLUSION_FILE:
    print("="*60)
    print("FILTERING AND REORGANIZING CLIP RDM")
    print("="*60)
    
    # Load category averages
    npz_path = Path(CLIP_OUTPUT_DIR) / 'category_average_embeddings.npz'
    if not npz_path.exists():
        print(f"Error: {npz_path} not found. Please run Step 1 first.")
    else:
        print(f"Loading category averages from {npz_path}...")
        data = np.load(npz_path)
        embeddings = data['embeddings']
        categories = [str(cat) for cat in data['categories']]
        
        # Load exclusion/inclusion lists
        if INCLUSION_FILE:
            print(f"Loading included categories from {INCLUSION_FILE}...")
            with open(INCLUSION_FILE, 'r') as f:
                included_categories = set(line.strip() for line in f if line.strip())
            print(f"Found {len(included_categories)} categories to include")
            excluded_categories = set()
        else:
            print(f"Loading excluded categories from {EXCLUSION_FILE}...")
            with open(EXCLUSION_FILE, 'r') as f:
                excluded_categories = set(line.strip() for line in f if line.strip())
            print(f"Found {len(excluded_categories)} categories to exclude")
            included_categories = None
        
        # Filter categories
        if included_categories is not None:
            filtered_indices = [i for i, cat in enumerate(categories) if cat in included_categories]
            filtered_categories = [categories[i] for i in filtered_indices]
            filtered_embeddings = embeddings[filtered_indices]
            print(f"After filtering: {len(filtered_categories)} categories")
        else:
            filtered_indices = [i for i, cat in enumerate(categories) if cat not in excluded_categories]
            filtered_categories = [categories[i] for i in filtered_indices]
            filtered_embeddings = embeddings[filtered_indices]
            print(f"After filtering: {len(filtered_categories)} categories (excluded {len(excluded_categories)})")
        
        # Load category types
        cdi_path = Path(CDI_PATH)
        if cdi_path.exists():
            category_types = load_category_types(cdi_path)
            
            # Organize by type
            organized = {
                'animals': [],
                'bodyparts': [],
                'big_objects': [],
                'small_objects': [],
                'others': []
            }
            
            cat_to_embedding = {cat: emb for cat, emb in zip(filtered_categories, filtered_embeddings)}
            
            for cat in filtered_categories:
                if cat not in category_types:
                    organized['others'].append(cat)
                    continue
                
                types = category_types[cat]
                if types['is_animate']:
                    organized['animals'].append(cat)
                elif types['is_bodypart']:
                    organized['bodyparts'].append(cat)
                elif types['is_big']:
                    organized['big_objects'].append(cat)
                elif types['is_small']:
                    organized['small_objects'].append(cat)
                else:
                    organized['others'].append(cat)
            
            # Optional hierarchical clustering within groups
            if USE_CLUSTERING:
                def cluster_categories_within_group(group_categories, cat_to_embedding):
                    if len(group_categories) <= 1:
                        return group_categories
                    
                    group_embeddings = np.array([cat_to_embedding[cat] for cat in group_categories])
                    normalized_embeddings = (group_embeddings - group_embeddings.mean(axis=0)) / (group_embeddings.std(axis=0) + 1e-10)
                    similarity_matrix = cosine_similarity(normalized_embeddings)
                    distance_matrix = 1 - similarity_matrix
                    np.fill_diagonal(distance_matrix, 0)
                    
                    condensed_distances = squareform(distance_matrix)
                    linkage_matrix = linkage(condensed_distances, method='ward')
                    
                    try:
                        linkage_matrix = optimal_leaf_ordering(linkage_matrix, condensed_distances)
                    except:
                        pass
                    
                    dendro_dict = dendrogram(linkage_matrix, no_plot=True)
                    leaf_order = dendro_dict['leaves']
                    
                    return [group_categories[i] for i in leaf_order]
                
                for key in organized:
                    if len(organized[key]) > 1:
                        organized[key] = cluster_categories_within_group(organized[key], cat_to_embedding)
            else:
                for key in organized:
                    organized[key] = sorted(organized[key])
            
            # Create ordered list
            ordered_categories = (
                organized['animals'] +
                organized['bodyparts'] +
                organized['big_objects'] +
                organized['small_objects'] +
                organized['others']
            )
            
            ordered_embeddings = np.array([cat_to_embedding[cat] for cat in ordered_categories])
        else:
            ordered_categories = sorted(filtered_categories)
            ordered_embeddings = filtered_embeddings
            organized = {'animals': [], 'bodyparts': [], 'big_objects': [], 'small_objects': [], 'others': ordered_categories}
        
        # Compute similarity and distance matrices
        normalized_embeddings = (ordered_embeddings - ordered_embeddings.mean(axis=0)) / (ordered_embeddings.std(axis=0) + 1e-10)
        similarity_matrix = cosine_similarity(normalized_embeddings)
        distance_matrix = 1 - similarity_matrix
        np.fill_diagonal(distance_matrix, 0)
        distance_matrix = (distance_matrix + distance_matrix.T) / 2
        
        # Save filtered data
        output_dir = Path(FILTERED_OUTPUT_DIR)
        output_dir.mkdir(exist_ok=True, parents=True)
        
        print("\nSaving filtered data files...")
        np.save(output_dir / 'similarity_matrix_filtered.npy', similarity_matrix)
        np.save(output_dir / 'distance_matrix_filtered.npy', distance_matrix)
        
        sim_df = pd.DataFrame(similarity_matrix, index=ordered_categories, columns=ordered_categories)
        sim_df.to_csv(output_dir / 'similarity_matrix_filtered.csv')
        
        dist_df = pd.DataFrame(distance_matrix, index=ordered_categories, columns=ordered_categories)
        dist_df.to_csv(output_dir / 'distance_matrix_filtered.csv')
        
        # Create organized RDM heatmap
        n_categories = len(ordered_categories)
        fig_size = max(20, n_categories * 0.5)
        
        plt.figure(figsize=(fig_size, fig_size))
        ax = sns.heatmap(distance_matrix, 
                    xticklabels=ordered_categories,
                    yticklabels=ordered_categories,
                    cmap='viridis',
                    vmin=0,
                    vmax=2,
                    square=True,
                    cbar_kws={'label': 'Distance (1 - Cosine Similarity)', 'shrink': 0.8})
        
        plt.title('CLIP Category RDM (Filtered and Organized)', fontsize=24, pad=20)
        plt.xticks(rotation=45, ha='right', fontsize=8)
        plt.yticks(rotation=0, fontsize=8)
        plt.tight_layout()
        plt.savefig(output_dir / 'rdm_organized_filtered.png', dpi=300, bbox_inches='tight')
        plt.close()
        print(f"Saved filtered RDM to {output_dir / 'rdm_organized_filtered.png'}")
        
        print(f"\nFiltering complete! Results saved to {output_dir}")
        print(f"Original categories: {len(categories)}")
        print(f"Filtered categories: {len(ordered_categories)}")
        print(f"Mean distance: {distance_matrix.mean():.4f}")
else:
    print("Skipping filtering step (EXCLUSION_FILE and INCLUSION_FILE not set)")

## Step 3: Correlate RDM Matrices

This step correlates two RDM matrices (e.g., CLIP vs DINOv3) using Pearson and Spearman correlations.

In [None]:
# Correlate RDM matrices
if CORRELATE_RDMS:
    print("="*60)
    print("CORRELATING RDM MATRICES")
    print("="*60)
    
    # Determine which RDM files to use (filtered if available, otherwise full)
    if Path(FILTERED_OUTPUT_DIR).exists() and (Path(FILTERED_OUTPUT_DIR) / 'distance_matrix_filtered.npy').exists():
        rdm1_path = Path(FILTERED_OUTPUT_DIR) / 'distance_matrix_filtered.npy'
        rdm2_path = Path(DINOV3_OUTPUT_DIR) / 'distance_matrix.npy'  # Assuming DINOv3 not filtered
        print("Using filtered CLIP RDM and full DINOv3 RDM")
    else:
        rdm1_path = Path(CLIP_OUTPUT_DIR) / 'distance_matrix.npy'
        rdm2_path = Path(DINOV3_OUTPUT_DIR) / 'distance_matrix.npy'
        print("Using full CLIP and DINOv3 RDMs")
    
    if not rdm1_path.exists():
        print(f"Error: {rdm1_path} not found. Please run Step 1 first.")
    elif not rdm2_path.exists():
        print(f"Error: {rdm2_path} not found. Please run Step 1.2 first.")
    else:
        # Load matrices
        print(f"Loading RDM 1 from {rdm1_path}...")
        matrix1 = np.load(rdm1_path)
        print(f"  Shape: {matrix1.shape}")
        
        print(f"Loading RDM 2 from {rdm2_path}...")
        matrix2 = np.load(rdm2_path)
        print(f"  Shape: {matrix2.shape}")
        
        if matrix1.shape != matrix2.shape:
            print(f"Warning: Matrices have different shapes: {matrix1.shape} vs {matrix2.shape}")
            print("  Attempting to match by category names...")
            
            # Try to load category names and match
            if Path(FILTERED_OUTPUT_DIR).exists():
                cat1_df = pd.read_csv(Path(FILTERED_OUTPUT_DIR) / 'distance_matrix_filtered.csv', index_col=0)
            else:
                cat1_df = pd.read_csv(Path(CLIP_OUTPUT_DIR) / 'distance_matrix.csv', index_col=0)
            cat2_df = pd.read_csv(Path(DINOV3_OUTPUT_DIR) / 'distance_matrix.csv', index_col=0)
            
            common_cats = sorted(set(cat1_df.index) & set(cat2_df.index))
            print(f"  Found {len(common_cats)} common categories")
            
            if len(common_cats) > 0:
                matrix1 = cat1_df.loc[common_cats, common_cats].values
                matrix2 = cat2_df.loc[common_cats, common_cats].values
                print(f"  Matched matrices to shape: {matrix1.shape}")
            else:
                raise ValueError("No common categories found between matrices")
        
        # Extract lower triangle (excluding diagonal)
        vec1 = matrix1[np.tril_indices_from(matrix1, k=-1)]
        vec2 = matrix2[np.tril_indices_from(matrix2, k=-1)]
        
        print(f"\nExtracted lower triangle: {len(vec1)} elements")
        
        # Remove NaN/Inf values
        mask = np.isfinite(vec1) & np.isfinite(vec2)
        vec1_clean = vec1[mask]
        vec2_clean = vec2[mask]
        
        print(f"Valid elements: {len(vec1_clean)} / {len(vec1)}")
        
        # Compute correlations
        pearson_r, pearson_p = pearsonr(vec1_clean, vec2_clean)
        spearman_r, spearman_p = spearmanr(vec1_clean, vec2_clean)
        
        # Print results
        print("\n" + "="*60)
        print("CORRELATION RESULTS")
        print("="*60)
        print(f"Pearson r:  {pearson_r:.6f} (p = {pearson_p:.2e})")
        print(f"Spearman r: {spearman_r:.6f} (p = {spearman_p:.2e})")
        print(f"\nMatrix 1 stats: Mean={vec1_clean.mean():.6f}, Std={vec1_clean.std():.6f}")
        print(f"Matrix 2 stats: Mean={vec2_clean.mean():.6f}, Std={vec2_clean.std():.6f}")
else:
    print("Skipping RDM correlation (CORRELATE_RDMS = False)")

## Step 4: Correlate Category Embeddings

This step correlates category-level average embeddings between two embedding files.

In [None]:
# Correlate category embeddings
if CORRELATE_EMBEDDINGS:
    print("="*60)
    print("CORRELATING CATEGORY EMBEDDINGS")
    print("="*60)
    
    embeddings1_path = Path(CLIP_OUTPUT_DIR) / 'category_average_embeddings.npz'
    embeddings2_path = Path(DINOV3_OUTPUT_DIR) / 'category_average_embeddings.npz'
    
    if not embeddings1_path.exists():
        print(f"Error: {embeddings1_path} not found. Please run Step 1 first.")
    elif not embeddings2_path.exists():
        print(f"Error: {embeddings2_path} not found. Please run Step 1.2 first.")
    else:
        # Load embeddings
        print(f"Loading embeddings 1 from {embeddings1_path}...")
        data1 = np.load(embeddings1_path)
        embeddings1 = data1['embeddings']
        categories1 = [str(cat) for cat in data1['categories']]
        print(f"  Categories: {len(categories1)}, Embedding dim: {embeddings1.shape[1]}")
        
        print(f"Loading embeddings 2 from {embeddings2_path}...")
        data2 = np.load(embeddings2_path)
        embeddings2 = data2['embeddings']
        categories2 = [str(cat) for cat in data2['categories']]
        print(f"  Categories: {len(categories2)}, Embedding dim: {embeddings2.shape[1]}")
        
        # Check embedding dimensions
        if embeddings1.shape[1] != embeddings2.shape[1]:
            print(f"Warning: Embedding dimensions differ: {embeddings1.shape[1]} vs {embeddings2.shape[1]}")
            min_dim = min(embeddings1.shape[1], embeddings2.shape[1])
            embeddings1 = embeddings1[:, :min_dim]
            embeddings2 = embeddings2[:, :min_dim]
            print(f"  Using first {min_dim} dimensions")
        
        # Find matching categories
        categories1_set = set(categories1)
        categories2_set = set(categories2)
        matching_categories = sorted(categories1_set & categories2_set)
        
        print(f"\nMatching categories: {len(matching_categories)}")
        
        if len(matching_categories) == 0:
            print("Error: No matching categories found!")
        else:
            # Create mapping from category to index
            cat_to_idx1 = {cat: idx for idx, cat in enumerate(categories1)}
            cat_to_idx2 = {cat: idx for idx, cat in enumerate(categories2)}
            
            # Compute correlations for each matching category
            per_category_results = []
            all_pearson_rs = []
            all_spearman_rs = []
            all_cosine_sims = []
            
            for cat in matching_categories:
                idx1 = cat_to_idx1[cat]
                idx2 = cat_to_idx2[cat]
                
                vec1 = embeddings1[idx1]
                vec2 = embeddings2[idx2]
                
                # Remove NaN/Inf
                mask = np.isfinite(vec1) & np.isfinite(vec2)
                vec1_clean = vec1[mask]
                vec2_clean = vec2[mask]
                
                if len(vec1_clean) >= 3:
                    pearson_r, pearson_p = pearsonr(vec1_clean, vec2_clean)
                    spearman_r, spearman_p = spearmanr(vec1_clean, vec2_clean)
                else:
                    pearson_r, pearson_p = np.nan, np.nan
                    spearman_r, spearman_p = np.nan, np.nan
                
                # Cosine similarity
                if len(vec1_clean) > 0:
                    vec1_2d = vec1_clean.reshape(1, -1)
                    vec2_2d = vec2_clean.reshape(1, -1)
                    cosine_sim = cosine_similarity(vec1_2d, vec2_2d)[0, 0]
                else:
                    cosine_sim = np.nan
                
                per_category_results.append({
                    'category': cat,
                    'pearson_r': pearson_r,
                    'spearman_r': spearman_r,
                    'cosine_similarity': cosine_sim
                })
                
                if not np.isnan(pearson_r):
                    all_pearson_rs.append(pearson_r)
                if not np.isnan(spearman_r):
                    all_spearman_rs.append(spearman_r)
                if not np.isnan(cosine_sim):
                    all_cosine_sims.append(cosine_sim)
            
            # Summary statistics
            print("\n" + "="*60)
            print("SUMMARY STATISTICS")
            print("="*60)
            print(f"Categories analyzed: {len(matching_categories)}")
            print(f"\nPearson Correlation:")
            print(f"  Mean:   {np.nanmean(all_pearson_rs):.6f}")
            print(f"  Std:    {np.nanstd(all_pearson_rs):.6f}")
            print(f"  Median: {np.nanmedian(all_pearson_rs):.6f}")
            print(f"  Min:    {np.nanmin(all_pearson_rs):.6f}")
            print(f"  Max:    {np.nanmax(all_pearson_rs):.6f}")
            print(f"\nSpearman Correlation:")
            print(f"  Mean:   {np.nanmean(all_spearman_rs):.6f}")
            print(f"  Std:    {np.nanstd(all_spearman_rs):.6f}")
            print(f"  Median: {np.nanmedian(all_spearman_rs):.6f}")
            print(f"\nCosine Similarity:")
            print(f"  Mean:   {np.nanmean(all_cosine_sims):.6f}")
            print(f"  Std:    {np.nanstd(all_cosine_sims):.6f}")
            print(f"  Median: {np.nanmedian(all_cosine_sims):.6f}")
            
            # Top and bottom categories
            sorted_results = sorted(per_category_results, 
                                  key=lambda x: x['pearson_r'] if not np.isnan(x['pearson_r']) else -np.inf, 
                                  reverse=True)
            
            print(f"\n\nTop 10 categories by Pearson correlation:")
            for i, result in enumerate(sorted_results[:10], 1):
                print(f"  {i:2d}. {result['category']:<30} r={result['pearson_r']:.6f}, cos={result['cosine_similarity']:.6f}")
            
            print(f"\nBottom 10 categories by Pearson correlation:")
            for i, result in enumerate(sorted_results[-10:], len(sorted_results)-9):
                print(f"  {i:2d}. {result['category']:<30} r={result['pearson_r']:.6f}, cos={result['cosine_similarity']:.6f}")
else:
    print("Skipping category embedding correlation (CORRELATE_EMBEDDINGS = False)")

## Summary

This notebook provides a complete pipeline for RDM analysis. The main steps are:

1. **Compute RDM Matrices** - Computes pairwise distance matrices from embeddings
2. **Filter and Reorganize** - Filters categories and reorganizes by type
3. **Correlate RDMs** - Compares RDM matrices between models
4. **Correlate Embeddings** - Compares category-level embeddings

All results are saved to the specified output directories. Check the output directories for:
- Category average embeddings (`.npz` and `.csv`)
- Similarity and distance matrices (`.npy` and `.csv`)
- RDM heatmap visualizations (`.png`)
- Correlation results (printed to console)

For more details, see the README.md file in this directory.