# Developmental Trajectory RDM Analysis

This notebook creates two Representational Dissimilarity Matrices (RDMs) for each individual subject, split by a median age threshold computed across all participants.
This allows tracking how object representations change developmentally within each subject.

## Overview

This analysis:
1. Loads grouped embeddings (averaged by category, subject, and age_mo)
2. Calculates the overall median age across all participants
3. For each subject, splits data into "younger" (age_mo <= median) and "older" (age_mo > median) bins
4. Computes RDM for each subject for each age bin (2 RDMs per subject)
5. Handles data density differences (some subjects/ages have more data)
6. Visualizes developmental trajectories
7. Compares RDMs between younger and older periods within subjects

## Key Features

- **Median split**: Uses overall median age across all participants to split each subject's data
- **Two RDMs per subject**: One for "younger" period, one for "older" period
- **Data density handling**: Minimum category threshold per age bin
- **Trajectory analysis**: Compare RDMs between younger and older periods to see developmental changes
- **Missing data handling**: Only includes subjects with sufficient data in both bins


## Setup and Imports


In [112]:
import numpy as np
import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity
from scipy.stats import spearmanr, pearsonr
from scipy.cluster.hierarchy import linkage, dendrogram, optimal_leaf_ordering
from scipy.spatial.distance import squareform
from collections import defaultdict
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# Set matplotlib backend
import matplotlib
matplotlib.use('Agg')

print("All imports successful!")


All imports successful!


## Configuration


In [113]:
# Paths
# Path to normalized embeddings from notebook 05 (age-month level normalized embeddings)
# These are saved in category folders: {normalized_embeddings_dir}/{category}/{subject_id}_{age_mo}_month_level_avg.npy
normalized_embeddings_dir = Path("/data2/dataset/babyview/868_hours/outputs/yoloe_cdi_embeddings/facebook_dinov3-vitb16-pretrain-lvd1689m_grouped_by_age-mo_normalized")

# Detect embedding type from path
normalized_embeddings_dir_str = str(normalized_embeddings_dir).lower()
if "dinov3" in normalized_embeddings_dir_str or "dinov" in normalized_embeddings_dir_str:
    embedding_type = "dinov3"
elif "clip" in normalized_embeddings_dir_str:
    embedding_type = "clip"
else:
    embedding_type = "unknown"

# Create output directory with embedding type in name
output_dir = Path(f"developmental_trajectory_rdms_{embedding_type}")
output_dir.mkdir(exist_ok=True, parents=True)

# Subject to exclude from analyses (should match notebook 06)
excluded_subject = "00270001"

# Categories file (optional - to filter to specific categories)
categories_file = Path("../../data/things_bv_overlap_categories_exclude_zero_precisions.txt")

# CDI words CSV file (required for category type organization)
cdi_path = Path("../../data/cdi_words.csv")

# Hierarchical clustering options
use_clustering = True  # Enable hierarchical clustering within category groups
save_dendrograms = True  # Save dendrogram plots for each category group

# Predefined category list for consistent RDM ordering (optional)
# Set to None to use automatic organization, or provide path to category order file
# This allows comparing RDMs across subjects with the same category ordering
USE_PREDEFINED_CATEGORY_LIST = True  # If True, load category order from PREDEFINED_CATEGORY_LIST_PATH
PREDEFINED_CATEGORY_LIST_PATH = "../vss-2026/bv_things_comp_12252025/bv_clip_filtered_zscored_hierarchical_163cats/category_order_reorganized.txt"  # Path to text file with category order (one category per line), or None
# Example: PREDEFINED_CATEGORY_LIST_PATH = "../vss-2026/bv_things_comp_12252025/bv_clip_filtered_zscored_hierarchical_163cats/category_order_reorganized.txt"

# Minimum categories required per age bin to compute RDM
min_categories_per_age_bin = 8

print(f"Normalized embeddings directory: {normalized_embeddings_dir}")
print(f"Detected embedding type: {embedding_type}")
print(f"Output directory: {output_dir}")
print(f"Excluded subject: {excluded_subject}")
print(f"CDI path: {cdi_path}")
print(f"Use clustering: {use_clustering}")
print(f"Use predefined category list: {USE_PREDEFINED_CATEGORY_LIST}")
if USE_PREDEFINED_CATEGORY_LIST and PREDEFINED_CATEGORY_LIST_PATH:
    print(f"Predefined category list path: {PREDEFINED_CATEGORY_LIST_PATH}")
print(f"Min categories per age bin: {min_categories_per_age_bin}")
print("\nNote: Using pre-normalized embeddings from notebook 05 (no normalization performed here)")


Normalized embeddings directory: /data2/dataset/babyview/868_hours/outputs/yoloe_cdi_embeddings/facebook_dinov3-vitb16-pretrain-lvd1689m_grouped_by_age-mo_normalized
Output directory: developmental_trajectory_rdms
Excluded subject: 00270001
CDI path: ../../data/cdi_words.csv
Use clustering: True
Use predefined category list: True
Predefined category list path: ../vss-2026/bv_things_comp_12252025/bv_clip_filtered_zscored_hierarchical_163cats/category_order_reorganized.txt
Min categories per age bin: 8

Note: Using pre-normalized embeddings from notebook 05 (no normalization performed here)


## Helper Functions


In [114]:
def load_category_types(cdi_path):
    """Load category type information from CDI words CSV"""
    print(f"Loading category types from {cdi_path}...")
    cdi_df = pd.read_csv(cdi_path)
    
    category_types = {}
    for _, row in cdi_df.iterrows():
        category_types[row['uni_lemma']] = {
            'is_animate': bool(row.get('is_animate', 0)),
            'is_bodypart': bool(row.get('is_bodypart', 0)),
            'is_small': bool(row.get('is_small', 0)),
            'is_big': bool(row.get('is_big', 0))
        }
    
    print(f"Loaded type information for {len(category_types)} categories")
    return category_types

def cluster_categories_within_group(group_categories, cat_to_embedding, save_dendrogram=False, output_dir=None, group_name=None):
    """
    Perform hierarchical clustering within a group of categories.
    
    Args:
        group_categories: List of category names in the group
        cat_to_embedding: Dictionary mapping category names to embeddings
        save_dendrogram: Whether to save dendrogram plot (default: False)
        output_dir: Output directory for saving dendrogram (required if save_dendrogram=True)
        group_name: Name of the group for saving dendrogram (required if save_dendrogram=True)
    
    Returns:
        List of category names reordered according to clustering dendrogram
    """
    if len(group_categories) <= 1:
        return group_categories, None
    
    # Get embeddings for this group
    group_embeddings = np.array([cat_to_embedding[cat].flatten() for cat in group_categories])
    
    # Normalize embeddings (z-score normalization per embedding)
    normalized_embeddings = (group_embeddings - group_embeddings.mean(axis=0)) / (group_embeddings.std(axis=0) + 1e-10)
    
    # Compute distance matrix (1 - cosine similarity)
    similarity_matrix = cosine_similarity(normalized_embeddings)
    distance_matrix = 1 - similarity_matrix
    np.fill_diagonal(distance_matrix, 0)
    
    # Convert to condensed form for linkage
    condensed_distances = squareform(distance_matrix)
    
    # Perform hierarchical clustering
    linkage_matrix = linkage(condensed_distances, method='ward')
    
    # Get optimal leaf ordering for better visualization
    try:
        linkage_matrix = optimal_leaf_ordering(linkage_matrix, condensed_distances)
    except:
        # If optimal leaf ordering fails, use original linkage
        pass
    
    # Extract the order from the dendrogram
    dendro_dict = dendrogram(linkage_matrix, no_plot=True)
    leaf_order = dendro_dict['leaves']
    
    # Reorder categories according to clustering
    clustered_categories = [group_categories[i] for i in leaf_order]
    
    # Save dendrogram if requested
    if save_dendrogram and output_dir is not None and group_name is not None:
        output_dir = Path(output_dir)
        output_dir.mkdir(exist_ok=True, parents=True)
        
        plt.figure(figsize=(12, 8))
        dendrogram(linkage_matrix, 
                  labels=group_categories,
                  leaf_rotation=90,
                  leaf_font_size=10)
        plt.title(f'Hierarchical Clustering Dendrogram: {group_name.upper()}\n({len(group_categories)} categories)',
                 fontsize=16, pad=20)
        plt.xlabel('Category', fontsize=12)
        plt.ylabel('Distance', fontsize=12)
        plt.tight_layout()
        
        # Save as PNG
        output_path_png = output_dir / f'dendrogram_{group_name}.png'
        plt.savefig(output_path_png, dpi=300, bbox_inches='tight', pad_inches=0.2)
        print(f"    Saved dendrogram to {output_path_png}")
        
        # Save as PDF
        output_path_pdf = output_dir / f'dendrogram_{group_name}.pdf'
        plt.savefig(output_path_pdf, bbox_inches='tight', pad_inches=0.2)
        print(f"    Saved dendrogram to {output_path_pdf}")
        
        plt.close()
    
    return clustered_categories, linkage_matrix

print("Helper functions loaded!")

Helper functions loaded!


In [115]:
# Load allowed categories if file exists
allowed_categories = None
if categories_file.exists():
    print(f"Loading categories from {categories_file}...")
    with open(categories_file, 'r') as f:
        allowed_categories = set(line.strip() for line in f if line.strip())
    print(f"Loaded {len(allowed_categories)} categories")
else:
    print(f"Categories file not found, using all categories")


Loading categories from ../../data/things_bv_overlap_categories_exclude_zero_precisions.txt...
Loaded 163 categories


## Load Embeddings by Age


In [116]:
def load_embeddings_by_age(embeddings_dir, allowed_categories=None, excluded_subject=None, age_binning_strategy='exact', age_bin_size=3):
    """
    Load pre-normalized embeddings organized by subject, age_mo, and category.
    These embeddings are already normalized from notebook 05.
    
    Returns:
        subject_age_embeddings: dict[subject_id][age_mo_bin][category] = embedding array (already normalized)
    """
    subject_age_embeddings = defaultdict(lambda: defaultdict(dict))
    
    # Get all category folders
    category_folders = [f for f in embeddings_dir.iterdir() if f.is_dir()]
    
    if allowed_categories:
        category_folders = [f for f in category_folders if f.name in allowed_categories]
    
    print(f"Loading pre-normalized embeddings from {len(category_folders)} categories...")
    
    for category_folder in tqdm(category_folders, desc="Loading categories"):
        category = category_folder.name
        
        # Get all embedding files in this category
        embedding_files = list(category_folder.glob("*.npy"))
        
        for emb_file in embedding_files:
            # Parse filename: {subject_id}_{age_mo}_month_level_avg.npy
            filename = emb_file.stem  # without .npy
            parts = filename.split('_')
            
            if len(parts) < 2:
                continue
            
            # Extract subject_id and age_mo
            subject_id = parts[0]
            
            # Exclude subject if specified
            if excluded_subject and subject_id == excluded_subject:
                continue
            
            age_mo = int(parts[1]) if parts[1].isdigit() else None
            
            if age_mo is None:
                continue
            
            # Apply age binning strategy
            if age_binning_strategy == 'binned':
                age_mo_bin = (age_mo // age_bin_size) * age_bin_size  # Round down to bin
            else:
                age_mo_bin = age_mo  # Use exact age
            
            try:
                embedding = np.load(emb_file)
                subject_age_embeddings[subject_id][age_mo_bin][category] = embedding
            except Exception as e:
                print(f"Error loading {emb_file}: {e}")
                continue
    
    return subject_age_embeddings

# Load pre-normalized embeddings from notebook 05 (using exact ages - we'll do median split later)
subject_age_embeddings = load_embeddings_by_age(
    normalized_embeddings_dir,  # Use normalized embeddings from notebook 05
    allowed_categories,
    excluded_subject=excluded_subject,  # Exclude specified subject
    age_binning_strategy='exact',  # Use exact ages
    age_bin_size=1  # Not used when strategy is 'exact'
)

print(f"\nLoaded embeddings for {len(subject_age_embeddings)} subjects")

# Show age bin distribution
all_age_bins = set()
for subject_id, age_data in subject_age_embeddings.items():
    all_age_bins.update(age_data.keys())

print(f"Age bins found: {sorted(all_age_bins)}")
print(f"Age range: {min(all_age_bins)} to {max(all_age_bins)} months")


Loading pre-normalized embeddings from 163 categories...


Loading categories: 100%|██████████| 163/163 [00:01<00:00, 105.46it/s]


Loaded embeddings for 31 subjects
Age bins found: [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 37]
Age range: 6 to 37 months





## Calculate Overall Median Age

We calculate the overall median age across all participants to split each subject's data into younger and older periods. Note: Embeddings are already normalized from notebook 05, so no normalization is performed here.


In [117]:
## Calculate Overall Median Age Across All Participants

# Collect all age_mo values across all subjects to compute overall median
all_ages = []
for subject_id, age_data in subject_age_embeddings.items():
    all_ages.extend(age_data.keys())

overall_median_age = np.median(all_ages)
print(f"Overall median age across all participants: {overall_median_age:.1f} months")
print(f"Age range: {min(all_ages)} to {max(all_ages)} months")
print(f"Total age observations: {len(all_ages)}")

## Use Pre-Normalized Embeddings from Notebook 05

# Embeddings are already normalized from notebook 05, so we use them directly
# Rename for consistency with rest of code
subject_age_embeddings_normalized = subject_age_embeddings

print(f"\nUsing pre-normalized embeddings for {len(subject_age_embeddings_normalized)} subjects")
print("  Note: Embeddings were normalized in notebook 05 (within each subject across all age bins)")
print("  No additional normalization performed here")

## Aggregate Embeddings by Median Split and Compute RDMs

def aggregate_embeddings_by_bin(age_embeddings_dict, age_bin_name):
    """
    Aggregate embeddings for a bin by averaging across all ages in that bin.
    
    Args:
        age_embeddings_dict: dict[age_mo][category] = embedding array
        age_bin_name: 'younger' or 'older'
    
    Returns:
        aggregated_embeddings: dict[category] = averaged embedding array
    """
    # Collect all embeddings for each category across ages in this bin
    category_embeddings = defaultdict(list)
    
    for age_mo, categories in age_embeddings_dict.items():
        for cat, embedding in categories.items():
            category_embeddings[cat].append(embedding)
    
    # Average embeddings for each category
    aggregated = {}
    for cat, embeddings_list in category_embeddings.items():
        if len(embeddings_list) > 0:
            aggregated[cat] = np.mean(embeddings_list, axis=0)
    
    return aggregated

def compute_rdm_for_bin_with_na(bin_embeddings_dict, ordered_categories_list):
    """
    Compute RDM for a single age bin (younger or older) with NA for missing categories.
    This ensures consistent ordering across bins using the predefined category order.
    
    Args:
        bin_embeddings_dict: dict[category] = embedding array (should be normalized and aggregated)
        ordered_categories_list: list of all categories in desired order (may include categories not present for this bin)
    
    Returns:
        rdm: numpy array of shape (n_categories, n_categories) with np.nan for missing categories
        mask: boolean array of shape (n_categories, n_categories) where True indicates NA (missing category)
        available_categories: list of categories actually present for this bin
    """
    n_categories = len(ordered_categories_list)
    
    # Find available categories (categories that exist for this bin)
    available_categories = [cat for cat in ordered_categories_list if cat in bin_embeddings_dict]
    
    if len(available_categories) < min_categories_per_age_bin:
        # Return RDM full of NaN if not enough categories
        rdm = np.full((n_categories, n_categories), np.nan)
        mask = np.ones((n_categories, n_categories), dtype=bool)
        return rdm, mask, available_categories
    
    # Build embedding matrix for available categories (already normalized)
    embedding_matrix = np.array([bin_embeddings_dict[cat].flatten() for cat in available_categories])
    
    # Ensure 2D shape: (n_available_categories, embedding_dim)
    if embedding_matrix.ndim != 2:
        raise ValueError(f"Expected 2D embedding matrix, got shape {embedding_matrix.shape}")
    
    # Compute cosine similarity for available categories
    similarity_matrix_available = cosine_similarity(embedding_matrix)
    
    # Convert to distance (RDM) for available categories
    distance_matrix_available = 1 - similarity_matrix_available
    np.fill_diagonal(distance_matrix_available, 0)  # Ensure diagonal is 0
    
    # Make symmetric (in case of numerical errors)
    distance_matrix_available = (distance_matrix_available + distance_matrix_available.T) / 2
    
    # Create full RDM with NaN for missing categories
    rdm = np.full((n_categories, n_categories), np.nan)
    mask = np.ones((n_categories, n_categories), dtype=bool)
    
    # Map available categories to their indices in ordered_categories_list
    available_indices = [ordered_categories_list.index(cat) for cat in available_categories]
    
    # Fill in the RDM for available categories
    for i, idx_i in enumerate(available_indices):
        for j, idx_j in enumerate(available_indices):
            rdm[idx_i, idx_j] = distance_matrix_available[i, j]
            mask[idx_i, idx_j] = False  # False means not NA (data present)
    
    return rdm, mask, available_categories

# Get all unique categories across all subjects and ages
all_categories = set()
for subject_id, age_data in subject_age_embeddings_normalized.items():
    for age_mo, categories in age_data.items():
        all_categories.update(categories.keys())

all_categories = sorted(list(all_categories))
print(f"\nTotal unique categories across all subjects and ages: {len(all_categories)}")
print("Note: RDMs will be computed with predefined category order after organization step.")

# First, identify which subjects have sufficient data in both bins
# We'll compute actual RDMs after category organization
print(f"\nIdentifying subjects with sufficient data in both age bins...")
subject_age_rdms = {}  # Temporary storage - will be recomputed with predefined order
subject_age_rdm_categories = {}  # Temporary storage
excluded_subjects = []  # Track excluded subjects and reasons

for subject_id, age_data in tqdm(subject_age_embeddings_normalized.items(), desc="Checking subjects"):
    # Split ages into younger and older bins
    younger_ages = {age_mo: categories for age_mo, categories in age_data.items() 
                    if age_mo <= overall_median_age}
    older_ages = {age_mo: categories for age_mo, categories in age_data.items() 
                  if age_mo > overall_median_age}
    
    subject_age_rdms[subject_id] = {}
    subject_age_rdm_categories[subject_id] = {}
    
    # Process younger bin
    younger_has_rdm = False
    younger_n_cats = 0
    if len(younger_ages) > 0:
        younger_aggregated = aggregate_embeddings_by_bin(younger_ages, 'younger')
        younger_n_cats = len(younger_aggregated)
        if younger_n_cats >= min_categories_per_age_bin:
            younger_has_rdm = True
            subject_age_rdms[subject_id]['younger'] = True  # Placeholder
            subject_age_rdm_categories[subject_id]['younger'] = list(younger_aggregated.keys())
    else:
        excluded_subjects.append({
            'subject_id': subject_id,
            'reason': 'no younger ages',
            'younger_n_cats': 0,
            'older_n_cats': len(aggregate_embeddings_by_bin(older_ages, 'older')) if len(older_ages) > 0 else 0
        })
    
    # Process older bin
    older_has_rdm = False
    older_n_cats = 0
    if len(older_ages) > 0:
        older_aggregated = aggregate_embeddings_by_bin(older_ages, 'older')
        older_n_cats = len(older_aggregated)
        if older_n_cats >= min_categories_per_age_bin:
            older_has_rdm = True
            subject_age_rdms[subject_id]['older'] = True  # Placeholder
            subject_age_rdm_categories[subject_id]['older'] = list(older_aggregated.keys())
    else:
        excluded_subjects.append({
            'subject_id': subject_id,
            'reason': 'no older ages',
            'younger_n_cats': younger_n_cats,
            'older_n_cats': 0
        })
    
    # Filter out subjects without both bins
    if not younger_has_rdm or not older_has_rdm:
        if subject_id not in [s['subject_id'] for s in excluded_subjects]:
            # Determine specific reason
            if not younger_has_rdm and not older_has_rdm:
                reason = 'both bins insufficient'
            elif not younger_has_rdm:
                reason = f'younger bin insufficient ({younger_n_cats} < {min_categories_per_age_bin} cats)'
            else:
                reason = f'older bin insufficient ({older_n_cats} < {min_categories_per_age_bin} cats)'
            
            excluded_subjects.append({
                'subject_id': subject_id,
                'reason': reason,
                'younger_n_cats': younger_n_cats,
                'older_n_cats': older_n_cats
            })
        
        del subject_age_rdms[subject_id]
        del subject_age_rdm_categories[subject_id]

print(f"\nIdentified {len(subject_age_rdms)} subjects with sufficient data in both bins")
print(f"  Excluded {len(excluded_subjects)} subjects without sufficient data in both bins")

# Show excluded subjects details
if len(excluded_subjects) > 0:
    print(f"\nExcluded subjects ({len(excluded_subjects)}):")
    excluded_df = pd.DataFrame(excluded_subjects)
    excluded_df = excluded_df.sort_values('subject_id')
    for _, row in excluded_df.iterrows():
        print(f"  {row['subject_id']}: {row['reason']} (younger: {row['younger_n_cats']} cats, older: {row['older_n_cats']} cats)")


Overall median age across all participants: 16.0 months
Age range: 6 to 37 months
Total age observations: 266

Using pre-normalized embeddings for 31 subjects
  Note: Embeddings were normalized in notebook 05 (within each subject across all age bins)
  No additional normalization performed here

Total unique categories across all subjects and ages: 163
Note: RDMs will be computed with predefined category order after organization step.

Identifying subjects with sufficient data in both age bins...


Checking subjects: 100%|██████████| 31/31 [00:00<00:00, 381.54it/s]


Identified 18 subjects with sufficient data in both bins
  Excluded 13 subjects without sufficient data in both bins

Excluded subjects (13):
  00220001: no older ages (younger: 155 cats, older: 0 cats)
  00230001: no older ages (younger: 143 cats, older: 0 cats)
  00340002: no older ages (younger: 99 cats, older: 0 cats)
  00350001: no older ages (younger: 111 cats, older: 0 cats)
  00350002: no older ages (younger: 123 cats, older: 0 cats)
  00360001: no older ages (younger: 153 cats, older: 0 cats)
  00390001: no older ages (younger: 141 cats, older: 0 cats)
  00430002: no older ages (younger: 112 cats, older: 0 cats)
  00440001: no older ages (younger: 134 cats, older: 0 cats)
  00460001: no older ages (younger: 140 cats, older: 0 cats)
  00550001: no older ages (younger: 132 cats, older: 0 cats)
  00720001: no younger ages (younger: 0 cats, older: 154 cats)
  00820001: no younger ages (younger: 0 cats, older: 160 cats)





In [118]:
# Organize Categories (with Predefined List Option)
# This section organizes categories either by loading a predefined category list (for consistent ordering across subjects) or by automatic organization.

# Get all unique categories across all subjects and ages (needed for organization)
all_categories = set()
for subject_id, age_data in subject_age_embeddings_normalized.items():
    for age_mo, categories in age_data.items():
        all_categories.update(categories.keys())

all_categories = sorted(list(all_categories))
print(f"Total unique categories across all subjects and ages: {len(all_categories)}")

# Organize categories: either load predefined list or organize automatically
print("\nOrganizing categories...")

if USE_PREDEFINED_CATEGORY_LIST and PREDEFINED_CATEGORY_LIST_PATH is not None:
    # Load predefined category list
    predefined_path = Path(PREDEFINED_CATEGORY_LIST_PATH)
    if not predefined_path.exists():
        raise FileNotFoundError(f"Predefined category list file not found: {predefined_path}")
    
    print(f"  Loading predefined category order from {predefined_path}...")
    with open(predefined_path, 'r') as f:
        # Skip comment lines (lines starting with #)
        ordered_categories = [line.strip() for line in f if line.strip() and not line.strip().startswith('#')]
    
    # Verify that all categories in predefined list exist in our data
    predefined_set = set(ordered_categories)
    all_categories_set = set(all_categories)
    
    if predefined_set != all_categories_set:
        missing_in_predefined = all_categories_set - predefined_set
        extra_in_predefined = predefined_set - all_categories_set
        if missing_in_predefined:
            print(f"  Warning: {len(missing_in_predefined)} categories in data but not in predefined list: {sorted(missing_in_predefined)[:5]}...")
        if extra_in_predefined:
            print(f"  Warning: {len(extra_in_predefined)} categories in predefined list but not in data: {sorted(extra_in_predefined)[:5]}...")
        # Use intersection: only categories that exist in both
        ordered_categories = [cat for cat in ordered_categories if cat in all_categories_set]
        print(f"  Using intersection: {len(ordered_categories)} categories")
    
    print(f"  Loaded {len(ordered_categories)} categories in predefined order")
    
    # Still organize into groups for visualization boundaries (even though order is predefined)
    # Load category types for grouping
    if cdi_path.exists():
        category_types = load_category_types(cdi_path)
    else:
        print(f"Warning: CDI path {cdi_path} not found. Cannot compute group boundaries.")
        category_types = {}
    
    # Organize predefined categories into groups for visualization boundaries
    organized = {
        'animals': [],
        'bodyparts': [],
        'big_objects': [],
        'small_objects': [],
        'others': []
    }
    
    for cat in ordered_categories:
        if cat not in category_types:
            organized['others'].append(cat)
            continue
        
        types = category_types[cat]
        if types['is_animate']:
            organized['animals'].append(cat)
        elif types['is_bodypart']:
            organized['bodyparts'].append(cat)
        elif types['is_big']:
            organized['big_objects'].append(cat)
        elif types['is_small']:
            organized['small_objects'].append(cat)
        else:
            organized['others'].append(cat)
    
else:
    # Automatic organization by type (similar to notebook 02)
    print(f"  Organizing categories by type...")
    
    # Load category types for organization
    if cdi_path.exists():
        category_types = load_category_types(cdi_path)
    else:
        print(f"Warning: CDI path {cdi_path} not found. Skipping category organization.")
        category_types = {}
    
    # Get a representative set of embeddings for clustering (average across all subjects and ages)
    representative_embeddings = {}
    for cat in all_categories:
        cat_embeddings = []
        for subject_id, age_data in subject_age_embeddings_normalized.items():
            for age_mo, categories in age_data.items():
                if cat in categories:
                    cat_embeddings.append(categories[cat])
        if len(cat_embeddings) > 0:
            # Average across all subjects and ages for this category
            representative_embeddings[cat] = np.mean(cat_embeddings, axis=0)
    
    # Organize by type
    organized = {
        'animals': [],
        'bodyparts': [],
        'big_objects': [],
        'small_objects': [],
        'others': []
    }
    
    for cat in all_categories:
        if cat not in category_types:
            organized['others'].append(cat)
            continue
        
        types = category_types[cat]
        if types['is_animate']:
            organized['animals'].append(cat)
        elif types['is_bodypart']:
            organized['bodyparts'].append(cat)
        elif types['is_big']:
            organized['big_objects'].append(cat)
        elif types['is_small']:
            organized['small_objects'].append(cat)
        else:
            organized['others'].append(cat)
    
    print(f"  Organized into: {len(organized['animals'])} animals, {len(organized['bodyparts'])} bodyparts, "
          f"{len(organized['big_objects'])} big objects, {len(organized['small_objects'])} small objects, "
          f"{len(organized['others'])} others")
    
    # Apply hierarchical clustering if enabled
    if use_clustering:
        print(f"  Applying hierarchical clustering within groups...")
        for group_name in ['animals', 'bodyparts', 'big_objects', 'small_objects', 'others']:
            if len(organized[group_name]) > 1:
                # Filter to categories that have representative embeddings
                group_cats = [cat for cat in organized[group_name] if cat in representative_embeddings]
                if len(group_cats) > 1:
                    print(f"    Clustering {group_name} ({len(group_cats)} categories)...")
                    organized[group_name], _ = cluster_categories_within_group(
                        group_cats,
                        representative_embeddings,
                        save_dendrogram=save_dendrograms,
                        output_dir=output_dir,
                        group_name=group_name
                    )
                else:
                    organized[group_name] = group_cats
            else:
                organized[group_name] = [cat for cat in organized[group_name] if cat in representative_embeddings]
    else:
        for group_name in organized:
            organized[group_name] = sorted([cat for cat in organized[group_name] if cat in representative_embeddings])
    
    # Create ordered list
    ordered_categories = (
        organized['animals'] +
        organized['bodyparts'] +
        organized['big_objects'] +
        organized['small_objects'] +
        organized['others']
    )

print(f"\nFinal ordered category list: {len(ordered_categories)} categories")

Total unique categories across all subjects and ages: 163

Organizing categories...
  Loading predefined category order from ../vss-2026/bv_things_comp_12252025/bv_clip_filtered_zscored_hierarchical_163cats/category_order_reorganized.txt...
  Loaded 163 categories in predefined order
Loading category types from ../../data/cdi_words.csv...
Loaded type information for 295 categories

Final ordered category list: 163 categories


In [119]:
# No age binning needed - we're using median split (younger/older)

In [120]:
# Recompute RDMs using predefined category order with NaN for missing categories
print("\nRecomputing RDMs with predefined category order (including NaN for missing categories)...")
subject_age_rdms_reorganized = {}
subject_age_rdm_masks = {}  # Store masks indicating NA cells
subject_age_rdm_categories_reorganized = {}
subject_age_group_boundaries = {}  # Store group boundaries for visual separators

for subject_id in tqdm(subject_age_rdms.keys(), desc="Recomputing RDMs"):
    subject_age_rdms_reorganized[subject_id] = {}
    subject_age_rdm_masks[subject_id] = {}
    subject_age_rdm_categories_reorganized[subject_id] = {}
    subject_age_group_boundaries[subject_id] = {}
    
    # Get original data for this subject
    original_rdms = subject_age_rdms[subject_id]
    original_categories = subject_age_rdm_categories[subject_id]
    
    # Recompute each bin's RDM using predefined order
    for bin_name in ['younger', 'older']:
        if bin_name not in original_rdms:
            continue
        
        # Get aggregated embeddings for this bin
        if bin_name == 'younger':
            relevant_ages = {age_mo: cats for age_mo, cats in subject_age_embeddings_normalized[subject_id].items() 
                           if age_mo <= overall_median_age}
        else:  # older
            relevant_ages = {age_mo: cats for age_mo, cats in subject_age_embeddings_normalized[subject_id].items() 
                            if age_mo > overall_median_age}
        
        # Aggregate embeddings for this bin
        bin_embeddings = aggregate_embeddings_by_bin(relevant_ages, bin_name)
        
        # Compute RDM with NaN for missing categories using predefined order
        rdm, mask, available_cats = compute_rdm_for_bin_with_na(bin_embeddings, ordered_categories)
        
        if rdm is not None:
            subject_age_rdms_reorganized[subject_id][bin_name] = rdm
            subject_age_rdm_masks[subject_id][bin_name] = mask
            subject_age_rdm_categories_reorganized[subject_id][bin_name] = available_cats
            
            # Compute group boundaries based on full ordered_categories (for visualization)
            group_boundaries = []
            current_idx = 0
            for group_name in ['animals', 'bodyparts', 'big_objects', 'small_objects', 'others']:
                group_cats = [cat for cat in organized[group_name] if cat in ordered_categories]
                if len(group_cats) > 0:
                    group_start = current_idx
                    group_end = current_idx + len(group_cats)
                    group_boundaries.append({
                        'name': group_name,
                        'start': group_start,
                        'end': group_end,
                        'categories': group_cats
                    })
                    current_idx = group_end
            
            subject_age_group_boundaries[subject_id][bin_name] = group_boundaries

# Update the main dictionaries
subject_age_rdms = subject_age_rdms_reorganized
subject_age_rdm_categories = subject_age_rdm_categories_reorganized

print(f"Recomputed RDMs for {len(subject_age_rdms)} subjects using predefined category order")
print(f"  All RDMs now use the same {len(ordered_categories)}-category order with NaN for missing categories")


Recomputing RDMs with predefined category order (including NaN for missing categories)...


Recomputing RDMs: 100%|██████████| 18/18 [00:00<00:00, 57.72it/s]

Recomputed RDMs for 18 subjects using predefined category order
  All RDMs now use the same 163-category order with NaN for missing categories





## Save RDMs for Each Subject (Younger and Older Bins)


In [121]:
# Save RDMs for each subject (younger and older bins)
print("Saving developmental trajectory RDMs...")

for subject_id, bin_rdms in tqdm(subject_age_rdms.items(), desc="Saving RDMs"):
    subject_output_dir = output_dir / subject_id
    subject_output_dir.mkdir(exist_ok=True, parents=True)
    
    for bin_name, rdm in bin_rdms.items():
        available_cats = subject_age_rdm_categories[subject_id][bin_name]
        
        # Save as numpy array (includes NaN for missing categories)
        np.save(subject_output_dir / f"rdm_{bin_name}.npy", rdm)
        
        # Save as CSV with category labels (use ordered_categories for full order)
        rdm_df = pd.DataFrame(rdm, index=ordered_categories, columns=ordered_categories)
        rdm_df.to_csv(subject_output_dir / f"rdm_{bin_name}.csv")
        
        # Save metadata
        # Compute statistics only on valid (non-NaN) values
        valid_rdm = rdm[~np.isnan(rdm)]
        valid_rdm_positive = valid_rdm[valid_rdm > 0]
        
        metadata = {
            'subject_id': subject_id,
            'age_bin': bin_name,
            'median_age_threshold': overall_median_age,
            'n_categories_total': len(ordered_categories),
            'n_categories_available': len(available_cats),
            'n_categories_missing': len(ordered_categories) - len(available_cats),
            'categories_available': available_cats,
            'mean_distance': float(np.nanmean(rdm)),
            'std_distance': float(np.nanstd(rdm)),
            'min_distance': float(valid_rdm_positive.min()) if len(valid_rdm_positive) > 0 else np.nan,
            'max_distance': float(np.nanmax(rdm))
        }
        
        metadata_df = pd.DataFrame([metadata])
        metadata_df.to_csv(subject_output_dir / f"metadata_{bin_name}.csv", index=False)

        # Create and save individual dendrogram for this bin (using available categories only)
        if len(available_cats) > 1:
            # Get aggregated embeddings for this bin's available categories
            # We need to reconstruct from the original normalized embeddings
            bin_embeddings = {}
            for cat in available_cats:
                cat_embeddings = []
                # Get all ages in this bin for this subject
                if bin_name == 'younger':
                    relevant_ages = {age_mo: cats for age_mo, cats in subject_age_embeddings_normalized[subject_id].items() 
                                   if age_mo <= overall_median_age}
                else:  # older
                    relevant_ages = {age_mo: cats for age_mo, cats in subject_age_embeddings_normalized[subject_id].items() 
                                    if age_mo > overall_median_age}
                
                for age_mo, age_cats in relevant_ages.items():
                    if cat in age_cats:
                        cat_embeddings.append(age_cats[cat])
                
                if len(cat_embeddings) > 0:
                    bin_embeddings[cat] = np.mean(cat_embeddings, axis=0)
            
            if len(bin_embeddings) > 1:
                # Build embedding matrix
                embedding_matrix = np.array([bin_embeddings[cat].flatten() for cat in available_cats])
                
                # Normalize embeddings
                normalized_embeddings = (embedding_matrix - embedding_matrix.mean(axis=0)) / (embedding_matrix.std(axis=0) + 1e-10)
                
                # Compute distance matrix
                similarity_matrix = cosine_similarity(normalized_embeddings)
                distance_matrix = 1 - similarity_matrix
                np.fill_diagonal(distance_matrix, 0)
                
                # Convert to condensed form for linkage
                condensed_distances = squareform(distance_matrix)
                
                # Perform hierarchical clustering
                linkage_matrix = linkage(condensed_distances, method='ward')
                
                # Get optimal leaf ordering
                try:
                    linkage_matrix = optimal_leaf_ordering(linkage_matrix, condensed_distances)
                except:
                    pass
                
                # Create dendrogram
                plt.figure(figsize=(max(16, len(available_cats) * 0.5), 10))
                dendrogram(linkage_matrix, 
                          labels=available_cats,
                          leaf_rotation=90,
                          leaf_font_size=max(8, min(14, 200 // len(available_cats))))
                plt.title(f'Dendrogram: {subject_id} {bin_name.capitalize()} (≤{overall_median_age:.0f}mo vs >{overall_median_age:.0f}mo)\n({len(available_cats)}/{len(ordered_categories)} categories)',
                         fontsize=16, pad=20)
                plt.xlabel('Category', fontsize=14)
                plt.ylabel('Distance', fontsize=14)
                plt.tight_layout()
                
                # Save dendrogram
                dendrogram_dir = subject_output_dir / "dendrograms"
                dendrogram_dir.mkdir(exist_ok=True, parents=True)
                dendrogram_path = dendrogram_dir / f"dendrogram_{bin_name}.png"
                plt.savefig(dendrogram_path, dpi=300, bbox_inches='tight', pad_inches=0.2)
                plt.close()

print(f"\nSaved RDMs to {output_dir}")


Saving developmental trajectory RDMs...


Saving RDMs: 100%|██████████| 18/18 [01:27<00:00,  4.87s/it]


Saved RDMs to developmental_trajectory_rdms





## Analyze Developmental Trajectories


## Detailed Explanation: RDM Correlation Logic

### Overview
This section explains in detail how we compute correlations between younger and older RDMs for each subject, including how we handle missing categories (NaN values) and whether correlations are comparable across subjects.

### RDM Structure
Each subject has two RDMs (younger and older), both with shape (163, 163) corresponding to the full predefined category order:
- **Diagonal elements**: Always 0 (distance from category to itself)
- **Off-diagonal elements**: Distance values (0-2 range) for category pairs that exist in that age bin
- **Missing categories**: Represented as NaN (white cells in visualization)

### Step-by-Step Correlation Process

#### Step 1: Identify Common Categories
- **Input**: Two lists of available categories (`available_cats1` for younger, `available_cats2` for older) and the full `ordered_categories_list` (predefined order)
- **Process**: Find categories that are in BOTH available lists, preserving the predefined order (NOT alphabetical)
- **Output**: `common_categories` - categories present in BOTH age bins, in predefined order
- **Example**: If younger has 150 categories and older has 155 categories, they might share 140 categories
- **Key Point**: Order matters! We use the predefined order to ensure submatrices are aligned correctly

#### Step 2: Map to Full Category Order
- **Input**: `common_categories` and the full `ordered_categories` list (163 categories)
- **Process**: Find the indices of common categories in the full ordered list
- **Output**: `common_indices` - positions in the 163x163 RDM matrices
- **Purpose**: This ensures we extract the correct submatrices from the full RDMs

#### Step 3: Extract Submatrices
- **Input**: Full 163x163 RDMs and `common_indices`
- **Process**: Extract square submatrices using `rdm[np.ix_(common_indices, common_indices)]`
- **Output**: Two smaller square matrices (e.g., 140x140 if 140 common categories)
- **Key Point**: These submatrices contain ONLY the common categories, but may still have NaN if there are any data issues

#### Step 4: Extract Upper Triangle
- **Process**: Use a triangular mask to extract only the upper triangle (excluding diagonal)
- **Why**: RDMs are symmetric, so we only need half the values to avoid double-counting
- **Output**: Two flattened arrays of pairwise distances
- **Size**: If n common categories, we get n×(n-1)/2 distance values

#### Step 5: Filter NaN Values
- **Process**: Create a boolean mask identifying valid (non-NaN) values in BOTH arrays
- **Filter**: Keep only pairs where BOTH RDMs have valid values
- **Output**: Two arrays of the same length with only valid distance pairs
- **Safety Check**: Even though we only use common categories, this ensures no NaN values slip through

#### Step 6: Compute Spearman Correlation
- **Method**: Spearman rank correlation (non-parametric, robust to outliers)
- **Input**: Two arrays of valid distance values (same length, same category pairs)
- **Output**: Correlation coefficient (-1 to 1)
- **Interpretation**: 
  - High correlation (>0.7): Similar representational structure across age bins
  - Low correlation (<0.5): Representational structure changed with development
  - Near 0: No relationship between structures

### Handling NaN Values

**Where NaN values come from:**
1. Categories not present in a particular age bin (expected)
2. Categories present but with insufficient data (rare, but possible)

**How we handle them:**
1. **Pre-filtering**: We only use categories present in BOTH bins (common categories)
2. **Submatrix extraction**: We extract only the common category submatrices
3. **Post-filtering**: We filter out any remaining NaN values before correlation
4. **Result**: The correlation is computed only on valid distance pairs

**Why this works:**
- By using only common categories, we ensure we're comparing the same category pairs
- The correlation reflects how similarly those common categories are organized in younger vs older periods
- Missing categories don't affect the correlation (they're simply excluded)

### Comparability Across Subjects

**Are correlations comparable across subjects?**

**YES, with important caveats:**

1. **Same correlation metric**: All subjects use Spearman correlation on the same type of data (distance matrices)

2. **Different category sets**: Each subject may have different numbers of common categories:
   - Subject A: 140 common categories → 9,730 distance pairs
   - Subject B: 150 common categories → 11,175 distance pairs
   - Subject C: 130 common categories → 8,385 distance pairs

3. **Interpretation considerations**:
   - **Absolute correlation values ARE comparable**: A correlation of 0.8 means the same thing for all subjects (strong similarity between age bins)
   - **Statistical power varies**: Subjects with more common categories have more data points, so their correlations may be more reliable
   - **Missing categories don't bias**: As long as we use only common categories, missing categories don't affect the correlation value

4. **What makes correlations comparable**:
   - Same age split (median = 16 months for all)
   - Same normalization (within-subject z-score normalization)
   - Same distance metric (cosine distance)
   - Same correlation method (Spearman)
   - Only common categories used (fair comparison)

5. **What to consider when comparing**:
   - **Number of common categories**: Tracked in `n_common_categories` - more categories = more reliable
   - **Category composition**: Different subjects may have different sets of common categories
   - **Data density**: Subjects with more data in both bins may have more stable RDMs

### Example Walkthrough

**Subject 00240001:**
- Younger bin: 155 categories available
- Older bin: 160 categories available
- Common categories: 150 categories
- Extracted submatrices: 150×150 (22,500 cells)
- Upper triangle: 11,175 distance pairs
- After NaN filtering: 11,175 valid pairs (assuming all common categories have data)
- Spearman correlation: 0.756
- **Interpretation**: Strong similarity (0.756) between younger and older representational structures, based on 150 common categories

**Subject 00320001:**
- Younger bin: 140 categories available
- Older bin: 145 categories available
- Common categories: 135 categories
- Extracted submatrices: 135×135 (18,225 cells)
- Upper triangle: 9,045 distance pairs
- After NaN filtering: 9,045 valid pairs
- Spearman correlation: 0.682
- **Interpretation**: Moderate similarity (0.682) between age bins, based on 135 common categories

**Comparison**: Subject 00240001 has a higher correlation (0.756 vs 0.682), suggesting more stable representational structure across development. However, we should also consider that Subject 00240001 has more common categories (150 vs 135), which provides more data for the correlation.

### Summary

The correlation logic:
1. ✅ Uses only categories present in BOTH age bins (fair comparison)
2. ✅ Filters out all NaN values before correlation
3. ✅ Uses Spearman correlation (robust, non-parametric)
4. ✅ Produces comparable values across subjects
5. ⚠️ But correlations should be interpreted with awareness of the number of common categories

**Key insight**: The correlation tells us how similarly categories are organized in younger vs older periods, but only for the categories that exist in both periods. This is appropriate for developmental trajectory analysis because we want to know: "For the categories this child experienced at both ages, how stable was their representational structure?"

In [122]:
# Demonstration: How RDM Correlation Works with NaN Values
# This cell demonstrates the correlation logic with a simple example

print("="*70)
print("DEMONSTRATION: RDM Correlation Logic with NaN Values")
print("="*70)

# Example: Simple 5-category case
print("\n1. SETUP: Full category order (5 categories)")
ordered_cats = ['cat1', 'cat2', 'cat3', 'cat4', 'cat5']
print(f"   Ordered categories: {ordered_cats}")

print("\n2. EXAMPLE SUBJECT:")
print("   Younger bin has: cat1, cat2, cat3, cat4 (4 categories)")
print("   Older bin has:   cat2, cat3, cat4, cat5 (4 categories)")
available_younger = ['cat1', 'cat2', 'cat3', 'cat4']
available_older = ['cat2', 'cat3', 'cat4', 'cat5']

print("\n3. COMMON CATEGORIES:")
# Preserve predefined order, not alphabetical
common = [cat for cat in ordered_cats if cat in available_younger and cat in available_older]
print(f"   Common categories: {common} ({len(common)} categories)")
print("   Note: cat1 only in younger, cat5 only in older - these are excluded")
print("   IMPORTANT: Categories are in predefined order, not alphabetical!")

print("\n4. RDM STRUCTURE:")
print("   Full RDMs are 5×5 (one row/column per category in ordered_cats)")
print("   Missing categories have NaN in their rows/columns")
print("\n   Younger RDM structure:")
print("   " + " ".join([f"{c:>6}" for c in ordered_cats]))
for i, cat in enumerate(ordered_cats):
    if cat in available_younger:
        status = "  data"
    else:
        status = "   NaN"
    print(f"   {cat:>6}: {status}")

print("\n5. SUBMATRIX EXTRACTION:")
common_indices = [ordered_cats.index(c) for c in common]
print(f"   Common category indices in full RDM: {common_indices}")
print(f"   Extract 3×3 submatrix using these indices")
print(f"   This gives us only the common categories: {common}")

print("\n6. UPPER TRIANGLE:")
n_common = len(common)
n_pairs = n_common * (n_common - 1) // 2
print(f"   For {n_common} categories, we get {n_pairs} unique pairs")
print(f"   (excluding diagonal: {n_common} self-pairs)")
print(f"   Example pairs: (cat2-cat3), (cat2-cat4), (cat3-cat4)")

print("\n7. CORRELATION:")
print("   - Extract distance values for these pairs from both RDMs")
print("   - Filter out any NaN values (shouldn't be any for common categories)")
print("   - Compute Spearman correlation on the paired distance values")
print("   - Result: Single correlation coefficient (-1 to 1)")

print("\n8. WHY THIS WORKS:")
print("   ✓ Only uses categories present in BOTH bins (fair comparison)")
print("   ✓ Same category pairs compared in both RDMs")
print("   ✓ NaN values are excluded (don't affect correlation)")
print("   ✓ Correlation reflects structural similarity, not data availability")

print("\n" + "="*70)
print("For actual subjects, this process uses 163 categories")
print("Common categories typically range from 130-160 per subject")
print("="*70)

DEMONSTRATION: RDM Correlation Logic with NaN Values

1. SETUP: Full category order (5 categories)
   Ordered categories: ['cat1', 'cat2', 'cat3', 'cat4', 'cat5']

2. EXAMPLE SUBJECT:
   Younger bin has: cat1, cat2, cat3, cat4 (4 categories)
   Older bin has:   cat2, cat3, cat4, cat5 (4 categories)

3. COMMON CATEGORIES:
   Common categories: ['cat2', 'cat3', 'cat4'] (3 categories)
   Note: cat1 only in younger, cat5 only in older - these are excluded
   IMPORTANT: Categories are in predefined order, not alphabetical!

4. RDM STRUCTURE:
   Full RDMs are 5×5 (one row/column per category in ordered_cats)
   Missing categories have NaN in their rows/columns

   Younger RDM structure:
     cat1   cat2   cat3   cat4   cat5
     cat1:   data
     cat2:   data
     cat3:   data
     cat4:   data
     cat5:    NaN

5. SUBMATRIX EXTRACTION:
   Common category indices in full RDM: [1, 2, 3]
   Extract 3×3 submatrix using these indices
   This gives us only the common categories: ['cat2', 'cat3',

In [123]:
def compute_rdm_correlation(rdm1, rdm2, ordered_categories_list, available_cats1, available_cats2):
    """
    Compute correlation between two RDMs that use the full ordered_categories list with NaN for missing categories.
    Only uses categories present in both RDMs (non-NaN in both).
    
    Args:
        rdm1: numpy array of shape (n_categories, n_categories) with NaN for missing categories
        rdm2: numpy array of shape (n_categories, n_categories) with NaN for missing categories
        ordered_categories_list: full list of categories in order (used for indexing)
        available_cats1: list of categories actually present in rdm1
        available_cats2: list of categories actually present in rdm2
    
    Returns:
        corr: correlation coefficient (or np.nan if insufficient data)
        n_common: number of common categories
    """
    # Find common categories (categories present in both RDMs)
    # IMPORTANT: Preserve predefined order from ordered_categories_list, NOT alphabetical order
    # This ensures submatrices are extracted in the same order for both RDMs
    # and maintains consistency with visualizations which use the predefined order
    common_categories = [cat for cat in ordered_categories_list 
                        if cat in available_cats1 and cat in available_cats2]
    
    if len(common_categories) < 2:
        return np.nan, len(common_categories)
    
    # Get indices for common categories in the ordered_categories_list
    common_indices = [ordered_categories_list.index(cat) for cat in common_categories]
    
    # Extract submatrices for common categories
    rdm1_subset = rdm1[np.ix_(common_indices, common_indices)]
    rdm2_subset = rdm2[np.ix_(common_indices, common_indices)]
    
    # Get upper triangle (excluding diagonal) for both RDMs
    mask = np.triu(np.ones_like(rdm1_subset, dtype=bool), k=1)
    rdm1_flat = rdm1_subset[mask]
    rdm2_flat = rdm2_subset[mask]
    
    # Filter out NaN values (shouldn't be any if categories are truly common, but check anyway)
    valid_mask = ~(np.isnan(rdm1_flat) | np.isnan(rdm2_flat))
    rdm1_valid = rdm1_flat[valid_mask]
    rdm2_valid = rdm2_flat[valid_mask]
    
    # Compute Spearman correlation (more robust to outliers)
    if len(rdm1_valid) > 0:
        corr, _ = spearmanr(rdm1_valid, rdm2_valid)
        return corr, len(common_categories)
    else:
        return np.nan, len(common_categories)

# Compute RDM correlations between younger and older bins for each subject
trajectory_data = []

for subject_id, bin_rdms in tqdm(subject_age_rdms.items(), desc="Analyzing trajectories"):
    if 'younger' not in bin_rdms or 'older' not in bin_rdms:
        continue
    
    rdm_younger = bin_rdms['younger']
    rdm_older = bin_rdms['older']
    cats_younger = subject_age_rdm_categories[subject_id]['younger']
    cats_older = subject_age_rdm_categories[subject_id]['older']
    
    # Use ordered_categories as reference and available categories for filtering
    corr, n_common = compute_rdm_correlation(
        rdm_younger, rdm_older, 
        ordered_categories,  # Full ordered list for indexing
        cats_younger,  # Available categories in younger bin
        cats_older     # Available categories in older bin
    )
    
    trajectory_data.append({
        'subject_id': subject_id,
        'age_bin_1': 'younger',
        'age_bin_2': 'older',
        'median_age_threshold': overall_median_age,
        'rdm_correlation': corr,
        'n_common_categories': n_common,
        'n_categories_younger': len(cats_younger),
        'n_categories_older': len(cats_older)
    })

trajectory_df = pd.DataFrame(trajectory_data)
trajectory_df.to_csv(output_dir / "trajectory_correlations.csv", index=False)

print(f"\nTrajectory analysis:")
print(f"  Total subjects analyzed: {len(trajectory_df)}")
print(f"  Mean RDM correlation (younger vs older): {trajectory_df['rdm_correlation'].mean():.3f}")
print(f"  Std RDM correlation: {trajectory_df['rdm_correlation'].std():.3f}")
print(f"  Median age threshold: {overall_median_age:.1f} months")
print(f"\nSaved trajectory correlations to {output_dir / 'trajectory_correlations.csv'}")


Analyzing trajectories: 100%|██████████| 18/18 [00:00<00:00, 384.13it/s]


Trajectory analysis:
  Total subjects analyzed: 18
  Mean RDM correlation (younger vs older): 0.771
  Std RDM correlation: 0.080
  Median age threshold: 16.0 months

Saved trajectory correlations to developmental_trajectory_rdms/trajectory_correlations.csv





## Category-Based Correlations

Compute correlations between younger and older RDMs separately for each broad semantic category group (animals, bodyparts, big_objects, small_objects, others). This allows us to examine whether developmental stability varies across different semantic domains.

In [124]:
# Compute category-based correlations for each semantic group
category_correlation_data = []

# Get category groups from organized structure
category_groups = {
    'animals': organized['animals'],
    'bodyparts': organized['bodyparts'],
    'big_objects': organized['big_objects'],
    'small_objects': organized['small_objects'],
    'others': organized['others']
}

print("Computing category-based correlations...")
print(f"Category group sizes: {[(name, len(cats)) for name, cats in category_groups.items()]}")

for subject_id, bin_rdms in tqdm(subject_age_rdms.items(), desc="Category correlations"):
    if 'younger' not in bin_rdms or 'older' not in bin_rdms:
        continue
    
    rdm_younger = bin_rdms['younger']
    rdm_older = bin_rdms['older']
    cats_younger = subject_age_rdm_categories[subject_id]['younger']
    cats_older = subject_age_rdm_categories[subject_id]['older']
    
    # Compute correlation for each category group
    for group_name, group_categories in category_groups.items():
        # Find common categories in this group that are present in both age bins
        common_in_group = [cat for cat in group_categories 
                          if cat in cats_younger and cat in ordered_categories and cat in cats_older]
        
        if len(common_in_group) < 2:
            # Not enough categories in this group for correlation
            category_correlation_data.append({
                'subject_id': subject_id,
                'category_group': group_name,
                'n_common_categories': len(common_in_group),
                'correlation': np.nan,
                'n_categories_younger': len([c for c in group_categories if c in cats_younger]),
                'n_categories_older': len([c for c in group_categories if c in cats_older])
            })
            continue
        
        # Get indices for common categories in this group
        common_indices = [ordered_categories.index(cat) for cat in common_in_group]
        
        # Extract submatrices for this group
        rdm_younger_group = rdm_younger[np.ix_(common_indices, common_indices)]
        rdm_older_group = rdm_older[np.ix_(common_indices, common_indices)]
        
        # Get upper triangle (excluding diagonal)
        mask = np.triu(np.ones_like(rdm_younger_group, dtype=bool), k=1)
        rdm_younger_flat = rdm_younger_group[mask]
        rdm_older_flat = rdm_older_group[mask]
        
        # Filter out NaN values
        valid_mask = ~(np.isnan(rdm_younger_flat) | np.isnan(rdm_older_flat))
        rdm_younger_valid = rdm_younger_flat[valid_mask]
        rdm_older_valid = rdm_older_flat[valid_mask]
        
        # Compute Spearman correlation
        if len(rdm_younger_valid) > 0:
            corr, _ = spearmanr(rdm_younger_valid, rdm_older_valid)
        else:
            corr = np.nan
        
        category_correlation_data.append({
            'subject_id': subject_id,
            'category_group': group_name,
            'n_common_categories': len(common_in_group),
            'correlation': corr,
            'n_categories_younger': len([c for c in group_categories if c in cats_younger]),
            'n_categories_older': len([c for c in group_categories if c in cats_older])
        })

category_corr_df = pd.DataFrame(category_correlation_data)
category_corr_df.to_csv(output_dir / "category_group_correlations.csv", index=False)

print(f"\nCategory-based correlation analysis:")
print(f"  Total subject-group combinations: {len(category_corr_df)}")
print(f"\nMean correlations by category group:")
for group_name in category_groups.keys():
    group_data = category_corr_df[category_corr_df['category_group'] == group_name]
    valid_corrs = group_data['correlation'].dropna()
    if len(valid_corrs) > 0:
        print(f"  {group_name:15s}: {valid_corrs.mean():.3f} (n={len(valid_corrs)} valid, {len(group_data)} total)")
    else:
        print(f"  {group_name:15s}: No valid correlations (n={len(group_data)} total)")

print(f"\nSaved category group correlations to {output_dir / 'category_group_correlations.csv'}")

Computing category-based correlations...
Category group sizes: [('animals', 19), ('bodyparts', 14), ('big_objects', 32), ('small_objects', 96), ('others', 2)]


Category correlations: 100%|██████████| 18/18 [00:00<00:00, 414.16it/s]


Category-based correlation analysis:
  Total subject-group combinations: 90

Mean correlations by category group:
  animals        : 0.522 (n=18 valid, 18 total)
  bodyparts      : 0.847 (n=18 valid, 18 total)
  big_objects    : 0.752 (n=18 valid, 18 total)
  small_objects  : 0.775 (n=18 valid, 18 total)
  others         : No valid correlations (n=18 total)

Saved category group correlations to developmental_trajectory_rdms/category_group_correlations.csv





## Visualize Category-Based Correlations

Create visualizations to examine how developmental stability (correlation between younger and older RDMs) varies across different semantic category groups.

In [None]:
# Visualize category-based correlations
print("Creating visualizations for category-based correlations...")

# Filter out NaN correlations for plotting
valid_category_corr_df = category_corr_df[category_corr_df['correlation'].notna()].copy()

# Create figure with multiple subplots
fig = plt.figure(figsize=(18, 12))

# 1. Box plot comparing correlations across category groups
ax1 = plt.subplot(2, 3, 1)
category_order = ['animals', 'bodyparts', 'big_objects', 'small_objects', 'others']
box_data = [valid_category_corr_df[valid_category_corr_df['category_group'] == group]['correlation'].values 
            for group in category_order if group in valid_category_corr_df['category_group'].values]

# Filter out empty groups
box_data_filtered = []
labels_filtered = []
for i, group in enumerate(category_order):
    group_data = valid_category_corr_df[valid_category_corr_df['category_group'] == group]['correlation'].values
    if len(group_data) > 0:
        box_data_filtered.append(group_data)
        labels_filtered.append(group.replace('_', ' ').title())

bp = ax1.boxplot(box_data_filtered, labels=labels_filtered, patch_artist=True)
colors = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#FFA07A', '#98D8C8']
for patch, color in zip(bp['boxes'], colors[:len(bp['boxes'])]):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)

ax1.set_ylabel('RDM Correlation (Spearman)', fontsize=12)
ax1.set_title('Distribution of Correlations by Category Group', fontsize=13, pad=10)
ax1.grid(True, alpha=0.3, axis='y')
ax1.set_ylim([0, 1])
plt.setp(ax1.xaxis.get_majorticklabels(), rotation=45, ha='right')

# 2. Bar plot of mean correlations by group
ax2 = plt.subplot(2, 3, 2)
mean_corrs = []
std_corrs = []
group_labels = []
for group in category_order:
    group_data = valid_category_corr_df[valid_category_corr_df['category_group'] == group]['correlation']
    if len(group_data) > 0:
        mean_corrs.append(group_data.mean())
        std_corrs.append(group_data.std())
        group_labels.append(group.replace('_', ' ').title())

bars = ax2.bar(range(len(group_labels)), mean_corrs, yerr=std_corrs, 
               color=colors[:len(group_labels)], alpha=0.7, capsize=5, edgecolor='black')
ax2.set_xticks(range(len(group_labels)))
ax2.set_xticklabels(group_labels, rotation=45, ha='right')
ax2.set_ylabel('Mean RDM Correlation', fontsize=12)
ax2.set_title('Mean Correlations by Category Group', fontsize=13, pad=10)
ax2.grid(True, alpha=0.3, axis='y')
ax2.set_ylim([0, 1])
ax2.axhline(y=valid_category_corr_df['correlation'].mean(), color='red', 
           linestyle='--', linewidth=2, label=f'Overall mean: {valid_category_corr_df["correlation"].mean():.3f}')
ax2.legend()

# 3. Heatmap: subjects x category groups
ax3 = plt.subplot(2, 3, 3)
pivot_data = valid_category_corr_df.pivot(index='subject_id', columns='category_group', values='correlation')
# Reorder columns
pivot_data = pivot_data[[col for col in category_order if col in pivot_data.columns]]
# Sort subjects by overall correlation (average across groups)
pivot_data['mean_corr'] = pivot_data.mean(axis=1)
pivot_data = pivot_data.sort_values('mean_corr', ascending=False)
pivot_data = pivot_data.drop('mean_corr', axis=1)

im = ax3.imshow(pivot_data.values, aspect='auto', cmap='RdYlBu_r', vmin=0, vmax=1)
ax3.set_xticks(range(len(pivot_data.columns)))
ax3.set_xticklabels([col.replace('_', ' ').title() for col in pivot_data.columns], 
                    rotation=45, ha='right')
ax3.set_yticks(range(len(pivot_data.index)))
ax3.set_yticklabels(pivot_data.index, fontsize=8)
ax3.set_title('Correlation Heatmap: Subjects × Category Groups', fontsize=13, pad=10)
plt.colorbar(im, ax=ax3, label='RDM Correlation')

# 4. Violin plot for better distribution visualization
ax4 = plt.subplot(2, 3, 4)
violin_data = []
violin_labels = []
for group in category_order:
    group_data = valid_category_corr_df[valid_category_corr_df['category_group'] == group]['correlation'].values
    if len(group_data) > 0:
        violin_data.append(group_data)
        violin_labels.append(group.replace('_', ' ').title())

parts = ax4.violinplot(violin_data, positions=range(len(violin_labels)), showmeans=True, showmedians=True)
for i, pc in enumerate(parts['bodies']):
    pc.set_facecolor(colors[i % len(colors)])
    pc.set_alpha(0.7)
ax4.set_xticks(range(len(violin_labels)))
ax4.set_xticklabels(violin_labels, rotation=45, ha='right')
ax4.set_ylabel('RDM Correlation (Spearman)', fontsize=12)
ax4.set_title('Distribution of Correlations (Violin Plot)', fontsize=13, pad=10)
ax4.grid(True, alpha=0.3, axis='y')
ax4.set_ylim([0, 1])

# 5. Scatter plot: correlation vs number of common categories
ax5 = plt.subplot(2, 3, 5)
for group in category_order:
    group_data = valid_category_corr_df[valid_category_corr_df['category_group'] == group]
    if len(group_data) > 0:
        ax5.scatter(group_data['n_common_categories'], group_data['correlation'], 
                   label=group.replace('_', ' ').title(), alpha=0.6, s=60)

ax5.set_xlabel('Number of Common Categories', fontsize=12)
ax5.set_ylabel('RDM Correlation', fontsize=12)
ax5.set_title('Correlation vs Category Count', fontsize=13, pad=10)
ax5.legend(loc='best', fontsize=9)
ax5.grid(True, alpha=0.3)
ax5.set_ylim([0, 1])

# 6. Individual subject trajectories (bar plot for each subject)
ax6 = plt.subplot(2, 3, 6)
# Get top 10 subjects by overall correlation for cleaner visualization
subject_means = valid_category_corr_df.groupby('subject_id')['correlation'].mean().sort_values(ascending=False)
top_subjects = subject_means.head(10).index

x_pos = np.arange(len(top_subjects))
width = 0.15
for i, group in enumerate(category_order):
    if group in valid_category_corr_df['category_group'].values:
        group_corrs = []
        for subj in top_subjects:
            subj_group_data = valid_category_corr_df[
                (valid_category_corr_df['subject_id'] == subj) & 
                (valid_category_corr_df['category_group'] == group)
            ]
            if len(subj_group_data) > 0:
                group_corrs.append(subj_group_data['correlation'].values[0])
            else:
                group_corrs.append(np.nan)
        
        # Only plot if we have data
        if not all(np.isnan(group_corrs)):
            ax6.bar(x_pos + i*width, group_corrs, width, 
                   label=group.replace('_', ' ').title(), alpha=0.7, color=colors[i % len(colors)])

ax6.set_xlabel('Subject ID', fontsize=12)
ax6.set_ylabel('RDM Correlation', fontsize=12)
ax6.set_title('Top 10 Subjects: Correlations by Category Group', fontsize=13, pad=10)
ax6.set_xticks(x_pos + width * 2)
ax6.set_xticklabels(top_subjects, rotation=45, ha='right', fontsize=8)
ax6.legend(loc='upper left', fontsize=8, ncol=2)
ax6.grid(True, alpha=0.3, axis='y')
ax6.set_ylim([0, 1])

plt.suptitle('Category-Based RDM Correlations: Younger vs Older Age Bins', 
             fontsize=16, y=0.995, fontweight='bold')
plt.tight_layout(rect=[0, 0, 1, 0.99])
plt.savefig(output_dir / "category_group_correlations_visualization.png", dpi=200, bbox_inches='tight')
print(f"Saved category correlation visualization to {output_dir / 'category_group_correlations_visualization.png'}")
plt.close()

# Create a separate detailed heatmap with all subjects
fig, ax = plt.subplots(figsize=(10, 14))
pivot_data_all = valid_category_corr_df.pivot(index='subject_id', columns='category_group', values='correlation')
pivot_data_all = pivot_data_all[[col for col in category_order if col in pivot_data_all.columns]]
# Sort by overall mean correlation
pivot_data_all['mean_corr'] = pivot_data_all.mean(axis=1)
pivot_data_all = pivot_data_all.sort_values('mean_corr', ascending=False)
pivot_data_all = pivot_data_all.drop('mean_corr', axis=1)

im = ax.imshow(pivot_data_all.values, aspect='auto', cmap='RdYlBu_r', vmin=0, vmax=1)
ax.set_xticks(range(len(pivot_data_all.columns)))
ax.set_xticklabels([col.replace('_', ' ').title() for col in pivot_data_all.columns], 
                   rotation=45, ha='right', fontsize=11)
ax.set_yticks(range(len(pivot_data_all.index)))
ax.set_yticklabels(pivot_data_all.index, fontsize=9)
ax.set_title('Category-Based RDM Correlations: All Subjects\n(Younger vs Older Age Bins)', 
             fontsize=14, pad=15, fontweight='bold')
cbar = plt.colorbar(im, ax=ax, label='RDM Correlation (Spearman)', fraction=0.046, pad=0.04)

# Add text annotations for correlation values
for i in range(len(pivot_data_all.index)):
    for j in range(len(pivot_data_all.columns)):
        val = pivot_data_all.iloc[i, j]
        if not np.isnan(val):
            ax.text(j, i, f'{val:.2f}', ha='center', va='center', 
                   fontsize=7, color='white' if val < 0.5 else 'black', fontweight='bold')

plt.tight_layout()
plt.savefig(output_dir / "category_group_correlations_heatmap.png", dpi=200, bbox_inches='tight')
print(f"Saved detailed heatmap to {output_dir / 'category_group_correlations_heatmap.png'}")
plt.close()

print("\nVisualization complete!")

## Visualize Developmental Trajectories


In [125]:
# Create side-by-side RDM visualization for each subject (younger vs older)
print("Creating RDM visualizations for all subjects (younger vs older)...")

for subject_id in tqdm(subject_age_rdms.keys(), desc="Creating RDM plots"):
    bin_rdms = subject_age_rdms[subject_id]
    bin_masks = subject_age_rdm_masks[subject_id]
    
    if 'younger' not in bin_rdms or 'older' not in bin_rdms:
        continue
    
    # Create figure with 2 subplots side by side
    fig, axes = plt.subplots(1, 2, figsize=(16, 7))
    
    # Find global min/max for consistent color scale (excluding NaN)
    all_rdm_values = []
    for rdm in bin_rdms.values():
        valid_values = rdm[~np.isnan(rdm)]
        if len(valid_values) > 0:
            all_rdm_values.extend(valid_values)
    vmin = np.percentile(all_rdm_values, 1) if len(all_rdm_values) > 0 else 0
    vmax = np.percentile(all_rdm_values, 99) if len(all_rdm_values) > 0 else 2
    
    for idx, bin_name in enumerate(['younger', 'older']):
        rdm = bin_rdms[bin_name]
        mask = bin_masks[bin_name]
        available_cats = subject_age_rdm_categories[subject_id][bin_name]
        group_boundaries = subject_age_group_boundaries[subject_id][bin_name]
        
        ax = axes[idx]
        
        # Determine font sizes based on number of categories in predefined order
        n_cats_total = len(ordered_categories)
        n_cats_available = len(available_cats)
        
        if n_cats_total <= 50:
            label_fontsize = 10
            tick_fontsize = 12
        elif n_cats_total <= 100:
            label_fontsize = 8
            tick_fontsize = 10
        else:
            label_fontsize = 6
            tick_fontsize = 8
        
        # Create masked array for visualization (white cells for missing categories)
        rdm_masked = np.ma.masked_where(mask, rdm)
        cmap = plt.cm.get_cmap('viridis').copy()  # Get a copy to avoid modifying global colormap
        cmap.set_bad(color='white', alpha=1.0)  # White for NA cells
        im = ax.imshow(rdm_masked, cmap=cmap, aspect='auto', vmin=vmin, vmax=vmax)
        
        # Add visual separators between category groups
        for boundary in group_boundaries:
            # Draw vertical line
            if boundary["start"] > 0:
                ax.axvline(x=boundary["start"] - 0.5, color="white", linewidth=1.5, linestyle="--", alpha=0.7)
            # Draw horizontal line
            if boundary["start"] > 0:
                ax.axhline(y=boundary["start"] - 0.5, color="white", linewidth=1.5, linestyle="--", alpha=0.7)
        
        # Set category names as axis labels (use full predefined order)
        ax.set_xticks(range(len(ordered_categories)))
        ax.set_yticks(range(len(ordered_categories)))
        ax.set_xticklabels(ordered_categories, rotation=90, ha="right", fontsize=tick_fontsize)
        ax.set_yticklabels(ordered_categories, fontsize=tick_fontsize)
        
        # Create title with age range info and category count
        if bin_name == 'younger':
            title = f"Younger (≤{overall_median_age:.0f} months)\n({n_cats_available}/{n_cats_total} categories)"
        else:
            title = f"Older (>{overall_median_age:.0f} months)\n({n_cats_available}/{n_cats_total} categories)"
        
        ax.set_title(title, fontsize=12, pad=10)
        
        # Add colorbar
        plt.colorbar(im, ax=ax, fraction=0.046, pad=0.04)
    
    plt.suptitle(f"Developmental Trajectory: {subject_id}\n(Median split at {overall_median_age:.1f} months)", 
                 fontsize=14, y=0.995)
    plt.tight_layout(rect=[0, 0, 1, 0.99])
    plt.savefig(output_dir / f"trajectory_{subject_id}.png", dpi=200, bbox_inches='tight')
    plt.close()

print(f"\nSaved RDM visualizations for {len(subject_age_rdms)} subjects")

Creating RDM visualizations for all subjects (younger vs older)...


Creating RDM plots: 100%|██████████| 18/18 [00:36<00:00,  2.05s/it]


Saved RDM visualizations for 18 subjects





## Plot RDM Stability Across Development


In [126]:
# Plot RDM correlation distribution between younger and older bins
# Filter out NaN correlations
valid_correlations = trajectory_df['rdm_correlation'].dropna()

if len(valid_correlations) > 0:
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Histogram of RDM correlations
    axes[0].hist(valid_correlations, bins=20, alpha=0.7, edgecolor='black')
    mean_corr = valid_correlations.mean()
    axes[0].axvline(mean_corr, color='red', linestyle='--', 
                    label=f'Mean: {mean_corr:.3f}')
    axes[0].set_xlabel('RDM Correlation (Spearman)')
    axes[0].set_ylabel('Number of Subjects')
    axes[0].set_title(f'Distribution of Younger vs Older RDM Correlations\n(n={len(valid_correlations)} valid)')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # Box plot
    axes[1].boxplot(valid_correlations, vert=True)
    axes[1].set_ylabel('RDM Correlation (Spearman)')
    axes[1].set_title(f'RDM Correlation: Younger vs Older\n(n={len(valid_correlations)} valid)')
    axes[1].set_xticklabels(['All Subjects'])
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig(output_dir / "rdm_stability_analysis.png", dpi=150, bbox_inches='tight')
    print(f"Saved RDM stability analysis to {output_dir / 'rdm_stability_analysis.png'}")
    plt.close()
else:
    print("Warning: No valid correlations to plot (all are NaN)")


Saved RDM stability analysis to developmental_trajectory_rdms/rdm_stability_analysis.png


## Summary Statistics


In [127]:
# Create summary statistics
summary_data = []

for subject_id, bin_rdms in subject_age_rdms.items():
    for bin_name in ['younger', 'older']:
        if bin_name not in bin_rdms:
            continue
            
        rdm = bin_rdms[bin_name]
        categories = subject_age_rdm_categories[subject_id][bin_name]
        
        # Use nan-aware functions to handle NaN values (missing categories)
        valid_rdm = rdm[~np.isnan(rdm)]
        valid_rdm_positive = valid_rdm[valid_rdm > 0]  # Exclude diagonal zeros
        
        summary_data.append({
            'subject_id': subject_id,
            'age_bin': bin_name,
            'median_age_threshold': overall_median_age,
            'n_categories': len(categories),
            'mean_distance': float(np.nanmean(rdm)) if len(valid_rdm) > 0 else np.nan,
            'std_distance': float(np.nanstd(rdm)) if len(valid_rdm) > 0 else np.nan,
            'min_distance': float(valid_rdm_positive.min()) if len(valid_rdm_positive) > 0 else np.nan,
            'max_distance': float(np.nanmax(rdm)) if len(valid_rdm) > 0 else np.nan
        })

summary_df = pd.DataFrame(summary_data)
summary_df.to_csv(output_dir / "summary_statistics.csv", index=False)

print("Summary statistics:")
print(summary_df.describe())
print(f"\nSaved summary to {output_dir / 'summary_statistics.csv'}")


Summary statistics:
       median_age_threshold  n_categories  mean_distance  std_distance  \
count                  36.0     36.000000      36.000000     36.000000   
mean                   16.0    150.916667       0.961607      0.195848   
std                     0.0     12.748950       0.018295      0.010685   
min                    16.0     88.000000       0.882214      0.175946   
25%                    16.0    149.000000       0.958460      0.188916   
50%                    16.0    154.500000       0.965709      0.196768   
75%                    16.0    157.000000       0.972139      0.203361   
max                    16.0    161.000000       0.983759      0.215448   

       min_distance  max_distance  
count     36.000000     36.000000  
mean       0.051624      1.455407  
std        0.021439      0.059208  
min        0.017231      1.229801  
25%        0.037562      1.432026  
50%        0.048999      1.466502  
75%        0.070066      1.492405  
max        0.102304      