# Official UROP Data Analysis Workflow

This notebook provides a complete, reproducible pipeline for analyzing schema experiment data as part of the UROP project.

## Setup Instructions
1. Click **Runtime** > **Run all** to process the default dataset.
2. To analyze your own data, upload CSV files to the `data/raw` folder or mount Google Drive.

## 1. Environment Setup

Clone the repository and install the standard analysis package. 

> **⚠️ REQUIRED ACTION**: If you are seeing `KeyError: 'mean'`, you **MUST** go to **Runtime > Restart session** after running this cell for the package updates to take effect.

In [None]:
# Clean up any previous clone to ensure a fresh pull
!rm -rf data-analysis-urop-standardized

# Clone the repository
!git clone https://github.com/dazubanator/data-analysis-urop-standardized.git

# Install the package (forcing upgrade)
!pip install -v -q --upgrade ./data-analysis-urop-standardized

print("✓ Environment setup complete!")
print("\n--- ACTION REQUIRED ---")
print("If this is your first run today or you just pulled an update:")
print("Go to top menu: Runtime > Restart session (or Restart runtime) NOW.")

## 2. Imports and Configuration

Import required libraries and set the analysis parameters.

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from schema_analysis.data_loader import load_and_merge_csvs
from schema_analysis import TubeTrials

# Set high-quality plotting style
sns.set_theme(style="whitegrid", palette="muted")
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['figure.dpi'] = 100

# Standard experimental parameters
DATA_PATH = '/content/data-analysis-urop-standardized/data/raw'
MIN_ANGLE = 3
MAX_ANGLE = 43
MAX_INVALID_TRIALS = 2

print("✓ Configuration loaded.")

## 3. Data Processing Pipeline

Loading, cleaning, and filtering the experimental data.

In [None]:
print("Starting Pipeline...")

# 1. Load & Merge Data
merged_df = load_and_merge_csvs(DATA_PATH)
print(f"[LOAD] Loaded {len(merged_df)} raw trials.")

# 2. Clean Data (Remove missing group_id)
merged_df = merged_df.dropna(subset=['session_group'])
print(f"[CLEAN] {len(merged_df)} trials remain after removing empty group IDs.")

# 3. Process Angles and Validity
trials = TubeTrials(merged_df)
trials.process_angles()
trials.mark_valid_angles(min_angle=MIN_ANGLE, max_angle=MAX_ANGLE)
trials.mark_valid_subjects(max_invalid_trials=MAX_INVALID_TRIALS)

# 4. Selection and Balancing
clean_trials = trials.select(valid_only=True)
results = clean_trials.calc_d_values()
stats_df = clean_trials.calc_stats()

# 5. Attrition Report
trials.get_validity_stats()

print(f"\n[FINALIZE] {len(results)} valid pairs identified for analysis.")

## 4. Results Summary

Aggregated statistics per Face ID.

In [None]:
print("STATISTICS BY FACE ID")
print("=" * 30)
if not stats_df.empty:
    # Robust column mapping if using old version in memory
    if 'mean' not in stats_df.columns and 'mean_D' in stats_df.columns:
        print("\n⚠️ NOTICE: Using old column names from memory. Please restart runtime for official fix.")
        mapper = {'mean_D': 'mean', 'std_D': 'std'}
        stats_df = stats_df.rename(columns=mapper)
        if 'sem' not in stats_df.columns:
            # Estimate SEM if on old version
            stats_df['sem'] = stats_df['std'] / np.sqrt(stats_df['n_subjects'])
    
    display(stats_df)
    if 'd' in results.columns:
        print("\nGLOBAL D-VALUE MEAN: {:.4f}°".format(results['d'].mean()))
        print("GLOBAL D-VALUE SEM:  {:.4f}°".format(results['d'].sem()))
else:
    print("No valid data to display statistics.")

## 5. Visualizations

High-quality plots for publication and presentation.

In [None]:
if not stats_df.empty:
    # Final check for required columns to avoid crash
    if 'mean' in stats_df.columns:
        fig, axes = plt.subplots(2, 2, figsize=(16, 12))

        # Plot 1: D-value Distribution
        sns.histplot(results['d'], bins=20, kde=True, ax=axes[0, 0], color='skyblue')
        axes[0, 0].axvline(results['d'].mean(), color='red', linestyle='--', label=f"Mean: {results['d'].mean():.2f}")
        axes[0, 0].set_title('Distribution of D-values (All Faces)')
        axes[0, 0].set_ylabel('Frequency (D-values)')
        axes[0, 0].legend()

        # Plot 2: Boxplot by Face ID
        order = sorted(results['face_id'].unique())
        sns.boxplot(data=results, x='face_id', y='d', order=order, ax=axes[0, 1], palette='pastel')
        axes[0, 1].set_title('D-value Spread per Face ID')

        # Plot 3: Mean D-value with SEM
        axes[1, 0].bar(stats_df['face_id'].astype(str), stats_df['mean'], yerr=stats_df['sem'].fillna(0), capsize=5, color='teal', alpha=0.7)
        axes[1, 0].set_title('Mean D-value per Face (±SEM)')
        axes[1, 0].set_ylabel('D-value (degrees)')

        # Plot 4: P-values
        colors = ['#2ecc71' if p < 0.05 else '#e74c3c' for p in stats_df['p_value']]
        axes[1, 1].bar(stats_df['face_id'].astype(str), stats_df['p_value'], color=colors, alpha=0.8)
        axes[1, 1].axhline(0.05, color='black', linestyle=':', label='p=0.05 Threshold')
        axes[1, 1].set_title('Statistical Significance (P-value)')
        axes[1, 1].set_yscale('log')
        axes[1, 1].legend()

        plt.tight_layout()
        plt.show()
    else:
        print("Error: 'mean' column still missing. YOU MUST RESTART THE RUNTIME (Runtime > Restart session).")
else:
    print("No valid data to visualize.")

## 6. Official Verification

Verify the pipeline logic using a standardized verification script and dummy data.

In [None]:
!python /content/data-analysis-urop-standardized/verification/run_verification.py

---
**Project Repository**: [dazubanator/data-analysis-urop-standardized](https://github.com/dazubanator/data-analysis-urop-standardized)