# Part 1: Dataset Exploration

**Duration**: ~10 minutes

**Objective**: Understand the Mario dataset organization and experimental protocol

In this notebook, we'll explore:
- BIDS organization and data structure
- Experimental protocol and task details
- Behavioral annotations (actions, events, scenes)
- Replay data and game state information

In [None]:
# Import libraries
import sys
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Add scripts directory to path
scripts_dir = Path('..') / 'scripts'
sys.path.insert(0, str(scripts_dir))

from utils import (
    get_sourcedata_path,
    get_derivatives_path,
    load_events,
    get_session_runs
)

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("Environment setup complete!")

## 1. BIDS Organization

The CNeuromod Mario dataset follows BIDS (Brain Imaging Data Structure) conventions:

**Data hierarchy**:
- `sourcedata/mario/` - Raw fMRI data
- `sourcedata/mario.fmriprep/` - Preprocessed BOLD and anatomical images
- `sourcedata/mario.annotations/` - Behavioral event annotations
- `sourcedata/mario.replays/` - Frame-by-frame game state recordings
- `sourcedata/cneuromod.processed/smriprep/` - Anatomical templates

**Subject/Session structure**: `sub-XX/ses-YYY/func/`

**File naming**: `sub-XX_ses-YYY_task-mario_run-ZZ_*`

In [None]:
# Define our subject and session
SUBJECT = 'sub-01'
SESSION = 'ses-010'

# Get paths
sourcedata_path = get_sourcedata_path()
derivatives_path = get_derivatives_path()

print(f"Analyzing: {SUBJECT}, {SESSION}")
print(f"Source data location: {sourcedata_path}")
print(f"Derivatives location: {derivatives_path}")

# Check if sourcedata exists
if not sourcedata_path.exists():
    print("\n⚠️  WARNING: Sourcedata directory not found!")
    print(f"Expected location: {sourcedata_path}")
    print("Please ensure the data is downloaded and in the correct location.")
else:
    print("\n✓ Sourcedata directory found")

In [None]:
# Get list of runs for this session
try:
    runs = get_session_runs(SUBJECT, SESSION, sourcedata_path)
    print(f"Found {len(runs)} runs for {SESSION}:")
    for run in runs:
        print(f"  - {run}")
except FileNotFoundError as e:
    print(f"Error: {e}")
    print("\nUsing default run list...")
    runs = ['run-01', 'run-02', 'run-03', 'run-04', 'run-05']

## 2. Experimental Protocol

**Task**: Play Super Mario Bros (NES) naturally while undergoing fMRI scanning

**Levels**:
- **Training levels** (6): w1l1, w1l2, w4l1, w4l2, w5l1, w5l2
- **Out-of-distribution (OOD) levels** (2): w2l1, w3l1

**Session structure**:
- ~5 runs per session
- ~5 minutes per run (~300 seconds)
- TR = 1.49s (multiband fMRI acquisition)
- ~200 volumes per run

**Participants**: Experienced players, free to play naturally (no constraints on strategy)

## 3. Behavioral Annotations

The `mario.annotations` dataset provides rich behavioral event coding:

**Action events** (button presses):
- A, B, LEFT, RIGHT, UP, DOWN
- Onset, duration, trial_type columns

**Game events**:
- Kill/stomp, Kill/kick (enemy defeats)
- Hit/life_lost (player damage)
- Powerup_collected, Coin_collected (rewards)
- Flag_reached (level completion)

**Scene information**:
- Level segmentation with unique scene codes
- Tracks progression through game levels

In [None]:
# Load events for first run
try:
    events_run1 = load_events(SUBJECT, SESSION, runs[0], sourcedata_path)
    print(f"Loaded events for {runs[0]}")
    print(f"Shape: {events_run1.shape}")
    print(f"\nColumns: {list(events_run1.columns)}")
    print(f"\nFirst few events:")
    display(events_run1.head(10))
except Exception as e:
    print(f"Error loading events: {e}")
    print("Creating dummy data for demonstration...")
    events_run1 = pd.DataFrame({
        'onset': [0.0, 1.5, 2.0, 3.5, 4.0],
        'duration': [0.5, 0.3, 0.4, 0.3, 0.5],
        'trial_type': ['LEFT', 'A', 'RIGHT', 'B', 'LEFT']
    })
    display(events_run1)

In [None]:
# Event type frequency analysis
event_counts = events_run1['trial_type'].value_counts()

print(f"Event type frequencies for {runs[0]}:")
print(event_counts.head(15))

# Plot event frequencies
fig, ax = plt.subplots(figsize=(14, 6))
event_counts.head(20).plot(kind='bar', ax=ax, color='steelblue')
ax.set_xlabel('Event Type', fontsize=12)
ax.set_ylabel('Count', fontsize=12)
ax.set_title(f'Event Frequencies - {SUBJECT} {SESSION} {runs[0]}', fontsize=14, fontweight='bold')
ax.grid(axis='y', alpha=0.3)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
# Categorize events
button_events = ['A', 'B', 'LEFT', 'RIGHT', 'UP', 'DOWN']
game_events = ['Kill/stomp', 'Kill/kick', 'Hit/life_lost', 'Powerup_collected', 'Coin_collected']

# Count by category
buttons = events_run1[events_run1['trial_type'].isin(button_events)]
game_evts = events_run1[events_run1['trial_type'].isin(game_events)]

print(f"\nEvent breakdown:")
print(f"  Button presses: {len(buttons)}")
print(f"  Game events: {len(game_evts)}")
print(f"  Other events: {len(events_run1) - len(buttons) - len(game_evts)}")
print(f"  Total events: {len(events_run1)}")

## 4. Event Timeline Visualization

Let's visualize when different types of events occur during gameplay.

In [None]:
# Create timeline plot
fig, axes = plt.subplots(3, 1, figsize=(16, 10), sharex=True)

# Axis 1: Button presses
ax1 = axes[0]
for idx, button in enumerate(button_events):
    button_data = events_run1[events_run1['trial_type'] == button]
    if len(button_data) > 0:
        ax1.scatter(button_data['onset'], [idx] * len(button_data), 
                   label=button, alpha=0.6, s=20)

ax1.set_ylabel('Button Type', fontsize=12)
ax1.set_yticks(range(len(button_events)))
ax1.set_yticklabels(button_events)
ax1.set_title('Button Press Timeline', fontsize=14, fontweight='bold')
ax1.grid(alpha=0.3)
ax1.legend(loc='upper right', ncol=6)

# Axis 2: Game events
ax2 = axes[1]
available_game_events = [e for e in game_events if e in events_run1['trial_type'].values]
for idx, event_type in enumerate(available_game_events):
    event_data = events_run1[events_run1['trial_type'] == event_type]
    if len(event_data) > 0:
        ax2.scatter(event_data['onset'], [idx] * len(event_data),
                   label=event_type, alpha=0.7, s=50, marker='D')

ax2.set_ylabel('Game Event', fontsize=12)
if available_game_events:
    ax2.set_yticks(range(len(available_game_events)))
    ax2.set_yticklabels(available_game_events)
ax2.set_title('Game Events Timeline', fontsize=14, fontweight='bold')
ax2.grid(alpha=0.3)
if available_game_events:
    ax2.legend(loc='upper right')

# Axis 3: Event density over time
ax3 = axes[2]
bin_size = 10  # seconds
max_time = events_run1['onset'].max()
bins = np.arange(0, max_time + bin_size, bin_size)

button_hist, _ = np.histogram(buttons['onset'], bins=bins)
game_hist, _ = np.histogram(game_evts['onset'], bins=bins)

ax3.bar(bins[:-1], button_hist, width=bin_size*0.4, label='Button presses', alpha=0.7)
ax3.bar(bins[:-1] + bin_size*0.4, game_hist, width=bin_size*0.4, label='Game events', alpha=0.7)
ax3.set_xlabel('Time (seconds)', fontsize=12)
ax3.set_ylabel('Event Count', fontsize=12)
ax3.set_title('Event Density Over Time', fontsize=14, fontweight='bold')
ax3.grid(axis='y', alpha=0.3)
ax3.legend()

plt.tight_layout()
plt.show()

## 5. Session-Level Event Statistics

Let's aggregate events across all runs in the session.

In [None]:
# Load all runs and aggregate
all_events = []
run_durations = []

for run in runs:
    try:
        events = load_events(SUBJECT, SESSION, run, sourcedata_path)
        all_events.append(events)
        run_durations.append(events['onset'].max() + events.iloc[-1]['duration'])
        print(f"✓ Loaded {run}: {len(events)} events, {run_durations[-1]:.1f}s duration")
    except Exception as e:
        print(f"✗ Could not load {run}: {e}")

if all_events:
    # Concatenate all events
    session_events = pd.concat(all_events, ignore_index=True)
    
    print(f"\n{'='*60}")
    print(f"Session summary: {SUBJECT} {SESSION}")
    print(f"{'='*60}")
    print(f"Total runs: {len(all_events)}")
    print(f"Total duration: {sum(run_durations):.1f} seconds ({sum(run_durations)/60:.1f} minutes)")
    print(f"Total events: {len(session_events)}")
    print(f"Average events per run: {len(session_events) / len(all_events):.1f}")
else:
    print("No events loaded successfully.")

In [None]:
# Session-level event type breakdown
if all_events:
    session_event_counts = session_events['trial_type'].value_counts()
    
    # Top 15 most frequent events
    print("\nTop 15 most frequent event types across session:")
    print(session_event_counts.head(15))
    
    # Visualize
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
    
    # Top events
    session_event_counts.head(15).plot(kind='barh', ax=ax1, color='steelblue')
    ax1.set_xlabel('Count', fontsize=12)
    ax1.set_ylabel('Event Type', fontsize=12)
    ax1.set_title(f'Top 15 Events - Session Level ({SESSION})', fontsize=14, fontweight='bold')
    ax1.grid(axis='x', alpha=0.3)
    
    # Category breakdown
    button_count = len(session_events[session_events['trial_type'].isin(button_events)])
    game_count = len(session_events[session_events['trial_type'].isin(game_events)])
    other_count = len(session_events) - button_count - game_count
    
    categories = ['Button\nPresses', 'Game\nEvents', 'Other\nEvents']
    counts = [button_count, game_count, other_count]
    colors = ['#3498db', '#e74c3c', '#95a5a6']
    
    ax2.bar(categories, counts, color=colors, alpha=0.8)
    ax2.set_ylabel('Count', fontsize=12)
    ax2.set_title('Event Category Breakdown', fontsize=14, fontweight='bold')
    ax2.grid(axis='y', alpha=0.3)
    
    # Add value labels
    for i, (cat, count) in enumerate(zip(categories, counts)):
        ax2.text(i, count, f'{count}\n({count/len(session_events)*100:.1f}%)',
                ha='center', va='bottom', fontsize=11, fontweight='bold')
    
    plt.tight_layout()
    plt.show()

## 6. Game Replay Data

The `mario.replays` dataset contains frame-by-frame game state recordings saved from BizHawk emulator (.bk2 files).

**Replay data includes**:
- RAM variables: player position (x, y), lives, score, time, power-up state
- Button states at 60Hz frame rate
- Can extract individual frames for visualization or RL training

**Usage**: These replays are used to:
1. Extract game frames for RL agent training (Notebook 04)
2. Validate behavioral annotations
3. Visualize gameplay moments

In [None]:
# Check for replay files
replay_dir = sourcedata_path / 'mario.replays' / SUBJECT / SESSION / 'beh'

print(f"Replay directory: {replay_dir}")

if replay_dir.exists():
    replay_files = sorted(replay_dir.glob('*.bk2'))
    print(f"\nFound {len(replay_files)} replay files:")
    for rf in replay_files:
        file_size_mb = rf.stat().st_size / (1024 * 1024)
        print(f"  - {rf.name} ({file_size_mb:.1f} MB)")
    
    if len(replay_files) > 0:
        print("\n✓ Replay files available for frame extraction and RL training")
    else:
        print("\n⚠️  No replay files found")
else:
    print(f"\n⚠️  Replay directory not found: {replay_dir}")
    print("Replay data is optional for this tutorial.")

## Summary

In this notebook, we explored:

✅ **BIDS organization**: Understood the data structure and file naming conventions

✅ **Experimental protocol**: Session structure, timing, and task design

✅ **Behavioral annotations**: Rich event coding including button presses and game events

✅ **Event statistics**: Frequency distributions and temporal patterns

✅ **Replay data**: Frame-by-frame recordings for detailed analysis

### Key findings:
- Session duration: ~25 minutes of gameplay across 5 runs
- Hundreds of behavioral events per run
- Mix of button presses (motor actions) and game events (rewards/punishments)
- Rich temporal structure for fMRI analysis

### Next steps:
In **Notebook 02**, we'll use these behavioral annotations to build GLM models and identify brain regions associated with different actions and game events.