# 05 - Assembly Validation (Surface)

Validate that the Brain-Score surface assemblies contain correct values by tracing
the full preprocessing chain from raw NSD fsaverage MGH betas.

**Sections:**
1. Vertex-level spot check: reproduce global z-score from NB02, compare with assembly
2. End-to-end raw data bypass: reproduce V4 from raw MGH for subj01
3. Standalone ridge regression: compare sklearn RidgeCV with benchmark output

**Environment:** `conda activate vision-2026`

**Requires:** External drive `/Volumes/Hagibis/nsd` mounted

In [1]:
import sys
sys.path.insert(0, '/Users/kartik/Brain-Score 2026/vision')
sys.path.insert(0, '/Users/kartik/Brain-Score 2026/core')

import numpy as np
import pandas as pd
import xarray as xr
import nibabel as nib
import scipy.io
from pathlib import Path
import time

NSD_ROOT = Path('/Volumes/Hagibis/nsd')
FSAVG_LABELS = NSD_ROOT / 'fsaverage_labels'
ASSEMBLY_DIR = NSD_ROOT / 'assemblies'
BRAINSCORE_DIR = NSD_ROOT / 'brainscore_surface'

REGIONS = ['V1', 'V2', 'V4', 'IT']
SUBJECT_LIST = [1, 2, 3, 4, 5, 6, 7, 8]
SESSIONS_PER_SUBJECT = {1: 40, 2: 40, 3: 32, 4: 30, 5: 40, 6: 32, 7: 40, 8: 30}
TRIALS_PER_SESSION = 750
N_FSAVG_VERTICES = 163842
VARIANT = '_8subj'

# ROI definitions (same as NB02)
REGION_TO_KASTNER = {'V1': [1, 2], 'V2': [3, 4], 'V4': [7]}
STREAMS_VENTRAL_LABEL = 5

# Train/test split
split_df = pd.read_csv(ASSEMBLY_DIR / f'train_test_split{VARIANT}.csv')
train_ids = set(split_df.loc[split_df['split'] == 'train', 'stimulus_id'])
test_ids = set(split_df.loc[split_df['split'] == 'test', 'stimulus_id'])

assert NSD_ROOT.exists(), 'External drive not mounted'
print(f'NSD root: {NSD_ROOT}')
print(f'Train: {len(train_ids)}, Test: {len(test_ids)}')

NSD root: /Volumes/Hagibis/nsd
Train: 412, Test: 103


## Section 1: Vertex-Level Spot Check

Validate the NB03 packaging step by reproducing the global z-score from NB02 data
and comparing with the Brain-Score surface assembly values.

**Preprocessing chain being validated:**
```
NB02 assembly (session z-scored, rep-averaged, 1000 images)
  -> filter to 515 complete images (min_reps >= 3)
  -> global z-score (per subject, per region, stats from 515 images)
  -> NaN fill (0.0)
  -> split into train/test
  -> Brain-Score assembly
```

In [2]:
# Load Brain-Score packaged surface assemblies (NB03 output)
bs_train = xr.open_dataarray(str(BRAINSCORE_DIR / f'Allen2022_fmri_surface_train{VARIANT}.nc'))
bs_test = xr.open_dataarray(str(BRAINSCORE_DIR / f'Allen2022_fmri_surface_test{VARIANT}.nc'))
bs_train.load()
bs_test.load()

print(f'Brain-Score train: {bs_train.shape}')
print(f'Brain-Score test:  {bs_test.shape}')
print(f'Train coords: {list(bs_train.coords)}')
print(f'Test coords:  {list(bs_test.coords)}')

Brain-Score train: (412, 221168, 1)
Brain-Score test:  (309, 221168, 1)
Train coords: ['stimulus_id', 'nsd_id', 'neuroid_id', 'subject', 'hemisphere', 'vertex_index', 'region', 'nc_testset', 'time_bin_start', 'time_bin_end']
Test coords:  ['stimulus_id', 'nsd_id', 'repetition', 'neuroid_id', 'subject', 'hemisphere', 'vertex_index', 'region', 'nc_testset', 'time_bin_start', 'time_bin_end']


In [3]:
# Load NB02 assembly for V4 and reproduce global z-score
nb02_v4 = xr.open_dataarray(str(ASSEMBLY_DIR / f'Allen2022_surface.V4{VARIANT}.nc'))
nb02_v4.load()
print(f'NB02 V4: {nb02_v4.shape}')  # (1000, 7312)

# Filter to 515 complete images (min_reps >= 3), same as NB03
min_reps = nb02_v4.coords['min_reps_across_subjects'].values
complete_mask = min_reps >= 3
nb02_v4_515 = nb02_v4.isel(presentation=complete_mask)
print(f'NB02 V4 (515 complete): {nb02_v4_515.shape}')

# Build train/test masks over the 515 images
nb02_stimulus_ids = nb02_v4_515.coords['stimulus_id'].values
train_mask_515 = np.array([sid in train_ids for sid in nb02_stimulus_ids])
test_mask_515 = np.array([sid in test_ids for sid in nb02_stimulus_ids])
print(f'Train: {train_mask_515.sum()}, Test: {test_mask_515.sum()}')
assert train_mask_515.sum() == 412
assert test_mask_515.sum() == 103

NB02 V4: (1000, 7312)
NB02 V4 (515 complete): (515, 7312)
Train: 412, Test: 103


In [4]:
# Reproduce global z-score for V4 (matching NB03 cell-7)
reproduced_data = nb02_v4_515.values.copy()  # (515, 7312)
subjects_v4 = nb02_v4_515.coords['subject'].values

for subj_label in [f'subj{s:02d}' for s in SUBJECT_LIST]:
    subj_mask = subjects_v4 == subj_label
    subj_data = reproduced_data[:, subj_mask]
    mean = np.nanmean(subj_data, axis=0, keepdims=True)
    std = np.nanstd(subj_data, axis=0, keepdims=True)
    std[std == 0] = 1.0
    reproduced_data[:, subj_mask] = (subj_data - mean) / std

# Fill NaN with 0.0 (same as NB03)
n_nan_filled = np.isnan(reproduced_data).sum()
reproduced_data = np.nan_to_num(reproduced_data, nan=0.0)
print(f'NaN filled: {n_nan_filled}')

reproduced_train = reproduced_data[train_mask_515]  # (412, 7312)
reproduced_test = reproduced_data[test_mask_515]    # (103, 7312)
print(f'Reproduced train: {reproduced_train.shape}')
print(f'Reproduced test:  {reproduced_test.shape}')

NaN filled: 0
Reproduced train: (412, 7312)
Reproduced test:  (103, 7312)


In [5]:
# Compare reproduced values with Brain-Score train assembly (V4 neuroids only)
v4_neuroid_mask = bs_train.coords['region'].values == 'V4'
bs_train_v4 = bs_train.values[:, v4_neuroid_mask, 0]  # (412, 7312)

max_diff = np.max(np.abs(reproduced_train - bs_train_v4))
mean_diff = np.mean(np.abs(reproduced_train - bs_train_v4))
r = np.corrcoef(reproduced_train.ravel(), bs_train_v4.ravel())[0, 1]

print('=== NB02 -> Filter -> Global Z-Score -> Brain-Score Train (V4) ===')
print(f'Max absolute difference:  {max_diff:.2e}')
print(f'Mean absolute difference: {mean_diff:.2e}')
print(f'Correlation:              {r:.10f}')
print(f'Match: {"PASS" if max_diff < 1e-5 else "FAIL"}')
assert max_diff < 1e-5, f'Train V4 mismatch: max_diff={max_diff}'

# Compare test (averaged reps -- average the 3 reps in Brain-Score test)
bs_test_v4_all = bs_test.values[:, v4_neuroid_mask, 0]  # (309, 7312)
n_test = len(test_ids)
bs_test_v4_avg = bs_test_v4_all.reshape(n_test, 3, -1).mean(axis=1)  # (103, 7312)

max_diff_test = np.max(np.abs(reproduced_test - bs_test_v4_avg))
print(f'\n=== Test (averaged reps, V4) ===')
print(f'Max absolute difference: {max_diff_test:.2e}')
print(f'Match: {"PASS" if max_diff_test < 1e-5 else "FAIL"}')
assert max_diff_test < 1e-5, f'Test V4 mismatch: max_diff={max_diff_test}'

=== NB02 -> Filter -> Global Z-Score -> Brain-Score Train (V4) ===
Max absolute difference:  0.00e+00
Mean absolute difference: 0.00e+00
Correlation:              1.0000000000
Match: PASS

=== Test (averaged reps, V4) ===
Max absolute difference: 9.54e-07
Match: PASS


In [6]:
# Spot check: trace 5 specific (subject, image, vertex) values
print('=== Vertex-Level Spot Check (V4) ===')
print(f'{"Subject":<10} {"Stimulus":<14} {"Vertex":<8} '
      f'{"Reproduced":>12} {"Assembly":>12} {"Diff":>12} {"Status"}')
print('-' * 82)

v4_subjects = bs_train.coords['subject'].values[v4_neuroid_mask]
train_stim_ids = bs_train.coords['stimulus_id'].values

spot_checks = [
    ('subj01', 0, 0),      # first subject, first train image, first vertex
    ('subj01', 0, 100),    # first subject, first train image, vertex 100
    ('subj04', 200, 50),   # middle subject, middle image, middle vertex
    ('subj08', 411, -1),   # last subject, last train image, last vertex
    ('subj05', 100, 200),  # arbitrary
]

all_pass = True
for subj_label, img_idx, vx_idx in spot_checks:
    subj_v4_mask = v4_subjects == subj_label
    subj_start = np.where(subj_v4_mask)[0][0]
    actual_vx_idx = subj_start + (vx_idx % subj_v4_mask.sum())

    reproduced_val = reproduced_train[img_idx, actual_vx_idx]
    assembly_val = bs_train_v4[img_idx, actual_vx_idx]
    diff = abs(reproduced_val - assembly_val)
    status = 'PASS' if diff < 1e-6 else 'FAIL'
    if status == 'FAIL':
        all_pass = False

    print(f'{subj_label:<10} {train_stim_ids[img_idx]:<14} {actual_vx_idx:<8} '
          f'{reproduced_val:>12.6f} {assembly_val:>12.6f} {diff:>12.2e} {status}')

print(f'\nAll spot checks: {"PASS" if all_pass else "FAIL"}')
assert all_pass, 'Spot check failed'

=== Vertex-Level Spot Check (V4) ===
Subject    Stimulus       Vertex     Reproduced     Assembly         Diff Status
----------------------------------------------------------------------------------
subj01     nsd_03049      0            0.411551     0.411551     0.00e+00 PASS
subj01     nsd_03049      100         -0.756218    -0.756218     0.00e+00 PASS
subj04     nsd_37224      2792         2.014114     2.014114     0.00e+00 PASS
subj08     nsd_72719      7311        -1.607054    -1.607054     0.00e+00 PASS
subj05     nsd_20738      3856         0.167083     0.167083     0.00e+00 PASS

All spot checks: PASS


## Section 2: End-to-End Raw Data Bypass

Reproduce V4 betas for subj01 entirely from raw MGH surface files, applying the full
preprocessing chain:
```
Raw MGH betas (float32, per hemisphere, 163842 vertices x 750 trials)
  -> Extract V4 vertices (LH Kastner hV4 + RH Kastner hV4)
  -> Z-score within session (750 trials per vertex)
  -> Collect shared-image trials, average 3 repetitions
  -> Global z-score (stats from all 515 averaged images)
  -> Compare with Brain-Score assembly
```

In [7]:
# Load fsaverage ROI labels for V4
lh_kastner = nib.load(str(FSAVG_LABELS / 'lh.Kastner2015.mgz')).get_fdata().flatten()
rh_kastner = nib.load(str(FSAVG_LABELS / 'rh.Kastner2015.mgz')).get_fdata().flatten()

v4_lh_mask = lh_kastner == 7  # hV4
v4_rh_mask = rh_kastner == 7
n_v4_lh = v4_lh_mask.sum()
n_v4_rh = v4_rh_mask.sum()
n_v4_total = n_v4_lh + n_v4_rh
print(f'V4 vertices: LH={n_v4_lh}, RH={n_v4_rh}, total={n_v4_total}')

# Trial mapping utilities
expdesign = scipy.io.loadmat(str(NSD_ROOT / 'metadata' / 'nsd_expdesign.mat'))
masterordering = expdesign['masterordering'].flatten()
subjectim = expdesign['subjectim']
sharedix = expdesign['sharedix'].flatten()

# Build the 515 complete image list (same as NB02/03)
all_nsd_ids = sorted(split_df['nsd_id'].values)
nsd_id_to_idx = {nsd_id: idx for idx, nsd_id in enumerate(all_nsd_ids)}
N_IMAGES = len(all_nsd_ids)
print(f'Total images: {N_IMAGES}')

V4 vertices: LH=410, RH=504, total=914
Total images: 515


In [8]:
def get_shared_trial_info(subj_idx: int, target_nsd_ids: set) -> pd.DataFrame:
    """Get session/trial mapping for shared images for a given subject."""
    subj = subj_idx + 1
    n_sessions = SESSIONS_PER_SUBJECT[subj]
    n_total_trials = n_sessions * TRIALS_PER_SESSION
    subj_nsdids = subjectim[subj_idx]
    nsdid_to_imgidx = {int(nsd_id): img_idx + 1
                       for img_idx, nsd_id in enumerate(subj_nsdids)}

    shared_imgidxs = set()
    for nsd_id in sharedix:
        if int(nsd_id) in nsdid_to_imgidx:
            shared_imgidxs.add(nsdid_to_imgidx[int(nsd_id)])

    records = []
    rep_counter = {}
    for trial_idx in range(n_total_trials):
        img_idx = masterordering[trial_idx]
        if img_idx in shared_imgidxs:
            nsd_id = int(subj_nsdids[img_idx - 1] - 1)  # 0-indexed
            if nsd_id not in target_nsd_ids:
                rep_counter[img_idx] = rep_counter.get(img_idx, 0) + 1
                continue
            rep = rep_counter.get(img_idx, 0)
            rep_counter[img_idx] = rep + 1
            session = trial_idx // TRIALS_PER_SESSION + 1
            trial_in_session = trial_idx % TRIALS_PER_SESSION
            records.append({
                'nsd_id': nsd_id, 'rep': rep,
                'session': session, 'trial_in_session': trial_in_session,
            })

    return pd.DataFrame(records)

In [9]:
# Reproduce V4 betas for subj01 from raw MGH files
SUBJ = 1
target_nsd_ids = set(all_nsd_ids)

trial_info = get_shared_trial_info(SUBJ - 1, target_nsd_ids)
print(f'subj{SUBJ:02d} V4: {n_v4_total} vertices')
print(f'Trials for {N_IMAGES} shared images: {len(trial_info)} (expected {N_IMAGES * 3})')
assert len(trial_info) == N_IMAGES * 3

# Extract session-z-scored betas, average across 3 reps
per_rep = np.zeros((N_IMAGES, 3, n_v4_total), dtype=np.float32)

t0 = time.time()
for session in range(1, SESSIONS_PER_SUBJECT[SUBJ] + 1):
    session_trials = trial_info[trial_info['session'] == session]
    if len(session_trials) == 0:
        continue

    # Load LH and RH surface betas
    lh_path = NSD_ROOT / f'subj{SUBJ:02d}' / 'betas' / f'lh.betas_session{session:02d}.mgh'
    rh_path = NSD_ROOT / f'subj{SUBJ:02d}' / 'betas' / f'rh.betas_session{session:02d}.mgh'
    lh_betas = nib.load(str(lh_path)).get_fdata().squeeze()  # (163842, 750)
    rh_betas = nib.load(str(rh_path)).get_fdata().squeeze()  # (163842, 750)

    # Extract V4 vertices: concat LH then RH -> (750, n_v4_total)
    roi_betas = np.concatenate([
        lh_betas[v4_lh_mask].T,  # (750, n_v4_lh)
        rh_betas[v4_rh_mask].T,  # (750, n_v4_rh)
    ], axis=1)

    # Z-score within session per vertex
    mean = roi_betas.mean(axis=0, keepdims=True)
    std = roi_betas.std(axis=0, keepdims=True)
    std[std == 0] = 1.0
    roi_betas = (roi_betas - mean) / std

    for _, row in session_trials.iterrows():
        img_idx = nsd_id_to_idx[row['nsd_id']]
        per_rep[img_idx, row['rep']] = roi_betas[row['trial_in_session']]

    del lh_betas, rh_betas
    if session % 10 == 0:
        print(f'  Session {session}/{SESSIONS_PER_SUBJECT[SUBJ]} ({time.time()-t0:.0f}s)')

# Average across repetitions
bypass_averaged = per_rep.mean(axis=1)  # (515, n_v4_total)

# Global z-score (stats from all 515 averaged images)
gz_mean = np.nanmean(bypass_averaged, axis=0)
gz_std = np.nanstd(bypass_averaged, axis=0)
gz_std[gz_std == 0] = 1.0
bypass_final = (bypass_averaged - gz_mean) / gz_std
bypass_final = np.nan_to_num(bypass_final, nan=0.0)

elapsed = time.time() - t0
print(f'\nDone in {elapsed:.0f}s')
print(f'Bypass output: {bypass_final.shape}')

subj01 V4: 914 vertices
Trials for 515 shared images: 1545 (expected 1545)
  Session 10/40 (39s)
  Session 20/40 (76s)
  Session 30/40 (113s)

Done in 113s
Bypass output: (515, 914)


In [10]:
# Compare bypass result with Brain-Score assembly for subj01 V4

# Extract subj01 V4 from Brain-Score train assembly
subj01_v4_mask = (bs_train.coords['subject'].values == 'subj01') & \
                 (bs_train.coords['region'].values == 'V4')
bs_train_subj01_v4 = bs_train.values[:, subj01_v4_mask, 0]  # (412, n_v4_total)

# Extract subj01 V4 from Brain-Score test assembly (average 3 reps)
subj01_v4_mask_test = (bs_test.coords['subject'].values == 'subj01') & \
                      (bs_test.coords['region'].values == 'V4')
bs_test_subj01_v4 = bs_test.values[:, subj01_v4_mask_test, 0]  # (309, n_v4_total)
bs_test_subj01_v4_avg = bs_test_subj01_v4.reshape(len(test_ids), 3, -1).mean(axis=1)

# Split bypass into train/test
bypass_train = bypass_final[train_mask_515]  # (412, n_v4_total)
bypass_test = bypass_final[test_mask_515]    # (103, n_v4_total)

# Compare train
train_max_diff = np.max(np.abs(bypass_train - bs_train_subj01_v4))
train_mean_diff = np.mean(np.abs(bypass_train - bs_train_subj01_v4))

# Compare test (averaged reps)
test_max_diff = np.max(np.abs(bypass_test - bs_test_subj01_v4_avg))
test_mean_diff = np.mean(np.abs(bypass_test - bs_test_subj01_v4_avg))

print('=== End-to-End Bypass: Raw MGH -> Brain-Score Assembly (subj01, V4) ===')
print(f'Train: max_diff={train_max_diff:.2e}, mean_diff={train_mean_diff:.2e} '
      f'[{"PASS" if train_max_diff < 1e-5 else "FAIL"}]')
print(f'Test:  max_diff={test_max_diff:.2e}, mean_diff={test_mean_diff:.2e} '
      f'[{"PASS" if test_max_diff < 1e-5 else "FAIL"}]')

assert train_max_diff < 1e-5, f'TRAIN MISMATCH: {train_max_diff}'
assert test_max_diff < 1e-5, f'TEST MISMATCH: {test_max_diff}'

print(f'\nBypass shape:   {bypass_train.shape} train, {bypass_test.shape} test')
print(f'Assembly shape: {bs_train_subj01_v4.shape} train, {bs_test_subj01_v4_avg.shape} test')
print(f'\nFull preprocessing chain validated for subj01 V4 (surface).')

=== End-to-End Bypass: Raw MGH -> Brain-Score Assembly (subj01, V4) ===
Train: max_diff=1.67e-06, mean_diff=1.14e-07 [PASS]
Test:  max_diff=2.38e-06, mean_diff=1.23e-07 [PASS]

Bypass shape:   (412, 914) train, (103, 914) test
Assembly shape: (412, 914) train, (103, 914) test

Full preprocessing chain validated for subj01 V4 (surface).


In [11]:
# Cross-check: verify vertex ordering matches between bypass and assembly
# The assembly stores vertex_index coords -- confirm our extraction order matches

assembly_vertex_indices = bs_train.coords['vertex_index'].values[subj01_v4_mask]
assembly_hemispheres = bs_train.coords['hemisphere'].values[subj01_v4_mask]

# Our bypass extraction order: LH vertices (sorted by index), then RH vertices (sorted by index)
lh_indices = np.where(v4_lh_mask)[0]
rh_indices = np.where(v4_rh_mask)[0]
expected_indices = np.concatenate([lh_indices, rh_indices])
expected_hemispheres = np.array(['lh'] * len(lh_indices) + ['rh'] * len(rh_indices))

indices_match = np.array_equal(assembly_vertex_indices, expected_indices)
hemi_match = np.array_equal(assembly_hemispheres, expected_hemispheres)

print(f'Vertex index ordering match: {"PASS" if indices_match else "FAIL"}')
print(f'Hemisphere ordering match:   {"PASS" if hemi_match else "FAIL"}')
print(f'  Assembly: {len(assembly_vertex_indices)} vertices '
      f'({(assembly_hemispheres == "lh").sum()} LH + {(assembly_hemispheres == "rh").sum()} RH)')
print(f'  Bypass:   {len(expected_indices)} vertices '
      f'({len(lh_indices)} LH + {len(rh_indices)} RH)')

assert indices_match, 'Vertex index ordering mismatch'
assert hemi_match, 'Hemisphere ordering mismatch'

Vertex index ordering match: PASS
Hemisphere ordering match:   PASS
  Assembly: 914 vertices (410 LH + 504 RH)
  Bypass:   914 vertices (410 LH + 504 RH)


## Section 3: Standalone Ridge Regression

Run sklearn RidgeCV on Brain-Score assembly data for V4 and compare per-subject
correlations with the benchmark output from NB04.

This validates the benchmark metric code: if our standalone ridge gives similar
raw correlations as the benchmark, the `ridge_split` metric is working correctly
on our surface assemblies.

In [12]:
from brainscore_vision import load_model, load_benchmark

# Load model and benchmark
model = load_model('alexnet')
benchmark = load_benchmark('Allen2022_fmri_surface.V4-ridge')

# Score via benchmark
t0 = time.time()
benchmark_score = benchmark(model)
elapsed = time.time() - t0

print(f'Benchmark score (ceiling-normalized): {float(benchmark_score.values):.4f}')
print(f'Benchmark raw:                        {float(benchmark_score.raw.values):.4f}')
print(f'Benchmark ceiling:                    {float(benchmark_score.ceiling.values):.4f}')
print(f'Time: {elapsed:.1f}s')

# Extract per-subject scores from benchmark
print(f'\nPer-subject raw scores from benchmark:')
benchmark_per_subj = {}
for subj_label in [f'subj{s:02d}' for s in SUBJECT_LIST]:
    if subj_label in benchmark_score.attrs:
        raw_val = float(benchmark_score.attrs[subj_label].raw.values)
        benchmark_per_subj[subj_label] = raw_val
        print(f'  {subj_label}: {raw_val:.4f}')

  class Score(DataAssembly):
  dual_coef = linalg.solve(K, y, assume_a="pos", overwrite_a=False)


<xarray.Score (subject: 1)>
array([0.44368672])
Coordinates:
  * subject  (subject) <U6 'subj01'
Attributes:
    raw:      <xarray.Score (subject: 1)>\narray([0.47483057], dtype=float32)...
    ceiling:  <xarray.Score ()>\narray(0.5081605)\nAttributes:\n    raw:     ...


  dual_coef = linalg.solve(K, y, assume_a="pos", overwrite_a=False)


<xarray.Score (subject: 1)>
array([0.48174151])
Coordinates:
  * subject  (subject) <U6 'subj02'
Attributes:
    raw:      <xarray.Score (subject: 1)>\narray([0.5506409], dtype=float32)\...
    ceiling:  <xarray.Score ()>\narray(0.62939434)\nAttributes:\n    raw:    ...


  dual_coef = linalg.solve(K, y, assume_a="pos", overwrite_a=False)


<xarray.Score (subject: 1)>
array([0.44947871])
Coordinates:
  * subject  (subject) <U6 'subj03'
Attributes:
    raw:      <xarray.Score (subject: 1)>\narray([0.4437025], dtype=float32)\...
    ceiling:  <xarray.Score ()>\narray(0.4380005)\nAttributes:\n    raw:     ...


  dual_coef = linalg.solve(K, y, assume_a="pos", overwrite_a=False)


<xarray.Score (subject: 1)>
array([0.30570829])
Coordinates:
  * subject  (subject) <U6 'subj04'
Attributes:
    raw:      <xarray.Score (subject: 1)>\narray([0.36520135], dtype=float32)...
    ceiling:  <xarray.Score ()>\narray(0.4362722)\nAttributes:\n    raw:     ...


  dual_coef = linalg.solve(K, y, assume_a="pos", overwrite_a=False)


<xarray.Score (subject: 1)>
array([0.27451815])
Coordinates:
  * subject  (subject) <U6 'subj05'
Attributes:
    raw:      <xarray.Score (subject: 1)>\narray([0.37262708], dtype=float32)...
    ceiling:  <xarray.Score ()>\narray(0.50579876)\nAttributes:\n    raw:    ...


  dual_coef = linalg.solve(K, y, assume_a="pos", overwrite_a=False)


<xarray.Score (subject: 1)>
array([0.27812625])
Coordinates:
  * subject  (subject) <U6 'subj06'
Attributes:
    raw:      <xarray.Score (subject: 1)>\narray([0.34403068], dtype=float32)...
    ceiling:  <xarray.Score ()>\narray(0.42555173)\nAttributes:\n    raw:    ...


  dual_coef = linalg.solve(K, y, assume_a="pos", overwrite_a=False)


<xarray.Score (subject: 1)>
array([0.53596906])
Coordinates:
  * subject  (subject) <U6 'subj07'
Attributes:
    raw:      <xarray.Score (subject: 1)>\narray([0.5619419], dtype=float32)\...
    ceiling:  <xarray.Score ()>\narray(0.58917342)\nAttributes:\n    raw:    ...


  dual_coef = linalg.solve(K, y, assume_a="pos", overwrite_a=False)


<xarray.Score (subject: 1)>
array([0.43810691])
Coordinates:
  * subject  (subject) <U6 'subj08'
Attributes:
    raw:      <xarray.Score (subject: 1)>\narray([0.4608936], dtype=float32)\...
    ceiling:  <xarray.Score ()>\narray(0.48486547)\nAttributes:\n    raw:    ...
Benchmark score (ceiling-normalized): 0.4009
Benchmark raw:                        0.4467
Benchmark ceiling:                    0.5051
Time: 7.9s

Per-subject raw scores from benchmark:
  subj01: 0.4748
  subj02: 0.5506
  subj03: 0.4437
  subj04: 0.3652
  subj05: 0.3726
  subj06: 0.3440
  subj07: 0.5619
  subj08: 0.4609


  common_dims = tuple(pd.unique([d for v in vars for d in v.dims]))
  common_dims = tuple(pd.unique([d for v in vars for d in v.dims]))
  raw_val = float(benchmark_score.attrs[subj_label].raw.values)


In [13]:
from sklearn.linear_model import RidgeCV
from scipy.stats import pearsonr

# Get the model activations that the benchmark already extracted
model_train_act = benchmark.train_activations.values
model_test_act = benchmark.test_activations.values

if model_train_act.ndim > 2:
    model_train_act = model_train_act.squeeze()
    model_test_act = model_test_act.squeeze()

print(f'Model train activations: {model_train_act.shape}')
print(f'Model test activations:  {model_test_act.shape}')

# Get neural data from the benchmark assemblies
train_neural = benchmark.train_assembly
test_neural = benchmark.test_assembly

if 'time_bin' in train_neural.dims:
    train_neural = train_neural.isel(time_bin=0)
if 'time_bin' in test_neural.dims:
    test_neural = test_neural.isel(time_bin=0)

print(f'Train neural: {train_neural.shape}')
print(f'Test neural:  {test_neural.shape}')

Model train activations: (412, 64896)
Model test activations:  (103, 64896)
Train neural: (412, 5782)
Test neural:  (103, 5782)


In [14]:
# Standalone ridge regression per subject
# Match the benchmark: per-subject fitting, median correlation across vertices

from brainscore_vision.metrics.regression_correlation.metric import ALPHA_LIST

subjects = np.unique(train_neural['subject'].values)
standalone_results = {}

for subj_label in subjects:
    subj_mask_train = train_neural['subject'].values == subj_label
    subj_mask_test = test_neural['subject'].values == subj_label

    y_train = train_neural.values[:, subj_mask_train]
    y_test = test_neural.values[:, subj_mask_test]

    ridge = RidgeCV(alphas=ALPHA_LIST, fit_intercept=True)
    ridge.fit(model_train_act, y_train)
    y_pred = ridge.predict(model_test_act)

    n_verts = y_test.shape[1]
    correlations = np.array([
        pearsonr(y_pred[:, v], y_test[:, v])[0]
        for v in range(n_verts)
    ])

    median_r = np.median(correlations)
    standalone_results[subj_label] = {
        'median_r': median_r,
        'alpha': ridge.alpha_,
        'n_verts': n_verts,
    }
    print(f'{subj_label}: median r = {median_r:.4f}, '
          f'alpha = {ridge.alpha_:.0f}, '
          f'n_verts = {n_verts}')

standalone_mean_r = np.mean([v['median_r'] for v in standalone_results.values()])
print(f'\nStandalone mean raw r: {standalone_mean_r:.4f}')

subj01: median r = 0.4940, alpha = 75000, n_verts = 724
subj02: median r = 0.5585, alpha = 50000, n_verts = 845
subj03: median r = 0.4471, alpha = 90000, n_verts = 769
subj04: median r = 0.3854, alpha = 150000, n_verts = 564
subj05: median r = 0.3921, alpha = 150000, n_verts = 774
subj06: median r = 0.3583, alpha = 200000, n_verts = 734
subj07: median r = 0.5731, alpha = 85000, n_verts = 735
subj08: median r = 0.4782, alpha = 150000, n_verts = 637

Standalone mean raw r: 0.4609


In [15]:
# Compare standalone vs benchmark results
print('=== Standalone Ridge vs Benchmark Comparison (Surface V4) ===')
print(f'{"Subject":<10} {"Standalone":>12} {"Benchmark":>12} {"Diff":>12}')
print('-' * 50)

for subj_label in subjects:
    standalone_r = standalone_results[subj_label]['median_r']
    bm_r = benchmark_per_subj.get(subj_label, float('nan'))
    diff = abs(standalone_r - bm_r) if not np.isnan(bm_r) else float('nan')
    print(f'{subj_label:<10} {standalone_r:>12.4f} {bm_r:>12.4f} {diff:>12.4f}')

benchmark_raw = float(benchmark_score.raw.values)
print(f'\n{"Mean":>10} {standalone_mean_r:>12.4f} {benchmark_raw:>12.4f} '
      f'{abs(standalone_mean_r - benchmark_raw):>12.4f}')

print(f'\nNote: Small differences are expected because the benchmark uses')
print(f'a cross-validated ridge metric that may handle multi-output fitting')
print(f'differently from sklearn\'s single RidgeCV.')

=== Standalone Ridge vs Benchmark Comparison (Surface V4) ===
Subject      Standalone    Benchmark         Diff
--------------------------------------------------
subj01           0.4940       0.4748       0.0192
subj02           0.5585       0.5506       0.0079
subj03           0.4471       0.4437       0.0034
subj04           0.3854       0.3652       0.0202
subj05           0.3921       0.3726       0.0194
subj06           0.3583       0.3440       0.0142
subj07           0.5731       0.5619       0.0112
subj08           0.4782       0.4609       0.0174

      Mean       0.4609       0.4467       0.0141

Note: Small differences are expected because the benchmark uses
a cross-validated ridge metric that may handle multi-output fitting
differently from sklearn's single RidgeCV.


## Summary

| Validation | What it proves |
|---|---|
| Section 1: NB02 -> filter -> global z-score -> assembly | NB03 packaging step is correct |
| Section 1: Vertex spot checks | Individual values trace correctly |
| Section 2: Raw MGH -> assembly (subj01, V4) | Full surface preprocessing chain is correct |
| Section 2: Vertex/hemisphere ordering | Assembly neuroid ordering matches extraction order |
| Section 3: Standalone ridge vs benchmark | Benchmark metric code works on surface data |

In [16]:
print('=== Surface Assembly Validation Summary ===')
print()
print('Section 1 (Vertex-Level Spot Check):')
print(f'  NB02->assembly train max_diff: {max_diff:.2e} [PASS]')
print(f'  NB02->assembly test max_diff:  {max_diff_test:.2e} [PASS]')
print(f'  5/5 spot checks: PASS')
print()
print('Section 2 (End-to-End Raw Data Bypass):')
print(f'  Raw MGH->assembly train max_diff: {train_max_diff:.2e} [PASS]')
print(f'  Raw MGH->assembly test max_diff:  {test_max_diff:.2e} [PASS]')
print(f'  Vertex ordering: PASS')
print(f'  Hemisphere ordering: PASS')
print()
print('Section 3 (Standalone Ridge):')
print(f'  Standalone mean r: {standalone_mean_r:.4f}')
print(f'  Benchmark mean r:  {benchmark_raw:.4f}')
print(f'  Difference:        {abs(standalone_mean_r - benchmark_raw):.4f}')
print()
print('All validations passed.')

=== Surface Assembly Validation Summary ===

Section 1 (Vertex-Level Spot Check):
  NB02->assembly train max_diff: 0.00e+00 [PASS]
  NB02->assembly test max_diff:  9.54e-07 [PASS]
  5/5 spot checks: PASS

Section 2 (End-to-End Raw Data Bypass):
  Raw MGH->assembly train max_diff: 1.67e-06 [PASS]
  Raw MGH->assembly test max_diff:  2.38e-06 [PASS]
  Vertex ordering: PASS
  Hemisphere ordering: PASS

Section 3 (Standalone Ridge):
  Standalone mean r: 0.4609
  Benchmark mean r:  0.4467
  Difference:        0.0141

All validations passed.
