# Analysis of human and model behavior across physical domains on subset of all trials
- Adversarial (<50% accuracy with 95% CI)
- Easy (0% <= accuracy <33.3%)
- By chance (33.3% <= accuracy <66.7%)
- Hard (66.7% <= accuracy <=100%)

(*accuracy as overall human accuracy on each trial)

**The purpose of this notebook is to:** 
* Apply preprocessing to human behavioral data
* Visualize distribution and compute summary statistics over **human** physical judgments
* Visualize distribution and compute summary statistics over **model** physical judgments
* Conduct human-model comparisons
* Output CSV that can be re-loaded into R notebook for statistical modeling & fancy visualizations

**This notebook depends on:**
* Running `./generate_dataframes.py` (INTERNAL USE ONLY)
* Running `./upload_results.py` (INTERNAL USE ONLY)
* Running `./download_results.py` (PUBLIC USE)

## setup

#### Load packages

In [1]:
import os
import sys
import urllib, io

sys.path.append('./analysis_helpers')
from importlib import reload

import numpy as np
import scipy.stats as stats
import pandas as pd

import analysis_helpers as h

import pymongo as pm
from collections import Counter
import json
import re
import ast

from PIL import Image, ImageOps, ImageDraw, ImageFont 

from io import BytesIO
import base64

from tqdm.notebook import tqdm

import  matplotlib
from matplotlib import pylab, mlab, pyplot
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.core.pylabtools import figsize, getfigs
plt = pyplot
import matplotlib as mpl
mpl.rcParams['pdf.fonttype'] = 42
plt.style.use('seaborn-white')

import seaborn as sns

%matplotlib inline
import scipy.stats
import sklearn.metrics
import random

from IPython.display import clear_output

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

#### options

In [2]:
# display all columns
pd.set_option('display.max_columns', None)

# seaborn plotting themes
sns.set_context('talk')
sns.set_style("whitegrid")

#### set up paths and directories

In [3]:
## directory & file hierarchy
proj_dir = os.path.abspath('..')
datavol_dir = os.path.join(proj_dir,'data')
analysis_dir =  os.path.abspath('.')
results_dir = os.path.join(proj_dir,'results')
plot_dir = os.path.join(results_dir,'plots')
csv_dir = os.path.join(results_dir,'csv')
json_dir = os.path.join(results_dir,'json')
exp_dir = os.path.abspath(os.path.join(proj_dir,'behavioral_experiments'))
png_dir = os.path.abspath(os.path.join(datavol_dir,'png'))

## add helpers to python path
if os.path.join(proj_dir,'stimuli') not in sys.path:
    sys.path.append(os.path.join(proj_dir,'stimuli'))
    
if not os.path.exists(results_dir):
    os.makedirs(results_dir)
    
if not os.path.exists(plot_dir):
    os.makedirs(plot_dir)   
    
if not os.path.exists(csv_dir):
    os.makedirs(csv_dir)       
    
## add helpers to python path
if os.path.join(analysis_dir,'utils') not in sys.path:
    sys.path.append(os.path.join(analysis_dir,'utils'))   

def make_dir_if_not_exists(dir_name):   
    if not os.path.exists(dir_name):
        os.makedirs(dir_name)
    return dir_name

## create directories that don't already exist        
result = [make_dir_if_not_exists(x) for x in [results_dir,plot_dir,csv_dir]]

### load human data

In [4]:
from experiment_meta import *
HEM = pd.DataFrame(NEURIPS2021_EXPS) # HEM = "human experiment metadata"
HEM

Unnamed: 0,study,bucket_name,stim_version,iterationName
0,dominoes_pilot,human-physics-benchmarking-dominoes-pilot,production_1,production_1_testing
1,collision_pilot,human-physics-benchmarking-collision-pilot,production_2,production_2_testing
2,towers_pilot,human-physics-benchmarking-towers-pilot,production_2,production_2_testing
3,linking_pilot,human-physics-benchmarking-linking-pilot,production_2,production_2_testing
4,containment_pilot,human-physics-benchmarking-containment-pilot,production_2,production_2_testing
5,rollingsliding_pilot,human-physics-benchmarking-rollingsliding-pilot,production_2,production_2_testing
6,drop_pilot,human-physics-benchmarking-drop-pilot,production_2,production_2_testing
7,clothiness_pilot,human-physics-benchmarking-clothiness-pilot,production_2,production_2_testing


In [5]:
SCENARIOS = sorted([n.split("_")[0] for n in HEM['study'].unique()])

In [None]:
## get paths to all human response data
data_paths = [os.path.join(csv_dir,'humans',i) for i in os.listdir(os.path.join(csv_dir,'humans'))]
resp_paths = [i for i in data_paths if i.split('/')[-1].split('-')[0]=='human_responses']
assert len(resp_paths)==8

In [7]:
## tally up all flagged sessions


In [8]:
## also load all human data into a big dataframe
HD = pd.concat([h.apply_exclusion_criteria(h.load_and_preprocess_data(p), verbose=True) for p in resp_paths])
print("Loaded {} lines".format(len(HD)))

97.5th percentile for streak length is 13.0.
There are 12 flagged IDs so far due to long streaks.
There are 0 flagged IDs so far due to alternating sequences.
TODO: Still need to flag familiarization trial failures!!!!
There are 2 flagged IDs so far due to low accuracy.
There are 2 flagged IDs so far due to high RTs.
There are a total of 13 flagged IDs.
There are a total of 89 valid and complete sessions for towers.
97.5th percentile for streak length is 12.0.
There are 16 flagged IDs so far due to long streaks.
There are 0 flagged IDs so far due to alternating sequences.
TODO: Still need to flag familiarization trial failures!!!!
There are 3 flagged IDs so far due to low accuracy.
There are 1 flagged IDs so far due to high RTs.
There are a total of 18 flagged IDs.
There are a total of 83 valid and complete sessions for containment.
97.5th percentile for streak length is 12.0.
There are 3 flagged IDs so far due to long streaks.
There are 0 flagged IDs so far due to alternating sequence

#### exclude subjects from familiarization
Run `familiariarization_exclusion.ipynb` to generate `excluded_games.csv`

In [9]:
bad_games = pd.read_csv(os.path.join(csv_dir,"humans/excluded_games.csv")).values[:,1]

In [10]:
bad_games

array(['0720-d5f527dc-d86a-4d88-af8f-b70ac9264fef',
       '1685-8963fea0-0d21-454b-8bbe-e9cbc792aa11',
       '4917-538725a5-383f-462b-9ab7-43b9473c9dcc',
       '7411-987b0a97-8a67-41a3-a3d8-d8f792c35ab5',
       '8383-e0582a4e-6498-4d91-bb29-2b6a363cc2e9',
       '8731-67e86658-28ff-4cc6-b722-9620e3b3ce43',
       '9784-7a67e88b-0416-4b55-8a72-9a0d99038c49',
       '9808-e983d3b8-75c3-428a-8182-f57fd645abb1',
       '9930-aa52e4be-e5e3-441a-9cb4-f1144d9e233f'], dtype=object)

In [11]:
print("Excluding {} rows for {} games".format(sum(HD['gameID'].isin(bad_games)), len(bad_games)))
HD = HD[~HD['gameID'].isin(bad_games)]

Excluding 900 rows for 9 games


### load model data

In [12]:
## get paths to all model data
model_data_paths = [os.path.join(csv_dir,'models',i) for i in os.listdir(os.path.join(csv_dir,'models'))]
model_res_paths = [i for i in model_data_paths if i.split('.')[-1] == "csv"]

In [13]:
## load all model results into a single dataframe
MD = pd.concat([pd.read_csv(p).assign(filename=p.split('/')[-1]) for p in model_res_paths])
print("Loaded {} rows".format(len(MD)))

Loaded 248011 rows


In [14]:
# a couple of import steps (restore original scenario names, add single prediction value, add correctness column)
MD = h.process_model_dataframe(MD)

In [15]:
#check for duplicated rows
if len(MD.duplicated()) > 0:
    print("⚠️There are {} duplicated rows!".format(np.sum(MD.duplicated())))
    MD = MD[~MD.duplicated(h.MODEL_COLS+["Stimulus Name"],keep="first")]
    print("Removed duplicates, {} rows left".format(len(MD)))

⚠️There are 0 duplicated rows!
Removed duplicates, 248011 rows left


In [16]:
# save model kinds to variable
MODELS = list(MD["Model Kind"].unique())

In [17]:
print("We get the following kinds of models:")
display(MODELS)

We get the following kinds of models:


['OP3_OP3 encoder_0.0_Image Reconstruction_all_but_this_Image Reconstruction_0_same',
 'OP3_OP3 encoder_0.0_Image Reconstruction_all_Image Reconstruction_0_same',
 'OP3_OP3 encoder_0.0_Image Reconstruction_same_Image Reconstruction_0_same',
 'SVG_VGG_1.0_VAE_all_but_this_VAE_1_same',
 'SVG_VGG_2.0_VAE_all_but_this_VAE_2_same',
 'SVG_VGG_0.0_VAE_all_but_this_VAE_0_same',
 'SVG_VGG_1.0_VAE_all_VAE_1_same',
 'SVG_VGG_2.0_VAE_all_VAE_2_same',
 'SVG_VGG_0.0_VAE_all_VAE_0_same',
 'SVG_VGG_1.0_VAE_same_VAE_1_same',
 'SVG_VGG_2.0_VAE_same_VAE_2_same',
 'SVG_VGG_0.0_VAE_same_VAE_0_same',
 'DEITFrozenMLP_DEIT_nan_nan_nan_L2 on latent_0_same',
 'VGGFrozenLSTM_VGG_nan_nan_nan_L2 on latent_0_same',
 'RPIN_R-CNN_0.0_L2 on 2D position_all_but_this_L2 on 2D position_0_same',
 'RPIN_R-CNN_0.0_L2 on 2D position_all_L2 on 2D position_0_same',
 'RPIN_R-CNN_0.0_L2 on 2D position_same_L2 on 2D position_0_same',
 'GNS-ransac_nan_nan_nan_nan_L2 on particle 3D positions_1_same',
 'RPIN_R-CNN_1.0_L2 on 2D posit

#### exclude bad stims (where model/human stims mismatched)

In [18]:
stim_comparision = pd.merge(pd.DataFrame(MD.groupby('Canon Stimulus Name')['Actual Outcome'].first()).reset_index(),pd.DataFrame(HD.groupby('stim_ID')['target_hit_zone_label'].first()).reset_index(),left_on='Canon Stimulus Name',right_on='stim_ID')

bad_stims = stim_comparision[stim_comparision['Actual Outcome'] != stim_comparision['target_hit_zone_label']]['Canon Stimulus Name']
print("There are {} bad stims".format(len(bad_stims)))

There are 37 bad stims


In [19]:
#Exclude bad stims
HD = HD[~HD['stim_ID'].isin(bad_stims)]
MD = MD[~MD['Canon Stimulus Name'].isin(bad_stims)]

In [20]:
#Also exclude stims from the rollingsliding ledge subset
HD = HD[~HD['stim_ID'].str.contains("rollingSliding_simple_ledge")]
MD = MD[~MD['Canon Stimulus Name'].str.contains("rollingSliding_simple_ledge")]

#### exclude familiriza stims (in order to do model/human stims comparison)

In [21]:
# human data trial accuracy
HD_accu = HD.groupby('stim_ID').agg({'correct':np.mean})
HD_accu

Unnamed: 0_level_0,correct
stim_ID,Unnamed: 1_level_1
pilot-containment-bowl_0001,0.566265
pilot-containment-bowl_0002,0.939759
pilot-containment-bowl_0003,0.939759
pilot-containment-bowl_0005,0.951807
pilot-containment-bowl_0007,0.819277
...,...
test19_0013,0.810811
test19_0015,0.351351
test19_0016,0.418919
test19_0017,0.932432


In [22]:
# remove all familiarization trials data because human data don't have those
MD = MD[~MD['Canon Stimulus Name'].str.contains("familiarization")]

In [23]:
# model data trial accuracy
MD_accu = MD.groupby(['Canon Stimulus Name','Model']).agg({'correct':np.mean}).reset_index()
MD_models = MD.groupby(['Canon Stimulus Name','Model']).first()['Readout Train Data'].reset_index()

MD_accu = MD_accu.join(MD_models['Readout Train Data'])
MD_accu

Unnamed: 0,Canon Stimulus Name,Model,correct,Readout Train Data
0,pilot-containment-bowl_0000,CSWM,0.777778,containment
1,pilot-containment-bowl_0000,DEITFrozenLSTM,1.000000,containment
2,pilot-containment-bowl_0000,DEITFrozenMLP,1.000000,containment
3,pilot-containment-bowl_0000,DPI,1.000000,containment
4,pilot-containment-bowl_0000,GNS,1.000000,containment
...,...,...,...,...
17819,test19_0019,OP3,0.444444,clothiness
17820,test19_0019,RPIN,0.527778,clothiness
17821,test19_0019,SVG,0.075000,clothiness
17822,test19_0019,VGGFrozenLSTM,0.333333,clothiness


In [24]:
out = MD[MD['Canon Stimulus Name'].isin(HD['stim_ID'].tolist())]
yr = out[out['Canon Stimulus Name'].str.contains('yellow')]
MD

Unnamed: 0,Model,Readout Train Data,Readout Test Data,Train Accuracy,Test Accuracy,Readout Type,Predicted Prob_false,Predicted Prob_true,Predicted Outcome,Actual Outcome,Stimulus Name,Encoder Type,Dynamics Type,Encoder Pre-training Task,Encoder Pre-training Dataset,Encoder Pre-training Seed,Encoder Training Task,Encoder Training Dataset,Encoder Training Seed,Dynamics Training Task,Dynamics Training Dataset,Dynamics Training Seed,filename,Encoder_Pre-training Dataset,correct,Canon Stimulus Name,Encoder Training Dataset Type,Dynamics Training Dataset Type,Readout Train Data Type,ModelID,Model Kind
0,OP3,linking,linking,1.000000,0.480583,A,9.999980e-01,0.000002,0,1,pilot_linking_nl4-8_mg-005_aCyl_bCyl_occ1_dis1...,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_linking,0.0,Image Reconstruction,no_linking,0,OP3_results.csv,,False,pilot_linking_nl4-8_mg-005_aCyl_bCyl_occ1_dis1...,all_but_this,all_but_this,same,OP3_OP3 encoder_0.0_Image Reconstruction_no_li...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...
1,OP3,linking,linking,1.000000,0.480583,A,9.999776e-01,0.000022,0,1,pilot_linking_nl1-6_ms03-7_aCylcap_bCyl_tdwroo...,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_linking,0.0,Image Reconstruction,no_linking,0,OP3_results.csv,,False,pilot_linking_nl1-6_ms03-7_aCylcap_bCyl_tdwroo...,all_but_this,all_but_this,same,OP3_OP3 encoder_0.0_Image Reconstruction_no_li...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...
2,OP3,linking,linking,1.000000,0.480583,A,3.345129e-10,1.000000,1,1,pilot_linking_nl6_aCyl_bCube_occ1_dis1_boxroom...,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_linking,0.0,Image Reconstruction,no_linking,0,OP3_results.csv,,True,pilot_linking_nl6_aCyl_bCube_occ1_dis1_boxroom...,all_but_this,all_but_this,same,OP3_OP3 encoder_0.0_Image Reconstruction_no_li...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...
3,OP3,linking,linking,1.000000,0.480583,A,9.999989e-01,0.000001,0,0,pilot_linking_nl6_aCyl_bCube_occ1_dis1_boxroom...,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_linking,0.0,Image Reconstruction,no_linking,0,OP3_results.csv,,True,pilot_linking_nl6_aCyl_bCube_occ1_dis1_boxroom...,all_but_this,all_but_this,same,OP3_OP3 encoder_0.0_Image Reconstruction_no_li...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...
4,OP3,linking,linking,1.000000,0.480583,A,2.348308e-02,0.976517,1,0,pilot_linking_nl4-8_mg-005_aCyl_bCyl_occ1_dis1...,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_linking,0.0,Image Reconstruction,no_linking,0,OP3_results.csv,,False,pilot_linking_nl4-8_mg-005_aCyl_bCyl_occ1_dis1...,all_but_this,all_but_this,same,OP3_OP3 encoder_0.0_Image Reconstruction_no_li...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20755,SVG,towers,towers,0.649723,0.446281,D,7.632269e-01,0.236773,0,0,pilot_towers_nb4_fr015_SJ000_gr-01_mono1_dis1_...,VGG,LSTM,,,,VAE,no_towers,1.0,VAE,no_towers,1,per_example_svg.csv,,True,pilot_towers_nb4_fr015_SJ000_gr-01_mono1_dis1_...,all_but_this,all_but_this,same,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same
20756,SVG,towers,towers,0.649723,0.446281,D,5.133288e-02,0.948667,1,1,pilot_towers_nb4_fr015_SJ000_gr-01_mono1_dis1_...,VGG,LSTM,,,,VAE,no_towers,1.0,VAE,no_towers,1,per_example_svg.csv,,True,pilot_towers_nb4_fr015_SJ000_gr-01_mono1_dis1_...,all_but_this,all_but_this,same,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same
20757,SVG,towers,towers,0.649723,0.446281,D,6.525511e-01,0.347449,0,1,pilot_towers_nb4_fr015_SJ000_gr-01_mono1_dis1_...,VGG,LSTM,,,,VAE,no_towers,1.0,VAE,no_towers,1,per_example_svg.csv,,False,pilot_towers_nb4_fr015_SJ000_gr-01_mono1_dis1_...,all_but_this,all_but_this,same,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same
20758,SVG,towers,towers,0.649723,0.446281,D,4.173050e-01,0.582695,1,0,pilot_towers_nb4_fr015_SJ000_gr-01_mono1_dis1_...,VGG,LSTM,,,,VAE,no_towers,1.0,VAE,no_towers,1,per_example_svg.csv,,False,pilot_towers_nb4_fr015_SJ000_gr-01_mono1_dis1_...,all_but_this,all_but_this,same,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same


## Identify trials where humans are consistently below 50% accuracy (with α = 0.05 with classical parametric binomial proportion) as adversial trials

In [25]:
print('Minimum number of participants in any stimulus is ', HD.groupby(["scenarioName","stim_ID"]).count().min()['trialNum'])
# minimum number of experiment participant count

Minimum number of participants in any stimulus is  74


In [26]:
# Calculate unbiased estimators of correctness for each stimulus under out presumption (using average rate of correctness).
correctness = HD.groupby('stim_ID').agg({'correct':np.mean})
scenarioName = HD.groupby('stim_ID').first()['scenarioName']
correctness = correctness.join(scenarioName)
correctness_count = HD.groupby('stim_ID').count()['trialNum']
correct_below_50 =  correctness[(correctness['correct'] + 1.96*(((correctness['correct'].multiply(1-correctness['correct']))).divide(correctness_count).apply(np.sqrt))) < 0.5].reset_index()
correct_below_50 

Unnamed: 0,stim_ID,correct,scenarioName
0,pilot-containment-bowl_0030,0.192771,containment
1,pilot-containment-box_0014,0.108434,containment
2,pilot-containment-box_0017,0.337349,containment
3,pilot-containment-box_0022,0.253012,containment
4,pilot-containment-cone-plate_0005,0.084337,containment
...,...,...,...
160,test17_0011,0.378378,clothiness
161,test17_0017,0.378378,clothiness
162,test18_0014,0.040541,clothiness
163,test19_0008,0.378378,clothiness


In [27]:
# how many trials' accuracy confidence interval are below 50% accuracy 
correct_below_50_dist = correct_below_50.groupby('scenarioName').count()
correct_below_50_dist

Unnamed: 0_level_0,stim_ID,correct
scenarioName,Unnamed: 1_level_1,Unnamed: 2_level_1
clothiness,19,19
collision,16,16
containment,18,18
dominoes,28,28
drop,24,24
linking,34,34
rollingsliding,5,5
towers,21,21


## Categorized trilas as hard, by chance, easy - (0% <= accu < 33.3%; 33.3% <= accu < 66.7%; ;  66.7% <= accu <= 100%)

In [28]:
# Calculate unbiased estimators of correctness for each stimulus under out presumption (using average rate of correctness).
correct_below_33 =  correctness[correctness['correct'] < 0.333].reset_index()
correct_below_33 

Unnamed: 0,stim_ID,correct,scenarioName
0,pilot-containment-bowl_0030,0.192771,containment
1,pilot-containment-box_0014,0.108434,containment
2,pilot-containment-box_0022,0.253012,containment
3,pilot-containment-cone-plate_0005,0.084337,containment
4,pilot-containment-cone-plate_0007,0.048193,containment
...,...,...,...
112,test16_0003,0.270270,clothiness
113,test16_0004,0.310811,clothiness
114,test17_0001,0.256757,clothiness
115,test17_0010,0.229730,clothiness


In [29]:
# how many trials' accuracy confidence interval are below 50% accuracy 
correct_below_33_dist = correct_below_33.groupby('scenarioName').count()
correct_below_33_dist

Unnamed: 0_level_0,stim_ID,correct
scenarioName,Unnamed: 1_level_1,Unnamed: 2_level_1
clothiness,15,15
collision,10,10
containment,13,13
dominoes,19,19
drop,15,15
linking,24,24
rollingsliding,4,4
towers,17,17


In [30]:
# Calculate unbiased estimators of correctness for each stimulus under out presumption (using average rate of correctness).
correct_below_67 =  correctness[(correctness['correct'] >= 0.333) & (correctness['correct'] < 0.667)].reset_index()
correct_below_67 

Unnamed: 0,stim_ID,correct,scenarioName
0,pilot-containment-bowl_0001,0.566265,containment
1,pilot-containment-bowl_0015,0.481928,containment
2,pilot-containment-bowl_0021,0.650602,containment
3,pilot-containment-bowl_0027,0.578313,containment
4,pilot-containment-bowl_0031,0.481928,containment
...,...,...,...
244,test19_0007,0.621622,clothiness
245,test19_0008,0.378378,clothiness
246,test19_0010,0.500000,clothiness
247,test19_0015,0.351351,clothiness


In [31]:
# how many trials' accuracy confidence interval are below 50% accuracy 
correct_below_67_dist = correct_below_67.groupby('scenarioName').count()
correct_below_67_dist

Unnamed: 0_level_0,stim_ID,correct
scenarioName,Unnamed: 1_level_1,Unnamed: 2_level_1
clothiness,48,48
collision,22,22
containment,26,26
dominoes,39,39
drop,28,28
linking,50,50
rollingsliding,8,8
towers,28,28


In [32]:
# Calculate unbiased estimators of correctness for each stimulus under out presumption (using average rate of correctness).
correct_below_100 =  correctness[(correctness['correct'] <= 1) & (correctness['correct'] >= 0.667)].reset_index()
correct_below_100

Unnamed: 0,stim_ID,correct,scenarioName
0,pilot-containment-bowl_0002,0.939759,containment
1,pilot-containment-bowl_0003,0.939759,containment
2,pilot-containment-bowl_0005,0.951807,containment
3,pilot-containment-bowl_0007,0.819277,containment
4,pilot-containment-bowl_0008,0.987952,containment
...,...,...,...
771,test19_0011,0.824324,clothiness
772,test19_0012,0.878378,clothiness
773,test19_0013,0.810811,clothiness
774,test19_0017,0.932432,clothiness


In [33]:
# how many trials' accuracy confidence interval are below 50% accuracy 
correct_below_100_dist = correct_below_100.groupby('scenarioName').count()
correct_below_100_dist

Unnamed: 0_level_0,stim_ID,correct
scenarioName,Unnamed: 1_level_1,Unnamed: 2_level_1
clothiness,86,86
collision,118,118
containment,111,111
dominoes,92,92
drop,107,107
linking,76,76
rollingsliding,82,82
towers,104,104


### filter to get trials on subsets (adversrial, easy, by chance, hard)

In [34]:
# adv trails in Model Data
MD_adv = MD[MD['Canon Stimulus Name'].isin(correct_below_50['stim_ID'].tolist())]
MD_adv

Unnamed: 0,Model,Readout Train Data,Readout Test Data,Train Accuracy,Test Accuracy,Readout Type,Predicted Prob_false,Predicted Prob_true,Predicted Outcome,Actual Outcome,Stimulus Name,Encoder Type,Dynamics Type,Encoder Pre-training Task,Encoder Pre-training Dataset,Encoder Pre-training Seed,Encoder Training Task,Encoder Training Dataset,Encoder Training Seed,Dynamics Training Task,Dynamics Training Dataset,Dynamics Training Seed,filename,Encoder_Pre-training Dataset,correct,Canon Stimulus Name,Encoder Training Dataset Type,Dynamics Training Dataset Type,Readout Train Data Type,ModelID,Model Kind
10,OP3,linking,linking,1.000000,0.480583,A,1.000000,2.945348e-09,0,1,pilot_linking_nl1-8_ms03_aCylcap_bCyl_occ1_dis...,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_linking,0.0,Image Reconstruction,no_linking,0,OP3_results.csv,,False,pilot_linking_nl1-8_ms03_aCylcap_bCyl_occ1_dis...,all_but_this,all_but_this,same,OP3_OP3 encoder_0.0_Image Reconstruction_no_li...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...
19,OP3,linking,linking,1.000000,0.480583,A,0.209647,7.903533e-01,1,0,pilot_linking_nl6_aCyl_bCube_occ1_dis1_boxroom...,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_linking,0.0,Image Reconstruction,no_linking,0,OP3_results.csv,,False,pilot_linking_nl6_aCyl_bCube_occ1_dis1_boxroom...,all_but_this,all_but_this,same,OP3_OP3 encoder_0.0_Image Reconstruction_no_li...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...
21,OP3,linking,linking,1.000000,0.480583,A,1.000000,4.075896e-09,0,1,pilot_linking_nl4-8_mg-005_aCyl_bCyl_occ1_dis1...,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_linking,0.0,Image Reconstruction,no_linking,0,OP3_results.csv,,False,pilot_linking_nl4-8_mg-005_aCyl_bCyl_occ1_dis1...,all_but_this,all_but_this,same,OP3_OP3 encoder_0.0_Image Reconstruction_no_li...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...
31,OP3,linking,linking,1.000000,0.480583,A,0.413014,5.869864e-01,1,0,pilot_linking_nl6_aCyl_bCube_occ1_dis1_boxroom...,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_linking,0.0,Image Reconstruction,no_linking,0,OP3_results.csv,,False,pilot_linking_nl6_aCyl_bCube_occ1_dis1_boxroom...,all_but_this,all_but_this,same,OP3_OP3 encoder_0.0_Image Reconstruction_no_li...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...
33,OP3,linking,linking,1.000000,0.480583,A,1.000000,1.171479e-12,0,0,pilot_linking_nl2-3_mg01_aCone_bCyl_boxroom-re...,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_linking,0.0,Image Reconstruction,no_linking,0,OP3_results.csv,,True,pilot_linking_nl2-3_mg01_aCone_bCyl_boxroom_0020,all_but_this,all_but_this,same,OP3_OP3 encoder_0.0_Image Reconstruction_no_li...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20732,SVG,towers,towers,0.649723,0.446281,D,0.163365,8.366347e-01,1,0,pilot_towers_nb5_fr015_SJ030_mono0_dis0_occ0_b...,VGG,LSTM,,,,VAE,no_towers,1.0,VAE,no_towers,1,per_example_svg.csv,,False,pilot_towers_nb5_fr015_SJ030_mono0_dis0_occ0_b...,all_but_this,all_but_this,same,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same
20733,SVG,towers,towers,0.649723,0.446281,D,0.951054,4.894563e-02,0,0,pilot_towers_nb5_fr015_SJ030_mono0_dis0_occ0_b...,VGG,LSTM,,,,VAE,no_towers,1.0,VAE,no_towers,1,per_example_svg.csv,,True,pilot_towers_nb5_fr015_SJ030_mono0_dis0_occ0_b...,all_but_this,all_but_this,same,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same
20739,SVG,towers,towers,0.649723,0.446281,D,0.409041,5.909593e-01,1,0,pilot_towers_nb5_fr015_SJ030_mono0_dis0_occ0_b...,VGG,LSTM,,,,VAE,no_towers,1.0,VAE,no_towers,1,per_example_svg.csv,,False,pilot_towers_nb5_fr015_SJ030_mono0_dis0_occ0_b...,all_but_this,all_but_this,same,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same
20741,SVG,towers,towers,0.649723,0.446281,D,0.821008,1.789917e-01,0,0,pilot_towers_nb4_fr015_SJ000_gr-01_mono1_dis1_...,VGG,LSTM,,,,VAE,no_towers,1.0,VAE,no_towers,1,per_example_svg.csv,,True,pilot_towers_nb4_fr015_SJ000_gr-01_mono1_dis1_...,all_but_this,all_but_this,same,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same


In [35]:
# adv trails in Merged Data
MD_hard = MD[MD['Canon Stimulus Name'].isin(correct_below_33['stim_ID'].tolist())]
MD_hard

Unnamed: 0,Model,Readout Train Data,Readout Test Data,Train Accuracy,Test Accuracy,Readout Type,Predicted Prob_false,Predicted Prob_true,Predicted Outcome,Actual Outcome,Stimulus Name,Encoder Type,Dynamics Type,Encoder Pre-training Task,Encoder Pre-training Dataset,Encoder Pre-training Seed,Encoder Training Task,Encoder Training Dataset,Encoder Training Seed,Dynamics Training Task,Dynamics Training Dataset,Dynamics Training Seed,filename,Encoder_Pre-training Dataset,correct,Canon Stimulus Name,Encoder Training Dataset Type,Dynamics Training Dataset Type,Readout Train Data Type,ModelID,Model Kind
10,OP3,linking,linking,1.000000,0.480583,A,1.000000,2.945348e-09,0,1,pilot_linking_nl1-8_ms03_aCylcap_bCyl_occ1_dis...,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_linking,0.0,Image Reconstruction,no_linking,0,OP3_results.csv,,False,pilot_linking_nl1-8_ms03_aCylcap_bCyl_occ1_dis...,all_but_this,all_but_this,same,OP3_OP3 encoder_0.0_Image Reconstruction_no_li...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...
19,OP3,linking,linking,1.000000,0.480583,A,0.209647,7.903533e-01,1,0,pilot_linking_nl6_aCyl_bCube_occ1_dis1_boxroom...,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_linking,0.0,Image Reconstruction,no_linking,0,OP3_results.csv,,False,pilot_linking_nl6_aCyl_bCube_occ1_dis1_boxroom...,all_but_this,all_but_this,same,OP3_OP3 encoder_0.0_Image Reconstruction_no_li...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...
21,OP3,linking,linking,1.000000,0.480583,A,1.000000,4.075896e-09,0,1,pilot_linking_nl4-8_mg-005_aCyl_bCyl_occ1_dis1...,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_linking,0.0,Image Reconstruction,no_linking,0,OP3_results.csv,,False,pilot_linking_nl4-8_mg-005_aCyl_bCyl_occ1_dis1...,all_but_this,all_but_this,same,OP3_OP3 encoder_0.0_Image Reconstruction_no_li...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...
31,OP3,linking,linking,1.000000,0.480583,A,0.413014,5.869864e-01,1,0,pilot_linking_nl6_aCyl_bCube_occ1_dis1_boxroom...,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_linking,0.0,Image Reconstruction,no_linking,0,OP3_results.csv,,False,pilot_linking_nl6_aCyl_bCube_occ1_dis1_boxroom...,all_but_this,all_but_this,same,OP3_OP3 encoder_0.0_Image Reconstruction_no_li...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...
33,OP3,linking,linking,1.000000,0.480583,A,1.000000,1.171479e-12,0,0,pilot_linking_nl2-3_mg01_aCone_bCyl_boxroom-re...,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_linking,0.0,Image Reconstruction,no_linking,0,OP3_results.csv,,True,pilot_linking_nl2-3_mg01_aCone_bCyl_boxroom_0020,all_but_this,all_but_this,same,OP3_OP3 encoder_0.0_Image Reconstruction_no_li...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20672,SVG,towers,towers,0.649723,0.446281,D,0.236706,7.632936e-01,1,0,pilot_towers_nb4_fr015_SJ000_gr01_mono1_dis0_o...,VGG,LSTM,,,,VAE,no_towers,1.0,VAE,no_towers,1,per_example_svg.csv,,False,pilot_towers_nb4_fr015_SJ000_gr01_mono1_dis0_o...,all_but_this,all_but_this,same,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same
20732,SVG,towers,towers,0.649723,0.446281,D,0.163365,8.366347e-01,1,0,pilot_towers_nb5_fr015_SJ030_mono0_dis0_occ0_b...,VGG,LSTM,,,,VAE,no_towers,1.0,VAE,no_towers,1,per_example_svg.csv,,False,pilot_towers_nb5_fr015_SJ030_mono0_dis0_occ0_b...,all_but_this,all_but_this,same,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same
20733,SVG,towers,towers,0.649723,0.446281,D,0.951054,4.894563e-02,0,0,pilot_towers_nb5_fr015_SJ030_mono0_dis0_occ0_b...,VGG,LSTM,,,,VAE,no_towers,1.0,VAE,no_towers,1,per_example_svg.csv,,True,pilot_towers_nb5_fr015_SJ030_mono0_dis0_occ0_b...,all_but_this,all_but_this,same,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same
20741,SVG,towers,towers,0.649723,0.446281,D,0.821008,1.789917e-01,0,0,pilot_towers_nb4_fr015_SJ000_gr-01_mono1_dis1_...,VGG,LSTM,,,,VAE,no_towers,1.0,VAE,no_towers,1,per_example_svg.csv,,True,pilot_towers_nb4_fr015_SJ000_gr-01_mono1_dis1_...,all_but_this,all_but_this,same,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same


In [36]:
# adv trails in Merged Data
MD_chance = MD[MD['Canon Stimulus Name'].isin(correct_below_67['stim_ID'].tolist())]
MD_chance

Unnamed: 0,Model,Readout Train Data,Readout Test Data,Train Accuracy,Test Accuracy,Readout Type,Predicted Prob_false,Predicted Prob_true,Predicted Outcome,Actual Outcome,Stimulus Name,Encoder Type,Dynamics Type,Encoder Pre-training Task,Encoder Pre-training Dataset,Encoder Pre-training Seed,Encoder Training Task,Encoder Training Dataset,Encoder Training Seed,Dynamics Training Task,Dynamics Training Dataset,Dynamics Training Seed,filename,Encoder_Pre-training Dataset,correct,Canon Stimulus Name,Encoder Training Dataset Type,Dynamics Training Dataset Type,Readout Train Data Type,ModelID,Model Kind
0,OP3,linking,linking,1.000000,0.480583,A,9.999980e-01,0.000002,0,1,pilot_linking_nl4-8_mg-005_aCyl_bCyl_occ1_dis1...,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_linking,0.0,Image Reconstruction,no_linking,0,OP3_results.csv,,False,pilot_linking_nl4-8_mg-005_aCyl_bCyl_occ1_dis1...,all_but_this,all_but_this,same,OP3_OP3 encoder_0.0_Image Reconstruction_no_li...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...
1,OP3,linking,linking,1.000000,0.480583,A,9.999776e-01,0.000022,0,1,pilot_linking_nl1-6_ms03-7_aCylcap_bCyl_tdwroo...,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_linking,0.0,Image Reconstruction,no_linking,0,OP3_results.csv,,False,pilot_linking_nl1-6_ms03-7_aCylcap_bCyl_tdwroo...,all_but_this,all_but_this,same,OP3_OP3 encoder_0.0_Image Reconstruction_no_li...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...
6,OP3,linking,linking,1.000000,0.480583,A,2.729372e-12,1.000000,1,1,pilot_linking_nl1-8_ms03_aCylcap_bCyl_occ1_dis...,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_linking,0.0,Image Reconstruction,no_linking,0,OP3_results.csv,,True,pilot_linking_nl1-8_ms03_aCylcap_bCyl_occ1_dis...,all_but_this,all_but_this,same,OP3_OP3 encoder_0.0_Image Reconstruction_no_li...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...
7,OP3,linking,linking,1.000000,0.480583,A,9.967214e-01,0.003279,0,1,pilot_linking_nl2-3_mg01_aCone_bCyl_boxroom-re...,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_linking,0.0,Image Reconstruction,no_linking,0,OP3_results.csv,,False,pilot_linking_nl2-3_mg01_aCone_bCyl_boxroom_0000,all_but_this,all_but_this,same,OP3_OP3 encoder_0.0_Image Reconstruction_no_li...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...
27,OP3,linking,linking,1.000000,0.480583,A,9.830368e-01,0.016963,0,0,pilot_linking_nl1-6_ms03-7_aCylcap_bCyl_tdwroo...,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_linking,0.0,Image Reconstruction,no_linking,0,OP3_results.csv,,True,pilot_linking_nl1-6_ms03-7_aCylcap_bCyl_tdwroo...,all_but_this,all_but_this,same,OP3_OP3 encoder_0.0_Image Reconstruction_no_li...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20728,SVG,towers,towers,0.649723,0.446281,D,8.524217e-01,0.147578,0,1,pilot_towers_nb5_fr015_SJ030_mono0_dis0_occ0_b...,VGG,LSTM,,,,VAE,no_towers,1.0,VAE,no_towers,1,per_example_svg.csv,,False,pilot_towers_nb5_fr015_SJ030_mono0_dis0_occ0_b...,all_but_this,all_but_this,same,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same
20729,SVG,towers,towers,0.649723,0.446281,D,2.880553e-01,0.711945,1,0,pilot_towers_nb5_fr015_SJ030_mono0_dis0_occ0_b...,VGG,LSTM,,,,VAE,no_towers,1.0,VAE,no_towers,1,per_example_svg.csv,,False,pilot_towers_nb5_fr015_SJ030_mono0_dis0_occ0_b...,all_but_this,all_but_this,same,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same
20734,SVG,towers,towers,0.649723,0.446281,D,3.301845e-01,0.669815,1,0,pilot_towers_nb5_fr015_SJ030_mono0_dis0_occ0_b...,VGG,LSTM,,,,VAE,no_towers,1.0,VAE,no_towers,1,per_example_svg.csv,,False,pilot_towers_nb5_fr015_SJ030_mono0_dis0_occ0_b...,all_but_this,all_but_this,same,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same
20739,SVG,towers,towers,0.649723,0.446281,D,4.090407e-01,0.590959,1,0,pilot_towers_nb5_fr015_SJ030_mono0_dis0_occ0_b...,VGG,LSTM,,,,VAE,no_towers,1.0,VAE,no_towers,1,per_example_svg.csv,,False,pilot_towers_nb5_fr015_SJ030_mono0_dis0_occ0_b...,all_but_this,all_but_this,same,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same


In [37]:
# adv trails in Merged Data
MD_easy = MD[MD['Canon Stimulus Name'].isin(correct_below_100['stim_ID'].tolist())]
MD_easy

Unnamed: 0,Model,Readout Train Data,Readout Test Data,Train Accuracy,Test Accuracy,Readout Type,Predicted Prob_false,Predicted Prob_true,Predicted Outcome,Actual Outcome,Stimulus Name,Encoder Type,Dynamics Type,Encoder Pre-training Task,Encoder Pre-training Dataset,Encoder Pre-training Seed,Encoder Training Task,Encoder Training Dataset,Encoder Training Seed,Dynamics Training Task,Dynamics Training Dataset,Dynamics Training Seed,filename,Encoder_Pre-training Dataset,correct,Canon Stimulus Name,Encoder Training Dataset Type,Dynamics Training Dataset Type,Readout Train Data Type,ModelID,Model Kind
5,OP3,linking,linking,1.000000,0.480583,A,0.221151,7.788493e-01,1,0,pilot_linking_nl1-5_aNone_bCube_occ1_dis1_tdwr...,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_linking,0.0,Image Reconstruction,no_linking,0,OP3_results.csv,,False,pilot_linking_nl1-5_aNone_bCube_occ1_dis1_tdwr...,all_but_this,all_but_this,same,OP3_OP3 encoder_0.0_Image Reconstruction_no_li...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...
8,OP3,linking,linking,1.000000,0.480583,A,0.999999,6.344417e-07,0,1,pilot_linking_nl1-6_ms03-7_aCylcap_bCyl_tdwroo...,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_linking,0.0,Image Reconstruction,no_linking,0,OP3_results.csv,,False,pilot_linking_nl1-6_ms03-7_aCylcap_bCyl_tdwroo...,all_but_this,all_but_this,same,OP3_OP3 encoder_0.0_Image Reconstruction_no_li...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...
11,OP3,linking,linking,1.000000,0.480583,A,0.989635,1.036536e-02,0,0,pilot_linking_nl1-8_mg000_aCylcap_bCyl_tdwroom...,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_linking,0.0,Image Reconstruction,no_linking,0,OP3_results.csv,,True,pilot_linking_nl1-8_mg000_aCylcap_bCyl_tdwroom...,all_but_this,all_but_this,same,OP3_OP3 encoder_0.0_Image Reconstruction_no_li...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...
13,OP3,linking,linking,1.000000,0.480583,A,1.000000,3.855496e-10,0,0,pilot_linking_nl6_aCyl_bCube_occ1_dis1_boxroom...,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_linking,0.0,Image Reconstruction,no_linking,0,OP3_results.csv,,True,pilot_linking_nl6_aCyl_bCube_occ1_dis1_boxroom...,all_but_this,all_but_this,same,OP3_OP3 encoder_0.0_Image Reconstruction_no_li...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...
16,OP3,linking,linking,1.000000,0.480583,A,0.999546,4.542668e-04,0,1,pilot_linking_nl2-3_mg01_aCone_bCyl_boxroom-re...,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_linking,0.0,Image Reconstruction,no_linking,0,OP3_results.csv,,False,pilot_linking_nl2-3_mg01_aCone_bCyl_boxroom_0003,all_but_this,all_but_this,same,OP3_OP3 encoder_0.0_Image Reconstruction_no_li...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20753,SVG,towers,towers,0.649723,0.446281,D,0.342623,6.573774e-01,1,1,pilot_towers_nb4_fr015_SJ000_gr-01_mono1_dis1_...,VGG,LSTM,,,,VAE,no_towers,1.0,VAE,no_towers,1,per_example_svg.csv,,True,pilot_towers_nb4_fr015_SJ000_gr-01_mono1_dis1_...,all_but_this,all_but_this,same,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same
20754,SVG,towers,towers,0.649723,0.446281,D,0.205786,7.942137e-01,1,0,pilot_towers_nb4_fr015_SJ000_gr-01_mono1_dis1_...,VGG,LSTM,,,,VAE,no_towers,1.0,VAE,no_towers,1,per_example_svg.csv,,False,pilot_towers_nb4_fr015_SJ000_gr-01_mono1_dis1_...,all_but_this,all_but_this,same,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same
20756,SVG,towers,towers,0.649723,0.446281,D,0.051333,9.486671e-01,1,1,pilot_towers_nb4_fr015_SJ000_gr-01_mono1_dis1_...,VGG,LSTM,,,,VAE,no_towers,1.0,VAE,no_towers,1,per_example_svg.csv,,True,pilot_towers_nb4_fr015_SJ000_gr-01_mono1_dis1_...,all_but_this,all_but_this,same,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same
20758,SVG,towers,towers,0.649723,0.446281,D,0.417305,5.826950e-01,1,0,pilot_towers_nb4_fr015_SJ000_gr-01_mono1_dis1_...,VGG,LSTM,,,,VAE,no_towers,1.0,VAE,no_towers,1,per_example_svg.csv,,False,pilot_towers_nb4_fr015_SJ000_gr-01_mono1_dis1_...,all_but_this,all_but_this,same,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same


### generate labels for regression analysis
* Comparison 1: Visual encoder architecture (ConvNet [SVG/VGGFrozenLSTM] vs. transformer [DEITFrozenLSTM] … DEITFrozenMLP vs. SVG/VGGFrozenMLP)
* Comparison 2: Dynamics model RNN vs. MLP (LSTM vs. MLP for above)
* Comparison 3: Among unsupervised models, object-centric vs. non-object-centric
        * {CSWM, OP3} vs. {SVG}
* Comparison 4: Latent vs. pixel reconstruction loss
        * CSWM vs. OP3
* Comparison 5: RPIN vs. CSWM/OP3 (“supervised explicit object-centric” vs. “unsupervised implicit object-centric”)

Dimensions: 
* “Visual encoder architecture” : [“ConvNet” “Transformer” “Neither”]
* “Dynamics model architecture” : [“LSTM”, “MLP”, “Neither”]
* “ObjectCentric”: [TRUE, FALSE, NA]
* “Supervised”: [TRUE, FALSE]
* “SelfSupervisedLoss”: [“latent”, “pixel”, “NA”]


In [None]:
#“Visual encoder architecture” : [“ConvNet” “Transformer” “Neither”]
MD_adv['Visual encoder architecture'] = "Neither"
MD_easy['Visual encoder architecture'] = "Neither"
MD_chance['Visual encoder architecture'] = "Neither"
MD_hard['Visual encoder architecture'] = "Neither"

MD_adv.loc[(MD_adv['Model'].str.contains("SVG")) | (MD_adv['Model'].str.contains("VGG")),'Visual encoder architecture'] = "ConvNet"
MD_easy.loc[(MD_easy['Model'].str.contains("SVG")) | (MD_easy['Model'].str.contains("VGG")),'Visual encoder architecture'] = "ConvNet"
MD_chance.loc[(MD_chance['Model'].str.contains("SVG")) | (MD_chance['Model'].str.contains("VGG")),'Visual encoder architecture'] = "ConvNet"
MD_hard.loc[(MD_hard['Model'].str.contains("SVG")) | (MD_hard['Model'].str.contains("VGG")),'Visual encoder architecture'] = "ConvNet"

MD_adv.loc[(MD_adv['Model'].str.contains("DEIT")) | (MD_adv['Model'].str.contains("VGG")),'Visual encoder architecture'] = "Transformer"
MD_easy.loc[(MD_easy['Model'].str.contains("DEIT")) | (MD_easy['Model'].str.contains("VGG")),'Visual encoder architecture'] = "Transformer"
MD_chance.loc[(MD_chance['Model'].str.contains("DEIT")) | (MD_chance['Model'].str.contains("VGG")),'Visual encoder architecture'] = "Transformer"
MD_hard.loc[(MD_hard['Model'].str.contains("DEIT")) | (MD_hard['Model'].str.contains("VGG")),'Visual encoder architecture'] = "Transformer"

In [None]:
# “Dynamics model architecture” : [“LSTM”, “MLP”, “Neither”]
MD_adv['Dynamics model architecture'] = "Neither"
MD_easy['Dynamics model architecture'] = "Neither"
MD_chance['Dynamics model architecture'] = "Neither"
MD_hard['Dynamics model architecture'] = "Neither"

MD_adv.loc[(MD_adv['Model'].str.contains("LSTM")),'Dynamics model architecture'] = "LSTM"
MD_easy.loc[(MD_easy['Model'].str.contains("LSTM")),'Dynamics model architecture'] = "LSTM"
MD_chance.loc[(MD_chance['Model'].str.contains("LSTM")),'Dynamics model architecture'] = "LSTM"
MD_hard.loc[(MD_hard['Model'].str.contains("LSTM")),'Dynamics model architecture'] = "LSTM"

MD_adv.loc[(MD_adv['Model'].str.contains("MLP")),'Dynamics model architecture'] = "MLP"
MD_easy.loc[(MD_easy['Model'].str.contains("MLP")),'Dynamics model architecture'] = "MLP"
MD_chance.loc[(MD_chance['Model'].str.contains("MLP")),'Dynamics model architecture'] = "MLP"
MD_hard.loc[(MD_hard['Model'].str.contains("MLP")),'Dynamics model architecture'] = "MLP"

In [None]:
# ObjectCentric”: [TRUE, FALSE, NA]
MD_adv['ObjectCentric'] = np.nan
MD_easy['ObjectCentric'] = np.nan
MD_chance['ObjectCentric'] = np.nan
MD_hard['ObjectCentric'] = np.nan

MD_adv.loc[(MD_adv['Model'].str.contains("CSWM")) | (MD_adv['Model'].str.contains("OP3")) | (MD_adv['Model'].str.contains("DPI")),'ObjectCentric'] = True
MD_easy.loc[(MD_easy['Model'].str.contains("CSWM")) | (MD_easy['Model'].str.contains("OP3")) | (MD_easy['Model'].str.contains("DPI")),'ObjectCentric'] = True
MD_chance.loc[(MD_chance['Model'].str.contains("CSWM")) | (MD_chance['Model'].str.contains("OP3")) | (MD_chance['Model'].str.contains("DPI")),'ObjectCentric'] = True
MD_hard.loc[(MD_hard['Model'].str.contains("CSWM")) | (MD_hard['Model'].str.contains("OP3")) | (MD_hard['Model'].str.contains("DPI")),'ObjectCentric'] = True

MD_adv.loc[(MD_adv['Model'].str.contains("SVG")),'ObjectCentric'] = False
MD_easy.loc[(MD_easy['Model'].str.contains("SVG")),'ObjectCentric'] = False
MD_chance.loc[(MD_chance['Model'].str.contains("SVG")),'ObjectCentric'] = False
MD_hard.loc[(MD_hard['Model'].str.contains("SVG")),'ObjectCentric'] = False
# MD['ObjectCentric'] = MD['ObjectCentric'].astype(bool)

In [None]:
# Supervised”: [TRUE, FALSE]
MD_adv['Supervised'] = np.nan
MD_easy['Supervised'] = np.nan
MD_chance['Supervised'] = np.nan
MD_hard['Supervised'] = np.nan

MD_adv.loc[(MD_adv['Model'].str.contains("RPIN")) | (MD_adv['Model'].str.contains("DPI")),'Supervised'] = True
MD_easy.loc[(MD_easy['Model'].str.contains("RPIN")) | (MD_easy['Model'].str.contains("DPI")),'Supervised'] = True
MD_chance.loc[(MD_chance['Model'].str.contains("RPIN")) | (MD_chance['Model'].str.contains("DPI")),'Supervised'] = True
MD_hard.loc[(MD_hard['Model'].str.contains("RPIN")) | (MD_hard['Model'].str.contains("DPI")),'Supervised'] = True

MD_adv.loc[(MD_adv['Model'].str.contains("CSWM")) | (MD_adv['Model'].str.contains("OP3")) | (MD_adv['Model'].str.contains("SVG") | (MD_adv['Model'].str.contains("VGG"))),'Supervised'] = False
MD_easy.loc[(MD_easy['Model'].str.contains("CSWM")) | (MD_easy['Model'].str.contains("OP3")) | (MD_easy['Model'].str.contains("SVG") | (MD_easy['Model'].str.contains("VGG"))),'Supervised'] = False
MD_chance.loc[(MD_chance['Model'].str.contains("CSWM")) | (MD_chance['Model'].str.contains("OP3")) | (MD_chance['Model'].str.contains("SVG") | (MD_chance['Model'].str.contains("VGG"))),'Supervised'] = False
MD_hard.loc[(MD_hard['Model'].str.contains("CSWM")) | (MD_hard['Model'].str.contains("OP3")) | (MD_hard['Model'].str.contains("SVG") | (MD_hard['Model'].str.contains("VGG"))),'Supervised'] = False
# MD_adv['Supervised'] = MD_adv['Supervised'].astype(bool)

In [None]:
# SelfSupervisedLoss”: [“latent”, “pixel”, “NA”]
MD_adv['SelfSupervisedLossSelfSupervisedLoss'] = "NA"
MD_easy['SelfSupervisedLossSelfSupervisedLoss'] = "NA"
MD_chance['SelfSupervisedLossSelfSupervisedLoss'] = "NA"
MD_hard['SelfSupervisedLossSelfSupervisedLoss'] = "NA"

MD_adv.loc[(MD_adv['Model'].str.contains("CSWM")),'SelfSupervisedLoss'] = "latent"
MD_easy.loc[(MD_easy['Model'].str.contains("CSWM")),'SelfSupervisedLoss'] = "latent"
MD_chance.loc[(MD_chance['Model'].str.contains("CSWM")),'SelfSupervisedLoss'] = "latent"
MD_hard.loc[(MD_hard['Model'].str.contains("CSWM")),'SelfSupervisedLoss'] = "latent"

MD_adv.loc[(MD_adv['Model'].str.contains("OP3")) | (MD_adv['Model'].str.contains("VGG")) | (MD_adv['Model'].str.contains("SVG") | (MD_adv['Model'].str.contains("VGG"))),'SelfSupervisedLoss'] = "pixel"
MD_easy.loc[(MD_easy['Model'].str.contains("OP3")) | (MD_easy['Model'].str.contains("VGG")) | (MD_easy['Model'].str.contains("SVG") | (MD_easy['Model'].str.contains("VGG"))),'SelfSupervisedLoss'] = "pixel"
MD_chance.loc[(MD_chance['Model'].str.contains("OP3")) | (MD_chance['Model'].str.contains("VGG")) | (MD_chance['Model'].str.contains("SVG") | (MD_chance['Model'].str.contains("VGG"))),'SelfSupervisedLoss'] = "pixel"
MD_hard.loc[(MD_hard['Model'].str.contains("OP3")) | (MD_hard['Model'].str.contains("VGG")) | (MD_hard['Model'].str.contains("SVG") | (MD_hard['Model'].str.contains("VGG"))),'SelfSupervisedLoss'] = "pixel"

In [43]:
#save as model identifying column
MODEL_COLS = h.MODEL_COLS + ['Visual encoder architecture','Dynamics model architecture','ObjectCentric','Supervised','SelfSupervisedLossSelfSupervisedLoss']

In [None]:
## save out 
MD_adv.to_csv(os.path.join(csv_dir, 'summary', 'allModels_results_adv.csv'))
MD_easy.to_csv(os.path.join(csv_dir, 'summary', 'allModels_results_easy.csv'))
MD_chance.to_csv(os.path.join(csv_dir, 'summary', 'allModels_results_chance.csv'))
MD_hard.to_csv(os.path.join(csv_dir, 'summary', 'allModels_results_hard.csv'))

### generate summary table of human 95% CIs for accuracy across all scenarios

In [None]:
## init human_bootstrapped_accuracy_adv for plotting
human_bootstrapped_accuracy_adv = pd.DataFrame()

for exp_ind, exp_name in enumerate(resp_paths):
    
    ## get path to response data
    path_to_data = resp_paths[exp_ind]

    ## load data and apply preprocessing
    _D = h.load_and_preprocess_data(path_to_data)
    scenarioName = _D.scenarioName.values[0]
    print('Currently analyzing the {} experiment.'.format(_D.scenarioName.values[0]))
    clear_output(wait=True)
    
    ## apply exclusion criteria
    D = h.apply_exclusion_criteria(_D)
    D = D[~D['gameID'].isin(bad_games)]
    D = D[D['stim_ID'].isin(correct_below_50['stim_ID'].tolist())]
    
    D = D.sort_values('stim_ID') #ensure same stim order
    humans = np.array(D['gameID'].unique())

    ## compute bootstrapped sampling distributions of accuracy
    Dacc = D.groupby('prolificIDAnon').agg({'correct':np.mean})
    bootmeans = h.bootstrap_mean(Dacc, col='correct', nIter=1000)

    obsmean = np.mean(Dacc.correct.values)
    bootmean = np.mean(bootmeans)
    lb = np.percentile(bootmeans,2.5)
    ub = np.percentile(bootmeans,97.5)
    pct25 = np.percentile(Dacc,2.5)
    pct975 = np.percentile(Dacc,97.5)
    ## merge bootstrapped accuracy estimates
    if len(human_bootstrapped_accuracy_adv)==0:
        human_bootstrapped_accuracy_adv = pd.DataFrame(['human', scenarioName, obsmean,bootmean,lb,ub, pct25, pct975]).transpose()
    else:
        human_bootstrapped_accuracy_adv = pd.concat([human_bootstrapped_accuracy_adv, pd.DataFrame(['human', scenarioName, obsmean,bootmean,lb,ub, pct25, pct975]).transpose()],axis=0)
        
## add column names        
human_bootstrapped_accuracy_adv.columns=['agent','scenario','obs_mean', 'boot_mean', 'ci_lb', 'ci_ub', 'pct_2.5', 'pct_97.5']

## save out human_bootstrapped_accuracy_adv to re-plot in R
if not os.path.exists(os.path.join(csv_dir, 'summary')):
    os.makedirs(os.path.join(csv_dir, 'summary'))    
human_bootstrapped_accuracy_adv.to_csv(os.path.join(csv_dir, 'summary','human_accuracy_by_scenario_adv.csv'), index=False)
print('Saved to file! Done.')

In [46]:
human_bootstrapped_accuracy_adv

Unnamed: 0,agent,scenario,obs_mean,boot_mean,ci_lb,ci_ub,pct_2.5,pct_97.5
0,human,towers,0.255462,0.255217,0.230798,0.279552,0.0952381,0.47619
0,human,containment,0.21419,0.214077,0.191432,0.237617,0.0555556,0.444444
0,human,collision,0.241356,0.241289,0.21742,0.264628,0.0203125,0.479687
0,human,rollingsliding,0.267368,0.267303,0.225263,0.307368,0.0,0.73
0,human,drop,0.263441,0.2638,0.24328,0.284509,0.0833333,0.458333
0,human,linking,0.250342,0.250959,0.228454,0.275308,0.0882353,0.470588
0,human,dominoes,0.27381,0.274002,0.249575,0.296344,0.0741071,0.497321
0,human,clothiness,0.240398,0.240565,0.216216,0.263887,0.0526316,0.439474


In [None]:
## init human_bootstrapped_accuracy_hard for plotting
human_bootstrapped_accuracy_hard = pd.DataFrame()

for exp_ind, exp_name in enumerate(resp_paths):
    
    ## get path to response data
    path_to_data = resp_paths[exp_ind]

    ## load data and apply preprocessing
    _D = h.load_and_preprocess_data(path_to_data)
    scenarioName = _D.scenarioName.values[0]
    print('Currently analyzing the {} experiment.'.format(_D.scenarioName.values[0]))
    clear_output(wait=True)
    
    ## apply exclusion criteria
    D = h.apply_exclusion_criteria(_D)
    D = D[~D['gameID'].isin(bad_games)]
    D = D[D['stim_ID'].isin(correct_below_33['stim_ID'].tolist())]
    
    D = D.sort_values('stim_ID') #ensure same stim order
    humans = np.array(D['gameID'].unique())

    ## compute bootstrapped sampling distributions of accuracy
    Dacc = D.groupby('prolificIDAnon').agg({'correct':np.mean})
    bootmeans = h.bootstrap_mean(Dacc, col='correct', nIter=1000)

    obsmean = np.mean(Dacc.correct.values)
    bootmean = np.mean(bootmeans)
    lb = np.percentile(bootmeans,2.5)
    ub = np.percentile(bootmeans,97.5)
    pct25 = np.percentile(Dacc,2.5)
    pct975 = np.percentile(Dacc,97.5)
    ## merge bootstrapped accuracy estimates
    if len(human_bootstrapped_accuracy_hard)==0:
        human_bootstrapped_accuracy_hard = pd.DataFrame(['human', scenarioName, obsmean,bootmean,lb,ub, pct25, pct975]).transpose()
    else:
        human_bootstrapped_accuracy_hard = pd.concat([human_bootstrapped_accuracy_hard, pd.DataFrame(['human', scenarioName, obsmean,bootmean,lb,ub, pct25, pct975]).transpose()],axis=0)
        
## add column names        
human_bootstrapped_accuracy_hard.columns=['agent','scenario','obs_mean', 'boot_mean', 'ci_lb', 'ci_ub', 'pct_2.5', 'pct_97.5']

## save out human_bootstrapped_accuracy_hard to re-plot in R
if not os.path.exists(os.path.join(csv_dir, 'summary')):
    os.makedirs(os.path.join(csv_dir, 'summary'))    
human_bootstrapped_accuracy_hard.to_csv(os.path.join(csv_dir, 'summary','human_accuracy_by_scenario_hard.csv'), index=False)
print('Saved to file! Done.')

In [48]:
human_bootstrapped_accuracy_hard

Unnamed: 0,agent,scenario,obs_mean,boot_mean,ci_lb,ci_ub,pct_2.5,pct_97.5
0,human,towers,0.22699,0.226808,0.201384,0.253287,0.00588235,0.411765
0,human,containment,0.16126,0.161538,0.139018,0.186284,0.0,0.457692
0,human,collision,0.168085,0.168088,0.140426,0.196809,0.0,0.5
0,human,rollingsliding,0.239474,0.239158,0.194737,0.284211,0.0,0.75
0,human,drop,0.205735,0.205637,0.184229,0.226541,0.0666667,0.446667
0,human,linking,0.199128,0.19962,0.177326,0.224806,0.0416667,0.53125
0,human,dominoes,0.227444,0.227616,0.20235,0.25188,0.0526316,0.469737
0,human,clothiness,0.205405,0.205851,0.18018,0.231532,0.0,0.411667


In [None]:
## init human_bootstrapped_accuracy_chance for plotting
human_bootstrapped_accuracy_chance = pd.DataFrame()

for exp_ind, exp_name in enumerate(resp_paths):
    
    ## get path to response data
    path_to_data = resp_paths[exp_ind]

    ## load data and apply preprocessing
    _D = h.load_and_preprocess_data(path_to_data)
    scenarioName = _D.scenarioName.values[0]
    print('Currently analyzing the {} experiment.'.format(_D.scenarioName.values[0]))
    clear_output(wait=True)
    
    ## apply exclusion criteria
    D = h.apply_exclusion_criteria(_D)
    D = D[~D['gameID'].isin(bad_games)]
    D = D[D['stim_ID'].isin(correct_below_67['stim_ID'].tolist())]
    
    D = D.sort_values('stim_ID') #ensure same stim order
    humans = np.array(D['gameID'].unique())

    ## compute bootstrapped sampling distributions of accuracy
    Dacc = D.groupby('prolificIDAnon').agg({'correct':np.mean})
    bootmeans = h.bootstrap_mean(Dacc, col='correct', nIter=1000)

    obsmean = np.mean(Dacc.correct.values)
    bootmean = np.mean(bootmeans)
    lb = np.percentile(bootmeans,2.5)
    ub = np.percentile(bootmeans,97.5)
    pct25 = np.percentile(Dacc,2.5)
    pct975 = np.percentile(Dacc,97.5)
    ## merge bootstrapped accuracy estimates
    if len(human_bootstrapped_accuracy_chance)==0:
        human_bootstrapped_accuracy_chance = pd.DataFrame(['human', scenarioName, obsmean,bootmean,lb,ub, pct25, pct975]).transpose()
    else:
        human_bootstrapped_accuracy_chance = pd.concat([human_bootstrapped_accuracy_chance, pd.DataFrame(['human', scenarioName, obsmean,bootmean,lb,ub, pct25, pct975]).transpose()],axis=0)
        
## add column names        
human_bootstrapped_accuracy_chance.columns=['agent','scenario','obs_mean', 'boot_mean', 'ci_lb', 'ci_ub', 'pct_2.5', 'pct_97.5']

## save out human_bootstrapped_accuracy_chance to re-plot in R
if not os.path.exists(os.path.join(csv_dir, 'summary')):
    os.makedirs(os.path.join(csv_dir, 'summary'))    
human_bootstrapped_accuracy_chance.to_csv(os.path.join(csv_dir, 'summary','human_accuracy_by_scenario_chance.csv'), index=False)
print('Saved to file! Done.')

In [50]:
human_bootstrapped_accuracy_chance

Unnamed: 0,agent,scenario,obs_mean,boot_mean,ci_lb,ci_ub,pct_2.5,pct_97.5
0,human,towers,0.534454,0.534659,0.513015,0.555063,0.357143,0.746429
0,human,containment,0.493513,0.493151,0.471258,0.514365,0.307692,0.692308
0,human,collision,0.499516,0.498799,0.471954,0.52419,0.272727,0.727273
0,human,rollingsliding,0.567105,0.566664,0.528947,0.602632,0.25,0.875
0,human,drop,0.486943,0.487482,0.468126,0.509601,0.321429,0.667857
0,human,linking,0.51907,0.519262,0.496279,0.541174,0.3425,0.7175
0,human,dominoes,0.495726,0.495739,0.478327,0.513126,0.358974,0.641026
0,human,clothiness,0.525338,0.525698,0.505631,0.548423,0.350521,0.708333


In [None]:
## init human_bootstrapped_accuracy_easy for plotting
human_bootstrapped_accuracy_easy = pd.DataFrame()

for exp_ind, exp_name in enumerate(resp_paths):
    
    ## get path to response data
    path_to_data = resp_paths[exp_ind]

    ## load data and apply preprocessing
    _D = h.load_and_preprocess_data(path_to_data)
    scenarioName = _D.scenarioName.values[0]
    print('Currently analyzing the {} experiment.'.format(_D.scenarioName.values[0]))
    clear_output(wait=True)
    
    ## apply exclusion criteria
    D = h.apply_exclusion_criteria(_D)
    D = D[~D['gameID'].isin(bad_games)]
    D = D[D['stim_ID'].isin(correct_below_100['stim_ID'].tolist())]
    
    D = D.sort_values('stim_ID') #ensure same stim order
    humans = np.array(D['gameID'].unique())

    ## compute bootstrapped sampling distributions of accuracy
    Dacc = D.groupby('prolificIDAnon').agg({'correct':np.mean})
    bootmeans = h.bootstrap_mean(Dacc, col='correct', nIter=1000)

    obsmean = np.mean(Dacc.correct.values)
    bootmean = np.mean(bootmeans)
    lb = np.percentile(bootmeans,2.5)
    ub = np.percentile(bootmeans,97.5)
    pct25 = np.percentile(Dacc,2.5)
    pct975 = np.percentile(Dacc,97.5)
    ## merge bootstrapped accuracy estimates
    if len(human_bootstrapped_accuracy_easy)==0:
        human_bootstrapped_accuracy_easy = pd.DataFrame(['human', scenarioName, obsmean,bootmean,lb,ub, pct25, pct975]).transpose()
    else:
        human_bootstrapped_accuracy_easy = pd.concat([human_bootstrapped_accuracy_easy, pd.DataFrame(['human', scenarioName, obsmean,bootmean,lb,ub, pct25, pct975]).transpose()],axis=0)
        
## add column names        
human_bootstrapped_accuracy_easy.columns=['agent','scenario','obs_mean', 'boot_mean', 'ci_lb', 'ci_ub', 'pct_2.5', 'pct_97.5']

## save out human_bootstrapped_accuracy_easy to re-plot in R
if not os.path.exists(os.path.join(csv_dir, 'summary')):
    os.makedirs(os.path.join(csv_dir, 'summary'))    
human_bootstrapped_accuracy_easy.to_csv(os.path.join(csv_dir, 'summary','human_accuracy_by_scenario_easy.csv'), index=False)
print('Saved to file! Done.')

In [52]:
human_bootstrapped_accuracy_easy

Unnamed: 0,agent,scenario,obs_mean,boot_mean,ci_lb,ci_ub,pct_2.5,pct_97.5
0,human,towers,0.917195,0.916996,0.905877,0.927152,0.784615,0.971154
0,human,containment,0.901986,0.901909,0.890047,0.912518,0.775225,0.972973
0,human,collision,0.920393,0.920494,0.910023,0.92995,0.810593,0.974576
0,human,rollingsliding,0.930039,0.930276,0.920536,0.939798,0.817073,0.987805
0,human,drop,0.886243,0.886208,0.875085,0.896995,0.750467,0.953271
0,human,linking,0.865667,0.865608,0.850673,0.878519,0.699013,0.960526
0,human,dominoes,0.874741,0.874569,0.862189,0.887684,0.750815,0.955707
0,human,clothiness,0.848366,0.848607,0.832652,0.863926,0.693605,0.943895


### Human-human consistency across stimuli (within scenario)
We will analyze human-human consistency by computing the mean correlation between (binary) response vectors produced by each human participant across all stimuli within each scenario. 



#### Correlation

##### Adversarial

In [None]:
## init human_boot_corr for plotting
human_boot_corr = pd.DataFrame()

for exp_ind, exp_name in enumerate(resp_paths):
    
    ## get path to response data
    path_to_data = resp_paths[exp_ind]

    ## load data and apply preprocessing
    _D = h.load_and_preprocess_data(path_to_data)
    scenarioName = _D.scenarioName.values[0]
    print('Currently analyzing the {} experiment.'.format(_D.scenarioName.values[0]))
    clear_output(wait=True)

    ## apply exclusion criteria
    D = h.apply_exclusion_criteria(_D)
    D = D[~D['gameID'].isin(bad_games)]
    D = D[D['stim_ID'].isin(correct_below_50['stim_ID'].tolist())]
    
    ## create response feature matrix (numSubs x numTrialsPerSub)
    D2 = D.sort_values(by=['prolificIDAnon','stim_ID']).reset_index(drop=True)
    numSubs = len(np.unique(D['prolificIDAnon'].values))
    numTrialsPerSub = int(len(D)/numSubs)
    respMat = np.reshape(D2['responseBool'].values, (numSubs,numTrialsPerSub)) 

    ## sanity check that the reshape operation happened correctly
    assert len([i for (i,j) in list(zip(respMat[0],D2[:150]['responseBool'].values)) if i!=j])==0    
    
    ## get pairwise correlations
    dists = 1-scipy.spatial.distance.pdist(respMat, metric='correlation')
    corrMat = scipy.spatial.distance.squareform(dists)
    
    ## get percentiles over pairwise corrs
    pairwiseCorrs = corrMat[np.triu_indices(n=len(corrMat), k=1)]
    lb = np.percentile(pairwiseCorrs, 2.5)
    med = np.percentile(pairwiseCorrs, 50)
    ub = np.percentile(pairwiseCorrs, 97.5)  
    
    ## get pearsons r by splitting the subject pool in half and comparing mean responses
    humans = np.array(D['gameID'].unique())
    pearsons_rs = []
    for i in range(1000):
        # shuffle human indices
        shuffled_humans = humans.copy()
        np.random.shuffle(shuffled_humans)
        # get group A
        humans_A = shuffled_humans[:int(len(shuffled_humans)/2)]
        mask_A = D['gameID'].isin(humans_A)
        #get responses for the two groups
        resp_A = D[mask_A].groupby('stim_ID')['responseBool'].mean()
        resp_B = D[~mask_A].groupby('stim_ID')['responseBool'].mean()
        assert np.all(resp_A.index == resp_B.index)
        # calc r
        r,_ = scipy.stats.pearsonr(resp_A.values, resp_B.values)
        pearsons_rs.append(r)
    # get mean and intervals
    r_mean = np.mean(pearsons_rs)
    r_lb = np.percentile(pearsons_rs,2.5)
    r_ub = np.percentile(pearsons_rs,97.5)
    r_med = np.percentile(pearsons_rs,50)
        
    if len(human_boot_corr)==0:
        human_boot_corr = pd.DataFrame(['human', scenarioName, lb, med, ub, r_mean, r_lb, r_ub, r_med]).transpose()
    else:
        human_boot_corr = pd.concat([human_boot_corr, pd.DataFrame(['human', scenarioName, lb, med, ub, r_mean, r_lb, r_ub, r_med]).transpose()],axis=0)
        
## add column names        
human_boot_corr.columns=['agent','scenario','corr_lb', 'corr_med', 'corr_ub', 'r_mean', 'r_lb', 'r_ub', 'r_med']

## save out human_boot_corr to re-plot in R
if not os.path.exists(os.path.join(csv_dir, 'summary')):
    os.makedirs(os.path.join(csv_dir, 'summary'))    
human_boot_corr.to_csv(os.path.join(csv_dir, 'summary','human_pairwiseCorrs_by_scenario_adv.csv'), index=False)
print('Saved to file! Done.')

##### Hard

In [None]:
## init human_boot_corr for plotting
human_boot_corr = pd.DataFrame()

for exp_ind, exp_name in enumerate(resp_paths):
    
    ## get path to response data
    path_to_data = resp_paths[exp_ind]

    ## load data and apply preprocessing
    _D = h.load_and_preprocess_data(path_to_data)
    scenarioName = _D.scenarioName.values[0]
    print('Currently analyzing the {} experiment.'.format(_D.scenarioName.values[0]))
    clear_output(wait=True)

    ## apply exclusion criteria
    D = h.apply_exclusion_criteria(_D)
    D = D[~D['gameID'].isin(bad_games)]
    D = D[D['stim_ID'].isin(correct_below_33['stim_ID'].tolist())]
    
    ## create response feature matrix (numSubs x numTrialsPerSub)
    D2 = D.sort_values(by=['prolificIDAnon','stim_ID']).reset_index(drop=True)
    numSubs = len(np.unique(D['prolificIDAnon'].values))
    numTrialsPerSub = int(len(D)/numSubs)
    respMat = np.reshape(D2['responseBool'].values, (numSubs,numTrialsPerSub)) 

    ## sanity check that the reshape operation happened correctly
    assert len([i for (i,j) in list(zip(respMat[0],D2[:150]['responseBool'].values)) if i!=j])==0    
    
    ## get pairwise correlations
    dists = 1-scipy.spatial.distance.pdist(respMat, metric='correlation')
    corrMat = scipy.spatial.distance.squareform(dists)
    
    ## get percentiles over pairwise corrs
    pairwiseCorrs = corrMat[np.triu_indices(n=len(corrMat), k=1)]
    lb = np.percentile(pairwiseCorrs, 2.5)
    med = np.percentile(pairwiseCorrs, 50)
    ub = np.percentile(pairwiseCorrs, 97.5)  
    
    ## get pearsons r by splitting the subject pool in half and comparing mean responses
    humans = np.array(D['gameID'].unique())
    pearsons_rs = []
    for i in range(1000):
        # shuffle human indices
        shuffled_humans = humans.copy()
        np.random.shuffle(shuffled_humans)
        # get group A
        humans_A = shuffled_humans[:int(len(shuffled_humans)/2)]
        mask_A = D['gameID'].isin(humans_A)
        #get responses for the two groups
        resp_A = D[mask_A].groupby('stim_ID')['responseBool'].mean()
        resp_B = D[~mask_A].groupby('stim_ID')['responseBool'].mean()
        assert np.all(resp_A.index == resp_B.index)
        # calc r
        r,_ = scipy.stats.pearsonr(resp_A.values, resp_B.values)
        pearsons_rs.append(r)
    # get mean and intervals
    r_mean = np.mean(pearsons_rs)
    r_lb = np.percentile(pearsons_rs,2.5)
    r_ub = np.percentile(pearsons_rs,97.5)
    r_med = np.percentile(pearsons_rs,50)
        
    if len(human_boot_corr)==0:
        human_boot_corr = pd.DataFrame(['human', scenarioName, lb, med, ub, r_mean, r_lb, r_ub, r_med]).transpose()
    else:
        human_boot_corr = pd.concat([human_boot_corr, pd.DataFrame(['human', scenarioName, lb, med, ub, r_mean, r_lb, r_ub, r_med]).transpose()],axis=0)
        
## add column names        
human_boot_corr.columns=['agent','scenario','corr_lb', 'corr_med', 'corr_ub', 'r_mean', 'r_lb', 'r_ub', 'r_med']

## save out human_boot_corr to re-plot in R
if not os.path.exists(os.path.join(csv_dir, 'summary')):
    os.makedirs(os.path.join(csv_dir, 'summary'))    
human_boot_corr.to_csv(os.path.join(csv_dir, 'summary','human_pairwiseCorrs_by_scenario_hard.csv'), index=False)
print('Saved to file! Done.')

##### By chance

In [None]:
## init human_boot_corr for plotting
human_boot_corr = pd.DataFrame()

for exp_ind, exp_name in enumerate(resp_paths):
    
    ## get path to response data
    path_to_data = resp_paths[exp_ind]

    ## load data and apply preprocessing
    _D = h.load_and_preprocess_data(path_to_data)
    scenarioName = _D.scenarioName.values[0]
    print('Currently analyzing the {} experiment.'.format(_D.scenarioName.values[0]))
    clear_output(wait=True)

    ## apply exclusion criteria
    D = h.apply_exclusion_criteria(_D)
    D = D[~D['gameID'].isin(bad_games)]
    D = D[D['stim_ID'].isin(correct_below_67['stim_ID'].tolist())]
    
    ## create response feature matrix (numSubs x numTrialsPerSub)
    D2 = D.sort_values(by=['prolificIDAnon','stim_ID']).reset_index(drop=True)
    numSubs = len(np.unique(D['prolificIDAnon'].values))
    numTrialsPerSub = int(len(D)/numSubs)
    respMat = np.reshape(D2['responseBool'].values, (numSubs,numTrialsPerSub)) 

    ## sanity check that the reshape operation happened correctly
    assert len([i for (i,j) in list(zip(respMat[0],D2[:150]['responseBool'].values)) if i!=j])==0    
    
    ## get pairwise correlations
    dists = 1-scipy.spatial.distance.pdist(respMat, metric='correlation')
    corrMat = scipy.spatial.distance.squareform(dists)
    
    ## get percentiles over pairwise corrs
    pairwiseCorrs = corrMat[np.triu_indices(n=len(corrMat), k=1)]
    lb = np.percentile(pairwiseCorrs, 2.5)
    med = np.percentile(pairwiseCorrs, 50)
    ub = np.percentile(pairwiseCorrs, 97.5)  
    
    ## get pearsons r by splitting the subject pool in half and comparing mean responses
    humans = np.array(D['gameID'].unique())
    pearsons_rs = []
    for i in range(1000):
        # shuffle human indices
        shuffled_humans = humans.copy()
        np.random.shuffle(shuffled_humans)
        # get group A
        humans_A = shuffled_humans[:int(len(shuffled_humans)/2)]
        mask_A = D['gameID'].isin(humans_A)
        #get responses for the two groups
        resp_A = D[mask_A].groupby('stim_ID')['responseBool'].mean()
        resp_B = D[~mask_A].groupby('stim_ID')['responseBool'].mean()
        assert np.all(resp_A.index == resp_B.index)
        # calc r
        r,_ = scipy.stats.pearsonr(resp_A.values, resp_B.values)
        pearsons_rs.append(r)
    # get mean and intervals
    r_mean = np.mean(pearsons_rs)
    r_lb = np.percentile(pearsons_rs,2.5)
    r_ub = np.percentile(pearsons_rs,97.5)
    r_med = np.percentile(pearsons_rs,50)
        
    if len(human_boot_corr)==0:
        human_boot_corr = pd.DataFrame(['human', scenarioName, lb, med, ub, r_mean, r_lb, r_ub, r_med]).transpose()
    else:
        human_boot_corr = pd.concat([human_boot_corr, pd.DataFrame(['human', scenarioName, lb, med, ub, r_mean, r_lb, r_ub, r_med]).transpose()],axis=0)
        
## add column names        
human_boot_corr.columns=['agent','scenario','corr_lb', 'corr_med', 'corr_ub', 'r_mean', 'r_lb', 'r_ub', 'r_med']

## save out human_boot_corr to re-plot in R
if not os.path.exists(os.path.join(csv_dir, 'summary')):
    os.makedirs(os.path.join(csv_dir, 'summary'))    
human_boot_corr.to_csv(os.path.join(csv_dir, 'summary','human_pairwiseCorrs_by_scenario_chance.csv'), index=False)
print('Saved to file! Done.')

##### Easy

In [None]:
## init human_boot_corr for plotting
human_boot_corr = pd.DataFrame()

for exp_ind, exp_name in enumerate(resp_paths):
    
    ## get path to response data
    path_to_data = resp_paths[exp_ind]

    ## load data and apply preprocessing
    _D = h.load_and_preprocess_data(path_to_data)
    scenarioName = _D.scenarioName.values[0]
    print('Currently analyzing the {} experiment.'.format(_D.scenarioName.values[0]))
    clear_output(wait=True)

    ## apply exclusion criteria
    D = h.apply_exclusion_criteria(_D)
    D = D[~D['gameID'].isin(bad_games)]
    D = D[D['stim_ID'].isin(correct_below_100['stim_ID'].tolist())]
    
    ## create response feature matrix (numSubs x numTrialsPerSub)
    D2 = D.sort_values(by=['prolificIDAnon','stim_ID']).reset_index(drop=True)
    numSubs = len(np.unique(D['prolificIDAnon'].values))
    numTrialsPerSub = int(len(D)/numSubs)
    respMat = np.reshape(D2['responseBool'].values, (numSubs,numTrialsPerSub)) 

    ## sanity check that the reshape operation happened correctly
    assert len([i for (i,j) in list(zip(respMat[0],D2[:150]['responseBool'].values)) if i!=j])==0    
    
    ## get pairwise correlations
    dists = 1-scipy.spatial.distance.pdist(respMat, metric='correlation')
    corrMat = scipy.spatial.distance.squareform(dists)
    
    ## get percentiles over pairwise corrs
    pairwiseCorrs = corrMat[np.triu_indices(n=len(corrMat), k=1)]
    lb = np.percentile(pairwiseCorrs, 2.5)
    med = np.percentile(pairwiseCorrs, 50)
    ub = np.percentile(pairwiseCorrs, 97.5)  
    
    ## get pearsons r by splitting the subject pool in half and comparing mean responses
    humans = np.array(D['gameID'].unique())
    pearsons_rs = []
    for i in range(1000):
        # shuffle human indices
        shuffled_humans = humans.copy()
        np.random.shuffle(shuffled_humans)
        # get group A
        humans_A = shuffled_humans[:int(len(shuffled_humans)/2)]
        mask_A = D['gameID'].isin(humans_A)
        #get responses for the two groups
        resp_A = D[mask_A].groupby('stim_ID')['responseBool'].mean()
        resp_B = D[~mask_A].groupby('stim_ID')['responseBool'].mean()
        assert np.all(resp_A.index == resp_B.index)
        # calc r
        r,_ = scipy.stats.pearsonr(resp_A.values, resp_B.values)
        pearsons_rs.append(r)
    # get mean and intervals
    r_mean = np.mean(pearsons_rs)
    r_lb = np.percentile(pearsons_rs,2.5)
    r_ub = np.percentile(pearsons_rs,97.5)
    r_med = np.percentile(pearsons_rs,50)
        
    if len(human_boot_corr)==0:
        human_boot_corr = pd.DataFrame(['human', scenarioName, lb, med, ub, r_mean, r_lb, r_ub, r_med]).transpose()
    else:
        human_boot_corr = pd.concat([human_boot_corr, pd.DataFrame(['human', scenarioName, lb, med, ub, r_mean, r_lb, r_ub, r_med]).transpose()],axis=0)
        
## add column names        
human_boot_corr.columns=['agent','scenario','corr_lb', 'corr_med', 'corr_ub', 'r_mean', 'r_lb', 'r_ub', 'r_med']

## save out human_boot_corr to re-plot in R
if not os.path.exists(os.path.join(csv_dir, 'summary')):
    os.makedirs(os.path.join(csv_dir, 'summary'))    
human_boot_corr.to_csv(os.path.join(csv_dir, 'summary','human_pairwiseCorrs_by_scenario_easy.csv'), index=False)
print('Saved to file! Done.')

#### Cohen's $\kappa$

##### Adversarial

In [None]:
## init human_boot_cohenk for plotting
human_boot_cohenk = pd.DataFrame()

for exp_ind, exp_name in enumerate(resp_paths):
    
    ## get path to response data
    path_to_data = resp_paths[exp_ind]

    ## load data and apply preprocessing
    _D = h.load_and_preprocess_data(path_to_data)
    scenarioName = _D.scenarioName.values[0]
    print('Currently analyzing the {} experiment.'.format(_D.scenarioName.values[0]))
    clear_output(wait=True)

    ## apply exclusion criteria
    D = h.apply_exclusion_criteria(_D)
    D = D[~D['gameID'].isin(bad_games)]
    D = D[D['stim_ID'].isin(correct_below_50['stim_ID'].tolist())]
    
    ## create response feature matrix (numSubs x numTrialsPerSub)
    D2 = D.sort_values(by=['prolificIDAnon','stim_ID']).reset_index(drop=True)
    numSubs = len(np.unique(D['prolificIDAnon'].values))
    numTrialsPerSub = int(len(D)/numSubs)
    respMat = np.reshape(D2['responseBool'].values, (numSubs,numTrialsPerSub)) 

    ## sanity check that the reshape operation happened correctly
    assert len([i for (i,j) in list(zip(respMat[0],D2[:150]['responseBool'].values)) if i!=j])==0    
      
    ## compute Cohen's kappa
    ## with a horrific double loop
    kappas = []
    for i in range(respMat.shape[0]): # for each participant
        for j in range(i+1,respMat.shape[0]): # compare to every participant after them
            assert i != j
            kappa = sklearn.metrics.cohen_kappa_score(respMat[i],respMat[j])
            kappas.append(kappa)
    
    ## get percentiles over pairwise corrs
    lb = np.percentile(kappas, 2.5)
    med = np.percentile(kappas, 50)
    ub = np.percentile(kappas, 97.5)  
        
    if len(human_boot_cohenk)==0:
        human_boot_cohenk = pd.DataFrame(['human', scenarioName, lb, med, ub]).transpose()
    else:
        human_boot_cohenk = pd.concat([human_boot_cohenk, pd.DataFrame(['human', scenarioName, lb, med, ub]).transpose()],axis=0)
        
## add column names        
human_boot_cohenk.columns=['agent','scenario','corr_lb', 'corr_med', 'corr_ub']

## save out human_boot_cohenk to re-plot in R
if not os.path.exists(os.path.join(csv_dir, 'summary')):
    os.makedirs(os.path.join(csv_dir, 'summary'))    
human_boot_cohenk.to_csv(os.path.join(csv_dir, 'summary','human_pairwiseCohensKs_by_scenario_adv.csv'), index=False)
print('Saved to file! Done.')

In [58]:
human_boot_cohenk

Unnamed: 0,agent,scenario,corr_lb,corr_med,corr_ub
0,human,towers,-0.166667,0.252964,0.704225
0,human,containment,-0.111111,0.298701,0.727273
0,human,collision,-0.19403,0.283582,0.737705
0,human,rollingsliding,,,
0,human,drop,-0.166667,0.25,0.666667
0,human,linking,-0.117647,0.294118,0.647059
0,human,dominoes,-0.176464,0.180851,0.564767
0,human,clothiness,-0.202532,0.278481,0.688525


##### Hard

In [None]:
## init human_boot_cohenk for plotting
human_boot_cohenk = pd.DataFrame()

for exp_ind, exp_name in enumerate(resp_paths):
    
    ## get path to response data
    path_to_data = resp_paths[exp_ind]

    ## load data and apply preprocessing
    _D = h.load_and_preprocess_data(path_to_data)
    scenarioName = _D.scenarioName.values[0]
    print('Currently analyzing the {} experiment.'.format(_D.scenarioName.values[0]))
    clear_output(wait=True)

    ## apply exclusion criteria
    D = h.apply_exclusion_criteria(_D)
    D = D[~D['gameID'].isin(bad_games)]
    D = D[D['stim_ID'].isin(correct_below_33['stim_ID'].tolist())]
    
    ## create response feature matrix (numSubs x numTrialsPerSub)
    D2 = D.sort_values(by=['prolificIDAnon','stim_ID']).reset_index(drop=True)
    numSubs = len(np.unique(D['prolificIDAnon'].values))
    numTrialsPerSub = int(len(D)/numSubs)
    respMat = np.reshape(D2['responseBool'].values, (numSubs,numTrialsPerSub)) 

    ## sanity check that the reshape operation happened correctly
    assert len([i for (i,j) in list(zip(respMat[0],D2[:150]['responseBool'].values)) if i!=j])==0    
      
    ## compute Cohen's kappa
    ## with a horrific double loop
    kappas = []
    for i in range(respMat.shape[0]): # for each participant
        for j in range(i+1,respMat.shape[0]): # compare to every participant after them
            assert i != j
            kappa = sklearn.metrics.cohen_kappa_score(respMat[i],respMat[j])
            kappas.append(kappa)
    
    ## get percentiles over pairwise corrs
    lb = np.percentile(kappas, 2.5)
    med = np.percentile(kappas, 50)
    ub = np.percentile(kappas, 97.5)  
        
    if len(human_boot_cohenk)==0:
        human_boot_cohenk = pd.DataFrame(['human', scenarioName, lb, med, ub]).transpose()
    else:
        human_boot_cohenk = pd.concat([human_boot_cohenk, pd.DataFrame(['human', scenarioName, lb, med, ub]).transpose()],axis=0)
        
## add column names        
human_boot_cohenk.columns=['agent','scenario','corr_lb', 'corr_med', 'corr_ub']

## save out human_boot_cohenk to re-plot in R
if not os.path.exists(os.path.join(csv_dir, 'summary')):
    os.makedirs(os.path.join(csv_dir, 'summary'))    
human_boot_cohenk.to_csv(os.path.join(csv_dir, 'summary','human_pairwiseCohensKs_by_scenario_hard.csv'), index=False)
print('Saved to file! Done.')

In [60]:
human_boot_cohenk

Unnamed: 0,agent,scenario,corr_lb,corr_med,corr_ub
0,human,towers,-0.148649,0.295858,0.763889
0,human,containment,-0.114286,0.41791,0.843373
0,human,collision,-0.315789,0.411765,1.0
0,human,rollingsliding,,,
0,human,drop,-0.153846,0.347826,0.736842
0,human,linking,-0.130435,0.4,0.813227
0,human,dominoes,-0.217949,0.231214,0.728571
0,human,clothiness,-0.216216,0.347826,0.842105


##### By chance

In [None]:
## init human_boot_cohenk for plotting
human_boot_cohenk = pd.DataFrame()

for exp_ind, exp_name in enumerate(resp_paths):
    
    ## get path to response data
    path_to_data = resp_paths[exp_ind]

    ## load data and apply preprocessing
    _D = h.load_and_preprocess_data(path_to_data)
    scenarioName = _D.scenarioName.values[0]
    print('Currently analyzing the {} experiment.'.format(_D.scenarioName.values[0]))
    clear_output(wait=True)

    ## apply exclusion criteria
    D = h.apply_exclusion_criteria(_D)
    D = D[~D['gameID'].isin(bad_games)]
    D = D[D['stim_ID'].isin(correct_below_67['stim_ID'].tolist())]
    
    ## create response feature matrix (numSubs x numTrialsPerSub)
    D2 = D.sort_values(by=['prolificIDAnon','stim_ID']).reset_index(drop=True)
    numSubs = len(np.unique(D['prolificIDAnon'].values))
    numTrialsPerSub = int(len(D)/numSubs)
    respMat = np.reshape(D2['responseBool'].values, (numSubs,numTrialsPerSub)) 

    ## sanity check that the reshape operation happened correctly
    assert len([i for (i,j) in list(zip(respMat[0],D2[:150]['responseBool'].values)) if i!=j])==0    
      
    ## compute Cohen's kappa
    ## with a horrific double loop
    kappas = []
    for i in range(respMat.shape[0]): # for each participant
        for j in range(i+1,respMat.shape[0]): # compare to every participant after them
            assert i != j
            kappa = sklearn.metrics.cohen_kappa_score(respMat[i],respMat[j])
            kappas.append(kappa)
    
    ## get percentiles over pairwise corrs
    lb = np.percentile(kappas, 2.5)
    med = np.percentile(kappas, 50)
    ub = np.percentile(kappas, 97.5)  
        
    if len(human_boot_cohenk)==0:
        human_boot_cohenk = pd.DataFrame(['human', scenarioName, lb, med, ub]).transpose()
    else:
        human_boot_cohenk = pd.concat([human_boot_cohenk, pd.DataFrame(['human', scenarioName, lb, med, ub]).transpose()],axis=0)
        
## add column names        
human_boot_cohenk.columns=['agent','scenario','corr_lb', 'corr_med', 'corr_ub']

## save out human_boot_cohenk to re-plot in R
if not os.path.exists(os.path.join(csv_dir, 'summary')):
    os.makedirs(os.path.join(csv_dir, 'summary'))    
human_boot_cohenk.to_csv(os.path.join(csv_dir, 'summary','human_pairwiseCohensKs_by_scenario_chance.csv'), index=False)
print('Saved to file! Done.')

In [62]:
human_boot_cohenk

Unnamed: 0,agent,scenario,corr_lb,corr_med,corr_ub
0,human,towers,-0.293478,0.030303,0.377778
0,human,containment,-0.328286,0.0225564,0.38921
0,human,collision,-0.363636,0.0364964,0.454545
0,human,rollingsliding,,,
0,human,drop,-0.343434,0.0277778,0.419689
0,human,linking,-0.24,0.0243902,0.333333
0,human,dominoes,-0.268293,0.0277008,0.326693
0,human,clothiness,-0.303337,0.0234742,0.358025


##### Easy

In [None]:
## init human_boot_cohenk for plotting
human_boot_cohenk = pd.DataFrame()

for exp_ind, exp_name in enumerate(resp_paths):
    
    ## get path to response data
    path_to_data = resp_paths[exp_ind]

    ## load data and apply preprocessing
    _D = h.load_and_preprocess_data(path_to_data)
    scenarioName = _D.scenarioName.values[0]
    print('Currently analyzing the {} experiment.'.format(_D.scenarioName.values[0]))
    clear_output(wait=True)

    ## apply exclusion criteria
    D = h.apply_exclusion_criteria(_D)
    D = D[~D['gameID'].isin(bad_games)]
    D = D[D['stim_ID'].isin(correct_below_100['stim_ID'].tolist())]
    
    ## create response feature matrix (numSubs x numTrialsPerSub)
    D2 = D.sort_values(by=['prolificIDAnon','stim_ID']).reset_index(drop=True)
    numSubs = len(np.unique(D['prolificIDAnon'].values))
    numTrialsPerSub = int(len(D)/numSubs)
    respMat = np.reshape(D2['responseBool'].values, (numSubs,numTrialsPerSub)) 

    ## sanity check that the reshape operation happened correctly
    assert len([i for (i,j) in list(zip(respMat[0],D2[:150]['responseBool'].values)) if i!=j])==0    
      
    ## compute Cohen's kappa
    ## with a horrific double loop
    kappas = []
    for i in range(respMat.shape[0]): # for each participant
        for j in range(i+1,respMat.shape[0]): # compare to every participant after them
            assert i != j
            kappa = sklearn.metrics.cohen_kappa_score(respMat[i],respMat[j])
            kappas.append(kappa)
    
    ## get percentiles over pairwise corrs
    lb = np.percentile(kappas, 2.5)
    med = np.percentile(kappas, 50)
    ub = np.percentile(kappas, 97.5)  
        
    if len(human_boot_cohenk)==0:
        human_boot_cohenk = pd.DataFrame(['human', scenarioName, lb, med, ub]).transpose()
    else:
        human_boot_cohenk = pd.concat([human_boot_cohenk, pd.DataFrame(['human', scenarioName, lb, med, ub]).transpose()],axis=0)
        
## add column names        
human_boot_cohenk.columns=['agent','scenario','corr_lb', 'corr_med', 'corr_ub']

## save out human_boot_cohenk to re-plot in R
if not os.path.exists(os.path.join(csv_dir, 'summary')):
    os.makedirs(os.path.join(csv_dir, 'summary'))    
human_boot_cohenk.to_csv(os.path.join(csv_dir, 'summary','human_pairwiseCohensKs_by_scenario_easy.csv'), index=False)
print('Saved to file! Done.')

In [64]:
human_boot_cohenk

Unnamed: 0,agent,scenario,corr_lb,corr_med,corr_ub
0,human,towers,0.393504,0.747195,0.883234
0,human,containment,0.367367,0.677013,0.871294
0,human,collision,0.470162,0.755982,0.897362
0,human,rollingsliding,0.501759,0.766898,0.924401
0,human,drop,0.344248,0.638963,0.829648
0,human,linking,0.235782,0.600281,0.815534
0,human,dominoes,0.340312,0.602687,0.800962
0,human,clothiness,0.159152,0.508571,0.754286


## Conduct human-model comparisons
We will compare human and model behavior in two ways: **absolute performance** and **response pattern.**

#### **Absolute Performance** 
We will compare the accuracy of each model to the mean accuracy of humans, for each scenario. 
To do this, we will first compute estimates of mean human accuracy for each scenario and construct 95% confidence intervals for each of these estimates. 
These confidence intervals will be constructed by bootstrapping: specifically, for an experiment with N participants, we will resample N participants with replacement and compute the proportion correct for that bootstrapped sample. We will take repeat this resampling procedure 1000 times to generate a sampling distribution for the mean proportion correct. The 2.5th and 97.5th percentile will be extracted from this sampling distribution to provide the lower and upper bounds of the 95% confidence interval.

For each model, we will then compare their proportion correct (a point estimate) to the human confidence interval. 

##### Adversarial

In [66]:
# group model data by scenario
MD_by_scenario = MD_adv.groupby(['Readout Test Data','ModelID']).agg(
        {**{ 'correct':'mean' },
         **{ col:'first' for col in MODEL_COLS+h.DATASET_ABSTRACTED_COLS} #save model identifying data as well
        })

In [None]:
accuracies = {}

for scenario in sorted(MD_adv['Readout Test Data'].unique()):
    print("Now running scenario",scenario)
    _MD_by_scenario = MD_by_scenario.loc[[scenario]]
    for _,model_row in list(_MD_by_scenario.iterrows()):
        #each model is one row of MD_adv
        human_row = human_bootstrapped_accuracy_adv.query("scenario == @scenario")
#         assert len(model_row) == len(human_row) == 1
        correct_ratio = model_row['correct']/human_row['obs_mean']
        correct_diff = model_row['correct'] - human_row['obs_mean']
        accuracies[(scenario,model_row.name[1])] = {**{
                                                    'scenario': scenario,
                                                    'ratio': float(correct_ratio), 
                                                    'diff': float(correct_diff),
                                                    'human_correct': float(human_row['obs_mean']),
                                                    'model_correct': float(model_row['correct']),
                                                    },**{col: model_row[col] for col in MODEL_COLS+h.DATASET_ABSTRACTED_COLS}} # save information for model identification
    clear_output(wait=True)

model_human_accuracies = pd.DataFrame(accuracies).transpose()  
model_human_accuracies.to_csv(os.path.join(csv_dir, 'summary','model_human_accuracies_adv.csv'), index=False)
print('Saved to file. Done!')

In [68]:
model_human_accuracies

Unnamed: 0,Unnamed: 1,scenario,ratio,diff,human_correct,model_correct,Model,Readout Train Data,Readout Type,Encoder Type,Dynamics Type,Encoder Pre-training Task,Encoder Pre-training Dataset,Encoder Pre-training Seed,Encoder Training Task,Encoder Training Dataset,Encoder Training Seed,Dynamics Training Task,Dynamics Training Dataset,Dynamics Training Seed,ModelID,Model Kind,Visual encoder architecture,Dynamics model architecture,ObjectCentric,Supervised,SelfSupervisedLossSelfSupervisedLoss,Encoder Training Dataset Type,Dynamics Training Dataset Type,Readout Train Data Type
clothiness,CSWM_CSWM encoder_0.0_Contrastive_all_Contrastive_0_all_readout_A_clothiness_CSWM_results.csv,clothiness,1.75148,0.180654,0.240398,0.421053,CSWM,clothiness,A,CSWM encoder,CSWM dynamics,,,,Contrastive,all,0,Contrastive,all,0,CSWM_CSWM encoder_0.0_Contrastive_all_Contrast...,CSWM_CSWM encoder_0.0_Contrastive_all_Contrast...,Neither,Neither,True,False,,all,all,same
clothiness,CSWM_CSWM encoder_0.0_Contrastive_all_Contrastive_0_all_readout_B_clothiness_CSWM_results.csv,clothiness,1.97041,0.233286,0.240398,0.473684,CSWM,clothiness,B,CSWM encoder,CSWM dynamics,,,,Contrastive,all,0,Contrastive,all,0,CSWM_CSWM encoder_0.0_Contrastive_all_Contrast...,CSWM_CSWM encoder_0.0_Contrastive_all_Contrast...,Neither,Neither,True,False,,all,all,same
clothiness,CSWM_CSWM encoder_0.0_Contrastive_all_Contrastive_0_all_readout_C_clothiness_CSWM_results.csv,clothiness,2.62722,0.391181,0.240398,0.631579,CSWM,clothiness,C,CSWM encoder,CSWM dynamics,,,,Contrastive,all,0,Contrastive,all,0,CSWM_CSWM encoder_0.0_Contrastive_all_Contrast...,CSWM_CSWM encoder_0.0_Contrastive_all_Contrast...,Neither,Neither,True,False,,all,all,same
clothiness,CSWM_CSWM encoder_0.0_Contrastive_clothiness_Contrastive_0_clothiness_readout_A_clothiness_CSWM_results.csv,clothiness,1.53254,0.128023,0.240398,0.368421,CSWM,clothiness,A,CSWM encoder,CSWM dynamics,,,,Contrastive,clothiness,0,Contrastive,clothiness,0,CSWM_CSWM encoder_0.0_Contrastive_clothiness_C...,CSWM_CSWM encoder_0.0_Contrastive_same_Contras...,Neither,Neither,True,False,,same,same,same
clothiness,CSWM_CSWM encoder_0.0_Contrastive_clothiness_Contrastive_0_clothiness_readout_B_clothiness_CSWM_results.csv,clothiness,2.40828,0.338549,0.240398,0.578947,CSWM,clothiness,B,CSWM encoder,CSWM dynamics,,,,Contrastive,clothiness,0,Contrastive,clothiness,0,CSWM_CSWM encoder_0.0_Contrastive_clothiness_C...,CSWM_CSWM encoder_0.0_Contrastive_same_Contras...,Neither,Neither,True,False,,same,same,same
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
towers,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_no_towers_readout_B_towers_VGGFrozenMLP_results.csv,towers,2.05044,0.268347,0.255462,0.52381,VGGFrozenMLP,towers,B,VGG,MLP,ImageNet classification,ImageNet,,,,,L2 on latent,no_towers,0,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_no...,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_same,Transformer,MLP,,False,,,all_but_this,same
towers,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_no_towers_readout_C_towers_VGGFrozenMLP_results.csv,towers,2.05044,0.268347,0.255462,0.52381,VGGFrozenMLP,towers,C,VGG,MLP,ImageNet classification,ImageNet,,,,,L2 on latent,no_towers,0,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_no...,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_same,Transformer,MLP,,False,,,all_but_this,same
towers,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_towers_readout_A_towers_VGGFrozenMLP_results.csv,towers,2.60965,0.411204,0.255462,0.666667,VGGFrozenMLP,towers,A,VGG,MLP,ImageNet classification,ImageNet,,,,,L2 on latent,towers,0,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_to...,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_same,Transformer,MLP,,False,,,same,same
towers,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_towers_readout_B_towers_VGGFrozenMLP_results.csv,towers,2.42325,0.363585,0.255462,0.619048,VGGFrozenMLP,towers,B,VGG,MLP,ImageNet classification,ImageNet,,,,,L2 on latent,towers,0,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_to...,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_same,Transformer,MLP,,False,,,same,same


##### Hard

In [69]:
# group model data by scenario
MD_by_scenario = MD_hard.groupby(['Readout Test Data','ModelID']).agg(
        {**{ 'correct':'mean' },
         **{ col:'first' for col in MODEL_COLS+h.DATASET_ABSTRACTED_COLS} #save model identifying data as well
        })

In [None]:
accuracies = {}

for scenario in sorted(MD_hard['Readout Test Data'].unique()):
    print("Now running scenario",scenario)
    _MD_by_scenario = MD_by_scenario.loc[[scenario]]
    for _,model_row in list(_MD_by_scenario.iterrows()):
        #each model is one row of MD_adv
        human_row = human_bootstrapped_accuracy_hard.query("scenario == @scenario")
#         assert len(model_row) == len(human_row) == 1
        correct_ratio = model_row['correct']/human_row['obs_mean']
        correct_diff = model_row['correct'] - human_row['obs_mean']
        accuracies[(scenario,model_row.name[1])] = {**{
                                                    'scenario': scenario,
                                                    'ratio': float(correct_ratio), 
                                                    'diff': float(correct_diff),
                                                    'human_correct': float(human_row['obs_mean']),
                                                    'model_correct': float(model_row['correct']),
                                                    },**{col: model_row[col] for col in MODEL_COLS+h.DATASET_ABSTRACTED_COLS}} # save information for model identification
    clear_output(wait=True)

model_human_accuracies = pd.DataFrame(accuracies).transpose()  
model_human_accuracies.to_csv(os.path.join(csv_dir, 'summary','model_human_accuracies_hard.csv'), index=False)
print('Saved to file. Done!')

In [71]:
model_human_accuracies

Unnamed: 0,Unnamed: 1,scenario,ratio,diff,human_correct,model_correct,Model,Readout Train Data,Readout Type,Encoder Type,Dynamics Type,Encoder Pre-training Task,Encoder Pre-training Dataset,Encoder Pre-training Seed,Encoder Training Task,Encoder Training Dataset,Encoder Training Seed,Dynamics Training Task,Dynamics Training Dataset,Dynamics Training Seed,ModelID,Model Kind,Visual encoder architecture,Dynamics model architecture,ObjectCentric,Supervised,SelfSupervisedLossSelfSupervisedLoss,Encoder Training Dataset Type,Dynamics Training Dataset Type,Readout Train Data Type
clothiness,CSWM_CSWM encoder_0.0_Contrastive_all_Contrastive_0_all_readout_A_clothiness_CSWM_results.csv,clothiness,1.62281,0.127928,0.205405,0.333333,CSWM,clothiness,A,CSWM encoder,CSWM dynamics,,,,Contrastive,all,0,Contrastive,all,0,CSWM_CSWM encoder_0.0_Contrastive_all_Contrast...,CSWM_CSWM encoder_0.0_Contrastive_all_Contrast...,Neither,Neither,True,False,,all,all,same
clothiness,CSWM_CSWM encoder_0.0_Contrastive_all_Contrastive_0_all_readout_B_clothiness_CSWM_results.csv,clothiness,2.59649,0.327928,0.205405,0.533333,CSWM,clothiness,B,CSWM encoder,CSWM dynamics,,,,Contrastive,all,0,Contrastive,all,0,CSWM_CSWM encoder_0.0_Contrastive_all_Contrast...,CSWM_CSWM encoder_0.0_Contrastive_all_Contrast...,Neither,Neither,True,False,,all,all,same
clothiness,CSWM_CSWM encoder_0.0_Contrastive_all_Contrastive_0_all_readout_C_clothiness_CSWM_results.csv,clothiness,3.24561,0.461261,0.205405,0.666667,CSWM,clothiness,C,CSWM encoder,CSWM dynamics,,,,Contrastive,all,0,Contrastive,all,0,CSWM_CSWM encoder_0.0_Contrastive_all_Contrast...,CSWM_CSWM encoder_0.0_Contrastive_all_Contrast...,Neither,Neither,True,False,,all,all,same
clothiness,CSWM_CSWM encoder_0.0_Contrastive_clothiness_Contrastive_0_clothiness_readout_A_clothiness_CSWM_results.csv,clothiness,2.27193,0.261261,0.205405,0.466667,CSWM,clothiness,A,CSWM encoder,CSWM dynamics,,,,Contrastive,clothiness,0,Contrastive,clothiness,0,CSWM_CSWM encoder_0.0_Contrastive_clothiness_C...,CSWM_CSWM encoder_0.0_Contrastive_same_Contras...,Neither,Neither,True,False,,same,same,same
clothiness,CSWM_CSWM encoder_0.0_Contrastive_clothiness_Contrastive_0_clothiness_readout_B_clothiness_CSWM_results.csv,clothiness,3.24561,0.461261,0.205405,0.666667,CSWM,clothiness,B,CSWM encoder,CSWM dynamics,,,,Contrastive,clothiness,0,Contrastive,clothiness,0,CSWM_CSWM encoder_0.0_Contrastive_clothiness_C...,CSWM_CSWM encoder_0.0_Contrastive_same_Contras...,Neither,Neither,True,False,,same,same,same
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
towers,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_no_towers_readout_B_towers_VGGFrozenMLP_results.csv,towers,2.33232,0.302422,0.22699,0.529412,VGGFrozenMLP,towers,B,VGG,MLP,ImageNet classification,ImageNet,,,,,L2 on latent,no_towers,0,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_no...,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_same,Transformer,MLP,,False,,,all_but_this,same
towers,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_no_towers_readout_C_towers_VGGFrozenMLP_results.csv,towers,2.33232,0.302422,0.22699,0.529412,VGGFrozenMLP,towers,C,VGG,MLP,ImageNet classification,ImageNet,,,,,L2 on latent,no_towers,0,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_no...,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_same,Transformer,MLP,,False,,,all_but_this,same
towers,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_towers_readout_A_towers_VGGFrozenMLP_results.csv,towers,2.85061,0.420069,0.22699,0.647059,VGGFrozenMLP,towers,A,VGG,MLP,ImageNet classification,ImageNet,,,,,L2 on latent,towers,0,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_to...,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_same,Transformer,MLP,,False,,,same,same
towers,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_towers_readout_B_towers_VGGFrozenMLP_results.csv,towers,2.59146,0.361246,0.22699,0.588235,VGGFrozenMLP,towers,B,VGG,MLP,ImageNet classification,ImageNet,,,,,L2 on latent,towers,0,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_to...,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_same,Transformer,MLP,,False,,,same,same


##### Chance

In [72]:
# group model data by scenario
MD_by_scenario = MD_chance.groupby(['Readout Test Data','ModelID']).agg(
        {**{ 'correct':'mean' },
         **{ col:'first' for col in MODEL_COLS+h.DATASET_ABSTRACTED_COLS} #save model identifying data as well
        })

In [None]:
accuracies = {}

for scenario in sorted(MD_chance['Readout Test Data'].unique()):
    print("Now running scenario",scenario)
    _MD_by_scenario = MD_by_scenario.loc[[scenario]]
    for _,model_row in list(_MD_by_scenario.iterrows()):
        #each model is one row of MD_adv
        human_row = human_bootstrapped_accuracy_chance.query("scenario == @scenario")
#         assert len(model_row) == len(human_row) == 1
        correct_ratio = model_row['correct']/human_row['obs_mean']
        correct_diff = model_row['correct'] - human_row['obs_mean']
        accuracies[(scenario,model_row.name[1])] = {**{
                                                    'scenario': scenario,
                                                    'ratio': float(correct_ratio), 
                                                    'diff': float(correct_diff),
                                                    'human_correct': float(human_row['obs_mean']),
                                                    'model_correct': float(model_row['correct']),
                                                    },**{col: model_row[col] for col in MODEL_COLS+h.DATASET_ABSTRACTED_COLS}} # save information for model identification
    clear_output(wait=True)

model_human_accuracies = pd.DataFrame(accuracies).transpose()  
model_human_accuracies.to_csv(os.path.join(csv_dir, 'summary','model_human_accuracies_chance.csv'), index=False)
print('Saved to file. Done!')

In [74]:
model_human_accuracies

Unnamed: 0,Unnamed: 1,scenario,ratio,diff,human_correct,model_correct,Model,Readout Train Data,Readout Type,Encoder Type,Dynamics Type,Encoder Pre-training Task,Encoder Pre-training Dataset,Encoder Pre-training Seed,Encoder Training Task,Encoder Training Dataset,Encoder Training Seed,Dynamics Training Task,Dynamics Training Dataset,Dynamics Training Seed,ModelID,Model Kind,Visual encoder architecture,Dynamics model architecture,ObjectCentric,Supervised,SelfSupervisedLossSelfSupervisedLoss,Encoder Training Dataset Type,Dynamics Training Dataset Type,Readout Train Data Type
clothiness,CSWM_CSWM encoder_0.0_Contrastive_all_Contrastive_0_all_readout_A_clothiness_CSWM_results.csv,clothiness,1.34834,0.182995,0.525338,0.708333,CSWM,clothiness,A,CSWM encoder,CSWM dynamics,,,,Contrastive,all,0,Contrastive,all,0,CSWM_CSWM encoder_0.0_Contrastive_all_Contrast...,CSWM_CSWM encoder_0.0_Contrastive_all_Contrast...,Neither,Neither,True,False,,all,all,same
clothiness,CSWM_CSWM encoder_0.0_Contrastive_all_Contrastive_0_all_readout_B_clothiness_CSWM_results.csv,clothiness,1.03108,0.0163288,0.525338,0.541667,CSWM,clothiness,B,CSWM encoder,CSWM dynamics,,,,Contrastive,all,0,Contrastive,all,0,CSWM_CSWM encoder_0.0_Contrastive_all_Contrast...,CSWM_CSWM encoder_0.0_Contrastive_all_Contrast...,Neither,Neither,True,False,,all,all,same
clothiness,CSWM_CSWM encoder_0.0_Contrastive_all_Contrastive_0_all_readout_C_clothiness_CSWM_results.csv,clothiness,0.951768,-0.0253378,0.525338,0.5,CSWM,clothiness,C,CSWM encoder,CSWM dynamics,,,,Contrastive,all,0,Contrastive,all,0,CSWM_CSWM encoder_0.0_Contrastive_all_Contrast...,CSWM_CSWM encoder_0.0_Contrastive_all_Contrast...,Neither,Neither,True,False,,all,all,same
clothiness,CSWM_CSWM encoder_0.0_Contrastive_clothiness_Contrastive_0_clothiness_readout_A_clothiness_CSWM_results.csv,clothiness,0.991426,-0.0045045,0.525338,0.520833,CSWM,clothiness,A,CSWM encoder,CSWM dynamics,,,,Contrastive,clothiness,0,Contrastive,clothiness,0,CSWM_CSWM encoder_0.0_Contrastive_clothiness_C...,CSWM_CSWM encoder_0.0_Contrastive_same_Contras...,Neither,Neither,True,False,,same,same,same
clothiness,CSWM_CSWM encoder_0.0_Contrastive_clothiness_Contrastive_0_clothiness_readout_B_clothiness_CSWM_results.csv,clothiness,0.912111,-0.0461712,0.525338,0.479167,CSWM,clothiness,B,CSWM encoder,CSWM dynamics,,,,Contrastive,clothiness,0,Contrastive,clothiness,0,CSWM_CSWM encoder_0.0_Contrastive_clothiness_C...,CSWM_CSWM encoder_0.0_Contrastive_same_Contras...,Neither,Neither,True,False,,same,same,same
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
towers,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_no_towers_readout_B_towers_VGGFrozenMLP_results.csv,towers,0.735063,-0.141597,0.534454,0.392857,VGGFrozenMLP,towers,B,VGG,MLP,ImageNet classification,ImageNet,,,,,L2 on latent,no_towers,0,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_no...,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_same,Transformer,MLP,,False,,,all_but_this,same
towers,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_no_towers_readout_C_towers_VGGFrozenMLP_results.csv,towers,0.735063,-0.141597,0.534454,0.392857,VGGFrozenMLP,towers,C,VGG,MLP,ImageNet classification,ImageNet,,,,,L2 on latent,no_towers,0,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_no...,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_same,Transformer,MLP,,False,,,all_but_this,same
towers,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_towers_readout_A_towers_VGGFrozenMLP_results.csv,towers,1.06918,0.0369748,0.534454,0.571429,VGGFrozenMLP,towers,A,VGG,MLP,ImageNet classification,ImageNet,,,,,L2 on latent,towers,0,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_to...,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_same,Transformer,MLP,,False,,,same,same
towers,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_towers_readout_B_towers_VGGFrozenMLP_results.csv,towers,0.868711,-0.0701681,0.534454,0.464286,VGGFrozenMLP,towers,B,VGG,MLP,ImageNet classification,ImageNet,,,,,L2 on latent,towers,0,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_to...,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_same,Transformer,MLP,,False,,,same,same


##### Easy

In [75]:
# group model data by scenario
MD_by_scenario = MD_easy.groupby(['Readout Test Data','ModelID']).agg(
        {**{ 'correct':'mean' },
         **{ col:'first' for col in MODEL_COLS+h.DATASET_ABSTRACTED_COLS} #save model identifying data as well
        })

In [None]:
accuracies = {}

for scenario in sorted(MD_easy['Readout Test Data'].unique()):
    print("Now running scenario",scenario)
    _MD_by_scenario = MD_by_scenario.loc[[scenario]]
    for _,model_row in list(_MD_by_scenario.iterrows()):
        #each model is one row of MD_adv
        human_row = human_bootstrapped_accuracy_easy.query("scenario == @scenario")
#         assert len(model_row) == len(human_row) == 1
        correct_ratio = model_row['correct']/human_row['obs_mean']
        correct_diff = model_row['correct'] - human_row['obs_mean']
        accuracies[(scenario,model_row.name[1])] = {**{
                                                    'scenario': scenario,
                                                    'ratio': float(correct_ratio), 
                                                    'diff': float(correct_diff),
                                                    'human_correct': float(human_row['obs_mean']),
                                                    'model_correct': float(model_row['correct']),
                                                    },**{col: model_row[col] for col in MODEL_COLS+h.DATASET_ABSTRACTED_COLS}} # save information for model identification
    clear_output(wait=True)

model_human_accuracies = pd.DataFrame(accuracies).transpose()  
model_human_accuracies.to_csv(os.path.join(csv_dir, 'summary','model_human_accuracies_easy.csv'), index=False)
print('Saved to file. Done!')

In [77]:
model_human_accuracies

Unnamed: 0,Unnamed: 1,scenario,ratio,diff,human_correct,model_correct,Model,Readout Train Data,Readout Type,Encoder Type,Dynamics Type,Encoder Pre-training Task,Encoder Pre-training Dataset,Encoder Pre-training Seed,Encoder Training Task,Encoder Training Dataset,Encoder Training Seed,Dynamics Training Task,Dynamics Training Dataset,Dynamics Training Seed,ModelID,Model Kind,Visual encoder architecture,Dynamics model architecture,ObjectCentric,Supervised,SelfSupervisedLossSelfSupervisedLoss,Encoder Training Dataset Type,Dynamics Training Dataset Type,Readout Train Data Type
clothiness,CSWM_CSWM encoder_0.0_Contrastive_all_Contrastive_0_all_readout_A_clothiness_CSWM_results.csv,clothiness,0.973143,-0.0227844,0.848366,0.825581,CSWM,clothiness,A,CSWM encoder,CSWM dynamics,,,,Contrastive,all,0,Contrastive,all,0,CSWM_CSWM encoder_0.0_Contrastive_all_Contrast...,CSWM_CSWM encoder_0.0_Contrastive_all_Contrast...,Neither,Neither,True,False,,all,all,same
clothiness,CSWM_CSWM encoder_0.0_Contrastive_all_Contrastive_0_all_readout_B_clothiness_CSWM_results.csv,clothiness,0.575662,-0.359994,0.848366,0.488372,CSWM,clothiness,B,CSWM encoder,CSWM dynamics,,,,Contrastive,all,0,Contrastive,all,0,CSWM_CSWM encoder_0.0_Contrastive_all_Contrast...,CSWM_CSWM encoder_0.0_Contrastive_all_Contrast...,Neither,Neither,True,False,,all,all,same
clothiness,CSWM_CSWM encoder_0.0_Contrastive_all_Contrastive_0_all_readout_C_clothiness_CSWM_results.csv,clothiness,0.630487,-0.313482,0.848366,0.534884,CSWM,clothiness,C,CSWM encoder,CSWM dynamics,,,,Contrastive,all,0,Contrastive,all,0,CSWM_CSWM encoder_0.0_Contrastive_all_Contrast...,CSWM_CSWM encoder_0.0_Contrastive_all_Contrast...,Neither,Neither,True,False,,all,all,same
clothiness,CSWM_CSWM encoder_0.0_Contrastive_clothiness_Contrastive_0_clothiness_readout_A_clothiness_CSWM_results.csv,clothiness,0.794962,-0.173947,0.848366,0.674419,CSWM,clothiness,A,CSWM encoder,CSWM dynamics,,,,Contrastive,clothiness,0,Contrastive,clothiness,0,CSWM_CSWM encoder_0.0_Contrastive_clothiness_C...,CSWM_CSWM encoder_0.0_Contrastive_same_Contras...,Neither,Neither,True,False,,same,same,same
clothiness,CSWM_CSWM encoder_0.0_Contrastive_clothiness_Contrastive_0_clothiness_readout_B_clothiness_CSWM_results.csv,clothiness,0.644193,-0.301854,0.848366,0.546512,CSWM,clothiness,B,CSWM encoder,CSWM dynamics,,,,Contrastive,clothiness,0,Contrastive,clothiness,0,CSWM_CSWM encoder_0.0_Contrastive_clothiness_C...,CSWM_CSWM encoder_0.0_Contrastive_same_Contras...,Neither,Neither,True,False,,same,same,same
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
towers,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_no_towers_readout_B_towers_VGGFrozenMLP_results.csv,towers,0.828194,-0.157579,0.917195,0.759615,VGGFrozenMLP,towers,B,VGG,MLP,ImageNet classification,ImageNet,,,,,L2 on latent,no_towers,0,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_no...,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_same,Transformer,MLP,,False,,,all_but_this,same
towers,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_no_towers_readout_C_towers_VGGFrozenMLP_results.csv,towers,0.849161,-0.138348,0.917195,0.778846,VGGFrozenMLP,towers,C,VGG,MLP,ImageNet classification,ImageNet,,,,,L2 on latent,no_towers,0,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_no...,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_same,Transformer,MLP,,False,,,all_but_this,same
towers,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_towers_readout_A_towers_VGGFrozenMLP_results.csv,towers,0.912062,-0.0806561,0.917195,0.836538,VGGFrozenMLP,towers,A,VGG,MLP,ImageNet classification,ImageNet,,,,,L2 on latent,towers,0,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_to...,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_same,Transformer,MLP,,False,,,same,same
towers,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_towers_readout_B_towers_VGGFrozenMLP_results.csv,towers,0.733843,-0.244118,0.917195,0.673077,VGGFrozenMLP,towers,B,VGG,MLP,ImageNet classification,ImageNet,,,,,L2 on latent,towers,0,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_to...,VGGFrozenMLP_VGG_nan_nan_nan_L2 on latent_0_same,Transformer,MLP,,False,,,same,same


#### **Response Pattern**
We will compare the pattern of predictions generated by each model to the pattern of predictions generated by humans. 

We will do this by using two standard inter-rater reliability metrics:

##### **Correlation between average-human and model responses**
For each stimulus, we will compute the proportion of "hit" responses by humans. 
For each stimulus, we will extract the hit probability generated by models.
For each scenario (i.e., domain), we will compute the root-mean-squared-deviation between the human proportion-hit vector and the model probability-hit vector. 
To estimate variability across human samples, we will conduct bootstrap resampling (i.e., resampling data from individual participants with replacement), where for each bootstrap sample we will re-compute the correlation between the model probability-hit vector and the (bootstrapped) human proportion-hit vector.

##### **Correlation** (DEPRECATED, SUPERSEDED BY COHEN's KAPPA BELOW, WHICH CORRECTS FOR CHANCE AGREEMENT RATE)
For each pair of human participants, we will compute the correlation between their (binary) response vectors, yielding a distribution of pairwise human-human correlations. 
For each model, we will compute the correlation between its response vector and every human participant, as well as every other model. 
A model's response pattern will be considered more similar to humans' insofar as the mean model-human correlation (across humans) lies closer to the mean human-human correlation (for all pairs of humans).


#### Correlation

##### Adversarial

In [None]:
out_dict = {}

for scenario in sorted(MD_adv['Readout Test Data'].unique()):
    print("Now running scenario",scenario)
    _MD = MD_adv[MD_adv['Readout Test Data'] == scenario]
    _HD = HD[HD['scenarioName'] == scenario].sort_values('stim_ID')
    for model in _MD['ModelID'].unique():
        #get responses of model        
        _MD_model = _MD[_MD['ModelID'] == model]
        _MD_model = _MD_model.sort_values('Canon Stimulus Name') #ensure same stim order 
        
        ## get average human response vector
        _HD_resp = _HD.groupby('stim_ID')['responseBool'].mean().reset_index()
        #in case the models have more or less responses compared to humans
        human_stim_names = set(list(_HD['stim_ID']))
        model_stim_names = set(list(_MD_model['Canon Stimulus Name']))
        joint_stim_names = human_stim_names.intersection(model_stim_names)
        if len(joint_stim_names) == 0:
            print("⛔️ {} is missing all datapoints on {} human responses".format(model, len(human_stim_names)),end="\r")
            continue #ignore and move on
        if len(human_stim_names) > len(joint_stim_names):
            print("⚠️ {} is missing {} datapoints on {} human responses".format(model,len(human_stim_names) - len(joint_stim_names), len(human_stim_names)),end="\r")

        #subset both models to ensure only common stims are used
        _MD_model = _MD_model[_MD_model['Canon Stimulus Name'].isin(joint_stim_names)]            
        _HD_resp = _HD_resp[_HD_resp['stim_ID'].isin(joint_stim_names)]           
        ## make sure order is exactly the same
        assert len([i for (i,j) in zip(_MD_model['Canon Stimulus Name'].values, _HD_resp['stim_ID'].values) if i!=j])==0
        
        ## extract human & model responses as arrays
        model_responses = _MD_model['Predicted Prob_true'].values
        human_responses = _HD_resp['responseBool'].values

        ## compute RMSE per stimulus
        RMSE = scipy.spatial.distance.euclidean(model_responses, human_responses) / len(model_responses)
        correlation,p = scipy.stats.pearsonr(model_responses, human_responses)
        
        out_dict[(scenario, model)] = {**{'scenario':scenario,
                                          'modelID': model,
                                          'RMSE':RMSE,
                                          'pearsons_r':correlation,
                                          'p_pearsons_r':p,
                                          'num_datapoints':len(model_responses)},
                                           **{col:_MD_model.head(1)[col].item() for col in MODEL_COLS+h.DATASET_ABSTRACTED_COLS} #save model ID info
                                      }
        clear_output(wait=True)        

model_human_rmse = pd.DataFrame(out_dict).transpose()  
model_human_rmse.columns = model_human_rmse.columns.get_level_values(0) ## flatten multi-level index
model_human_rmse.reset_index(drop=True) ## get rid of multi-level index
model_human_rmse = model_human_rmse.assign(RMSE = pd.to_numeric(model_human_rmse['RMSE']))
model_human_rmse.to_csv(os.path.join(csv_dir, 'summary','model_human_pearsonsr_rmse_adv.csv'), index=False)
print('Saved to file. Done!')        

In [79]:
model_human_rmse

Unnamed: 0,Unnamed: 1,scenario,modelID,RMSE,pearsons_r,p_pearsons_r,num_datapoints,Model,Readout Train Data,Readout Type,Encoder Type,Dynamics Type,Encoder Pre-training Task,Encoder Pre-training Dataset,Encoder Pre-training Seed,Encoder Training Task,Encoder Training Dataset,Encoder Training Seed,Dynamics Training Task,Dynamics Training Dataset,Dynamics Training Seed,ModelID,Model Kind,Visual encoder architecture,Dynamics model architecture,ObjectCentric,Supervised,SelfSupervisedLossSelfSupervisedLoss,Encoder Training Dataset Type,Dynamics Training Dataset Type,Readout Train Data Type
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_Image Reconstruction_0_all_readout_A_clothiness_OP3_results.csv,clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,0.118487,0.380399,0.108135,19,OP3,clothiness,A,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,all,0,Image Reconstruction,all,0,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,Neither,Neither,True,False,,all,all,same
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_Image Reconstruction_0_all_readout_B_clothiness_OP3_results.csv,clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,0.123630,0.405666,0.0848506,19,OP3,clothiness,B,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,all,0,Image Reconstruction,all,0,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,Neither,Neither,True,False,,all,all,same
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_Image Reconstruction_0_all_readout_C_clothiness_OP3_results.csv,clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,0.123831,0.2679,0.267474,19,OP3,clothiness,C,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,all,0,Image Reconstruction,all,0,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,Neither,Neither,True,False,,all,all,same
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_no_clothiness_Image Reconstruction_0_no_clothiness_readout_A_clothiness_OP3_results.csv,clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_no_cl...,0.136683,-0.089385,0.715934,19,OP3,clothiness,A,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_clothiness,0,Image Reconstruction,no_clothiness,0,OP3_OP3 encoder_0.0_Image Reconstruction_no_cl...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...,Neither,Neither,True,False,,all_but_this,all_but_this,same
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_no_clothiness_Image Reconstruction_0_no_clothiness_readout_B_clothiness_OP3_results.csv,clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_no_cl...,0.120581,0.175214,0.473083,19,OP3,clothiness,B,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_clothiness,0,Image Reconstruction,no_clothiness,0,OP3_OP3 encoder_0.0_Image Reconstruction_no_cl...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...,Neither,Neither,True,False,,all_but_this,all_but_this,same
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
towers,SVG_VGG_1.0_VAE_towers_VAE_1_towers_readout_D_towers_per_example_svg.csv,towers,SVG_VGG_1.0_VAE_towers_VAE_1_towers_readout_D_...,0.065366,0.283357,0.213242,21,SVG,towers,D,VGG,LSTM,,,,VAE,towers,1,VAE,towers,1,SVG_VGG_1.0_VAE_towers_VAE_1_towers_readout_D_...,SVG_VGG_1.0_VAE_same_VAE_1_same,ConvNet,Neither,False,False,,same,same,same
towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_readout_A_towers_per_example_svg.csv,towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,0.104984,-0.160462,0.487163,21,SVG,towers,A,VGG,LSTM,,,,VAE,no_towers,1,VAE,no_towers,1,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same,ConvNet,Neither,False,False,,all_but_this,all_but_this,same
towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_readout_B_towers_per_example_svg.csv,towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,0.076610,-0.224383,0.328156,21,SVG,towers,B,VGG,LSTM,,,,VAE,no_towers,1,VAE,no_towers,1,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same,ConvNet,Neither,False,False,,all_but_this,all_but_this,same
towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_readout_C_towers_per_example_svg.csv,towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,0.074207,0.0827166,0.721499,21,SVG,towers,C,VGG,LSTM,,,,VAE,no_towers,1,VAE,no_towers,1,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same,ConvNet,Neither,False,False,,all_but_this,all_but_this,same


##### Hard

In [None]:
out_dict = {}

for scenario in sorted(MD_hard['Readout Test Data'].unique()):
    print("Now running scenario",scenario)
    _MD = MD_hard[MD_hard['Readout Test Data'] == scenario]
    _HD = HD[HD['scenarioName'] == scenario].sort_values('stim_ID')
    for model in _MD['ModelID'].unique():
        #get responses of model        
        _MD_model = _MD[_MD['ModelID'] == model]
        _MD_model = _MD_model.sort_values('Canon Stimulus Name') #ensure same stim order 
        
        ## get average human response vector
        _HD_resp = _HD.groupby('stim_ID')['responseBool'].mean().reset_index()
        #in case the models have more or less responses compared to humans
        human_stim_names = set(list(_HD['stim_ID']))
        model_stim_names = set(list(_MD_model['Canon Stimulus Name']))
        joint_stim_names = human_stim_names.intersection(model_stim_names)
        if len(joint_stim_names) == 0:
            print("⛔️ {} is missing all datapoints on {} human responses".format(model, len(human_stim_names)),end="\r")
            continue #ignore and move on
        if len(human_stim_names) > len(joint_stim_names):
            print("⚠️ {} is missing {} datapoints on {} human responses".format(model,len(human_stim_names) - len(joint_stim_names), len(human_stim_names)),end="\r")

        #subset both models to ensure only common stims are used
        _MD_model = _MD_model[_MD_model['Canon Stimulus Name'].isin(joint_stim_names)]            
        _HD_resp = _HD_resp[_HD_resp['stim_ID'].isin(joint_stim_names)]           
        ## make sure order is exactly the same
        assert len([i for (i,j) in zip(_MD_model['Canon Stimulus Name'].values, _HD_resp['stim_ID'].values) if i!=j])==0
        
        ## extract human & model responses as arrays
        model_responses = _MD_model['Predicted Prob_true'].values
        human_responses = _HD_resp['responseBool'].values

        ## compute RMSE per stimulus
        RMSE = scipy.spatial.distance.euclidean(model_responses, human_responses) / len(model_responses)
        correlation,p = scipy.stats.pearsonr(model_responses, human_responses)
        
        out_dict[(scenario, model)] = {**{'scenario':scenario,
                                          'modelID': model,
                                          'RMSE':RMSE,
                                          'pearsons_r':correlation,
                                          'p_pearsons_r':p,
                                          'num_datapoints':len(model_responses)},
                                           **{col:_MD_model.head(1)[col].item() for col in MODEL_COLS+h.DATASET_ABSTRACTED_COLS} #save model ID info
                                      }
        clear_output(wait=True)        

model_human_rmse = pd.DataFrame(out_dict).transpose()  
model_human_rmse.columns = model_human_rmse.columns.get_level_values(0) ## flatten multi-level index
model_human_rmse.reset_index(drop=True) ## get rid of multi-level index
model_human_rmse = model_human_rmse.assign(RMSE = pd.to_numeric(model_human_rmse['RMSE']))
model_human_rmse.to_csv(os.path.join(csv_dir, 'summary','model_human_pearsonsr_rmse_hard.csv'), index=False)
print('Saved to file. Done!')        

In [81]:
model_human_rmse

Unnamed: 0,Unnamed: 1,scenario,modelID,RMSE,pearsons_r,p_pearsons_r,num_datapoints,Model,Readout Train Data,Readout Type,Encoder Type,Dynamics Type,Encoder Pre-training Task,Encoder Pre-training Dataset,Encoder Pre-training Seed,Encoder Training Task,Encoder Training Dataset,Encoder Training Seed,Dynamics Training Task,Dynamics Training Dataset,Dynamics Training Seed,ModelID,Model Kind,Visual encoder architecture,Dynamics model architecture,ObjectCentric,Supervised,SelfSupervisedLossSelfSupervisedLoss,Encoder Training Dataset Type,Dynamics Training Dataset Type,Readout Train Data Type
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_Image Reconstruction_0_all_readout_A_clothiness_OP3_results.csv,clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,0.139677,0.358468,0.189514,15,OP3,clothiness,A,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,all,0,Image Reconstruction,all,0,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,Neither,Neither,True,False,,all,all,same
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_Image Reconstruction_0_all_readout_B_clothiness_OP3_results.csv,clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,0.141088,0.385026,0.156433,15,OP3,clothiness,B,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,all,0,Image Reconstruction,all,0,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,Neither,Neither,True,False,,all,all,same
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_Image Reconstruction_0_all_readout_C_clothiness_OP3_results.csv,clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,0.141361,0.224354,0.421476,15,OP3,clothiness,C,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,all,0,Image Reconstruction,all,0,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,Neither,Neither,True,False,,all,all,same
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_no_clothiness_Image Reconstruction_0_no_clothiness_readout_A_clothiness_OP3_results.csv,clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_no_cl...,0.159232,-0.202462,0.469279,15,OP3,clothiness,A,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_clothiness,0,Image Reconstruction,no_clothiness,0,OP3_OP3 encoder_0.0_Image Reconstruction_no_cl...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...,Neither,Neither,True,False,,all_but_this,all_but_this,same
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_no_clothiness_Image Reconstruction_0_no_clothiness_readout_B_clothiness_OP3_results.csv,clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_no_cl...,0.138231,0.123176,0.661861,15,OP3,clothiness,B,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_clothiness,0,Image Reconstruction,no_clothiness,0,OP3_OP3 encoder_0.0_Image Reconstruction_no_cl...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...,Neither,Neither,True,False,,all_but_this,all_but_this,same
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
towers,SVG_VGG_1.0_VAE_towers_VAE_1_towers_readout_D_towers_per_example_svg.csv,towers,SVG_VGG_1.0_VAE_towers_VAE_1_towers_readout_D_...,0.075665,0.287968,0.262357,17,SVG,towers,D,VGG,LSTM,,,,VAE,towers,1,VAE,towers,1,SVG_VGG_1.0_VAE_towers_VAE_1_towers_readout_D_...,SVG_VGG_1.0_VAE_same_VAE_1_same,ConvNet,Neither,False,False,,same,same,same
towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_readout_A_towers_per_example_svg.csv,towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,0.121755,-0.197367,0.447675,17,SVG,towers,A,VGG,LSTM,,,,VAE,no_towers,1,VAE,no_towers,1,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same,ConvNet,Neither,False,False,,all_but_this,all_but_this,same
towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_readout_B_towers_per_example_svg.csv,towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,0.090175,-0.273312,0.288492,17,SVG,towers,B,VGG,LSTM,,,,VAE,no_towers,1,VAE,no_towers,1,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same,ConvNet,Neither,False,False,,all_but_this,all_but_this,same
towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_readout_C_towers_per_example_svg.csv,towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,0.089987,-0.0207676,0.936943,17,SVG,towers,C,VGG,LSTM,,,,VAE,no_towers,1,VAE,no_towers,1,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same,ConvNet,Neither,False,False,,all_but_this,all_but_this,same


##### By chance

In [None]:
out_dict = {}

for scenario in sorted(MD_chance['Readout Test Data'].unique()):
    print("Now running scenario",scenario)
    _MD = MD_chance[MD_chance['Readout Test Data'] == scenario]
    _HD = HD[HD['scenarioName'] == scenario].sort_values('stim_ID')
    for model in _MD['ModelID'].unique():
        #get responses of model        
        _MD_model = _MD[_MD['ModelID'] == model]
        _MD_model = _MD_model.sort_values('Canon Stimulus Name') #ensure same stim order 
        
        ## get average human response vector
        _HD_resp = _HD.groupby('stim_ID')['responseBool'].mean().reset_index()
        #in case the models have more or less responses compared to humans
        human_stim_names = set(list(_HD['stim_ID']))
        model_stim_names = set(list(_MD_model['Canon Stimulus Name']))
        joint_stim_names = human_stim_names.intersection(model_stim_names)
        if len(joint_stim_names) == 0:
            print("⛔️ {} is missing all datapoints on {} human responses".format(model, len(human_stim_names)),end="\r")
            continue #ignore and move on
        if len(human_stim_names) > len(joint_stim_names):
            print("⚠️ {} is missing {} datapoints on {} human responses".format(model,len(human_stim_names) - len(joint_stim_names), len(human_stim_names)),end="\r")

        #subset both models to ensure only common stims are used
        _MD_model = _MD_model[_MD_model['Canon Stimulus Name'].isin(joint_stim_names)]            
        _HD_resp = _HD_resp[_HD_resp['stim_ID'].isin(joint_stim_names)]           
        ## make sure order is exactly the same
        assert len([i for (i,j) in zip(_MD_model['Canon Stimulus Name'].values, _HD_resp['stim_ID'].values) if i!=j])==0
        
        ## extract human & model responses as arrays
        model_responses = _MD_model['Predicted Prob_true'].values
        human_responses = _HD_resp['responseBool'].values

        ## compute RMSE per stimulus
        RMSE = scipy.spatial.distance.euclidean(model_responses, human_responses) / len(model_responses)
        correlation,p = scipy.stats.pearsonr(model_responses, human_responses)
        
        out_dict[(scenario, model)] = {**{'scenario':scenario,
                                          'modelID': model,
                                          'RMSE':RMSE,
                                          'pearsons_r':correlation,
                                          'p_pearsons_r':p,
                                          'num_datapoints':len(model_responses)},
                                           **{col:_MD_model.head(1)[col].item() for col in MODEL_COLS+h.DATASET_ABSTRACTED_COLS} #save model ID info
                                      }
        clear_output(wait=True)        

model_human_rmse = pd.DataFrame(out_dict).transpose()  
model_human_rmse.columns = model_human_rmse.columns.get_level_values(0) ## flatten multi-level index
model_human_rmse.reset_index(drop=True) ## get rid of multi-level index
model_human_rmse = model_human_rmse.assign(RMSE = pd.to_numeric(model_human_rmse['RMSE']))
model_human_rmse.to_csv(os.path.join(csv_dir, 'summary','model_human_pearsonsr_rmse_chance.csv'), index=False)
print('Saved to file. Done!')        

In [83]:
model_human_rmse

Unnamed: 0,Unnamed: 1,scenario,modelID,RMSE,pearsons_r,p_pearsons_r,num_datapoints,Model,Readout Train Data,Readout Type,Encoder Type,Dynamics Type,Encoder Pre-training Task,Encoder Pre-training Dataset,Encoder Pre-training Seed,Encoder Training Task,Encoder Training Dataset,Encoder Training Seed,Dynamics Training Task,Dynamics Training Dataset,Dynamics Training Seed,ModelID,Model Kind,Visual encoder architecture,Dynamics model architecture,ObjectCentric,Supervised,SelfSupervisedLossSelfSupervisedLoss,Encoder Training Dataset Type,Dynamics Training Dataset Type,Readout Train Data Type
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_Image Reconstruction_0_all_readout_A_clothiness_OP3_results.csv,clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,0.069734,-0.0706479,0.63325,48,OP3,clothiness,A,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,all,0,Image Reconstruction,all,0,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,Neither,Neither,True,False,,all,all,same
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_Image Reconstruction_0_all_readout_B_clothiness_OP3_results.csv,clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,0.071340,-0.210341,0.151293,48,OP3,clothiness,B,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,all,0,Image Reconstruction,all,0,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,Neither,Neither,True,False,,all,all,same
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_Image Reconstruction_0_all_readout_C_clothiness_OP3_results.csv,clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,0.071628,-0.141392,0.337762,48,OP3,clothiness,C,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,all,0,Image Reconstruction,all,0,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,Neither,Neither,True,False,,all,all,same
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_no_clothiness_Image Reconstruction_0_no_clothiness_readout_A_clothiness_OP3_results.csv,clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_no_cl...,0.070063,-0.125417,0.395676,48,OP3,clothiness,A,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_clothiness,0,Image Reconstruction,no_clothiness,0,OP3_OP3 encoder_0.0_Image Reconstruction_no_cl...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...,Neither,Neither,True,False,,all_but_this,all_but_this,same
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_no_clothiness_Image Reconstruction_0_no_clothiness_readout_B_clothiness_OP3_results.csv,clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_no_cl...,0.073004,-0.172294,0.241596,48,OP3,clothiness,B,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_clothiness,0,Image Reconstruction,no_clothiness,0,OP3_OP3 encoder_0.0_Image Reconstruction_no_cl...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...,Neither,Neither,True,False,,all_but_this,all_but_this,same
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
towers,SVG_VGG_1.0_VAE_towers_VAE_1_towers_readout_D_towers_per_example_svg.csv,towers,SVG_VGG_1.0_VAE_towers_VAE_1_towers_readout_D_...,0.038250,0.175423,0.371919,28,SVG,towers,D,VGG,LSTM,,,,VAE,towers,1,VAE,towers,1,SVG_VGG_1.0_VAE_towers_VAE_1_towers_readout_D_...,SVG_VGG_1.0_VAE_same_VAE_1_same,ConvNet,Neither,False,False,,same,same,same
towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_readout_A_towers_per_example_svg.csv,towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,0.067701,0.0711914,0.718853,28,SVG,towers,A,VGG,LSTM,,,,VAE,no_towers,1,VAE,no_towers,1,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same,ConvNet,Neither,False,False,,all_but_this,all_but_this,same
towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_readout_B_towers_per_example_svg.csv,towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,0.038770,-0.26059,0.180465,28,SVG,towers,B,VGG,LSTM,,,,VAE,no_towers,1,VAE,no_towers,1,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same,ConvNet,Neither,False,False,,all_but_this,all_but_this,same
towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_readout_C_towers_per_example_svg.csv,towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,0.040916,0.240188,0.218275,28,SVG,towers,C,VGG,LSTM,,,,VAE,no_towers,1,VAE,no_towers,1,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same,ConvNet,Neither,False,False,,all_but_this,all_but_this,same


##### Easy

In [None]:
out_dict = {}

for scenario in sorted(MD_easy['Readout Test Data'].unique()):
    print("Now running scenario",scenario)
    _MD = MD_easy[MD_easy['Readout Test Data'] == scenario]
    _HD = HD[HD['scenarioName'] == scenario].sort_values('stim_ID')
    for model in _MD['ModelID'].unique():
        #get responses of model        
        _MD_model = _MD[_MD['ModelID'] == model]
        _MD_model = _MD_model.sort_values('Canon Stimulus Name') #ensure same stim order 
        
        ## get average human response vector
        _HD_resp = _HD.groupby('stim_ID')['responseBool'].mean().reset_index()
        #in case the models have more or less responses compared to humans
        human_stim_names = set(list(_HD['stim_ID']))
        model_stim_names = set(list(_MD_model['Canon Stimulus Name']))
        joint_stim_names = human_stim_names.intersection(model_stim_names)
        if len(joint_stim_names) == 0:
            print("⛔️ {} is missing all datapoints on {} human responses".format(model, len(human_stim_names)),end="\r")
            continue #ignore and move on
        if len(human_stim_names) > len(joint_stim_names):
            print("⚠️ {} is missing {} datapoints on {} human responses".format(model,len(human_stim_names) - len(joint_stim_names), len(human_stim_names)),end="\r")

        #subset both models to ensure only common stims are used
        _MD_model = _MD_model[_MD_model['Canon Stimulus Name'].isin(joint_stim_names)]            
        _HD_resp = _HD_resp[_HD_resp['stim_ID'].isin(joint_stim_names)]           
        ## make sure order is exactly the same
        assert len([i for (i,j) in zip(_MD_model['Canon Stimulus Name'].values, _HD_resp['stim_ID'].values) if i!=j])==0
        
        ## extract human & model responses as arrays
        model_responses = _MD_model['Predicted Prob_true'].values
        human_responses = _HD_resp['responseBool'].values

        ## compute RMSE per stimulus
        RMSE = scipy.spatial.distance.euclidean(model_responses, human_responses) / len(model_responses)
        correlation,p = scipy.stats.pearsonr(model_responses, human_responses)
        
        out_dict[(scenario, model)] = {**{'scenario':scenario,
                                          'modelID': model,
                                          'RMSE':RMSE,
                                          'pearsons_r':correlation,
                                          'p_pearsons_r':p,
                                          'num_datapoints':len(model_responses)},
                                           **{col:_MD_model.head(1)[col].item() for col in MODEL_COLS+h.DATASET_ABSTRACTED_COLS} #save model ID info
                                      }
        clear_output(wait=True)        

model_human_rmse = pd.DataFrame(out_dict).transpose()  
model_human_rmse.columns = model_human_rmse.columns.get_level_values(0) ## flatten multi-level index
model_human_rmse.reset_index(drop=True) ## get rid of multi-level index
model_human_rmse = model_human_rmse.assign(RMSE = pd.to_numeric(model_human_rmse['RMSE']))
model_human_rmse.to_csv(os.path.join(csv_dir, 'summary','model_human_pearsonsr_rmse_easy.csv'), index=False)
print('Saved to file. Done!')        

In [85]:
model_human_rmse

Unnamed: 0,Unnamed: 1,scenario,modelID,RMSE,pearsons_r,p_pearsons_r,num_datapoints,Model,Readout Train Data,Readout Type,Encoder Type,Dynamics Type,Encoder Pre-training Task,Encoder Pre-training Dataset,Encoder Pre-training Seed,Encoder Training Task,Encoder Training Dataset,Encoder Training Seed,Dynamics Training Task,Dynamics Training Dataset,Dynamics Training Seed,ModelID,Model Kind,Visual encoder architecture,Dynamics model architecture,ObjectCentric,Supervised,SelfSupervisedLossSelfSupervisedLoss,Encoder Training Dataset Type,Dynamics Training Dataset Type,Readout Train Data Type
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_Image Reconstruction_0_all_readout_A_clothiness_OP3_results.csv,clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,0.061254,0.197765,0.0679655,86,OP3,clothiness,A,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,all,0,Image Reconstruction,all,0,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,Neither,Neither,True,False,,all,all,same
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_Image Reconstruction_0_all_readout_B_clothiness_OP3_results.csv,clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,0.066813,-0.00410664,0.970065,86,OP3,clothiness,B,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,all,0,Image Reconstruction,all,0,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,Neither,Neither,True,False,,all,all,same
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_Image Reconstruction_0_all_readout_C_clothiness_OP3_results.csv,clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,0.063056,0.143126,0.188627,86,OP3,clothiness,C,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,all,0,Image Reconstruction,all,0,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,Neither,Neither,True,False,,all,all,same
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_no_clothiness_Image Reconstruction_0_no_clothiness_readout_A_clothiness_OP3_results.csv,clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_no_cl...,0.065923,0.0713734,0.513732,86,OP3,clothiness,A,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_clothiness,0,Image Reconstruction,no_clothiness,0,OP3_OP3 encoder_0.0_Image Reconstruction_no_cl...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...,Neither,Neither,True,False,,all_but_this,all_but_this,same
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_no_clothiness_Image Reconstruction_0_no_clothiness_readout_B_clothiness_OP3_results.csv,clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_no_cl...,0.066505,0.0545443,0.617925,86,OP3,clothiness,B,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_clothiness,0,Image Reconstruction,no_clothiness,0,OP3_OP3 encoder_0.0_Image Reconstruction_no_cl...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...,Neither,Neither,True,False,,all_but_this,all_but_this,same
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
towers,SVG_VGG_1.0_VAE_towers_VAE_1_towers_readout_D_towers_per_example_svg.csv,towers,SVG_VGG_1.0_VAE_towers_VAE_1_towers_readout_D_...,0.037897,0.425219,6.81259e-06,104,SVG,towers,D,VGG,LSTM,,,,VAE,towers,1,VAE,towers,1,SVG_VGG_1.0_VAE_towers_VAE_1_towers_readout_D_...,SVG_VGG_1.0_VAE_same_VAE_1_same,ConvNet,Neither,False,False,,same,same,same
towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_readout_A_towers_per_example_svg.csv,towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,0.040684,0.426632,6.30141e-06,104,SVG,towers,A,VGG,LSTM,,,,VAE,no_towers,1,VAE,no_towers,1,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same,ConvNet,Neither,False,False,,all_but_this,all_but_this,same
towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_readout_B_towers_per_example_svg.csv,towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,0.043098,0.11687,0.237407,104,SVG,towers,B,VGG,LSTM,,,,VAE,no_towers,1,VAE,no_towers,1,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same,ConvNet,Neither,False,False,,all_but_this,all_but_this,same
towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_readout_C_towers_per_example_svg.csv,towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,0.045958,0.0705898,0.476424,104,SVG,towers,C,VGG,LSTM,,,,VAE,no_towers,1,VAE,no_towers,1,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same,ConvNet,Neither,False,False,,all_but_this,all_but_this,same


##### **Cohen's kappa**

##### Adversarial

In [None]:
import time
start_time = time.time()

out_dict = {}

for scenario in sorted(MD_adv['Readout Test Data'].unique()):
    _MD = MD_adv[MD_adv['Readout Test Data'] == scenario]
    _HD = HD[HD['scenarioName'] == scenario]
    for model in _MD['ModelID'].unique():
        measures_for_model = []
        #get responses of model        
        _MD_model = _MD[_MD['ModelID'] == model]
        _MD_model = _MD_model.sort_values('Canon Stimulus Name') #ensure same stim order
        #iterate over the 100 or so participants
        for gameID in _HD['gameID'].unique():
            #get one game
            _HD_game = _HD[_HD['gameID']==gameID]
            #ensure stim order
            _HD_game = _HD_game.sort_values('stim_ID')
            #in case the models have more or less responses compared to humans
            human_stim_names = list(_HD_game['stim_ID'])
            model_stim_names = list(_MD_model['Canon Stimulus Name'])
            joint_stim_names = set(human_stim_names).intersection(set(model_stim_names))
            if len(joint_stim_names) == 0:
                print("⛔️ {} is missing all datapoints on {} human responses".format(model, len(human_stim_names)),end="\r")
                continue #ignore and move on
            if len(human_stim_names) > len(joint_stim_names):
                print("⚠️ {} is missing {} datapoints on {} human responses".format(model,len(human_stim_names) - len(joint_stim_names), len(human_stim_names)),end="\r")
            #subset both models to ensure only common stims are used
            _MD_model = _MD_model[_MD_model['Canon Stimulus Name'].isin(joint_stim_names)]            
            _HD_game = _HD_game[_HD_game['stim_ID'].isin(joint_stim_names)]
            #pull response vector
            human_responses = np.array(_HD_game['responseBool'].astype(int)) #get human response and cast to int
            model_responses = np.array(_MD_model['Predicted Outcome'])
#             assert list(model_stim_names) == list(human_stim_names), "experimental and test stims don't match"
            assert len(model_responses) == len(human_responses), "More than 1 observation per stimulus"
            # compute Cohen's kappa
            measure = sklearn.metrics.cohen_kappa_score(model_responses,human_responses)
            measures_for_model.append(measure)
        if len(measures_for_model) == 0:
            print("⛔️ {} is missing all datapoints on human responses".format(model))
            continue
        # get percentiles over the range of measures
        lb = np.percentile(measures_for_model, 2.5)
        med = np.percentile(measures_for_model, 50)
        ub = np.percentile(measures_for_model, 97.5)
        out_dict[(scenario, model)] = {**{'scenario':scenario,
                                       'Cohens_k_lb':lb,
                                       'Cohens_k_med':med,
                                       'Cohens_k_ub':ub,
                                        'num_datapoints':len(measures_for_model)},
                                      **{col:_MD_model.head(1)[col].item() for col in MODEL_COLS+h.DATASET_ABSTRACTED_COLS} #save model ID info
                                      }
    
    elapsed_time = np.round(time.time() - start_time,1)
    print("Now running: scenario {} | model {}| elapsed time {} seconds".format(scenario, model, elapsed_time))
    clear_output(wait=True)        

model_human_CohensK = pd.DataFrame(out_dict).transpose()    
model_human_CohensK.to_csv(os.path.join(csv_dir, 'summary','model_human_CohensK_adv.csv'), index=False)
print('Saved to file. Done!')

In [87]:
model_human_CohensK

Unnamed: 0,Unnamed: 1,scenario,Cohens_k_lb,Cohens_k_med,Cohens_k_ub,num_datapoints,Model,Readout Train Data,Readout Type,Encoder Type,Dynamics Type,Encoder Pre-training Task,Encoder Pre-training Dataset,Encoder Pre-training Seed,Encoder Training Task,Encoder Training Dataset,Encoder Training Seed,Dynamics Training Task,Dynamics Training Dataset,Dynamics Training Seed,ModelID,Model Kind,Visual encoder architecture,Dynamics model architecture,ObjectCentric,Supervised,SelfSupervisedLossSelfSupervisedLoss,Encoder Training Dataset Type,Dynamics Training Dataset Type,Readout Train Data Type
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_Image Reconstruction_0_all_readout_A_clothiness_OP3_results.csv,clothiness,-0.1886,0.216495,0.440824,74,OP3,clothiness,A,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,all,0,Image Reconstruction,all,0,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,Neither,Neither,True,False,,all,all,same
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_Image Reconstruction_0_all_readout_B_clothiness_OP3_results.csv,clothiness,-0.195039,0.182796,0.387097,74,OP3,clothiness,B,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,all,0,Image Reconstruction,all,0,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,Neither,Neither,True,False,,all,all,same
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_Image Reconstruction_0_all_readout_C_clothiness_OP3_results.csv,clothiness,-0.176309,0.105882,0.425245,74,OP3,clothiness,C,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,all,0,Image Reconstruction,all,0,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,Neither,Neither,True,False,,all,all,same
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_no_clothiness_Image Reconstruction_0_no_clothiness_readout_A_clothiness_OP3_results.csv,clothiness,-0.386014,-0.0106383,0.2612,74,OP3,clothiness,A,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_clothiness,0,Image Reconstruction,no_clothiness,0,OP3_OP3 encoder_0.0_Image Reconstruction_no_cl...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...,Neither,Neither,True,False,,all_but_this,all_but_this,same
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_no_clothiness_Image Reconstruction_0_no_clothiness_readout_B_clothiness_OP3_results.csv,clothiness,-0.253807,0.0756757,0.454088,74,OP3,clothiness,B,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_clothiness,0,Image Reconstruction,no_clothiness,0,OP3_OP3 encoder_0.0_Image Reconstruction_no_cl...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...,Neither,Neither,True,False,,all_but_this,all_but_this,same
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
towers,SVG_VGG_1.0_VAE_towers_VAE_1_towers_readout_D_towers_per_example_svg.csv,towers,-0.108937,0.222222,0.522235,89,SVG,towers,D,VGG,LSTM,,,,VAE,towers,1,VAE,towers,1,SVG_VGG_1.0_VAE_towers_VAE_1_towers_readout_D_...,SVG_VGG_1.0_VAE_same_VAE_1_same,ConvNet,Neither,False,False,,same,same,same
towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_readout_A_towers_per_example_svg.csv,towers,-0.392727,-0.0547945,0.270085,89,SVG,towers,A,VGG,LSTM,,,,VAE,no_towers,1,VAE,no_towers,1,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same,ConvNet,Neither,False,False,,all_but_this,all_but_this,same
towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_readout_B_towers_per_example_svg.csv,towers,-0.438356,-0.145455,0.240466,89,SVG,towers,B,VGG,LSTM,,,,VAE,no_towers,1,VAE,no_towers,1,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same,ConvNet,Neither,False,False,,all_but_this,all_but_this,same
towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_readout_C_towers_per_example_svg.csv,towers,-0.233078,0.0493827,0.384211,89,SVG,towers,C,VGG,LSTM,,,,VAE,no_towers,1,VAE,no_towers,1,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same,ConvNet,Neither,False,False,,all_but_this,all_but_this,same


##### Hard

In [None]:
import time
start_time = time.time()

out_dict = {}

for scenario in sorted(MD_hard['Readout Test Data'].unique()):
    _MD = MD_hard[MD_hard['Readout Test Data'] == scenario]
    _HD = HD[HD['scenarioName'] == scenario]
    for model in _MD['ModelID'].unique():
        measures_for_model = []
        #get responses of model        
        _MD_model = _MD[_MD['ModelID'] == model]
        _MD_model = _MD_model.sort_values('Canon Stimulus Name') #ensure same stim order
        #iterate over the 100 or so participants
        for gameID in _HD['gameID'].unique():
            #get one game
            _HD_game = _HD[_HD['gameID']==gameID]
            #ensure stim order
            _HD_game = _HD_game.sort_values('stim_ID')
            #in case the models have more or less responses compared to humans
            human_stim_names = list(_HD_game['stim_ID'])
            model_stim_names = list(_MD_model['Canon Stimulus Name'])
            joint_stim_names = set(human_stim_names).intersection(set(model_stim_names))
            if len(joint_stim_names) == 0:
                print("⛔️ {} is missing all datapoints on {} human responses".format(model, len(human_stim_names)),end="\r")
                continue #ignore and move on
            if len(human_stim_names) > len(joint_stim_names):
                print("⚠️ {} is missing {} datapoints on {} human responses".format(model,len(human_stim_names) - len(joint_stim_names), len(human_stim_names)),end="\r")
            #subset both models to ensure only common stims are used
            _MD_model = _MD_model[_MD_model['Canon Stimulus Name'].isin(joint_stim_names)]            
            _HD_game = _HD_game[_HD_game['stim_ID'].isin(joint_stim_names)]
            #pull response vector
            human_responses = np.array(_HD_game['responseBool'].astype(int)) #get human response and cast to int
            model_responses = np.array(_MD_model['Predicted Outcome'])
#             assert list(model_stim_names) == list(human_stim_names), "experimental and test stims don't match"
            assert len(model_responses) == len(human_responses), "More than 1 observation per stimulus"
            # compute Cohen's kappa
            measure = sklearn.metrics.cohen_kappa_score(model_responses,human_responses)
            measures_for_model.append(measure)
        if len(measures_for_model) == 0:
            print("⛔️ {} is missing all datapoints on human responses".format(model))
            continue
        # get percentiles over the range of measures
        lb = np.percentile(measures_for_model, 2.5)
        med = np.percentile(measures_for_model, 50)
        ub = np.percentile(measures_for_model, 97.5)
        out_dict[(scenario, model)] = {**{'scenario':scenario,
                                       'Cohens_k_lb':lb,
                                       'Cohens_k_med':med,
                                       'Cohens_k_ub':ub,
                                        'num_datapoints':len(measures_for_model)},
                                      **{col:_MD_model.head(1)[col].item() for col in MODEL_COLS+h.DATASET_ABSTRACTED_COLS} #save model ID info
                                      }
    
    elapsed_time = np.round(time.time() - start_time,1)
    print("Now running: scenario {} | model {}| elapsed time {} seconds".format(scenario, model, elapsed_time))
    clear_output(wait=True)        

model_human_CohensK = pd.DataFrame(out_dict).transpose()    
model_human_CohensK.to_csv(os.path.join(csv_dir, 'summary','model_human_CohensK_hard.csv'), index=False)
print('Saved to file. Done!')

In [89]:
model_human_CohensK

Unnamed: 0,Unnamed: 1,scenario,Cohens_k_lb,Cohens_k_med,Cohens_k_ub,num_datapoints,Model,Readout Train Data,Readout Type,Encoder Type,Dynamics Type,Encoder Pre-training Task,Encoder Pre-training Dataset,Encoder Pre-training Seed,Encoder Training Task,Encoder Training Dataset,Encoder Training Seed,Dynamics Training Task,Dynamics Training Dataset,Dynamics Training Seed,ModelID,Model Kind,Visual encoder architecture,Dynamics model architecture,ObjectCentric,Supervised,SelfSupervisedLossSelfSupervisedLoss,Encoder Training Dataset Type,Dynamics Training Dataset Type,Readout Train Data Type
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_Image Reconstruction_0_all_readout_A_clothiness_OP3_results.csv,clothiness,-0.301566,0.233577,0.501028,74,OP3,clothiness,A,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,all,0,Image Reconstruction,all,0,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,Neither,Neither,True,False,,all,all,same
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_Image Reconstruction_0_all_readout_B_clothiness_OP3_results.csv,clothiness,-0.301566,0.233577,0.501028,74,OP3,clothiness,B,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,all,0,Image Reconstruction,all,0,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,Neither,Neither,True,False,,all,all,same
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_Image Reconstruction_0_all_readout_C_clothiness_OP3_results.csv,clothiness,-0.332321,0.146341,0.5,74,OP3,clothiness,C,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,all,0,Image Reconstruction,all,0,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,Neither,Neither,True,False,,all,all,same
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_no_clothiness_Image Reconstruction_0_no_clothiness_readout_A_clothiness_OP3_results.csv,clothiness,-0.49101,-0.0909091,0.336283,74,OP3,clothiness,A,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_clothiness,0,Image Reconstruction,no_clothiness,0,OP3_OP3 encoder_0.0_Image Reconstruction_no_cl...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...,Neither,Neither,True,False,,all_but_this,all_but_this,same
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_no_clothiness_Image Reconstruction_0_no_clothiness_readout_B_clothiness_OP3_results.csv,clothiness,-0.388889,0.0366972,0.446212,74,OP3,clothiness,B,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_clothiness,0,Image Reconstruction,no_clothiness,0,OP3_OP3 encoder_0.0_Image Reconstruction_no_cl...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...,Neither,Neither,True,False,,all_but_this,all_but_this,same
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
towers,SVG_VGG_1.0_VAE_towers_VAE_1_towers_readout_D_towers_per_example_svg.csv,towers,-0.17789,0.28169,0.539282,89,SVG,towers,D,VGG,LSTM,,,,VAE,towers,1,VAE,towers,1,SVG_VGG_1.0_VAE_towers_VAE_1_towers_readout_D_...,SVG_VGG_1.0_VAE_same_VAE_1_same,ConvNet,Neither,False,False,,same,same,same
towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_readout_A_towers_per_example_svg.csv,towers,-0.43662,-0.0699301,0.271429,89,SVG,towers,A,VGG,LSTM,,,,VAE,no_towers,1,VAE,no_towers,1,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same,ConvNet,Neither,False,False,,all_but_this,all_but_this,same
towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_readout_B_towers_per_example_svg.csv,towers,-0.45121,-0.206452,0.23167,89,SVG,towers,B,VGG,LSTM,,,,VAE,no_towers,1,VAE,no_towers,1,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same,ConvNet,Neither,False,False,,all_but_this,all_but_this,same
towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_readout_C_towers_per_example_svg.csv,towers,-0.204724,0.0129032,0.367994,89,SVG,towers,C,VGG,LSTM,,,,VAE,no_towers,1,VAE,no_towers,1,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same,ConvNet,Neither,False,False,,all_but_this,all_but_this,same


##### By chance

In [None]:
import time
start_time = time.time()

out_dict = {}

for scenario in sorted(MD_chance['Readout Test Data'].unique()):
    _MD = MD_chance[MD_chance['Readout Test Data'] == scenario]
    _HD = HD[HD['scenarioName'] == scenario]
    for model in _MD['ModelID'].unique():
        measures_for_model = []
        #get responses of model        
        _MD_model = _MD[_MD['ModelID'] == model]
        _MD_model = _MD_model.sort_values('Canon Stimulus Name') #ensure same stim order
        #iterate over the 100 or so participants
        for gameID in _HD['gameID'].unique():
            #get one game
            _HD_game = _HD[_HD['gameID']==gameID]
            #ensure stim order
            _HD_game = _HD_game.sort_values('stim_ID')
            #in case the models have more or less responses compared to humans
            human_stim_names = list(_HD_game['stim_ID'])
            model_stim_names = list(_MD_model['Canon Stimulus Name'])
            joint_stim_names = set(human_stim_names).intersection(set(model_stim_names))
            if len(joint_stim_names) == 0:
                print("⛔️ {} is missing all datapoints on {} human responses".format(model, len(human_stim_names)),end="\r")
                continue #ignore and move on
            if len(human_stim_names) > len(joint_stim_names):
                print("⚠️ {} is missing {} datapoints on {} human responses".format(model,len(human_stim_names) - len(joint_stim_names), len(human_stim_names)),end="\r")
            #subset both models to ensure only common stims are used
            _MD_model = _MD_model[_MD_model['Canon Stimulus Name'].isin(joint_stim_names)]            
            _HD_game = _HD_game[_HD_game['stim_ID'].isin(joint_stim_names)]
            #pull response vector
            human_responses = np.array(_HD_game['responseBool'].astype(int)) #get human response and cast to int
            model_responses = np.array(_MD_model['Predicted Outcome'])
            #assert list(model_stim_names) == list(human_stim_names), "experimental and test stims don't match"
            assert len(model_responses) == len(human_responses), "More than 1 observation per stimulus"
            # compute Cohen's kappa
            measure = sklearn.metrics.cohen_kappa_score(model_responses,human_responses)
            measures_for_model.append(measure)
        if len(measures_for_model) == 0:
            print("⛔️ {} is missing all datapoints on human responses".format(model))
            continue
        # get percentiles over the range of measures
        lb = np.percentile(measures_for_model, 2.5)
        med = np.percentile(measures_for_model, 50)
        ub = np.percentile(measures_for_model, 97.5)
        out_dict[(scenario, model)] = {**{'scenario':scenario,
                                       'Cohens_k_lb':lb,
                                       'Cohens_k_med':med,
                                       'Cohens_k_ub':ub,
                                        'num_datapoints':len(measures_for_model)},
                                      **{col:_MD_model.head(1)[col].item() for col in MODEL_COLS+h.DATASET_ABSTRACTED_COLS} #save model ID info
                                      }
    
    elapsed_time = np.round(time.time() - start_time,1)
    print("Now running: scenario {} | model {}| elapsed time {} seconds".format(scenario, model, elapsed_time))
    clear_output(wait=True)        

model_human_CohensK = pd.DataFrame(out_dict).transpose()    
model_human_CohensK.to_csv(os.path.join(csv_dir, 'summary','model_human_CohensK_chance.csv'), index=False)
print('Saved to file. Done!')

In [91]:
model_human_CohensK

Unnamed: 0,Unnamed: 1,scenario,Cohens_k_lb,Cohens_k_med,Cohens_k_ub,num_datapoints,Model,Readout Train Data,Readout Type,Encoder Type,Dynamics Type,Encoder Pre-training Task,Encoder Pre-training Dataset,Encoder Pre-training Seed,Encoder Training Task,Encoder Training Dataset,Encoder Training Seed,Dynamics Training Task,Dynamics Training Dataset,Dynamics Training Seed,ModelID,Model Kind,Visual encoder architecture,Dynamics model architecture,ObjectCentric,Supervised,SelfSupervisedLossSelfSupervisedLoss,Encoder Training Dataset Type,Dynamics Training Dataset Type,Readout Train Data Type
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_Image Reconstruction_0_all_readout_A_clothiness_OP3_results.csv,clothiness,-0.260617,-0.015873,0.225543,74,OP3,clothiness,A,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,all,0,Image Reconstruction,all,0,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,Neither,Neither,True,False,,all,all,same
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_Image Reconstruction_0_all_readout_B_clothiness_OP3_results.csv,clothiness,-0.300756,-0.026738,0.19918,74,OP3,clothiness,B,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,all,0,Image Reconstruction,all,0,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,Neither,Neither,True,False,,all,all,same
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_Image Reconstruction_0_all_readout_C_clothiness_OP3_results.csv,clothiness,-0.202605,-0.0253165,0.170334,74,OP3,clothiness,C,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,all,0,Image Reconstruction,all,0,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,Neither,Neither,True,False,,all,all,same
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_no_clothiness_Image Reconstruction_0_no_clothiness_readout_A_clothiness_OP3_results.csv,clothiness,-0.235294,-0.0320184,0.255546,74,OP3,clothiness,A,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_clothiness,0,Image Reconstruction,no_clothiness,0,OP3_OP3 encoder_0.0_Image Reconstruction_no_cl...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...,Neither,Neither,True,False,,all_but_this,all_but_this,same
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_no_clothiness_Image Reconstruction_0_no_clothiness_readout_B_clothiness_OP3_results.csv,clothiness,-0.302886,-0.0253165,0.183333,74,OP3,clothiness,B,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_clothiness,0,Image Reconstruction,no_clothiness,0,OP3_OP3 encoder_0.0_Image Reconstruction_no_cl...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...,Neither,Neither,True,False,,all_but_this,all_but_this,same
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
towers,SVG_VGG_1.0_VAE_towers_VAE_1_towers_readout_D_towers_per_example_svg.csv,towers,-0.327915,0.0421053,0.373717,89,SVG,towers,D,VGG,LSTM,,,,VAE,towers,1,VAE,towers,1,SVG_VGG_1.0_VAE_towers_VAE_1_towers_readout_D_...,SVG_VGG_1.0_VAE_same_VAE_1_same,ConvNet,Neither,False,False,,same,same,same
towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_readout_A_towers_per_example_svg.csv,towers,-0.322462,0.0344828,0.296482,89,SVG,towers,A,VGG,LSTM,,,,VAE,no_towers,1,VAE,no_towers,1,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same,ConvNet,Neither,False,False,,all_but_this,all_but_this,same
towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_readout_B_towers_per_example_svg.csv,towers,-0.356557,-0.0821256,0.236099,89,SVG,towers,B,VGG,LSTM,,,,VAE,no_towers,1,VAE,no_towers,1,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same,ConvNet,Neither,False,False,,all_but_this,all_but_this,same
towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_readout_C_towers_per_example_svg.csv,towers,-0.260362,0.048,0.415584,89,SVG,towers,C,VGG,LSTM,,,,VAE,no_towers,1,VAE,no_towers,1,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same,ConvNet,Neither,False,False,,all_but_this,all_but_this,same


##### Easy

In [None]:
import time
start_time = time.time()

out_dict = {}

for scenario in sorted(MD_easy['Readout Test Data'].unique()):
    _MD = MD_easy[MD_easy['Readout Test Data'] == scenario]
    _HD = HD[HD['scenarioName'] == scenario]
    for model in _MD['ModelID'].unique():
        measures_for_model = []
        #get responses of model        
        _MD_model = _MD[_MD['ModelID'] == model]
        _MD_model = _MD_model.sort_values('Canon Stimulus Name') #ensure same stim order
        #iterate over the 100 or so participants
        for gameID in _HD['gameID'].unique():
            #get one game
            _HD_game = _HD[_HD['gameID']==gameID]
            #ensure stim order
            _HD_game = _HD_game.sort_values('stim_ID')
            #in case the models have more or less responses compared to humans
            human_stim_names = list(_HD_game['stim_ID'])
            model_stim_names = list(_MD_model['Canon Stimulus Name'])
            joint_stim_names = set(human_stim_names).intersection(set(model_stim_names))
            if len(joint_stim_names) == 0:
                print("⛔️ {} is missing all datapoints on {} human responses".format(model, len(human_stim_names)),end="\r")
                continue #ignore and move on
            if len(human_stim_names) > len(joint_stim_names):
                print("⚠️ {} is missing {} datapoints on {} human responses".format(model,len(human_stim_names) - len(joint_stim_names), len(human_stim_names)),end="\r")
            #subset both models to ensure only common stims are used
            _MD_model = _MD_model[_MD_model['Canon Stimulus Name'].isin(joint_stim_names)]            
            _HD_game = _HD_game[_HD_game['stim_ID'].isin(joint_stim_names)]
            #pull response vector
            human_responses = np.array(_HD_game['responseBool'].astype(int)) #get human response and cast to int
            model_responses = np.array(_MD_model['Predicted Outcome'])
#             assert list(model_stim_names) == list(human_stim_names), "experimental and test stims don't match"
            assert len(model_responses) == len(human_responses), "More than 1 observation per stimulus"
            # compute Cohen's kappa
            measure = sklearn.metrics.cohen_kappa_score(model_responses,human_responses)
            measures_for_model.append(measure)
        if len(measures_for_model) == 0:
            print("⛔️ {} is missing all datapoints on human responses".format(model))
            continue
        # get percentiles over the range of measures
        lb = np.percentile(measures_for_model, 2.5)
        med = np.percentile(measures_for_model, 50)
        ub = np.percentile(measures_for_model, 97.5)
        out_dict[(scenario, model)] = {**{'scenario':scenario,
                                       'Cohens_k_lb':lb,
                                       'Cohens_k_med':med,
                                       'Cohens_k_ub':ub,
                                        'num_datapoints':len(measures_for_model)},
                                      **{col:_MD_model.head(1)[col].item() for col in MODEL_COLS+h.DATASET_ABSTRACTED_COLS} #save model ID info
                                      }
    
    elapsed_time = np.round(time.time() - start_time,1)
    print("Now running: scenario {} | model {}| elapsed time {} seconds".format(scenario, model, elapsed_time))
    clear_output(wait=True)        

model_human_CohensK = pd.DataFrame(out_dict).transpose()    
model_human_CohensK.to_csv(os.path.join(csv_dir, 'summary','model_human_CohensK_easy.csv'), index=False)
print('Saved to file. Done!')

In [93]:
model_human_CohensK

Unnamed: 0,Unnamed: 1,scenario,Cohens_k_lb,Cohens_k_med,Cohens_k_ub,num_datapoints,Model,Readout Train Data,Readout Type,Encoder Type,Dynamics Type,Encoder Pre-training Task,Encoder Pre-training Dataset,Encoder Pre-training Seed,Encoder Training Task,Encoder Training Dataset,Encoder Training Seed,Dynamics Training Task,Dynamics Training Dataset,Dynamics Training Seed,ModelID,Model Kind,Visual encoder architecture,Dynamics model architecture,ObjectCentric,Supervised,SelfSupervisedLossSelfSupervisedLoss,Encoder Training Dataset Type,Dynamics Training Dataset Type,Readout Train Data Type
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_Image Reconstruction_0_all_readout_A_clothiness_OP3_results.csv,clothiness,-0.0463587,0.125344,0.245514,74,OP3,clothiness,A,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,all,0,Image Reconstruction,all,0,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,Neither,Neither,True,False,,all,all,same
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_Image Reconstruction_0_all_readout_B_clothiness_OP3_results.csv,clothiness,-0.114394,-0.0248705,0.127017,74,OP3,clothiness,B,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,all,0,Image Reconstruction,all,0,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,Neither,Neither,True,False,,all,all,same
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_all_Image Reconstruction_0_all_readout_C_clothiness_OP3_results.csv,clothiness,-0.0599417,0.0992491,0.238141,74,OP3,clothiness,C,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,all,0,Image Reconstruction,all,0,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,OP3_OP3 encoder_0.0_Image Reconstruction_all_I...,Neither,Neither,True,False,,all,all,same
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_no_clothiness_Image Reconstruction_0_no_clothiness_readout_A_clothiness_OP3_results.csv,clothiness,-0.0831245,0.0345676,0.127798,74,OP3,clothiness,A,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_clothiness,0,Image Reconstruction,no_clothiness,0,OP3_OP3 encoder_0.0_Image Reconstruction_no_cl...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...,Neither,Neither,True,False,,all_but_this,all_but_this,same
clothiness,OP3_OP3 encoder_0.0_Image Reconstruction_no_clothiness_Image Reconstruction_0_no_clothiness_readout_B_clothiness_OP3_results.csv,clothiness,-0.0922432,0.0227802,0.140175,74,OP3,clothiness,B,OP3 encoder,OP3 dynamics,,,,Image Reconstruction,no_clothiness,0,Image Reconstruction,no_clothiness,0,OP3_OP3 encoder_0.0_Image Reconstruction_no_cl...,OP3_OP3 encoder_0.0_Image Reconstruction_all_b...,Neither,Neither,True,False,,all_but_this,all_but_this,same
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
towers,SVG_VGG_1.0_VAE_towers_VAE_1_towers_readout_D_towers_per_example_svg.csv,towers,0.210721,0.30538,0.399248,89,SVG,towers,D,VGG,LSTM,,,,VAE,towers,1,VAE,towers,1,SVG_VGG_1.0_VAE_towers_VAE_1_towers_readout_D_...,SVG_VGG_1.0_VAE_same_VAE_1_same,ConvNet,Neither,False,False,,same,same,same
towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_readout_A_towers_per_example_svg.csv,towers,0.205471,0.323383,0.403674,89,SVG,towers,A,VGG,LSTM,,,,VAE,no_towers,1,VAE,no_towers,1,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same,ConvNet,Neither,False,False,,all_but_this,all_but_this,same
towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_readout_B_towers_per_example_svg.csv,towers,-0.0446429,0.0545455,0.144421,89,SVG,towers,B,VGG,LSTM,,,,VAE,no_towers,1,VAE,no_towers,1,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same,ConvNet,Neither,False,False,,all_but_this,all_but_this,same
towers,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_readout_C_towers_per_example_svg.csv,towers,-0.0863394,-0.015015,0.087588,89,SVG,towers,C,VGG,LSTM,,,,VAE,no_towers,1,VAE,no_towers,1,SVG_VGG_1.0_VAE_no_towers_VAE_1_no_towers_read...,SVG_VGG_1.0_VAE_all_but_this_VAE_1_same,ConvNet,Neither,False,False,,all_but_this,all_but_this,same
