# Analysis of human error rate under different experiment parameters across scenarios

**The purpose of this notebook is to:** 
* Apply preprocessing to human behavioral data
* Visualize distribution and compute summary statistics over **human** physical judgments
* Conduct error analysis on which scenarios/ instances did humans make lots of error
* Conduct more detailed error analysis on which scenarios/ instances did humans and models diverge the most

**This notebook depends on:**
* Running `./download_results.py` (PUBLIC USE)
* Download all mp4 files from this [link](https://physics-benchmarking-neurips2021-dataset.s3.amazonaws.com/Physion.zip) and put them into the `./result/videos`

## setup

#### Load packages

In [None]:
import os
import sys
import urllib, io

sys.path.append('./analysis_helpers')
from importlib import reload
from analysis_helpers import *
from display_trials import *

import numpy as np
import scipy.stats as stats
import pandas as pd
from IPython.display import Video
from ipywidgets import Output, GridspecLayout

from IPython import display

import pymongo as pm
from collections import Counter
import json
import re
import ast

from PIL import Image, ImageOps, ImageDraw, ImageFont 

from io import BytesIO
import base64

from PIL import Image
from PIL import ImageDraw
from PIL import ImageFont

from tqdm.notebook import tqdm

import  matplotlib
from matplotlib import pylab, mlab, pyplot
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.core.pylabtools import figsize, getfigs
plt = pyplot
import matplotlib as mpl
mpl.rcParams['pdf.fonttype'] = 42
plt.style.use('seaborn-white')

import seaborn as sns

%matplotlib inline
import scipy.stats
import random

from IPython.display import clear_output

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

#### options

In [None]:
# display all columns
pd.set_option('display.max_columns', None)

# seaborn plotting themes
sns.set_context('talk')
sns.set_style("whitegrid")

#### set up paths and directoriesg

In [None]:
## directory & file hierarchy
proj_dir = os.path.abspath('..')
datavol_dir = os.path.join(proj_dir,'data')
analysis_dir =  os.path.abspath('.')
results_dir = os.path.join(proj_dir,'results')
plot_dir = os.path.join(results_dir,'plots')
csv_dir = os.path.join(results_dir,'csv')
video_dir = os.path.join(results_dir,'videos')
json_dir = os.path.join(results_dir,'json')
exp_dir = os.path.abspath(os.path.join(proj_dir,'behavioral_experiments'))
png_dir = os.path.abspath(os.path.join(datavol_dir,'png'))

## add helpers to python path
if os.path.join(proj_dir,'stimuli') not in sys.path:
    sys.path.append(os.path.join(proj_dir,'stimuli'))
    
if not os.path.exists(results_dir):
    os.makedirs(results_dir)
    
if not os.path.exists(plot_dir):
    os.makedirs(plot_dir)   
    
if not os.path.exists(csv_dir):
    os.makedirs(csv_dir)   
    
if not os.path.exists(video_dir):
    os.makedirs(video_dir) 
    
## add helpers to python path
if os.path.join(analysis_dir,'utils') not in sys.path:
    sys.path.append(os.path.join(analysis_dir,'utils'))   

def make_dir_if_not_exists(dir_name):   
    if not os.path.exists(dir_name):
        os.makedirs(dir_name)
    return dir_name

## create directories that don't already exist        
result = [make_dir_if_not_exists(x) for x in [results_dir,plot_dir,csv_dir]]

### load in data

In [None]:
from experiment_meta import *
HEM = pd.DataFrame(NEURIPS2021_EXPS) # HEM = "human experiment metadata"
HEM

In [None]:
## get paths to all human response data
data_paths = [os.path.join(csv_dir,'humans',i) for i in os.listdir(os.path.join(csv_dir,'humans'))]
resp_paths = [i for i in data_paths if i.split('/')[-1].split('-')[0]=='human_responses']
assert len(resp_paths)==8

In [None]:
## load in example dataframe
exp_ind = 0
d = pd.concat([pd.read_csv(p) for p in resp_paths])

## some utility vars
d['scenarioName'] = d['study'].apply(lambda x:x.split('_')[0])
# colnames_with_variable_entries = [col for col in sorted(d.columns) if len(np.unique(d[col]))>1]
colnames = ['scenarioName','study','gameID','trialNum','prolificIDAnon','stim_ID','response','target_hit_zone_label','correct','choices','rt']
# colnames = ['gameID','trialNum','stim_ID','response','target_hit_zone_label','correct','choices','rt']

## subset dataframe by colnames of interest
_D = d[colnames]

## preprocess RTs (subtract 2500ms presentation time, log transform)
_D = _D.assign(RT = _D['rt'] - 2500) 
_D = _D.assign(logRT = np.log(_D['RT']))
_D = _D.drop(columns=['rt'],axis=1)

## convert responses to boolean
binary_mapper = {'YES':True, 'NO':False}
_D = _D.assign(responseBool = _D['response'].apply(lambda x: binary_mapper[x]), axis=0)

# print('Currently analyzing the {} experiment.'.format(scenarioName))

### Data exclusion criteria (from `preregistration_neurips2021.md`)

 Data from an entire experimental session will be excluded if the responses:
 * contain a sequence with unusually long streak, defined as occurring less than 2.5% of the time under random responding
 * contain a sequence of at least 24 trials alternating "yes" and "no" responses
 * are correct for fewer than 4 out of 10 familiarization trials (i.e., 30% correct or lower)
 * the mean accuracy for that participant is below 3 standard deviations below the median accuracy across all participants \for that scenario
 * the mean log-transformed response time for that participant is 3 standard deviations above the median log-transformed response time across all participants for that scenario
 
Excluded sessions will be flagged. Flagged sessions will not be included in the main analyses. We will also conduct our planned analyses with the flagged sessions included to investigate the extent to which the outcomes of the main analyses change when these sessions are included. Specifically, we will fit a statistical model to all sessions and estimate the effect of a session being flagged on accuracy. 

In [None]:
from analysis_helpers import *
D = apply_exclusion_criteria(_D)

## Visualize distribution and compute summary statistics over human physical judgments

### Human accuracy by scenario

#### Dominoes

In [None]:
# get all dominoes trials
dominoes = D[D["scenarioName"] == "dominoes"]

#### 33.3%/ 66.7%/ 100% Accuracy Split 

In [None]:
# draw the accuracy dsitribution
Dacc = dominoes.groupby('stim_ID').agg({'correct':np.mean})
h = sns.histplot(data=Dacc, x='correct', bins=30, stat='probability')
t = plt.title('Accuracy distribution across stimuli for dominoes')

Dacc['answer'] = dominoes.groupby('stim_ID')['target_hit_zone_label'].first() # add ground truth

##### (a) systematically fail (0 <= p <= 33.3%)

In [None]:
# plot all the trails with accuracies lower than 33.3%

plot_by_3(Dacc, video_dir)

Problematic instances with low accuracy:
- Total/ bad occlusion
- Block stuck at weird positions


##### (b) are close to chance (33.3% < p <= 66.7%)

##### (c) consistently succeed (66.7% < p <= 100%)

#### Support

In [None]:
# get all dominoes trials
support = D[D["scenarioName"] == "towers"]

#### 33.3%/ 66.7%/ 100% Accuracy Split 

In [None]:
Dacc = support.groupby('stim_ID').agg({'correct':np.mean})
h = sns.histplot(data=Dacc, x='correct', bins=30, stat='probability')
t = plt.title('Accuracy distribution across stimuli for tower')

Dacc['answer'] = support.groupby('stim_ID')['target_hit_zone_label'].first() # add ground truth

##### (a) systematically fail (0 <= p <= 33.3%)

In [None]:
# plot all the trails with accuracies lower than 33.3%

plot_by_3(Dacc, video_dir)

##### (b) are close to chance (33.3% < p <= 66.7%)

##### (c) consistently succeed (66.7% < p <= 100%)

#### Collide

In [None]:
# get all dominoes trials
collide = D[D["scenarioName"] == "collision"]

#### 33.3%/ 66.7%/ 100% Accuracy Split 

In [None]:
Dacc = collide.groupby('stim_ID').agg({'correct':np.mean})
h = sns.histplot(data=Dacc, x='correct', bins=30, stat='probability')
t = plt.title('Accuracy distribution across stimuli for collison')

Dacc['answer'] = collide.groupby('stim_ID')['target_hit_zone_label'].first() # add ground truth

##### (a) systematically fail (0 <= p <= 33.3%)

In [None]:
# plot all the trails with accuracies lower than 33.3%

plot_by_3(Dacc, video_dir)

Problematic instances with low accuracy:
- ...



##### (b) are close to chance (33.3% < p <= 66.7%)

##### (c) consistently succeed (66.7% < p <= 100%)

#### Contain

In [None]:
# get all dominoes trials
contain = D[D["scenarioName"] == "containment"]

#### 33.3%/ 66.7%/ 100% Accuracy Split 

In [None]:
Dacc = contain.groupby('stim_ID').agg({'correct':np.mean})
h = sns.histplot(data=Dacc, x='correct', bins=30, stat='probability')
t = plt.title('Accuracy distribution across stimuli for tower')

Dacc['answer'] = contain.groupby('stim_ID')['target_hit_zone_label'].first() # add ground truth

In [None]:
Dacc[Dacc["correct"]<=0.33]

##### (a) systematically fail (0 <= p <= 33.3%)

In [None]:
# plot all the trails with accuracies lower than 33.3%

plot_by_3(Dacc, video_dir)

Problematic instances with low accuracy:
- ...



- Physics broke

##### (b) are close to chance (33.3% < p <= 66.7%)

##### (c) consistently succeed (66.7% < p <= 100%)

#### Drop

In [None]:
# get all dominoes trials
drop = D[D["scenarioName"] == "drop"]

#### 33.3%/ 66.7%/ 100% Accuracy Split 

In [None]:
Dacc = drop.groupby('stim_ID').agg({'correct':np.mean})
h = sns.histplot(data=Dacc, x='correct', bins=30, stat='probability')
t = plt.title('Accuracy distribution across stimuli for drop')

Dacc['answer'] = drop.groupby('stim_ID')['target_hit_zone_label'].first() # add ground truth

##### (a) systematically fail (0 <= p <= 33.3%)

In [None]:
# plot all the trails with accuracies lower than 33.3%

plot_by_3(Dacc, video_dir)

Problematic instances with low accuracy:
- ...



##### (b) are close to chance (33.3% < p <= 66.7%)

##### (c) consistently succeed (66.7% < p <= 100%)

#### Link

In [None]:
# get all dominoes trials
link = D[D["scenarioName"] == "linking"]

#### 33.3%/ 66.7%/ 100% Accuracy Split 

In [None]:
Dacc = link.groupby('stim_ID').agg({'correct':np.mean})
h = sns.histplot(data=Dacc, x='correct', bins=30, stat='probability')
t = plt.title('Accuracy distribution across stimuli for linking')

Dacc['answer'] = link.groupby('stim_ID')['target_hit_zone_label'].first() # add ground truth

##### (a) systematically fail (0 <= p <= 33.3%)

In [None]:
# plot all the trails with accuracies lower than 33.3%

plot_by_3(Dacc, video_dir)

Problematic instances with low accuracy:
- ...



##### (b) are close to chance (33.3% < p <= 66.7%)

##### (c) consistently succeed (66.7% < p <= 100%)

#### Roll

In [None]:
# get all dominoes trials
roll = D[D["scenarioName"] == "rollingsliding"]

#### 33.3%/ 66.7%/ 100% Accuracy Split 

In [None]:
Dacc = roll.groupby('stim_ID').agg({'correct':np.mean})
h = sns.histplot(data=Dacc, x='correct', bins=30, stat='probability')
t = plt.title('Accuracy distribution across stimuli for rolling & sliding')

Dacc['answer'] = roll.groupby('stim_ID')['target_hit_zone_label'].first() # add ground truth

##### (a) systematically fail (0 <= p <= 33.3%)

In [None]:
# plot all the trails with accuracies lower than 33.3%

plot_by_3(Dacc, video_dir)

Problematic instances with low accuracy:
- ...



##### (b) are close to chance (33.3% < p <= 66.7%)

##### (c) consistently succeed (66.7% < p <= 100%)

#### Drape

In [None]:
# get all drape trials
drape = D[D["scenarioName"] == "clothiness"]

#### 33.3%/ 66.7%/ 100% Accuracy Split 

In [None]:
Dacc = drape.groupby('stim_ID').agg({'correct':np.mean})
h = sns.histplot(data=Dacc, x='correct', bins=30, stat='probability')
t = plt.title('Accuracy distribution across stimuli for drape')

Dacc['answer'] = drape.groupby('stim_ID')['target_hit_zone_label'].first() # add ground truth

##### (a) systematically fail (0 <= p <= 33.3%)

In [None]:
# plot all the trails with accuracies lower than 33.3%

plot_by_3(Dacc, video_dir)

Problematic instances with low accuracy:
- ...



##### (b) are close to chance (33.3% < p <= 66.7%)

##### (c) consistently succeed (66.7% < p <= 100%)

## Conclusion

Here is a list of exeriment parameters that are problematic. 
- Dominoes
    - Bad occlusion (can't see any domino, or portion important for inference is blocked)