In [None]:
import sys, os
pth = os.path.abspath("../lib")
sys.path.insert(0, pth)

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import analyses.util as util
import plots.util as putil
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image

In [None]:
%%bash
scons

In [None]:
data = util.load_all()

# Inferring mass in complex physical scenes

Jessica B. Hamrick, Peter W. Battaglia, Thomas L. Griffiths, Joshua B. Tenenbaum

---

## IPE

We ran 100 IPE samples for each stimulus for 27 different values of $\kappa=\log_{10}(r)$:

In [None]:
kappas = np.asarray(data['ipe']['B'].data['kappa'].drop_duplicates())
print "Unique kappa values:", kappas

From these IPE samples, we can compute estimates of $p(\mathrm{fall}\ \vert\ \kappa, S)$ for each stimulus $S$. In order to enforce that this curve is continuous in the domain of $\kappa$, we apply a kernel smoothing procedure. For example, the following plot shows the raw estimates of $p(\mathrm{fall}\ \vert\ \kappa, S)$ for one particular $S$ (blue dots), as well as the final smoothed estimate (black line).

In [None]:
i = 'tower_00029_1010011010'
fig, (ax1, ax2) = plt.subplots(1, 2)
data['ipe']['C'].plot_fall(ax1, i, color=putil.colors[0])
data['empirical']['C'].plot_fall(ax2, i, color=putil.colors[0])
fig.set_figwidth(10)
plt.tight_layout()

In [None]:
i = 0
fig, (ax1, ax2) = plt.subplots(1, 2, subplot_kw=dict(polar=True))
data['ipe']['C'].plot_direction(ax1, i, -1.0, color=putil.colors[0])
data['ipe']['C'].plot_direction(ax2, i, 1.0, color=putil.colors[2])
ymax = max(ax1.get_ylim()[1], ax2.get_ylim()[1])
ax1.set_ylim(0, ymax)
ax2.set_ylim(0, ymax)
fig.set_figwidth(10)
plt.tight_layout()

---

## Methods

We ran two experiments in which participants made judgments about the stability and mass of towers of building blocks. Both experiments were organized as follows:

* Pretest: "will it fall?" judgments on 6 original (non-mass) towers
* Block A: "will it fall?" judgments on 10 red/blue mass towers, with visual feedback
* Block B: "will it fall?" judgments on 20 red/blue mass towers, with no feedback
* Block C: "which is heavier?" judgments on the same towers as in Block B, with visual feedback
* Posttest: same as pretest, but different order

The difference between Experiment 1 and Experiment 2 was in block C. In Experiment 1, the colors of the blocks were different on every trial. In Experiment 2, the colors were always green and purple. Thus, in Experiment 1, participants had to make inferences based off of information from just a single trial, whereas in Experiment 2, participants had to integrate information about the mass over multiple trials.

The mass ratio was always the same for blocks A and B, but could be different for Block C, giving four conditions:

* Condition 0: A, B: $r=0.1$, C: $r=0.1$
* Condition 1: A, B: $r=0.1$, C: $r=10$
* Condition 2: A, B: $r=10$, C: $r=0.1$
* Condition 3: A, B: $r=10$, C: $r=10$

Additionally, we counterbalanced the colors, resulting in a total of eight conditions. People were distributed into the conditions as follows:

In [None]:
cond_counts = pd.read_csv("results/condition_counts.csv")\
    .set_index(['version', 'condition', 'counterbalance'])\
    .unstack('version')
cond_counts

In [None]:
pd.read_csv("results/num_participants.csv", index_col="version").T

Participants were paid either \$1.25 (Experiment 1), \$1.00 (Experiment 2), or \$0.70 (Experiment 3):

In [None]:
pd.read_csv("results/payrate.csv").set_index('version')

---

## Results

### Can people reason with mass?

Previous research \cite{Battaglia2013} indicated that people can take information about the mass of objects into account when reasoning about physical properties like stability. The first parts of our experiment (Block A and Block B) were essentially the same design as that from \citeA{Battalia2013}, so we should see the same trends as in Battaliga2013.

First, we can take a look at how well the IPE estimates of $p(\mathrm{fall}\ \vert\ \kappa_0, S)$ predict human judgments of stability. In the following plots, the $x$-axis is the IPE's estimate of $p(\mathrm{fall}\ \vert\ \kappa, S)$ for $\kappa=\kappa_0$ (center subplots) and $\kappa=0.0$ (right subplots). The $y$-axis is people's judgments of stability on a scale from 1-7, with 1 being less stable, and 7 being more stable. The black lines connect stimuli with the same geometry.

These plots illustrate the same trend that was previously found: people are sensitive to the information about mass. In the left plots, we see that people give different judgments for stimuli with identical when they are told that the mass ratio is different. The IPE does the same when it is given information about mass (center plots). As such, IPE predictions that were generated without knowledge of mass are poor predictions of human judgments (right plots).

---

#### "Will it fall?" responses from block A

In [None]:
# plot "will it fall?" responses for block A
Image("figures/fall_responses_GH_A.png")

These are the corresponding Pearson correlations for the above plots (`ModelIS` is the mass-insensitive IPE, and `ModelS` is the mass-sensitive IPE).

In [None]:
pd.read_csv("results/fall_response_corrs.csv").set_index(['block', 'X', 'Y']).ix['A']

---

#### "Will it fall?" responses from block B

In [None]:
# plot "will it fall?" responses for block B
Image("figures/fall_responses_GH_B.png")

These are the corresponding Pearson correlations for the above plots (`ModelIS` is the mass-insensitive IPE, and `ModelS` is the mass-sensitive IPE).

In [None]:
pd.read_csv("results/fall_response_corrs.csv").set_index(['block', 'X', 'Y']).ix['B']

---

### Can people infer mass?

Before we can make an argument that people infer mass in a manner consistent with the "noisy Newton" approach, we must first demonstrate that people can make inferences about mass at all. Recent research (Sanborn2013) has explained how previous results suggesting people do not make sophisticated inferences about mass (Todd1982, Gilden1994) can be explained using Bayesian inference. However, no one has yet shown whether this also holds true in more complicated, realistic scenes.

We can examine the accuracy of "which is heavier?" judgments from block C of the experiment to see whether people are judging the heavier color actually as heavier. If they are guessing randomly, then we should see around 50% accuracy. If they are correctly inferring the mass, then their accuracy should be above 50%.

---

#### Overall accuracy

As shown by the following table, people are (across stimuli) above chance at determining the heavier color, regardless of the mass ratio:

In [None]:
pd.read_csv("results/mass_accuracy.csv")\
    .groupby(['species', 'version'])\
    .get_group(('human', 'H'))\
    .set_index('kappa0')\
    .drop(['species', 'class', 'version'], axis=1)

---

#### Per-stimulus accuracy

We see that people are also above chance on many of the individual stimuli, though there are a few stimuli for which people are at chance:

In [None]:
Image("figures/mass_accuracy_by_stimulus.png")

Specifically, significantly above chance on 31 of these stimuli, and not significantly above chance for 9 of these stimuli (using Bonferroni correction for multiple comparisons):

In [None]:
pd.read_csv("results/num_chance.csv").groupby('version')['0.00125'].sum()

---

### How do people infer mass?

#### Human vs. model accuracy

Are the stimuli that people are better at inferring mass from the same stimuli that the model is good at inferring mass from? We see that the original IPE is not very correlated with people, but the IPE based off of people's "fall?" judgments is:

In [None]:
Image("figures/mass_responses_by_stimulus.png")

In [None]:
Image("figures/model_results.png")

In [None]:
pd.read_csv("results/mass_responses_by_stimulus_corrs.csv").groupby('version').get_group('H').set_index(['X', 'Y'])

In [None]:
pd.read_csv("results/mass_accuracy_by_stimulus_corrs.csv").groupby('version').get_group('H').set_index(['X', 'Y'])

In [None]:
Image("figures/fall_responses_best_parameters.png")

In [None]:
Image("figures/mass_accuracy_best_parameters.png")

In [None]:
Image("figures/best_parameters.png")

---

### Do people integrate information over time?

The following plot does not show a clear effect of learning over time -- just that people are above chance.

In [None]:
Image("figures/mass_accuracy_by_trial.png")

In [None]:
pd.read_csv("results/mass_accuracy_by_trial_corrs.csv")\
    .groupby('kappa0')\
    .get_group('all')\
    .drop('kappa0', axis=1)\
    .set_index(['version', 'num_mass_trials'])\
    .sortlevel()

If we look more closely at individual participants, we see that the majority of them did eventually figure out which color was heavier. The following plots shows the fraction of participants who gave the correct answer on trial $t$ and all trials afterwards. In both conditions, the majority of participatns eventually figured out the which color was heavier, however there were also some participants who never settled on the correct answer. It is possible that these participants were confused about the instructions, and did not realize that the heavier color was always the same.

In [None]:
Image("figures/num_learned_by_trial.png")

In [None]:
Image("figures/mass_accuracy_by_trial_with_model.png")

To look at the data from a slightly different angle, we can compare three different inference models:

* `chance` -- guesses uniformly at random, reflecting the hypothesis that people do not make any inferences about mass
* `learning` -- updates its beliefs according to Bayes' rule, using physical knowledge from the IPE
* `static` -- uses physical knowledge from the IPE, but only considers information from the most recent trial (does not update beliefs)

Both models which utilize knowledge from the IPE are better explanations of people's behavior than a model that guesses randomly:

In [None]:
Image("figures/model_log_lh_ratio_by_trial.png")

In [None]:
pd.read_csv("results/model_log_lh_ratios.csv")

In [None]:
Image("figures/model_params.png")

In [None]:
store_pth = "results/model_belief_by_trial.h5"
store = pd.HDFStore(store_pth, mode="r")

In [None]:
keys = store.keys()

In [None]:
len(keys)

In [None]:
keys