<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"></ul></div>

In [1]:
# Import analysis packages
%matplotlib inline
import stan as ps
import numpy as np
import pandas as pd
import seaborn as sns
import arviz as az
import matplotlib.pyplot as plt
import scipy.stats as ss
from patsy import dmatrix
from patsy.contrasts import Treatment, Sum

# Importing nest_asyncio is only necessary to run pystan in Jupyter Notebooks.
import nest_asyncio
nest_asyncio.apply();

In [2]:
from IPython.core.display import HTML as Center

Center(""" <style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style> """)

# Steps of Bayesian data analysis

<font size = "3"> Kruschke (2015) offers a step by step formulation for how to conduct a Bayesian analysis:

1. Identify the relevant data for the question under investigation.

2. Define the descriptive (mathematical) model for the data.

3. Specify the Priors for the model. If scientific research publication is the goal the priors will need to be accepted by a skeptical audience. This should be achievable using prior predictive checks to ascertain if the priors are reasonable.

4. Using Bayes rule estimate the posterior for the parameters of the model using the likelihood and priors. Then use the posterior for conducting your inferences.

5. Conduct model checks. i.e. Posterior predictive checks.</font> 

<font size = "1">This notebook will follow this approach generally.</font> 

#  Step 1 - Identify the relevant data for the question under investigation

## Study data/description

The psychological literature suggests that both the physical distance and psychological factors can influence peoples' perceptions of distance. An example of this is the percieved distance to walk to a location when fed than when in a caloric deficit.

The data analysed below is from Maglio and Polman (2014) study that investigated this general distance perception phenomenon. Specifically investigating, the effect of spatial orientation on percieved distance. This was achieved through the acqusition of 202 participants from the Toronto underground green trainline. Half of the passengers were headed eestbound and the other half westbound. The particpants were then further sub-divded indepently to give their pericieved distance from 4 different locations. For the analysis below two of these stations have been dropped for simplcicity of exposition and to demonstrate the most common experimental design in psychology the 2x2 with a Bayesian model equivalent to the 2x2 Between subjects ANOVA using prior contrasts in a linear model (Schad, Vasishth, Hohenstein and Kliegl, 2020).

In [3]:
url = 'https://raw.githubusercontent.com/ebrlab/Statistical-methods-for-research-workers-bayes-for-psychologists-and-neuroscientists/master/wip/Data/Maglio%20and%20Polman%202014.csv'
df = pd.read_csv(url)

In [4]:
%%capture
# Reduce the dataset to demonstrate 2x2 for simplicities sake
dfReduced = df[df.station < 3]

# Convert data variable into type 'category' for generating data
dfReduced.direction = dfReduced.direction.astype('category') 
dfReduced.orientation = dfReduced.orientation.astype('category') 

In [32]:
dfReduced

Unnamed: 0,direction,orientation,station,subjective_distance
0,EAST,1,1,5
1,EAST,1,1,4
2,EAST,1,1,3
3,EAST,1,1,3
4,EAST,1,1,4
...,...,...,...,...
97,WEST,2,2,1
98,WEST,2,2,1
99,WEST,2,2,2
100,WEST,2,2,1


In [5]:
# Reorder categories
dfReduced['orientation'].cat.reorder_categories([2, 1], inplace = True)

In [26]:
TbTBA = """
data{

int N; // Number of observations
vector[N] y;
int K; // Number of predictors
matrix[N, K] X; // The design matrix

}

transformed data{
real SD;
real Mean = mean(y);
SD = sd(y);
}

parameters{

vector[K] beta;
real<lower = 0> sigma;

}

model{

// Priors
beta[1] ~ normal(Mean, 2.5 * SD);
beta[2:K] ~ normal(0, 10);
sigma ~ normal(0, 10);

// Likelihood
y ~ normal(X * beta, sigma);
}

generated quantities{
vector[K - 1] effect_sizes = beta[2:K]/sigma;

real yrep[N];
yrep = normal_rng(X * beta, sigma);
}


"""

In [27]:
# Generate design matrix for the general linear model 
X = np.asarray(dmatrix("~ direction + orientation", data = dfReduced))

# Convert contrasts to effect coding
X[:,1] = X[:,1] - .5
X[:,2] = (X[:,2] - .5) 

# Generate the interaction term
int_t = X[:,1] * X[:,2]

# Add interaction term to the design matrix
X =  np.c_[X, int_t]

# Generate python dictionary 
data = {'N': len(dfReduced),
        'y': dfReduced['subjective_distance'].values,
         'X': X,
         'K': np.shape(X)[1]}

In [28]:
# Compile stan model into C++ code.
sm = ps.build(TbTBA, data = data)

Building...



Building: found in cache, done.

In [29]:
fit = sm.sample(num_chains = 4)

Sampling:   0%
Sampling:   1% (100/8000)
Sampling:   4% (300/8000)
Sampling:  12% (1000/8000)
Sampling:  18% (1400/8000)
Sampling:  42% (3400/8000)
Sampling:  68% (5400/8000)
Sampling:  82% (6600/8000)
Sampling: 100% (8000/8000)
Sampling: 100% (8000/8000), done.
Messages received during sampling:
  Gradient evaluation took 2.7e-05 seconds
  1000 transitions using 10 leapfrog steps per transition would take 0.27 seconds.
  Adjust your expectations accordingly!
  Gradient evaluation took 1.6e-05 seconds
  1000 transitions using 10 leapfrog steps per transition would take 0.16 seconds.
  Adjust your expectations accordingly!
  Gradient evaluation took 9e-06 seconds
  1000 transitions using 10 leapfrog steps per transition would take 0.09 seconds.
  Adjust your expectations accordingly!
  Gradient evaluation took 1.3e-05 seconds
  1000 transitions using 10 leapfrog steps per transition would take 0.13 seconds.
  Adjust your expectations accordingly!


In [30]:
az.summary(fit, var_names=['beta', 'sigma', 'effect_sizes'])

Unnamed: 0,mean,sd,hdi_3%,hdi_97%,mcse_mean,mcse_sd,ess_bulk,ess_tail,r_hat
beta[0],2.528,1.927,-0.684,6.531,0.045,0.034,1807.0,1999.0,1.0
beta[1],-0.536,7.052,-13.461,13.178,0.182,0.143,1495.0,1350.0,1.0
beta[2],0.531,7.054,-13.482,13.097,0.182,0.147,1492.0,1346.0,1.0
beta[3],-0.59,7.697,-13.512,15.217,0.181,0.147,1809.0,1918.0,1.0
sigma,1.104,0.075,0.968,1.25,0.002,0.001,2221.0,2301.0,1.0
effect_sizes[0],-0.498,6.426,-13.348,10.899,0.166,0.129,1502.0,1353.0,1.0
effect_sizes[1],0.473,6.426,-12.404,11.762,0.166,0.133,1503.0,1387.0,1.0
effect_sizes[2],-0.547,7.011,-13.532,12.555,0.165,0.133,1815.0,1989.0,1.0


In [None]:
# Convert pystan fit object to IO for Arviz functions.
idata = az.from_pystan(posterior=fit, posterior_model=sm, posterior_predictive=['yrep'],observed_data= 'y')

## Posterior predicitve checks

In [None]:
# Plot posterior simulated data sets for posterior predictive check
az.plot_ppc(idata, data_pairs = {"y" : "yrep"}, num_pp_samples= 100);

As the posterior predictive checks show above there is a quite dractic misfit of the model. What is happening here. The reason for such a poor fit is the the fact the data anlysed was ordinal. As such the normal likelihood used here and one of the assumptions in classical ANOVA's is inappropriate for such type of data. However the use of such inapproapriate models is common in the psychological sciences (Liddell & Kruscke, 2018). There are arguemetns in the literature that ANOVA can be robust in analysing  of Ordinal. The authors of this notebook make no counter arguments around this issue. We only direct readers to more appropriate and better fitting alternative generalised (ordinal) linear models  (Liddell & Kruscke, 2018), which we demonstrate in another notebook. 

<font size = "3">As Kruschke (2015) correctly points out there is not standard formula or presentation method for reuslts in journal article like the APA guide for reporting frequentist analysis. It is likely there never will be, because as McElreath (2020) explains, Bayesian data analysis is more like a engineering approach to the problem and the resulting model that is fit will be analysis specific. In addition, as Gabry et al, (2019) argue visualisations maybe even more key. So, all the visualisations above would have to be included with any write up. Anyway, below the write up generally follows the advice of Kruschke (2015) chapter 25. In any application though it comes down to the problem to be described an the audience that needs to be convinced. </p><br/>

<h2>Write up of categorical regression</h2><br/>

Note having fitted the model and observed the ill fit it is likely that the following write would have little utility, given that the data is ordinal and a normal likelihood was inappropriate for this data analysis problem. Of course, throughout the scientific literatue their are countless examples of inappropriate models applied to data, McElreath (2020). Now, bearing that in mind, the following write up is still a useful example for how to write up a 2x2 design analysis using an Bayesian regression equivalent of factorial ANOVA.  

# Bonus Content

In [35]:
dfReduced['direction_n'] = dfReduced['direction'].replace(['WEST', 'EAST'],
                        [1, 2])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Index variable with matrices parameterisation Stan model

In [45]:
matIndex = """
data{

int N; // Number of observations
vector[N] y; // Dependent variable
int K; // Number of stations
int J; // Number of orientations

int<lower=1, upper = K> stations_id[N];
int<lower=1, upper = J> orientations_id[N];
}

parameters{

matrix[K,J] mu;
real<lower = 0> sigma;

}

model{

y ~ normal(mu[stations_id, orientations_id], sigma);
}
"""

In [46]:
# Generate python dictionary 
index_data = {'N': len(dfReduced),
              'y': dfReduced['subjective_distance'].values,
              'K': len(np.unique(dfReduced['station'])),
              'J': len(np.unique(dfReduced['orientation'])),
              'stations_id': dfReduced['station'].values,
              'orientations_id': dfReduced['direction_n'].values}

index_sm = ps.build(matIndex)

Building...


Building: Semantic error:   -------------------------------------------------
    20:  model{
    21:  
    22:  y ~ normal(mu[stations_id, orientations_id], sigma);
         ^
    23:  }
   -------------------------------------------------

Ill-typed arguments to '~' statement. No distribution 'normal' was found with the correct signature.

ValueError: Semantic error

# References

Liddell, T. M., & Kruschke, J. K. (2018). Analyzing ordinal data with metric models: What could possibly go wrong?. Journal of Experimental Social Psychology, 79, 328-348.

Maglio, S. J., & Polman, E. (2014). Spatial orientation shrinks and expands psychological distance. Psychological Science, 25, 1345-1352.

Schad, D. J., Vasishth, S., Hohenstein, S., & Kliegl, R. (2020). How to capitalize on a priori contrasts in linear (mixed) models: A tutorial. Journal of Memory and Language, 110, 104038.