In this notebook, I will use Stan to try to fit a really simply fixed effects model to the first 40 answers for each Toppr user. We assume that each Toppr user i has a fixed probability of answering each of the first 20 questions correctly $\alpha_{i,0}$ and a fixed probability of answering the second 20 questions correctly $\alpha_{i,1}$.  Let $Y_{it}$ be an indicator for whether student i gets the t question correct and d[t] a function which equals 1 if t is less than or equal to 10 and 2 if t is greater than 10.  Then the model we fit is...

$$ P(Y_{it}=1) = \alpha_{i,d[t]} $$

$$ \begin{pmatrix}\alpha_{i,0}\\\alpha_{i,1} \end{pmatrix} \sim  N 
\begin{bmatrix}
\begin{pmatrix}
0\\
0
\end{pmatrix},
\begin{pmatrix}
1 & .2 \\
.2 & 1 
\end{pmatrix}
\end{bmatrix}
$$

Our goal is to compare the variance of $\alpha_{i,1}$ vs $\alpha_{i,2}$. If we find that the variance of $\alpha_{i,2}$ is lower than $\alpha_{i,1}$ this would be tentative evidence of adaptivity.

In [1]:
# import packages
import numpy as np
import pandas as pd
import pystan

# import the data
import os
df = pd.read_stata(os.environ["HOME"]+'/Desktop/''Toppr Dummy Data (First 40 Qns).dta')
# generate variables for t and d
df.sort_values(by=['student_id','question_start'])
df['t']=df.groupby('student_id').cumcount()+1
df['d']=(df['t']>10)*1
df['student_num']=df['student_id'].astype('category').cat.codes+1

In [2]:
# save the dataframe columns as a dict to pass to pystan
toppr_data = {'N': len(df), 'num_students': max(df['student_num']), 'student_num': df['student_num'],
              'correct': df['correct'],  't': df['t'], 'd': df['d']}

In [3]:
# save the Stan model as a string to pass to pystan
toppr_model = """
data {
    int<lower=0> N;  // number of observations
    int<lower=0> num_students; // total number of students in the sample
    int<lower=0> student_num; // student number
    int<lower=0,upper=1> correct[N];  // whether question answered correctly
    int<lower=1,upper=40> t[N];  // question sequence variable
    int<lower=0,upper=1> d[N]; // question batch
}
parameters {
    real alpha1[num_students];
    real alpha2[num_students];
}
model {
    alpha1 ~ normal(0,10);  // our prior for alpha1, we use a pretty uninformative prior
    alpha2 ~ normal(0,10); // our prior for alpha2, we use a pretty uninformative prior
    for (n in 1:N)
        correct ~ bernoulli(alpha1[student_num]*(1-d[n]) + alpha2[student_num]*d[n]);
} 
"""

In [4]:
# pass the data to the model and run the model
t_model = pystan.StanModel(model_code=toppr_model)
fit = t_model.sampling(data=toppr_data, iter=1000, chains=4)

INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_8c862ed856801632dfa19b0a2f06335a NOW.


CompileError: command 'gcc' failed with exit status 1

In [6]:
! gcc --version

Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/usr/include/c++/4.2.1
Apple LLVM version 10.0.0 (clang-1000.10.44.4)
Target: x86_64-apple-darwin18.2.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
