# Effective Sample Size (Alpha)

From Wikipedia *In statistics and quantitative research methodology, a data sample is a set of data collected and/or selected from a statistical population by a defined procedure*. Properly sampling is one of the most challenging parts of experimental design for all statistics, one of the pitfalls being correlated data points in a sample.

For example if designing a drug to lower blood pressure, and if only able to 10 readings consider 3 designs.

1. 10 readings from 1 person
2. 10 readings from 10 people in one family
3. 10 readings from 10 random people

In each of these designs the number of data points remains the same, 10 readings, but an argument can be made that the **effective sample size** is not the same. In the first example, 10 readings from 1 person is a poor sample of if a drug works, there effectively was only one test if a drug had an effect. The second example is an improvement, but given knowledge of genetics family members will likely have similar reponses to a drug. The effective sample size is not one, but it could be argued that the sample size is greater than 10. In the last case 10 readings from 10 people would probably

The key notion is that effective samples are ones that present new information about the objective of the experiment.

## Effective Sample Size in Markov Chain Monte Carlo
In MCMC samples are draws from the Markov Chain. Before the chain reaches convergence draws could be correlated, meaning the nth sample may not be independent from the (n-1) draw.

**TODO**
Wait for the result of this discussion
https://github.com/arviz-devs/arviz/issues/422

To estimate the Gelman etal provide the following calculation $$\hat{n}_{eff} = \frac{mn}{1 + 2 \sum_{t=1}^T \hat{\rho}_t}$$

where 

\begin{aligned} 
m &= \text{Number of chains} \\
n &= \text{Number of draws} \\
t &= \text{Autocorrelation lag} \\
T &= \text{Stopping point (for autocorrelation lag)} \\
\hat{\rho}_t &= \text{Estimate of the sum of correlations}
\end{aligned} 


The full derivation can be found in *Bayesian Data Analysis 3rd ed* equations *11.5* to *11.8*. However the intuition is as follows. If draws from a Markov Chain is uncorrelated 

$$ \hat{\rho}_t = 0 $$

and therefore

$$ \hat{n}_{eff} = \frac{mn}{1 + 2 \sum_{t=1}^T 0} = mn $$



## Examples

### Eight Schools Model Formulation
Effective Samples Size is a helpful diagnostic formulating the Eight Schools Model. More on the Eight Schools Model can be found **here**

**TODO** Add reference to Eight Schools model

While the finished inference run can be loaded with `az.load_arviz_data("non_centered_eight")' and az.load_arviz_data("centered_eight")` in this example Stan will be used with ArviZ for completeness.

In [1]:
import pystan
import arviz as az
import numpy as np

### Eight School Models

We will be using the models from Michael Betancourt's essay *Diagnosing Biased Inference with Divergences* 

In [9]:
data = {
    "J": 8,
    "y": np.array([28.0, 8.0, -3.0, 7.0, -1.0, 1.0, 18.0, 12.0]),
    "sigma": np.array([15.0, 10.0, 16.0, 11.0, 9.0, 11.0, 10.0, 18.0]),
    }

### Centered Eight Schools Model
In the centered example theta is a normal distribution with the parameters `mu` and `tau`. Without prior knowledge this is a reasonable formulation.

In [19]:
non_centered_model = """
    data {
      int<lower=0> J;
      real y[J];
      real<lower=0> sigma[J];
    }

    parameters {
      real mu;
      real<lower=0> tau;
      real theta[J];
    }

    model {
      mu ~ normal(0, 5);
      tau ~ cauchy(0, 5);
      theta ~ normal(mu, tau);
      y ~ normal(theta, sigma);
    }
"""

In [22]:
sm = pystan.StanModel(model_code=non_centered_model)
fit = sm.sampling(data=data, iter=500, chains=2, algorithm="NUTS", verbose=False)

pystan - INFO - COMPILING THE C++ CODE FOR MODEL anon_model_dadcdb82e8768088aeb9c143cf1f3a28 NOW.


ArviZ can then be used to calculate the effective sample size of the inference run.

In [23]:
az.effective_n(fit)

<xarray.Dataset>
Dimensions:      (theta_dim_0: 8)
Coordinates:
  * theta_dim_0  (theta_dim_0) int64 0 1 2 3 4 5 6 7
Data variables:
    mu           float64 41.88
    tau          float64 60.13
    theta        (theta_dim_0) float64 79.34 116.7 83.64 ... 84.2 131.0 110.2

When running 500 sampling iterations, 

# Centered Model

In [13]:
centered_model= """
data {
  int<lower=0> J;
  real y[J];
  real<lower=0> sigma[J];
}

parameters {
  real mu;
  real<lower=0> tau;
  real theta_tilde[J];
}

transformed parameters {
  real theta[J];
  for (j in 1:J)
    theta[j] = mu + tau * theta_tilde[j];
}

model {
  mu ~ normal(0, 5);
  tau ~ cauchy(0, 5);
  theta_tilde ~ normal(0, 1);
  y ~ normal(theta, sigma);
}
"""

In [14]:
sm = pystan.StanModel(model_code=centered_model)
fit = sm.sampling(data=data, iter=500, chains=2, algorithm="NUTS", verbose=False)

pystan - INFO - COMPILING THE C++ CODE FOR MODEL anon_model_3f8f9e8bb354ab461436bb51d935571d NOW.


<xarray.Dataset>
Dimensions:            (theta_dim_0: 8, theta_tilde_dim_0: 8)
Coordinates:
  * theta_dim_0        (theta_dim_0) int64 0 1 2 3 4 5 6 7
  * theta_tilde_dim_0  (theta_tilde_dim_0) int64 0 1 2 3 4 5 6 7
Data variables:
    mu                 float64 437.7
    tau                float64 394.5
    theta_tilde        (theta_tilde_dim_0) float64 486.0 376.5 ... 466.5 392.2
    theta              (theta_dim_0) float64 392.3 537.4 502.2 ... 436.2 388.9

## PyMC3 (To show what Im trying to get at)

In [None]:
# Attempt to use pymc3 to estimate mean of the distribution
with pm.Model() as model:
    mu = pm.Normal("mu", mu=-5000, sd=1)
    y = pm.Normal("y", mu=mu, sd=SD, observed=obs)
    step = pm.Metropolis()
    trace = pm.sample(500, step, chains=2)
    
print(az.effective_n(trace))


# Attempt to use pymc3 to estimate mean of the distribution
with pm.Model() as model:
    mu = pm.Normal("mu", mu=-5, sd=1)
    y = pm.Normal("y", mu=mu, sd=SD, observed=obs)
    step = pm.Metropolis()
    trace = pm.sample(500, step, chains=2)
    
az.effective_n(trace)

#### The nice pystan output goes to the terminal (Possible to reverse outputs?)
<img src="PystanOutputInTerminal.png">

### Resources
[Effective Sample Size](https://www.youtube.com/watch?v=67zCIqdeXpo&t=227s) - Josh Starmer

### See also
Autocorrelation (TODO, Add link)

In [None]:
%load_ext watermark

In [None]:
%watermark -m

In [None]:
%watermark