### Definitions, Assumptions

For notational ease, let our data set $X$ be a list of vectors in $R^3$, where the number of times a vector $x_i$ appears in the list is equal to the synapse value for that vector in our original data set (if the vector does not appear at all in the data set, don't include it).

As we have seen through 3d scatter plots, the data can be seen as slices of data across the y-z plane for fixed values of x. Let $L = \{l_1, l_2, ..., l_m\}$ be the set of these x values. We can now turn our data set X, an unlabeled list of vectors in $R^3$, into labeled data by treating the x value of each vector as the label, and thus the set of possible labels is L.

For test purposes we assume that all $X_i$ are iid and will treat the $X_i$ as labeled/categorical data as discussed above.

### Statistical test

We will test whether or not the number of synapses are distributed uniformly across each 'slice'. In other words does X' follow a uniform distribution.

Null hypothesis: $X_i \sim U(1, M)$ where $U$ is a discrete uniform distribution with codomain 1, 2, ..., M

Alternative: $X_i \not \sim U$.

### Test statistic 
We'll use Pearson's Chi Squared Test to determine whether to reject the null. First, define $E_i=\frac{|X|}{|L|}$ (recall X is our data and L is the set of labels), and define $O_i=|\{X_i \mid X_i \textrm{ has label } l_i\}$. In other words, $E_i$ is the expected number of data points with label $l_i$ and $O_i$ is the actual number. Our test statistic is
$$
T = \sum_{i = 1}^M \frac{(O_i - E_i)^2}{E_i}
$$
and it approximately follows a chi-squared distribution with M-1 degrees of freedom. Therefore, given a signifigance level, $\alpha$, we can use the inverse CDF of the chi-squared distribution to determine a critical value. When T is greater than the critical value, we can reject the null.

In [30]:
# preliminary code: 
# imports 
# define some global variables and functions

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import itertools
import urllib2
import scipy.stats as stats

%matplotlib inline

np.random.seed(1)
alpha = .05
url = ('https://raw.githubusercontent.com/Upward-Spiral-Science'
       '/data/master/syn-density/output.csv')
data = urllib2.urlopen(url)
csv = np.genfromtxt(data, delimiter=",")[1:] # don't want first row (labels)
num_samples = 100 # how many different sized samples to draw
N = np.sum(csv[:, -1]) # total data set size
L = np.unique(csv[:, 0]) # list of labels


# make data set of synapse labels (x coordinates)
X = []
def build_label_set(row):
    global X
    s = row[-1]
    new_vals = [int(row[0])]*s
    X.extend(new_vals)
    return 0
np.apply_along_axis(build_label_set, 1, csv)
X = np.array(X)
# sample sizes to iterate over
sample_sizes = np.linspace(100, 1000000, num_samples, dtype='int32')

# function to get the chi-squared test statistic
def chi_sq_stat(X, L):
    df = len(L) - 1 # degrees of freedom
    E_i = np.array([float(len(X))/float(len(L)) for l in L]) #E_i constant across all i
    O_i = np.array([np.count_nonzero(np.where(X == int(l))) for l in L])
    return stats.chisquare(O_i, E_i)

### Null model