# Regression (predicting unmasked value given (x, y, z, synapses)

### Null distribution

No relationship, i.e. all variables independent, so joint can be factored into marginals. Let's just let all marginals be uniform across their respective min and max in the actual dataset. So the target variable Y, i.e. unmasked, follows a multivariate uniform distribution.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import urllib2
import sklearn.linear_model

%matplotlib inline

sample_size = 10000
np.random.seed(1)
url = ('https://raw.githubusercontent.com/Upward-Spiral-Science'
       '/data/master/syn-density/output.csv')
data = urllib2.urlopen(url)
csv = np.genfromtxt(data, delimiter=",")[1:] # don't want first row (labels)

mins = [np.min(csv[:,i]) for i in xrange(5)]
maxs = [np.max(csv[:,i]) for i in xrange(5)]
domains = zip(mins, maxs)
Y_range = domains[3]
del domains[3]


null_X = np.array([[np.random.randint(*domains[i]) for i in xrange(4)] for k in xrange(sample_size)])
null_Y = np.array([[np.random.randint(*Y_range)] for k in xrange(sample_size)])

print null_X.shape, null_Y.shape


(10000, 4) (10000, 1)


Just to check that this synthetic data is good null model...

In [2]:
lin_reg = sklearn.linear_model.LinearRegression()
lin_reg.fit(null_X, null_Y)
print lin_reg.score(null_X, null_Y)


0.000279194974088


### Alternate distribution

Here we want a strong relationship between variables. Let's keep the x, y, z uniformly distributed across the sample space, but let # of synapses, s, be a deterministic function, f, of x, y, z. Let $s=f(x,y,z)=\frac{x+y+z}{3}$. Now let's say our random variable $Y=(s/4)+\epsilon$ where $\epsilon$ is some Gaussian noise with variance equal to average(s/4) (just to make this synthetic data slightly more realistic).

In [3]:
alt_X = np.apply_along_axis(lambda row : np.hstack((row[0:3], np.average(row[0:3]))), 1, null_X)
std_dev = np.sqrt(np.average(alt_X[:, 3]))
alt_Y = alt_X[:, 3]/4 + np.random.normal(scale=std_dev, size=(sample_size,))

Just to check that this synthetic data is good alt model...

In [4]:
lin_reg.fit(alt_X, alt_Y)
print lin_reg.score(alt_X, alt_Y)

0.884357772213
