### ANOVAs, multiple comparisons, post-hoc tests, normality tests

### Correlation in python

$\text{cov}(X,Y) = \frac{1}{N} \sum\limits_{i}^{N} (\bar{X}-x_i) (\bar{Y}-y_i)$

$r = \frac{\text{cov}(X,Y)}{\sigma_X \sigma_Y}$

cov(*X*,*Y*) : covariance of *X* and *Y*  
*r* : Pearson r  
*X*,*Y* : arrays/vectors of same length  
$N$ : number of elements in *X*/*Y*  
$\bar{X}, \bar{Y}$ : mean of *X*/*Y*  
$x_i,y_i$ : *i*'th element of *X*/*Y*  
$\sigma_X$,$\sigma_Y$ : standard deviation of *X*/*Y*

In [1]:
# write a function that accepts 2 arguments: x and y
# and returns pearson r of x and y

In [9]:
from scipy.stats.stats import pearsonr

In [19]:
x = np.random.normal(loc=0, scale=1, size=10000)
y = np.random.normal(loc=0, scale=1, size=10000)


r,p = pearsonr(x,x)



### Linear models

In [49]:
# intro to linear relationships

x = np.arange(1000)
y = 32*x**.8 + 91 + np.random.normal(0,10000,size=len(x))

fig,ax = pl.subplots()

ax.scatter(x,y)

beta = np.cov(x,y)[0,1] / np.var(x)
A = np.mean(y) - beta*np.mean(x)

x0 = np.min(x)
x1 = np.max(x)
y0 = beta*x0 + A
y1 = beta*x1 + A
ax.plot([x0,x1], [y0,y1], color='red', lw=4)

r,p = pearsonr(x,y)

ax.set_xlabel('Temperature (degrees blah)')
ax.set_ylabel('Firing rate of neuron')

ax.set_title(r'$\beta={:0.3f}$  $A={:0.3f}$  $r={:0.3f}$'.format(beta,A,r))

print(beta)
print(A)

7.184110339147569
1532.7595442976449


### the form of a linear model

$y = \beta X + A$

X : the "independent" variable (regressor/s)  
y : the "dependent" variable

In [33]:
x = np.arange(100)

beta = 30
A = 6

y = beta*x + A

pl.scatter(x, y)

<matplotlib.collections.PathCollection at 0x1c3bf78630>

## definition of best line?

minimizing average distance between points and line

# the solution to the linear model

$\beta = \frac{cov(X,y)}{var(X)}$

$A = \bar{y} - \beta \bar{X}$

### Some assumptions of linear regression

* The data used in fitting the model are representative of the population

* The true underlying relationship between the two variables is linear

<hr/>

* The variance of the *residuals* is constant (homoscedastic, not heteroscedastic)

* The *residuals* are independent

* The *residuals* are normally distributed

#### how to assess how well the model did

we can use $r$ or $r^2$

In [None]:
# multiple regressors

$X_0$ : temperature  
$X_1$ : humidity

$y = \beta_0 X_0 + \beta_1 X_1 +  A$

Our model: "We believe the firing rate of the neuron can be explained/predicted as a sum of (scaled) temperature and humidity values (plus a constant offset).

In [52]:
# python implementation of a linear model
from statsmodels.regression.linear_model import OLS

# OLS: ordinary least squares
dependent = np.array([500, 520, 513, 520])

independent = np.array([
                    [10, 5, 1],
                    [11, 3, 1],
                    [12, 6, 1],
                    [13, 4, 1],
])

# in this example:
# dependent variable is firing rate of neuron
# indpendent variables are temperature and humidity
# the first column of independent represents temperature
# the second column of independent represents humidity
# each row in independent is an "observation"
# for example:
# in the first "observation" or "trial," the temperature
# was 10, the humidity was 5, and we recorded a firing rate of 500
OLS

statsmodels.regression.linear_model.OLS

In [54]:
# interpreting the output of a linear model

model = OLS(dependent, independent)
fit = model.fit()

fit.summary()



0,1,2,3
Dep. Variable:,y,R-squared:,0.842
Model:,OLS,Adj. R-squared:,0.525
Method:,Least Squares,F-statistic:,2.657
Date:,"Thu, 30 Aug 2018",Prob (F-statistic):,0.398
Time:,14:23:33,Log-Likelihood:,-10.39
No. Observations:,4,AIC:,26.78
Df Residuals:,1,BIC:,24.94
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,5.3000,2.907,1.823,0.319,-31.636,42.236
x2,-4.1000,2.907,-1.410,0.393,-41.036,32.836
const,470.7500,36.044,13.060,0.049,12.764,928.736

0,1,2,3
Omnibus:,,Durbin-Watson:,2.0
Prob(Omnibus):,,Jarque-Bera (JB):,0.667
Skew:,0.0,Prob(JB):,0.717
Kurtosis:,1.0,Cond. No.,138.0
