### Week03 - Feb 21, 2016 ###

__Parametric test__
- Based on parameters that summarize a distribution, such as mean and standard deviation
- Generally assume a normal distribution of samples

__Non-parametric test__
- Advantage: No assumptions about parent population (more robust)
- Disadvantage: have less power in situations where parametric assumptions are satisfied (more samples needed to draw conclusions at same confidence level)

__ Testing for normality __

<img src='images/norm_dist_week3.png',width=500>

Blue: Sample distribution ($O_i$)<br>
Red: Normal distribution with same mean and standard deviation, expected value ($E_i$)

__ Chi squared test for normality__

### $$ X^2 = \sum_{i=1}^k \frac{O_i - E}{E_i}$$ ###

Compare this test statistic to the Chi-Squared distribution $\chi_{\nu, 1-\alpha}^2$ <br>
- If test statistic is larger than the Chi-square value, can reject the Null Hypothesis that they are from the same distribution (Sensitive on Bin Size!)

__ Probability Plot __

<img src='images/prob_dens.png',width=500>
Y-Axis: Probability density, ie Count / Bin Width

<img src='images/prob_plot.png',width=500>
X-Axis: Quantiles - if normal distribution is split up into some discrete number of pieces, take the z-scores that ?

If values are normally distributed, the Quantiles should plot linearly with the Ordered Values. That is, most values are clustered around the mean.

Examples of a Non-Normal distribution
<img src='images/non_norm_dist.png',width=500>
<img src='images/non_norm_prob_plot.png',width=500>



__ Kolmogorov-Smirnov test __

Can be used to compare two sample distributions, or a sample distribution and a reference distribution (normal, exponential, etc.)

Null Hypothesis: Samples are drawn from the same distribution (in the two-sample case)

<img src='images/km_dist.png',width=500>
Source: Durkin et al (2009), Chitin in diatoms and its association with the cell wall, Eukaryotic Cell


__Example__
<img src='images/ks_Wiki.png',width=400>
Source: https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test <br>

Illustration of the two-sample Kolmogorov–Smirnov statistic. Red and blue lines each correspond to an empirical distribution function, and the black arrow is the two-sample KS statistic.

__ Other tests __

__Shapiro-Wilk__:
- High Power
- Biased at __Large__ sample size

__Anderson-Darling__


```python
from scipy import stats
help(stats)
```

### Wilcoxan signed-rank test ###

__$H_0$__: the median difference between pairs of observations is zero

- Rank the absolute values of the differences (smallest = 1)
- Sum the ranks of the positive values, and sum the ranks the negative values separately
- The smaller of the two sums is the test statistic T
- Low values of T required for significance
- Use __Mann-Whitney__ test for unpaired data

__ Kruskal-Wallis ANOVA __

__$H_0$__: Means of ranks of groups are the same <br>
__$H_0 (II)$__: Medians of groups are the same (assuming they come from distributions with the same shape)

- Related to the Mann-Whitney rank-sum test (two groups)
- Does not assume normality, but...
- According to [McDonald](http://www.biostathandbook.com), the Fisher's classic ANOVA is not actually very sensitive to non-normal distributions
- Like Fisher's classic ANOVA, testing $H_0 (II)$ does not assume difference groups have same variance( homoscedasticity)
- Welch's ANOVA is another alternative to Fisher's ANOVA that does not assume homoscedasticity (like Welch's t-test)

### Correlation $\neq$ Causation ###

<img src='images/everest.png', width=500>

<img src='images/correlation.png', width=500>
[source](http://support.minitab.com/en-us/minitab-express/1/help-and-how-to/modeling-statistics/regression/how-to/correlation/interpret-the-results/)


### $$ Variance: s^2 = \frac{1}{N-1}\sum_{i=1}^N (X_i - \bar{x})(x_i - \bar{x}) $$ ###
- This has to be positive

### $$ Covariance: s_{xy}^2 = \frac{1}{N-1}\sum_{i=1}^N (X_i - \bar{x})(y_i - \bar{y}) $$ ###
- Can be positive or negative

### $$ r_{xy} = \frac{s_{xy}^2}{s_x s_y} $$ ###
Where s is the standard deviation of the x or y values (depending on the subscript)

__ Varying degrees of correlation __

<img src='images/correlation_cases.png',width=600>
Figure: Bendat and Piersol

#### There may be a strong nonlinear relationship between variables, so a linear model may not be appropriate ####

<img src='images/correlation_cases2.png',width=600>

[source](http://wikipedia.org/wiki/Correlation_and_dependence)

__Parametric statistic (like Pearson's correlation are sensitive to outliers __
<img src='images/outlier1.png',width=600>
[source: Stackexchange](http://stats.stackexchange.com/questions/11746/what-could-cause-big-differences-in-correlation-coefficient-between-pearsons-an)

__ Testing Significance of linear Correlation __

$$ t = |r|\frac{\sqrt{N-2}}{\sqrt{1-r^2}} $$ 

Calculate a T statistic and compare with a critical t values $t_{\nu, \alpha/2} $
where $ \nu = N-2 $



__ Non-parametric test __
<img src='images/non_para_test.png',width=600>
[source: Minitab](http://support.minitab.com/en-us/minitab-express/1/help-and-how-to/modeling-statistics/regression/how-to/correlation/interpret-the-results/)

- Test for a relationship between the rank-ordered data (lowest variable in x corresponds to lowest variable in y, highest variable in x corresponds to highest variable in y, etc.)

- Rho = 1 if  y increases __monotonically__ with x

__Not useful for non-linear relationships below __:
<img src='images/bad_rank.png',width=600>
[source: jpktd.blogspot.com](http://jpktd.blogspot.com/2012/06/non-linear-dependence-measures-distance.html)


__ Test Cases __

<img src='images/test_cases.png', width=800>

__ Other options __
Log transform t-test
H0: no diffrence between geometric means

$ G.M. = \sqrt[n]{x_i - x_2 - ... x_n } $



### Linear regression (Type 1) ###
<img src='images/linear_reg.png',width= 500>
source: Emery and Thomson, Fig. 3.11 <br>

__Type I regression minimizes the variance along one axis__
- x is the independent variable
- y is the dependent variable

__Type I regression assumes that the x variable is exactly known (error free) __

```python
from scipy import stats
help(stats.linregress)
```

Examples of Error Free Measurements:
- Time
- Chemical Standard
- Distance

<img src='images/sst_trends.png',width= 500>
[source: c2es](www.c2es.org)
Careful when choosing a subset!

__Linear Regression Assumptions__

<img src='images/regres_assumptions.png',width= 500>
[Source](http://matplotlib.org/examples/pylab_examples/anscombe.html)
- Validity of linear model

- Constant variance: same variance regardless of x value (homoscdastic)

- Independence of errors (errors are uncorrelated)

<br>
<br>
<br>

<img src='images/hetero.png',width= 500>

[source:Wikipedia](https://en.wikipedia.org/wiki/Heteroscedasticity)

__ Minimizing the Sum of Squared Errors, SSE __
### $$SSE  = \sum_{i=1} ^N (y_i - \hat{y_i})^2  $$ ###

Where:
- $y_i$ value of data point
- $\hat{y_i}$ = predicted y-value

__ Regression slope __

$$ \hat{a} = \frac{ \sum_{i=1} ^N (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1} ^N (x_i - \bar{x})^2} $$

__ Regression Intercept __
$$ \hat{b} = \bar{y} - \hat{a}\bar{x} $$

__ Standard Error __
$$ Se = ( \frac{SSE}{N-2} )^{1/2} $$

__Confidence Intervals __

$$ \hat{a} - t_{\nu,\alpha /2} \frac{Se^2}{\sum_{i=1}^N (x-\bar{x})^2} < a < \hat{a} + t_{\nu,\alpha /2} \frac{Se^2}{\sum_{i=1}^N (x-\bar{x})^2} $$

__ Type II regression __

Case where there are potentially errors in both the x and y variables

<img src='images/type_2_reg.png',width= 700>
Reference for Geometric Mean Function Regression (GMFR, a.k.a. neutral regression)

Ricker, W. E. Computation and uses of central trend lines
Can. J. Zool., 1984, 62, 1897-1905 

__ Calculating Geometric Mean __

- $\hat{a}_{yx}$ : slope f regression of y on x
- $\hat{a}_{xy}$ : slope f regression of y on x
- Geometric mean: $ \hat{a}_{GM} =   \sqrt{\frac {\hat{a}_{yx}} {\hat{a}_{xy}} }$


__General Linear Regression - higher order polynomials__

```python
numpy.polyfit
```
<img src='images/poly_reg.png',width= 600>

[Source](http://www.turingfinance.com/regression-analysis-using-python-statsmodels-and-quandl/)

__ Non Linear Least-Squared __<br>
__Iterative procedure__: parameters of the function are adjusted until the fit to data does not get any better <br>
See: http://www.turingfinance.com/regression-analysis-using-python-statsmodels-and-quandl/

__Fitting an Exponential curve__ (See Glover Jenkins, and Doney Fig 3.9):

$ P = P_o e^{\mu t} $

or

$ P =P(0) + P_o e^{\mu t} $


```python
from scipy.optimize import curve_fit
help(curve_fit)
```
