# Choosing the Correct Statistical Test

References:<br>
http://www.ats.ucla.edu/stat/mult_pkg/whatstat/<br>
http://www-users.cs.umn.edu/~ludford/stat_guide.html<br>
http://wise.cgu.edu/

___

# Short answer

  &nbsp; | Categorical Dependent Variable | Continuous Dependent Variable
  ------------- | -------------
  **Categorical<br>Independent Variable** | Chi square | t-test or ANOVA
  **Continuous<br>Independent Variable** | LDA or QDA | Regression


# Long answer

<br>
___
# 0 Independent variables
___

Comparing a dataset against a hypothesized value (e.g., average age of people is 30 years old).

##  1 continuous dependent variable with normal distribution
### One sample t-test

Null hypothesis is that the mean value of the dataset is equal to the test/hypothesized value ($\mu_0$).
$$ H_o: \mu = \mu_0 $$
$$ H_A: \mu < \ or > \ or \ \ne \mu_0 $$

The test determines if there is a statistically significant difference between $\mu$ and $\mu_0$. 

This requires calculation of the mean of the dataset ($\bar{x}$),  the degrees of freedom dataset ($df$), the corresponding t-value and the p-value for the predetermined significance level (most commonly $\alpha=0.05$). 

If the p-value calculated from the t-distribution at the given t-value is smaller than the selected signficance level (e.g., $p<0.05$) then the null hypothesis can be rejected and the alternative hypothesis is supported.

Important calculations,

$$ \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i \ , \quad df = n-1 \ , \quad t = \frac{\bar{x}-\mu_0}{s/\sqrt{n}} \ , \quad s = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})^2}$$

Typically, $p$ is found for a given t-distribution in a table or computer software as it requires integrating the t-distribution to find the probability from the area under the curve.

If the alternative hypothesis includes $\ne$, then a ***two-tail test*** is used where the significance level is split at the two extremes of the probability density curve. Otherwise for $<$ or $>$, a ***one-tail test*** is used where the significance level is all at the left or right side of the curve, respectively.

***This test is not appropriate if the number of data points is below 30 ($n<30$) and the data is not normally distributed.*** Larger values of $n$ may be more robust against this requirement.

In Python using scipy
```
scipy.stats.ttest_1samp(data, popmean, axis=0, nan_policy='propagate')
```

## 1 continuous dependent variable with non-normal distribution

### One sample median test

This test is non-parameteric (it is not fit to a specific distribution) so it doesn't have the same requirements as the t-test. Typically a ***one-sample Wilcoxon Signed Rank Test*** is used.

## 1 binary categorical dependent variable

### Binomial test

For $n$ trials of success/failure, returns the probability $p$ that the number of successes is not equal, greater than, or less than $k$.
$$ H_0: n_{success} = k $$
$$ H_A: n_{success} \ne k $$

In Python using scipy,
```
scipy.stats.binom_test(k, n, p, alternative=['two-sided','greater','less'])
```

For example, probability of returning $k$ heads given $n$ coin flips. Of course for a coin there are only two probabilities and each are likely to appear ($p=0.5$ for heads). The binomial test will determine if $k$ heads are likely to occur. If the null hypothesis is rejected for the pre-determined significance level, this can be interpreted as evidence that the coin is not fair.

## 1 categorical dependent variable

### Chi square goodness of fit test (Pearson's)

Similar to binomial test however there are more than two categories in the dependent variable. ***This test will determine if the observed proportions of each category differ significantly from the hypothesized proportions.***

For example, testing the number of people over age 65 in a jury vs the proportion in the local population. This example includes 2 groups (below and 65 or above) it could also be used on multiple intervals (less than 35, 35-45, above 45, etc).

In other words, it tests a null hypothesis stating that the frequency distribution of certain events observed in a sample is consistent with a particular theoretical distribution. The events (or categories) must be mutually exclusive and have total probability 1. 

A simple example is testing the outcome from a six-sided die to determine if the die is fair (all 6 outcomes equally likely to occur).

This assumes the categories are independent and identically distributed (*iid*) with the normal distribution, i.e., unpaired.

Here the test statistic $\chi^2$ follows the chi-squared distribution (*from wikipedia*):
<img src="https://upload.wikimedia.org/wikipedia/commons/8/8e/Chi-square_distributionCDF-English.png" width="350">

#### Procedure
1. Calculate the test statistic $\chi^2$
$$ \chi^2 = \sum_{i=1}^{n} \frac{(O_i - E_i)^2}{E_i} $$
1. Determine the degrees of freedom $df = n-p$ where $n$ is the number of outcomes/categories and $p=s+1$ where $s$ is the number of parameters in the distribution (s=2 for normal, mean and standard deviation; s=0 for discrete uniform, no parameters)
1. Select a significance level $\alpha$ ($\alpha = 0.05$ typically)
1. Compare calculated $\chi^2$ with its critical value from the chi-squared distribution with the appropriate $df$ (one sided since only check if $\chi^2$ is greater than critical)
1. Accept or reject the null hypothesis that the observed frequency distribution is different from the theoretical distribution based on whether the test statistic exceeds the critical value of $\chi^2$ ***or*** report the corresponding p-value and compare with pre-determined significance.

In python using scipy,
```
scipy.stats.chisquare(f_obs, f_exp=None, ddof=0, axis=0)
```
This returns the calculated chi-square and the corresponding p-value which can be compared with the desired significance. The default value for this test assumes all expected probabilities are equally likely (come from a discrete uniform distribution).

A rule of thumb states that there must be at least 5 values for each category for this test to be valid.

A good example can be found at: https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test#Examples.

<br>
___
# 1 Independent binary categorical variable
___
Comparing two samples such as in an experiment, control vs. treatment groups or male vs. female.

## 1 continuous dependent variable with normal distribution

### Two sample independent t-test

This t-test assesses whether the means of two groups/categories of data are statistically different.

It is very similar to the one-sample t-test with a few exceptions:
1. The t-statistic is calculed from,
$$ t = \frac{\bar{x}_1-\bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} $$

1. And the degrees of freedom are,
$$ df = \frac{(s_1^2/n_1 + s_2^2/n_2)^2}{(s_1^2/n_1)^2/(n_1-1) + (s_2^2/n_2)^2/(n_2-1)} $$

The calculation for the t-statistic is appropriate for equal or unequal sample sizes and unequal variances between the two independent categories. This variation is known as ***Welch's t-test***.

In Python using scipy,
```
scipy.stats.ttest_ind(a, b, axis=0, equal_var=True, nan_policy='propagate')
```

Or if only the descriptive statistics are known,
```
scipy.stats.ttest_ind_from_stats(mean1, std1, nobs1, mean2, std2, nobs2, equal_var=True)
```


### Paired t-test

For ***paired samples*** (e.g., experiments looking at before/after treatment values to find a statistically significant difference), the t-statistic is calculated from,
$$ t = \frac{\bar{x}_D - \mu_0}{s_D / \sqrt{n}} $$

where the $D$ subscript represents the difference between pairs, so the mean and standard deviation are calculated from the paired differences and not the measured values. Here the degrees of freedom is the same as the one-sample t-test $df = n-1$.

The paired test can help reduce the influence of confounding variables. For an explanation see wikipedia: https://en.wikipedia.org/wiki/Paired_difference_test#Use_in_reducing_confounding.

When there are more than two independent categories, the *repeated measures ANOVA* test can be applied.

In Python using scipy,
```
scipy.stats.ttest_rel(a, b, axis=0, nan_policy='propagate')
```
where `a` and `b` are the two equal-length variables before calculating the paired difference.

## 1 continuous dependent variable with non-normal distribution

### Wilcoxon signed ranks test

This is a non-parametric type of paired different test that is appropriate for non-normal population distributions and small sample sizes.

It requires that the data are paired and each pair is chosen randomly and independently.

#### Procedure

Let $N$ be the number of pairs and $2N$ the number of data points, $x_{1,i}$ and $x_{2,i}$ for $i=1,...,N$.

The hypotheses are,

$H_0:$ *difference between pairs follows a symmetric distribution about zero*<br>
$H_A:$ *difference between pairs does not follow a symmetric distribution about zero*

1. For all N pairs, calculate $|x_{2,i}-x_{1,i}|$ and $sgn(x_{2,i}-x_{1,i})$ where $sgn$ is the sign function ($ x = sgn(x)\cdot|x|$)
1. Exclude pairs where the difference is zero, let $N_r$ be the reduced pair sample size
1. Ordered the remaining $N_r$ pairs in ascending order by $|x_{2,i}-x_{1,i}|$
1. Rank the pairs $R_i$ with the smallest as 1. Ties receive a rank equal to the average of the ranks they span.
1. Calculate the test statistic $$W = \sum{i=1}{N_r} \ [sgn(x_{2,i}-x_{1,i})\cdot R_i]$$
1. The distribution for $W$ is not easily described, lookup critical value in a reference table for pre-determined significance level. For the two-tail test, reject $H_0$ if $|W| \ge W_{critical, N_r}$.
1. As $N_r$ increases, the sampling distribution of $W$ approaches a normal distribution so a typical z score can be computed where,
$$ z = \frac{W}{\sigma_W} \ , \quad \sigma_W = \sqrt{\frac{N_r(N_r+1)(2N_r+1)}{6}} $$
Then compare $z$ with $z_{critical}$ for the appropriate significance level at one or two tail test or calculate the p-value.

An ***effect size*** (quantitative measure of the strength of a phenomenon, similar to coefficient of determination) can be calculated from the ***rank correlation***, $r=W/S$, where $S$ is the sum of the ranks for the $N_r$ data pairs.

In Python using scipy,
```
scipy.stats.wilcoxon(x, y=None, zero_method='wilcox', correction=False)
```
For this application, the $W$ distribution is assumed to be close to normal, so the rule of thumb is to require the number of pairs to be > 20.

One **example** for using this test would be analyzing survey results from a multiple-point likert scale (interval scale) to questions such as "I am comfortable using this product".

## 1 categorical dependent variable

### Chi square test for independence

This technique ***tests the null hypothesis that there is no statistical difference between the effect of categories in the independent variable*** on the frequency of occurence of the categories in the dependent variable. The alternative hypothesis is that the independent and dependent variables do have a relationship.

For example, if we measured the percentage of left-handed men and compared that with the percentage of left-handed women. I would expect the null hypothesis to not be rejected and infer that gender does not have an effect on left-handedness.

This test can also be applied when there are ***more than two possible categories in the independent variable***. However it requires each category to have > 5 samples.

When applying this test, it is also possible to use a *two sample independent t-test*, where the categorical dependent variable is expressed in proportions (50% in category A, 50% in category B, etc).

If there are more than two categories in the independent variable, another appropriate choice would be to apply *ANOVA* and again express the data in terms of proportions.

An example from wikipedia that tests for independence (equal occurence of independent variable categories in each dependent variable category): https://en.wikipedia.org/wiki/Chi-squared_test#Example_chi-squared_test_for_categorical_data

In this test, if the $\chi^2$ value exceeds the critical value for the pre-determined level of significance or the p-value is below the critical level, then the null hypothesis of no relationship or independence can be rejected.

In Python with scipy,
```
scipy.stats.chi2_contingency(observed, correction=True, lambda_=None)
```
Here the `observed` argument is an RxC (simplest case is 2x2) contingency table containing observed frequencies.


### Fisher's exact test

This test is valid for all sample sizes include small ones, where a category has < 5 entries. It is exact and does not rely on the approximation that the test becomes exact as the sample size limit approaches infinity.

There is no test statistic here, simply determine the appropriate degrees of freedom then calculate the p-value and compare with the predetermined significance level (typically $\alpha = 0.05$). The probability of obtaining any set of values (a,b,c,d) for two-level independent and dependent variables is given by the ***hypergeometric distribution***, this is what allows exact determination of the p-value.

In Python using scipy a 2x2 contingency table can be analyzed,
```
scipy.stats.fisher_exact(table, alternative='two-sided')
```

See wikipedia for an example: https://en.wikipedia.org/wiki/Fisher%27s_exact_test#Example

<br>
___
# 1 Independent multi-valued categorical variable

___

## 1 continuous dependent variable with normal distribution

### One-way ANOVA

The null hypothesis ($H_0$) tested by one-way ANOVA is that two or more population
means are equal. A statistically significant test indicates that observed data sampled from each of the populations would be unlikely if the null hypothesis were true.

#### Assumptions

The data from each independent category must be:
1. Random and independent
1. Normally distributed
1. Equal variances

The last two assumptions are relaxed for large samples.

In Python using scipy,
```
scipy.stats.f_oneway(*args)
```

### One-way repeated measures ANOVA (rANOVA)

Here, the independent variable represents measurements collected under a variety of conditions or at several different time points from the same group. 

One example would be collecting blood pressure at a variety of time points from the same group of people, or perhaps in a taste test for different types of cake being sampled by the same group of people.

There is no easy Python implementation for repeated measures ANOVA. One possibility is to use [rpy2](http://rpy2.bitbucket.org/) to call the R-package for this method.

<br>
___
# 2 Independent multi-valued categorical variable
___

<br>
## 1 continuous dependent variable with normal distribution

### Two-way ANOVA

This test will assess the ***main effect*** of each independent variable (its effect on the DV averaging across the levels of other IVs) and ***interaction*** between independent variables. An interaction between three variables indicates that considering the effect of two on the third is not strictly additive, there is an interaction (or simultaneous influence) between the two that influences the value of the third.

