# Intermediate - Statistical Tests & Regression

Intermediate Section Activities (Statistical Testing / Regression):
- Statistical libraries
    * Scipy, statsmodels
- Stats simulations
    * Use sims to determine desired sample size for some power of test
    * Alternatively evaluate power based on given sample size
- Create user defined functions
    * Write function for simulation
- Statistical Tests
    * Test for statistically significant difference in 1 gene between two cancer types
        * Do this for a specific gene identified as important for treatment
    * ANOVA on one gene versus all other cancer types
- Regression using library for 2 cancer types for specific gene(s)
    * Linear / Logistic regression


# Overview

This is the intermediate level notebook for the Data Science (DS) and Machine Learning (ML) FredHutch.io tutorial, where we will work through beginning to end on different aspects and techniques in DS for Research and Analysis.

In this notebook we will work through the process of data analysis for the [gene expression cancer RNA-Seq Data Set](https://archive.ics.uci.edu/ml/datasets/gene+expression+cancer+RNA-Seq#). **We will be using some findings from the Beginner Tutorial Notebook.**

This is the intermediate notebook and we will  focusing specifically on statistical testing and regression models in **python**. We will keep working with *python libraries* introduced in the Beginner Tutorial and introduce some new libraries with special purposes in statistics.
> **Libraries Used in This Tutorial**
* Data Manipulation and Processing
     - [pandas]( https://pandas.pydata.org/)
     - [numpy]( https://numpy.org/)
* Data Visualization
	- [Matplotlib](https://matplotlib.org/)
    - [Seaborn](https://seaborn.pydata.org/)
    - [Altair](https://altair-viz.github.io/)
* Statistics
    - [Scipy](https://www.scipy.org/)
    - [Statsmodels](https://www.statsmodels.org/stable/index.html)

## Questions

In this Notebook, we are focused on figuring out the statistically significant differences in genes between cancer groups. We are also concerned with determining statistical power our experiment given the PANCAN data.

# Table of Contents

[1. Statistical Background](#1.-Statistical-Background)

* [1.1 Power Calculations](#1.1-Power-Calculations)


[2. Setup](#1.-Setup)

* [1.1 Importing Libraries](#1.1-Importing-Libraries)


## 1. Statistical Background

### 1.1 Power Calculations

In designing our experiments, one of the most important aspects is the choice of a proper sample size, too small we wont yield useful information, too large we then waste time and resources.

To find an answer to our main questions in this notebook, and any research in general, we must decide which particular alternative *Hypothesis*, or *$H_{1}$*, are important to be able to detect with high ***power***.

In statistics, we refer to the **power** of an experiment as the control over the *type II* error rate:

> **Power = *P* (Reject *$H_{0}$* given that the alternative *$H_{1}$* holds)**  
Also written as  
**Power = 1 - *P* (Type II error) = 1 - $\beta$**

Power calculations are an important aspect of experimental design, as it might tell us if the results of our study are statistically significant or even if results from previous studies are incorrect.  

We can perform the calculations in a variety of ways:
* formulas
* simulations
* on-line calculators, *like this [one]( https://www.stat.ubc.ca/~rollin/stats/ssize/n2.html)*
* commercial software

In this notebook we’ll work with both simulations and formulas. These formulas are based on our familiar assumptions such as:
> independence  
normality of errors  
constant variance 

so are often thought of as an initial rough calculation of power.

The formulas we will be using are then derived from the general formula for the Z test statistic
> $$
Z=\frac{\overline X - \mu_{0}}{\frac{\sigma}{\sqrt[]{n}}} \\	
$$
$\overline X$ - sample mean  
$\mu_0$ - population mean (Null Hypothesis)  
$\sigma$ - standard deviation  
$n$ - sample size  


We algebraically manipulate the formula and allow for $Z$ to be dependent on the desired significance level $\alpha$ for the quantile values in the Normal Distribution, $N(0,1)$. The power of the test for a mean is _increased_ by:

1. Increasing the difference between the means under the
null and alternative hypotheses ($\mu_1 - \mu_0$).
2. Increasing the significance level ($\alpha$).
3. Decreasing the standard deviation ($\sigma$).
4. Increasing the sample size ($n$).

#### 1.1.1 One-Sample Population Calculations

> $$
\hbox{Power} = P\left( N(0,1) <
	-Z_{1 - \alpha / 2} + \frac{ |\mu_1 - \mu_0|}{ \sigma / \sqrt n } \right) = 	\Phi(-Z_{1 - \alpha / 2} + \frac{ |\mu_1 - \mu_0|}{ \sigma / \sqrt n } ),
$$
where $\Phi$ is the cdf of the N(0,1) distribution.

The sample size that is required in order to have power equal to $1-\beta$:
> $$
	n = \frac{ \sigma^2 (Z_{1 - \beta} + Z_{1 - \alpha / 2})^2}{ (\mu_0 - \mu_1)^2 }.
$$

#### 1.1.2 Comparing Two Samples Calculations

When looking at comparing 2 samples, we consider the test of $H_0:\mu_A=\mu_B$ versus
$H_0:\mu_A\neq\mu_B$, where $\mu_A$ and $\mu_B$ are
means of two populations. Assuming a known population
variance $\sigma^2$ and sample sizes $n$ per group,
test statistic is 
> $$
Z=\frac{|\bar X_A - \bar X_B|}{\sqrt{\sigma_A^2/n_A+\sigma_B^2/n_B} },
$$

As a result, our power and sample size formulas become
> $$
\hbox{Power} = \Phi ( -Z_{1 - \alpha / 2} + \frac{|\Delta|}{ \sqrt{\sigma_A^2/n_A+\sigma_B^2/n_B}} ),
$$  
$$
n = \frac{ (\sigma_A^2+\sigma_B^2) (Z_{1 - \beta} + Z_{1 - \alpha/2})^2}{ \Delta^2 }.
$$

Where $|\Delta|=|\mu_A - \mu_B|$.

## 2. Setup



### 2.1 Importing Libraries
