### DNA Microarray
DNA Microarrays (DNA chips) are used to measure the expression levels of large number of genes simultaneously.

##### Gene expression: 
represents the activity of a gene in the cell
##### how to get gene expression ?
* Cellular functions are mainly determined by the synthesized proteins and measuring protein activity (expression levels) is the best to describe gene expression.


#### What can we do with Microarray data?
* Clustering:Grouping homogeneous genes sharing common characteristics.
* Classification: prediction of gene function.
* Inferring deferentially expressed genes (DEGs):Genes whose activity change under different conditions, as in the
case of a diseased tissue.    

Most of these analyses require formulating statistical hypotheses and testing them to augment your result/conclusion with statistical confidence.

<img src="1D-microarray.PNG" />

<img src="2D-microarray.PNG" />

##### Inferring deferentially expressed genes (DEGs)
<img src="DEGS.PNG" />

## Hypothesis Testing
A hypothesis is an educated guess about something in the world around you. It should be testable, either by experiment or observation. 
Hypothesis examples:
* Is Wnt Pathway implicated in colorectal cancer?
* Is the AMACR gene correlated with the PTEN gene in tissues of prostate cancer?
* Is TP53 the cause of breast cancer?


<img src="Hypothesis_Testing_image.PNG" />

### Skewness of a graph:
<img src="skewness.jpg" /> 
  

##### Parameteric Tests:May be (1,2 or more samples)
* 1 sample test (we are comparing the sample with the mean of the population)(T-test or Z-test),T-test is used when the population variance is unknown while Z-test when population variance is known.
* 2 sample test(paired,unpaired)(we are comparing the diseased sample with the healthy "normal" sample)
* more than 2 sample test

## Example on 1 sample test:
### Calculate the T-test for the mean of ONE group of scores.

    

In [45]:
import numpy as np
import statistics
import math
from scipy.stats import norm, ttest_1samp, ttest_ind, ttest_rel, tstd
np.random.seed(12)
rvs = norm.rvs(loc=5, scale=10, size=(20,1))# generating the data
t_stat, p = ttest_1samp(rvs,5.0)# inputs are the data and the mean of the population
print('stat=%f, p=%.3f' % (t_stat, p))




stat=0.147644, p=0.884


<img src="One-Sample_T-test.png"/>

In [46]:
## Calculate T_static Manually
s=np.std(rvs)
mean=np.mean(rvs)
t_static_manually=(mean-5.0)/(s/math.sqrt(20))
print('t_static_Manually',t_static_manually)

t_static_Manually 0.15147915024700018


## Independent (unpaired) samples
### Calculate the T-test for the means of TWO INDEPENDENT samples of scores.
The null hypothesis that 2 independent samples have identical average (expected) values. 

In [47]:
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075,-0.169]
t_stat, p = ttest_ind(data1, data2) # This function assumes that the populations have identical variances by default
print('stat=%.3f, p=%.3f' % (t_stat, p))
if p > 0.05:
    print('Probably the same distribution')
else:
    print('Probably different distributions')

stat=-0.326, p=0.748
Probably the same distribution


## Paired two samples
### Calculate the t-test on TWO RELATED samples of scores, a and b.
The null hypothesis is that 2 related(paired) samples have identical average (expected) values.

In [48]:
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075,-0.169]
stat, p = ttest_rel(data1, data2)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('Probably the same distribution')
else:
    print('Probably different distributions')

stat=-0.334, p=0.746
Probably the same distribution
