In [1]:
import scipy.stats as t
import pandas as pd

# In order to handle hypothesis testing we should start by formulating null and alternative hypothesis
- H0(Null Hypothesis) - There is no association
- HA(Alternative) - There is an association

# In order to compare two means and to understand do they come from the same distribution we can use `t-criteria` by handling t-test. By having `t-criteria` and degrees of freedom (df = n-1) we can achieve `p-value`, which will say us the probability of achiving this or more expressed differences in two means, believing that H0 - is true.

### For example we can compare the expression of 2 genes, gene A and gene B and test do their means come from the same distribution or not.

In [2]:
df = pd.read_csv("../data/genetherapy.csv")
df.head()

Unnamed: 0,expr,Therapy
0,100,A
1,96,A
2,101,A
3,95,A
4,103,A


In [3]:
a = df[df['Therapy'] == "A"]['expr']
b = df[df['Therapy'] == "B"]['expr']

### Using `scipy.stats.ttest_ind` we calculate t-criteria and `p-value`

In [4]:
tstat, pval = t.ttest_ind(a, b)
pval

0.6204107442268376

### `p-value` is greater than 0.05 and we accept that null hypothesis is true, which means that there is no difference in gene therapy A and B

# Using `Anova` or Analysis of Variance we can compare not only two groups but handle test which compares more than 3 groups.

### `scipy.stats.f_oneway` comes in handy for such tasks

In [5]:
df['Therapy'].value_counts()

A    15
B    15
C    15
D    15
Name: Therapy, dtype: int64

### Let's calculate, do the means of 4 grops differ from each other

In [7]:
c = df[df['Therapy'] == "C"]['expr']
d = df[df['Therapy'] == 'D']['expr']
fstat, pval = t.f_oneway(a, b, c, d)

In [8]:
pval

0.00015249722895229536

### `pval` is smaller than 0.05 and we can say that there is a difference in mean in population

# Hypothesis testing in order to test, if our distribution is normal or not

### The Shapiro-Wilk test tests the null hypothesis that the data was drawn from a normal distribution.

In [10]:
shapiro_test = t.shapiro(df['expr'])

In [11]:
shapiro_test

ShapiroResult(statistic=0.977378249168396, pvalue=0.32793858647346497)

### Our `p-value` is greater than 0.05 and test says that our distribution is normal.