Source: http://www.randalolson.com/2012/08/06/statistical-analysis-made-easy-in-python/

In [6]:
import pandas as pd

In [7]:
experimentDF = pd.read_csv('https://raw.githubusercontent.com/rhiever/ipython-notebook-workshop/master/parasite_data.csv',
                        na_values=[' '])

In [8]:
experimentDF.head()

Unnamed: 0,Virulence,Replicate,ShannonDiversity
0,0.5,1,0.059262
1,0.5,2,1.0936
2,0.5,3,1.13939
3,0.5,4,0.547651
4,0.5,5,0.065928


## Preprocessing impact

In [9]:
print(experimentDF["Virulence"].dropna().mean())
print(experimentDF["Virulence"].fillna(0.0).mean())

0.75
0.642857142857


## Standard Error of the Mean (SEM)

In [10]:
from scipy import stats

print('SEM of Shannoon Diversity 2/ 0.8 Parasite Virulence =',
      stats.sem(experimentDF[experimentDF["Virulence"] == 0.8]["ShannonDiversity"]))

SEM of Shannoon Diversity 2/ 0.8 Parasite Virulence = 0.110547585529


## Mann-Whitney-Wilcoxon RankSum test

It is useful to determine if two distributions are significantly different or not.
Unlike the **t-test**, the RankSum does not assume that the data are normally distributed,
potentially providing a more accurate assessment of the data sets.

In [11]:
# select two treatment data sets from the parasite data  
treatment1 = experimentDF[experimentDF["Virulence"] == 0.5]["ShannonDiversity"]  
treatment2 = experimentDF[experimentDF["Virulence"] == 0.8]["ShannonDiversity"] 

print("Data set 1:\n", treatment1.head())
print("Data set 2:\n", treatment2.head())

Data set 1:
 0    0.059262
1    1.093600
2    1.139390
3    0.547651
4    0.065928
Name: ShannonDiversity, dtype: float64
Data set 2:
 150    1.433800
151    2.079700
152    0.892139
153    2.384740
154    0.006980
Name: ShannonDiversity, dtype: float64


In [12]:
z_stat, p_val = stats.ranksums(treatment1, treatment2)

print("MWW Ranksum P for treatment 1 and 2 = ", p_val)

MWW Ranksum P for treatment 1 and 2 =  0.000983355902735


The two distributions are significantly different

## One-way analysis of varianve (ANOVA)

In [13]:
treatment1 = experimentDF[experimentDF["Virulence"] == 0.7]["ShannonDiversity"]  
treatment2 = experimentDF[experimentDF["Virulence"] == 0.8]["ShannonDiversity"]  
treatment3 = experimentDF[experimentDF["Virulence"] == 0.9]["ShannonDiversity"] 

z_stat, p_val = stats.f_oneway(treatment1, treatment2, treatment3)

print("One-way ANOVA P = ", p_val)

One-way ANOVA P =  0.381509481874


The means of the results of all three experiments are not significantly different

## Bootsraped 95% confidence intervals

when the dataset size is < 20, standard error mean may not be accurate enough.

In [17]:
import scipy
import scikits.bootstrap as bootstrap

CIs = bootstrap.ci(data=treatment1, statfunction=scipy.mean)

print("Bootstrapped 95% confidence intervals\n Low:", CIs[0], "\nHigh", CIs[1])

Bootstrapped 95% confidence intervals
 Low: 0.9208756094 
High 1.340388464


In [18]:
import scipy
import scikits.bootstrap as bootstrap

CIs = bootstrap.ci(data=treatment1, statfunction=scipy.mean, alpha = 0.2)

print("Bootstrapped 95% confidence intervals\n Low:", CIs[0], "\nHigh", CIs[1])

Bootstrapped 95% confidence intervals
 Low: 0.9861036594 
High 1.2627904322


In [20]:
import scipy
import scikits.bootstrap as bootstrap

CIs = bootstrap.ci(data=treatment1, statfunction=scipy.mean, alpha = 0.6, n_samples=20000)

print("Bootstrapped 95% confidence intervals\n Low:", CIs[0], "\nHigh", CIs[1])

Bootstrapped 95% confidence intervals
 Low: 1.0728948554 
High 1.1831079314
