# Hypothesis tests
Are used to compare data samples and test validity of a claim (null hypothesis).
The assumption of a statistical test is the null hypothesis, or H0.

Hypothesis tests can be parametric or non parametric.
Parametric statistical hypothesis tests are used when the sample is gaussian. 
Non Parametric statistical hypothesis tests used to compare data samples were data doesnt come from a gaussian distribution.
Before a nonparametric statistical method can be applied, the data must be converted into a rank format.

examples of parametric tests-
- t test- Tests whether the means of two independent samples are significantly different
- paired t test- Tests whether the means of two paired samples are significantly different.
- ANOVA- Tests whether the means of two or more independent samples are significantly different

examples of non parametric tests-
- Mann-whitney U test- Tests whether the distributions of two independent samples are equal or not.
- Wilcoxon Signed Rank test- Tests whether the distributions of two paired samples are equal or not.
- Kruskal Wallis h test- Tests whether the distributions of two or more independent samples are equal or not.
- Friedman test- Tests whether the distributions of two or more paired samples are equal or not.

## p value
the p value determines if something is statistically significant, by comparing it to a chosen significance value (alpha).
It is the probability that the percentage difference between the two versions is at least 100% given that the null hypothesis is true. Alpha is usally 0.05 or 5%.
ie if we assume that this happened by chance (null hypothesis is true), then what is the probability that we obtain a result which is greater than or equal to the result we already observed.

- **If p-value > alpha**:  not significant
- **If p-value <= alpha**: Reject the null hypothesis (significant result).

## Covariance- quantifies the strength and direction of a relationship between a pair of variables

cov(X, Y) = (sum (x - mean(X)) * (y - mean(Y)) ) * 1/(n-1)

data is required to have a gaussian or gaussian like distribution.
if covariance is zero, both variables are completely independant. 
the two variables change in the same direction if the value is positive or change in different directions if negative.

In [1]:
# %load imports.py
import pandas as pd
import scipy
import statsmodels
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

the cov() NumPy func can be used to find the covariance matrix.
The diagonal entries of the covariance matrix are the variances (top left (x variance) and bottom right (y variance)) and the other entries are the covariances (top right, bottom left).

In [2]:
x= np.random.randn(1000) +100
y= x + (np.random.randn(1000)+50)

In [3]:
covariance_matrix=np.cov(x, y)
print(covariance_matrix)

[[ 1.00125551  0.9992882 ]
 [ 0.9992882   2.03402505]]


In comparison, variance measures the variation of a single variable

In [4]:
print(x.var())
print(y.var())

1.00025425038
2.0319910244


# Correlation Coeffient
r>0 - positive correlation
r<0 - negative correlation
r=1 - maximum value- perfect positive linear relationship
r=-1 - minimum value- perfect negative linear relationship
r=0 - weak correlation 

## Pearson Correlation Coeffient
used to summarise the strength of the linear relationship between 2 data samples.

In [5]:
correlation_pear, p = scipy.stats.pearsonr(x,y)
print('correlation: ', correlation_pear, '\np value: ', p)

correlation:  0.700229093369 
p value:  3.14587168724e-148


## Spearmans Rank Correlation Coeffient
used to summarise strength of nonlinear relationships between samples.

In [6]:
correlation_spear, p = scipy.stats.spearmanr(x,y)
print('correlation: ', correlation_spear, '\np value: ', p)

correlation:  0.684461076461 
p value:  3.96161071215e-139


## Chi- squared test
 used to test whether 2 categorical variables are related or independent. If independant, then the feature may be removed from the datatset if irrelevant.
 there should be at least 5 observed frequencies in each category for the test to be valid.
It compares the difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table. 


- If p-value <= alpha: significant result, reject null hypothesis (H0), dependent.
- If p-value > alpha: not significant result, fail to reject null hypothesis (H0), independent.


- If Statistic >= Critical Value: significant result, reject null hypothesis (H0), dependent.
- If Statistic < Critical Value: not significant result, fail to reject null hypothesis (H0), independent.


degrees of freedom for the chi-squared distribution is calculated as:
- degrees of freedom: (rows - 1) * (cols - 1)  

Using the SciPy chi2_contingency function, we obtain the statitic for the test, the p value, degrees of freedom and expected frequencies

In [7]:
table= [[1,2,3],[12,14,16]]
stat, p, dof, expected = scipy.stats.chi2_contingency(table)

using an alpha/ significance value of 5%..

In [8]:
probability= 0.95
critical = scipy.stats.chi2.ppf(probability, dof)
print('critical=%.3f, stat=%.3f' % (critical, stat))

critical=5.991, stat=0.463


In [9]:
if abs(stat) >= critical:
	print('Dependent (reject H0)')
else:
	print('Independent (fail to reject H0)')

Independent (fail to reject H0)


In [10]:
alpha = 1.0 - probability
if p <= alpha:
	print('Dependent (reject H0)')
else:
	print('Independent (fail to reject H0)')

Independent (fail to reject H0)


# Parametric Hypothesis Tests

# T test
Tests whether the means of two independent samples are significantly different. It should only be used for Gaussian distributed data

In [11]:
# testing if the population mean of data is likely to be equal to a given value
datax= [2,2,2,4,4,4]
stat, p= scipy.stats.ttest_1samp(datax, 3) 
print('stat= %.3f, p= %.3f' % (stat, p))
if p > alpha:
    print('likely to be equal')
else:
	print('likely to be unequal')

stat= 0.000, p= 1.000
likely to be equal


In [12]:
# testing on 2 data samples
datax= [1,2,3,4,5,6,7,8]
datay=[2,4,6,8,10]
alpha =0.5
stat, p = scipy.stats.ttest_ind(datax, datay)

In [13]:
print('stat= %.3f, p= %.3f' % (stat, p))
if p > alpha:
    print('the same distribution')
else:
	print('different distribution')

stat= -0.964, p= 0.356
different distribution


Paired t test , similar to the t test, but used for paired samples. Eg samples obtained from the same sample group but obtained at different times or under different conditions

In [14]:
datax= [1,2,3,4,5,6,7,8]
datay=[1.1,2.2,3,4.1,5,6,7.1,8.8]
alpha =0.5
stat, p = scipy.stats.ttest_rel(datax, datay)

In [15]:
print('stat= %.3f, p= %.3f' % (stat, p))
if p > alpha:
    print('the same distribution')
else:
	print('different distribution')

stat= -1.722, p= 0.129
different distribution


## ANOVA - Analysis of Variance test
Tests whether the means of two or more independent samples are significantly different

F statistic is a value you get when you run an ANOVA test or a regression analysis to find out if the means between two populations are significantly different. if your calculated F value in a test is larger than your F statistic, you can reject the null hypothesis. 

In [16]:
datax= [1,2,3,4,5,6,7,8]
datay=[1.1,2.2,3,4.1,5,6,7.1,8.8]
dataz=[0.1,1.1,2.5,3.2,4.5,5.8,6.7,7.7]
alpha =0.5
stat, p = scipy.stats.f_oneway(datax, datay, dataz)

In [17]:
print('stat= %.3f, p= %.3f' % (stat, p))
if p > alpha:
    print('the same distribution')
else:
	print('different distribution')

stat= 0.168, p= 0.847
the same distribution


## Z test
alternative to the t test, to be used when the sample size is more than 30, you have independant data points, data is normally distributed (unless data size is very large), data is randomly selected and sample sizes are equal.
It is used to determine beliefs about proportions of data in the sample.

In [18]:
# One-Sample Z test- comparing a sample mean with the population mean.
squares=[]
for x in range(1,32):
    squares.append(x**2)



In [19]:
from statsmodels.stats import weightstats as ss
stat , p = ss.ztest(squares, value=156)
print('stat= %.3f ' % stat)
if p<0.05:
    print("reject null hypothesis")
else:
    print("accept null hypothesis")

stat= 3.342 
reject null hypothesis


In [20]:
squares2=[]
for x in range(32,64):
    squares.append(x*2)

if we want to compare the mean of 2 samples, use Two Sample Z test.
eg 
stat , p = ss.ztest(data1, data2, value=10, alternative='two-sided')


# Non Parametric Hypothesis Tests

## Mann Whitney U test
-- Tests whether the distributions of two independent samples are equal or not.

-- determines if 2 independant samples were drawn from a population with the same distribution. 
- Fail to Reject H0: Sample distributions are equal.
- Reject H0: Sample distributions are not equal.

-- Requires at least 20 observations in the data sample


In [21]:
from scipy.stats import mannwhitneyu
datax= [12,20,13,44,5,16,17,8]
datay=[2,22,31,4,14,16,71,88]
stat , p = mannwhitneyu(datax, datay)
print('stat= %.3f , p= %0.3f' % (stat,p))
if p>0.05:
    print("Same distribution (fail to reject H0)")
else:
    print("Different distribution (reject H0)")

stat= 25.500 , p= 0.264
Same distribution (fail to reject H0)


## Wilcoxon Signed rank
-- Tests whether the distributions of two paired samples are equal or not.

-- samples are independant but comes from the same population. the samples are realted or matched in some way.
- Fail to Reject H0: Sample distributions are equal.
- Reject H0: Sample distributions are not equal.

-- Requires at least 20 observations in the data sample

-- The p-value suggests whether the samples are drawn from different distributions.


In [22]:
from scipy.stats import wilcoxon
datax= [12,20,13,44,5,16,17,8]
datay=[12,22,13,41,6,16,18,8]
stat , p = mannwhitneyu(datax, datay)
print('stat= %.3f , p= %0.3f' % (stat,p))
if p>0.05:
    print("Same distribution (fail to reject H0)")
else:
    print("Different distribution (reject H0)")

stat= 31.000 , p= 0.479
Same distribution (fail to reject H0)


## Kruskal Wallis H Test
-- Tests whether the distributions of two or more independent samples are equal or not.

-- non parametric version of ANOVA. determines if more than 2 independant samples have the same distribution.
- Fail to Reject H0: All sample distributions are equal.
- Reject H0: One or more sample distributions are not equal.

-- Each data sample must be independent, have 5 or more observations. The data samples can differ in size.

In [23]:
from scipy.stats import kruskal
datax= [12,20,13,44,5,16,17,8]
datay=[12,22,13,41,6,16,18,8]
dataz= [14,20,3,44,5,16,18,8]
stat , p = kruskal(datax, datay, dataz)
print('stat= %.3f , p= %0.3f' % (stat,p))
if p>0.05:
    print("Same distributions (fail to reject H0)")
else:
    print("Different distributions (reject H0)")

stat= 0.072 , p= 0.965
Same distributions (fail to reject H0)


## Friedman Test
-- Tests whether the distributions of two or more paired samples are equal or not.

-- used if the samples are paired in some way. A generalization of the Kruskal-Wallis H Test to more than two samples.


- Fail to Reject H0: Paired sample distributions are equal.
- Reject H0: Paired sample distributions are not equal.

-- assumes two or more paired data samples with 10 or more samples per group.

In [24]:
from scipy.stats import friedmanchisquare
datax= [46,12,13,44,5,16,10,8]
datay=[12,7,13,16,7,16,18,10]
dataz= [10,20,3,44,11,5,18,5]
stat , p = friedmanchisquare(datax, datay, dataz)
print('stat= %.3f , p= %0.3f' % (stat,p))
if p>0.05:
    print("Same distributions (fail to reject H0)")
else:
    print("Different distributions (reject H0)")

stat= 0.214 , p= 0.898
Same distributions (fail to reject H0)


# Normality Tests

Normality tests are used to check if data has a gaussian (normal) distribution.  It is assumed that during any measurement values will follow a normal distribution with an equal number of measurements above and below the mean value. If a distribution is normal, then the values of the mean, median, and mode are the same.
 
Types of test:
- Shapiro Wilk
- D' Agostino K2 
- Anderson Darling

## Shapiro Wilk

assumes observations in each sample are independent and identically distributed. works well if every value is unique, it does not work as well when several values are identical- use D'Agostino for this case. The test however does not show whether the distribution is skewed or heavy tailed (the extreme portion of the distribution (the part farthest away from the median) spreads out further relative to the width of the middle 50% of the data. Sample size must be greater than 2.

In [25]:
from scipy.stats import shapiro
datax= [46,12,13,44,5,16,10,8]
stat, p = shapiro(datax)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably Gaussian')
else:
	print('Probably not Gaussian')

stat=0.754, p=0.009
Probably not Gaussian


## D’Agostino’s K^2

assumes observations in each sample are independent and identically distributed. tests for skewness tests for nonnormality due to a lack of symmetry. Sample size must be greater than 8.

In [26]:
from scipy.stats import normaltest
datax= [46,12,13,44,5,16,10,8,2,46,12,13,44,5,16,10,8,2,20,21]
stat, p = normaltest(datax)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably Gaussian')
else:
	print('Probably not Gaussian')

stat=5.011, p=0.082
Probably Gaussian


## Anderson-Darling Test

assumes observations in each sample are independent and identically distributed.

In [27]:
from scipy.stats import anderson
datax= [46,12,13,44,5,16,10,8]
result = anderson(datax)
print(result)
print('stat=%.3f' % (result.statistic))
for i in range(len(result.critical_values)):
	sl, cv = result.significance_level[i], result.critical_values[i]
	if result.statistic < cv:
		print('Probably Gaussian at the %.1f%% level' % (sl))
	else:
		print('Probably not Gaussian at the %.1f%% level' % (sl))

AndersonResult(statistic=0.95512235994592132, critical_values=array([ 0.519,  0.591,  0.709,  0.827,  0.984]), significance_level=array([ 15. ,  10. ,   5. ,   2.5,   1. ]))
stat=0.955
Probably not Gaussian at the 15.0% level
Probably not Gaussian at the 10.0% level
Probably not Gaussian at the 5.0% level
Probably not Gaussian at the 2.5% level
Probably Gaussian at the 1.0% level
