# Statistical hypothesis tests
are based a statement called the null hypothesis that assumes nothing interesting is going on between whatever **variables** you are testing. 

The exact form of the null hypothesis varies from one type test to another: if you are testing whether groups differ, the null hypothesis states that the groups are the same. For instance, if you wanted to test whether the average age of voters in your home state differs from the national average, the null hypothesis would be that there is no difference between the average ages.

The purpose of a hypothesis test is to determine whether the null hypothesis is likely to be true given sample data. If there is little evidence against the null hypothesis given the data, you accept the null hypothesis. If the null hypothesis is unlikely given the data, you might reject the null in favor of the alternative hypothesis: that something interesting is going on. The exact form of the alternative hypothesis will depend on the specific test you are carrying out. Continuing with the example above, the alternative hypothesis would be that the average age of voters in your state does in fact differ from the national average.

Once you have the null and alternative hypothesis in hand, you choose a significance level (often denoted by the Greek letter α.). The significance level is a probability threshold that determines when you reject the null hypothesis. After carrying out a test, if the probability of getting a result as extreme as the one you observe due to chance is lower than the significance level, you reject the null hypothesis in favor of the alternative. This probability of seeing a result as extreme or more extreme than the one observed is known as the p-value.

**The T-test is a statistical test used to determine whether a numeric data sample of differs significantly from the population or whether two samples differ from one another.**

**NOT between the real o/p and the prediction**

# One-Sample T-Test

A one-sample t-test checks whether a sample mean differs from the population mean. Age data for the population of voters in the entire country and a sample of voters in Minnesota and test the whether the average age of voters Minnesota differs from the population.

# Two-Sample T-Test - student's t-test
A two-sample t-test investigates whether the means of two independent data samples differ from one another. In a two-sample test, the null hypothesis is that the means of both groups are the same. Unlike the one sample-test where we test against a known population parameter, the two sample test only involves sample means. You can conduct a two-sample t-test by passing with the **stats.ttest_ind()** function.

# Paired T-Test
The basic two sample t-test is designed for testing differences between independent groups. In some cases, you might be interested in testing differences between samples of the same group at different points in time. For instance, a hospital might want to test whether a weight-loss drug works by checking the weights of the same group patients before and after treatment. A paired t-test lets you check whether the means of samples from the same group differ.

We can conduct a paired t-test using the scipy function **stats.ttest_rel()**.

# Type I and Type II Error
The result of a statistical hypothesis test and the corresponding decision of whether to reject or accept the null hypothesis is not infallible. A test provides evidence for or against the null hypothesis and then you decide whether to accept or reject it based on that evidence, but the evidence may lack the strength to arrive at the correct conclusion. Incorrect conclusions made from hypothesis tests fall in one of two categories: type I error and type II error.

Type I error describes a situation where you reject the null hypothesis when it is actually true. This type of error is also known as a "false positive" or "false hit". The type 1 error rate is equal to the significance level α, so setting a higher confidence level (and therefore lower alpha) reduces the chances of getting a false positive.

Type II error describes a situation where you fail to reject the null hypothesis when it is actually false. Type II error is also known as a "false negative" or "miss". The higher your confidence level, the more likely you are to make a type II error.

In [1]:
# is there a gab in paid between males and females?
# the values in males sample reveal no info to female sample ,, so they are independant sample
# H0 : the avg male salary is equal to the avg female salary
# H1 : Not equal

In [3]:
import pandas as pd

In [4]:
df= pd.read_excel(r'D:/5_11_practical.xlsx')

In [5]:
# Deal with nan values
df['Salary']=df['Salary'].fillna(df['Salary'].max())

In [6]:
df.head(6)

Unnamed: 0,Surname,Name,Age,Gender,Country,hispanic,Start_date,Department,Position,Salary,Unnamed: 10,Unnamed: 11
0,Bold,Caroline,63.0,Female,United States,White,2012-07-02,Executive Office,President & CEO,166400.0,,
1,Zamora,Jennifer,38.0,Female,United States,White,2010-04-10,IT/IS,CIO,135200.0,,
2,Houlihan,Debra,51.0,Female,United States,White,2014-05-05,Sales,Director of Sales,124800.0,,
3,Bramante,Elisa,34.0,Female,United States,Black or African American,2009-01-05,Production,Director of Operations,124800.0,,1.0
4,Del Bosque,Keyla,38.0,Female,United States,Black or African American,2012-01-09,Software Engineering,Software Engineer,118809.6,,1.0
5,Onque,Jasmine,27.0,Female,United States,White,2013-09-30,Sales,Area Sales Manager,118560.0,,1.0


In [7]:
#female = df.iloc[0:98, 9].values
#male = df.iloc[98:, 9].values

In [8]:
#Create new dataframes for each eparate
male=df[df['Gender']=='Male']
female=df[df['Gender']=='Female']

In [9]:
#take just the salary from each one
male = male['Salary']
female = female['Salary']

In [17]:
import numpy as np
sync = np.array([94. , 84.9, 82.6, 69.5, 80.1, 79.6, 81.4, 77.8, 81.7, 78.8, 73.2,
                   87.9, 87.9, 93.5, 82.3, 79.3, 78.3, 71.6, 88.6, 74.6, 74.1, 80.6])
asyncr =np.array([77.1, 71.7, 91. , 72.2, 74.8, 85.1, 67.6, 69.9, 75.3, 71.7, 65.7, 72.6, 71.5, 78.2])

In [None]:
#The Student’s t-test is a statistical hypothesis test that two independent data samples
#known to have a Gaussian distribution, have the same Gaussian distribution
#Tests whether the means of two independent samples are significantly different.

#Assumptions

#Observations in each sample are independent and identically distributed (iid).
#Observations in each sample are normally distributed.
#Observations in each sample have the same variance.
#Interpretation

#H0: the means of the samples are equal.  p > alpha
#H1: the means of the samples are unequal.  p < alpha

In [27]:
# Student’s t-test
import numpy as np
from numpy.random import seed
from numpy.random import randn
from scipy.stats import ttest_ind
# seed the random number generator
seed(1)

#two independent samples
# male ,, female
# compare samples
stat, p = ttest_ind(male, female)
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
    print('Same distributions (fail to reject H0)')
else:
    print('Different distributions (reject H0)')

Statistics=1.261, p=0.209
Same distributions (fail to reject H0)


In [11]:
# Coudnt reject H0 ,, so the avg salaries are equal.
# the two samples has the same distribution ,, Equality
# There's no gab in salaries.

# /////////////////////////////////////////////////////////////////////////////////////////////////////////////////

# Parametric Statistical Hypothesis Tests
Parametric statistical tests assume that a data sample was drawn from a specific population distribution.

They often refer to statistical tests that assume the Gaussian distribution. Because it is so common for data to fit this distribution, parametric statistical methods are more commonly used.

# 1 Student’s t-test
Tests whether the means of two independent samples are significantly different.

Assumptions

Observations in each sample are independent and identically distributed (iid).
Observations in each sample are normally distributed.
Observations in each sample have the same variance.
Interpretation

H0: the means of the samples are equal.
H1: the means of the samples are unequal.

In [1]:
# Example of the Student's t-test
from scipy.stats import ttest_ind
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
stat, p = ttest_ind(data1, data2)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('Probably the same distribution')
else:
    print('Probably different distributions')

stat=-0.326, p=0.748
Probably the same distribution


# 2 Paired Student’s t-test
Tests whether the means of two paired ,, dependant samples are significantly different.

Assumptions

Observations in each sample are independent and identically distributed (iid).
Observations in each sample are normally distributed.
Observations in each sample have the same variance.
Observations across each sample are paired.

Interpretation

H0: the means of the samples are equal.
H1: the means of the samples are unequal.

In [6]:
# Example of the Paired Student's t-test
from scipy.stats import ttest_rel
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
stat, p = ttest_rel(data1, data2)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('Probably the same distribution')
else:
    print('Probably different distributions')

stat=-0.334, p=0.746
Probably the same distribution


# 3 Analysis of Variance Test (ANOVA)
Tests whether the means of **two or more** independent samples are significantly different.

Assumptions

Observations in each sample are independent and identically distributed (iid).
Observations in each sample are normally distributed.
Observations in each sample have the same variance.

Interpretation

H0: the means of the samples are equal.
H1: one or more of the means of the samples are unequal.

In [7]:
# Example of the Analysis of Variance Test
from scipy.stats import f_oneway
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
data3 = [-0.208, 0.696, 0.928, -1.148, -0.213, 0.229, 0.137, 0.269, -0.870, -1.204]
stat, p = f_oneway(data1, data2, data3)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')

stat=0.096, p=0.908
Probably the same distribution


# Nonparametric Statistical Hypothesis Tests
# 1 Mann-Whitney U Test
Tests whether the distributions of two independent samples are equal or not.

Assumptions

Observations in each sample are independent and identically distributed (iid).
Observations in each sample can be ranked.

Interpretation

H0: the distributions of both samples are equal.
H1: the distributions of both samples are not equal.

In [8]:
# Example of the Mann-Whitney U Test
from scipy.stats import mannwhitneyu
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
stat, p = mannwhitneyu(data1, data2)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')

stat=40.000, p=0.236
Probably the same distribution


# 2 Wilcoxon Signed-Rank Test
Tests whether the distributions of two paired samples are equal or not.

Assumptions

Observations in each sample are independent and identically distributed (iid).
Observations in each sample can be ranked.
Observations across each sample are paired.

Interpretation

H0: the distributions of both samples are equal.
H1: the distributions of both samples are not equal.

In [9]:
# Example of the Wilcoxon Signed-Rank Test
from scipy.stats import wilcoxon
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
stat, p = wilcoxon(data1, data2)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')

stat=21.000, p=0.557
Probably the same distribution


# 3 Kruskal-Wallis H Test
Tests whether the distributions of two or more independent samples are equal or not.

Assumptions

Observations in each sample are independent and identically distributed (iid).
Observations in each sample can be ranked.

Interpretation

H0: the distributions of all samples are equal.
H1: the distributions of one or more samples are not equal.

In [10]:
# Example of the Kruskal-Wallis H Test
from scipy.stats import kruskal
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
stat, p = kruskal(data1, data2)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')

stat=0.571, p=0.450
Probably the same distribution


# 4 Friedman Test
Tests whether the distributions of two or more paired samples are equal or not.

Assumptions

Observations in each sample are independent and identically distributed (iid).
Observations in each sample can be ranked.
Observations across each sample are paired.

Interpretation

H0: the distributions of all samples are equal.
H1: the distributions of one or more samples are not equal.

In [12]:
# Example of the Friedman Test
from scipy.stats import friedmanchisquare
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
data3 = [-0.208, 0.696, 0.928, -1.148, -0.213, 0.229, 0.137, 0.269, -0.870, -1.204]
stat, p = friedmanchisquare(data1, data2, data3)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')

stat=0.800, p=0.670
Probably the same distribution
