## Introduction to Hypothesis Testing

Hypothesis testing is a statistical method that is used in making statistical decisions using experimental data. Hypothesis Testing is basically an assumption that we make about the population parameter.

**Hypothesis testing is an essential procedure in statistics. A hypothesis test evaluates two mutually exclusive statements about a population to determine which statement is best supported by the sample data. When we say that a finding is statistically significant, it’s thanks to a hypothesis test.**

![image.png](attachment:image.png)

### Level of significance: 
Refers to the degree of significance in which we accept or reject the null-hypothesis. 100% accuracy is not possible for accepting or rejecting a hypothesis, so we therefore select a level of significance that is usually 5%.

This is normally denoted with alpha(maths symbol ) and generally it is 0.05 or 5% , which means your output should be 95% confident to give similar kind of result in each sample.

**Type I error**: When we reject the null hypothesis, although that hypothesis was true. Type I error is denoted by alpha. In hypothesis testing, the normal curve that shows the critical region is called the alpha region

**Type II errors**: When we accept the null hypothesis but it is false. Type II errors are denoted by beta. In Hypothesis testing, the normal curve that shows the acceptance region is called the beta region.

### One tailed test :

A test of a statistical hypothesis , where the region of rejection is on only one side of the sampling distribution , is called a one-tailed test.

**Example :- a college has ≥ 4000 student or data science ≤ 80% org adopted.**

### Two-tailed test :

A two-tailed test is a statistical test in which the critical area of a distribution is two-sided and tests whether a sample is greater than or less than a certain range of values. If the sample being tested falls into either of the critical areas, the alternative hypothesis is accepted instead of the null hypothesis.

**Example : a college != 4000 student or data science != 80% org adopted**

![image.png](attachment:image.png)

## P-value :

The P value, or calculated probability, is the probability of finding the observed, or more extreme, results when the null hypothesis (H 0) of a study question is true — the definition of ‘extreme’ depends on how the hypothesis is being tested.

If your P value is less than the chosen significance level then you reject the null hypothesis i.e. accept that your sample gives reasonable evidence to support the alternative hypothesis. It does NOT imply a “meaningful” or “important” difference; that is for you to decide when considering the real-world relevance of your result.

## Degree of freedom

In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary. The number of independent ways by which a dynamic system can move, without violating any constraint imposed on it, is called number of degrees of freedom.

So if a data set has 10 values, the sum of the 10 values must equal the mean x 10. If the mean of the 10 values is 3.5 (you could pick any number), this constraint requires that the sum of the 10 values must equal 10 x 3.5 = 35.
With that constraint, the first value in the data set is free to vary.

Whatever value it is, it’s still possible for the sum of all 10 numbers to have a value of 35. The second value is also free to vary, because whatever value you choose, it still allows for the possibility that the sum of all the values is 35.

**The calculation of the correct number of degrees of freedom is important because the number of degrees of freedom determines the critical value at which you accept or reject a hypothesis. Thus the degrees of freedom help us to achieve desired confidence level, or the goal, in general in other tests.**

## Types of Hypothesis Testing

* **1. T Test ( Student T test)**
* **2. Z Test**
* **3. ANOVA Test**
* **4. Chi-Square Test**

## T Test

A t-test is a type of inferential statistic which is used to determine if there is a significant difference between the means of two groups which may be related in certain features. It is mostly used when the data sets, like the set of data recorded as outcome from flipping a coin a 100 times, would follow a normal distribution and may have unknown variances. T test is used as a hypothesis testing tool, which allows testing of an assumption applicable to a population.

### Types of T Test:

* 1. **One Sample T test**
* 2. **Two Sampled T test**

### One Sample T Test

The One Sample t Test determines whether the sample mean is statistically different from a known or hypothesised population mean. The One Sample t Test is a parametric test.


### Two Sample T Test

The Independent Samples t Test or 2-sample t-test compares the means of two independent groups in order to determine whether there is statistical evidence that the associated population means are significantly different. The Independent Samples t Test is a parametric test. This test is also known as: Independent t Test.

### Paired Sampled T Test

The paired sample t-test is also called dependent sample t-test. It’s an uni variate test that tests for a significant difference between 2 related variables. An example of this is if you where to collect the blood pressure for an individual before and after some treatment, condition, or time point.

* H0 :- means difference between two sample is 0
* H1:- mean difference between two sample is not 0

In [1]:
import numpy as np
import pandas as pd

from scipy.stats import ttest_1samp  # one t test
from scipy.stats import ttest_ind # two t test
from scipy.stats import ttest_rel # paired t test

In [2]:
# lets read the data
data = pd.read_csv('Datasets/train.csv')

# lets check the shape of the dataset
data.shape

(1460, 81)

In [3]:
# lets check the head of the dataset
pd.set_option('max_columns', 1000)
data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


## One Sampled T Test

**The two-sample t-test is one of the most commonly used hypothesis tests in Six Sigma work. It is applied to compare whether the average difference between two groups is really significant or if it is due instead to random chance. This is the data collected from a sample of deliveries of Company A and Company B.**


In [4]:
## Hypothesis Testing

prices_mean = data['SalePrice'].mean()
print(prices_mean)

tset, pval = ttest_1samp(data['SalePrice'], 180000)
print("p-values", pval)

# alpha value is 0.05 or 5%
if pval < 0.05:    
    print(" we are rejecting null hypothesis")
else:
    print("we are accepting null hypothesis")

180921.19589041095
p-values 0.6577784607522807
we are accepting null hypothesis


In [5]:
data['SalePrice'].mean()

180921.19589041095

## Two Sampled T Test

### We want ot check that, if there any association between Price of First Floor per Square foot and Price of Second Floor per Square foot.

In [6]:
# Two Sample t test

firstfloor_mean = data['1stFlrSF'].mean()
secondfloor_mean = data['2ndFlrSF'].mean()

print("Mean of cost of first floor per square foot :", firstfloor_mean)
print("Mean of cost of second floor per sqaure foot :", secondfloor_mean)

firstfloor_std = data['1stFlrSF'].std()
secondfloor_std = data['2ndFlrSF'].std()

print("\nStandard Deviation of cost of first floor per square foot :", firstfloor_std)
print("Standard Deviation of cost of second floor per sqaure foot :", secondfloor_std)

ttest, pval = ttest_ind(data['1stFlrSF'], data['2ndFlrSF'])
print("\np-value",pval)

if pval <0.05:
    print("we reject null hypothesis")
else:
    print("we accept null hypothesis")

Mean of cost of first floor per square foot : 1162.626712328767
Mean of cost of second floor per sqaure foot : 346.99246575342465

Standard Deviation of cost of first floor per square foot : 386.5877380410744
Standard Deviation of cost of second floor per sqaure foot : 436.5284358862568

p-value 0.0
we reject null hypothesis


## Paired T Test

* It is generally used to check the there is a difference of time between two different variables.
* Mostly It is used in the Clinical ndustry to conduct tests and health check ups

In [7]:
control = [8.0, 7.1, 6.5, 6.7, 7.2, 5.4, 4.7, 8.1, 6.3, 4.8]
treatment = [9.9, 7.9, 7.6, 6.8, 7.1, 9.9, 10.5, 9.7, 10.9, 8.2]

ttest, pval = ttest_rel(control, treatment)
print("P Value: ",pval)

if pval <0.05:
    print("we reject null hypothesis")
else:
    print("we accept null hypothesis")

P Value:  0.0055329408161001415
we reject null hypothesis


## Z test

* 1. Your sample size is greater than 30. Otherwise, use a t test.
* 2. Data points should be independent from each other. In other words, one data point isn’t related or doesn’t affect another data point.
* 3. Your data should be normally distributed. However, for large sample sizes (over 30) this doesn’t always matter.
* 4. Your data should be randomly selected from a population, where each item has an equal chance of being selected.
* 5. Sample sizes should be equal if at all possible.

In [10]:
from statsmodels.stats import weightstats as stests
help(stests.ztest)

Help on function ztest in module statsmodels.stats.weightstats:

ztest(x1, x2=None, value=0, alternative='two-sided', usevar='pooled', ddof=1.0)
    test for mean based on normal distribution, one or two samples
    
    In the case of two samples, the samples are assumed to be independent.
    
    Parameters
    ----------
    x1 : array_like, 1-D or 2-D
        first of the two independent samples
    x2 : array_like, 1-D or 2-D
        second of the two independent samples
    value : float
        In the one sample case, value is the mean of x1 under the Null
        hypothesis.
        In the two sample case, value is the difference between mean of x1 and
        mean of x2 under the Null hypothesis. The test statistic is
        `x1_mean - x2_mean - value`.
    alternative : str
        The alternative hypothesis, H1, has to be one of the following
    
           'two-sided': H1: difference in means not equal to value (default)
           'larger' :   H1: difference in means la

### One Sample Z Test

The one-sample z-test is used to test whether the mean of a population is greater than, less than, or not equal to a specific value. Because the standard normal distribution is used to calculate critical values for the test, this test is often called the one-sample z-test.

In [None]:
## We are testing whether the mean of house prices is 180000 or not

In [11]:
from scipy import stats

ztest ,pval = stests.ztest(x1 = data['SalePrice'], x2=None, value=180000)
print("P Value :",float(pval))

if pval<0.05:
    print(" We reject the null hypothesis")
else:
    print("We fail to reject the null hypothesis")

P Value : 0.6577127747949538
accept null hypothesis


### Two sample Z Test

In two sample z-test , similar to t-test here we are checking two independent data groups and deciding whether sample mean of two group is equal or not.

In [12]:
# Two Sample t test

firstfloor_mean = data['1stFlrSF'].mean()
secondfloor_mean = data['2ndFlrSF'].mean()

print("Mean of cost of first floor per square foot :", firstfloor_mean)
print("Mean of cost of second floor per sqaure foot :", secondfloor_mean)

firstfloor_std = data['1stFlrSF'].std()
secondfloor_std = data['2ndFlrSF'].std()

print("\nStandard Deviation of cost of first floor per square foot :", firstfloor_std)
print("Standard Deviation of cost of second floor per sqaure foot :", secondfloor_std)

ttest, pval = stests.ztest(data['1stFlrSF'], data['2ndFlrSF'], value = 0, alternative = 'smaller')
print("\np-value",pval)

if pval <0.05:
    print("we reject null hypothesis")
else:
    print("we accept null hypothesis")

Mean of cost of first floor per square foot : 1162.626712328767
Mean of cost of second floor per sqaure foot : 346.99246575342465

Standard Deviation of cost of first floor per square foot : 386.5877380410744
Standard Deviation of cost of second floor per sqaure foot : 436.5284358862568

p-value 1.0
we accept null hypothesis


### ANOVA (F-Test)

The t-test works well when dealing with two groups, but sometimes we want to compare more than two groups at the same time.

* **For example,**
if we wanted to test whether voter age differs based on some categorical variable like race, we have to compare the means of each level or group the variable. 

We could carry out a separate t-test for each pair of groups, but when you conduct many tests you increase the chances of false positives. The analysis of variance or ANOVA is a statistical inference test that lets you compare multiple groups at the same time.

![image.png](attachment:image.png)

```Unlike the z and t-distributions, the F-distribution does not have any negative values because between and within-group variability are always positive due to squaring each deviation.```

### One Way ANOVA (F Test)

It tell whether two or more groups are similar or not based on their mean similarity and f-score.


### Checking significance  between Overall Condition and Sales Price of houses.

In [16]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

F, p = stats.f_oneway(data['SalePrice'], data['OverallCond'])
print("p-value for significance is: ", p)

if p<0.05:
    print("reject null hypothesis")
else:
    print("accept null hypothesis")

p-value for significance is:  0.0
reject null hypothesis



### Two Way F-test 

Two way F-test is extension of 1-way f-test, it is used when we have 2 independent variable and 2+ groups. 2-way F-test does not tell which variable is dominant. if we need to check individual significance then Post-hoc testing need to be performed.