# Introduction to Hypothesis Testing

## What is Hypothesis?
- Proposed explanation for a phenomenon.
- A hypothesis is an educated guess about something in the world around you. It should be testable, either by experiment or observation. 

- Proposed explanation
- Objectively testable 
- Singular - hypothesis
- Plural - hypotheses

__Examples__

- A new medicine you think might work.
- A way of teaching you think might be better.
- A possible location of new species.




## What is a Hypothesis Statement?
“If I…(do this to an **independent variable**)….then (this will happen to the **dependent variable**).”

__Example__

- If I (decrease the amount of water given to herbs) then (the herbs will increase in size).
- If I (give patients counseling in addition to medication) then (their overall depression scale will decrease).
- If I (give exams at noon instead of 7) then (student test scores will improve).

A good hypothesis statement should:
- Include an “if” and “then” statement (according to the University of California).
- Include both the independent and dependent variables.
- Be testable by experiment, survey or other scientifically sound technique.
- Be based on information in prior research (either yours or someone else’s).
- Have design criteria (for engineering or programming projects).


## What is Hypothesis Testing?
Hypothesis testing in statistics is a way for you to test the results of a survey or experiment to see if you have meaningful results. You’re basically testing whether your results are valid by figuring out the odds that your results have happened by chance. If your results may have happened by chance, the experiment won’t be repeatable and so has little use.

## Null and Alternative Hypothesis
- Hypothesis 0 (H0): Assumption of the test holds and is failed to be rejected at some level of significance.
- Hypothesis 1 (Ha): Assumption of the test does not hold and is rejected at some level of significance.

## Errors in Statistical Tests
- **Type I Error:** The incorrect rejection of a true null hypothesis or a false positive.
- **Type II Error:** The incorrect failure of rejection of a false null hypothesis or a false negative.

![imgaes](../img/errors.png)

## Alpha($\alpha$)
- $\alpha$ is probability of rejecting H0 when H0 is true. 
- $\alpha$ = Probability of Type-I error. 
- Ranges from 0 to 1
- **High α is not good**

## p-value
- If p-value > alpha: Fail to reject the null hypothesis (i.e. not significant result).
- If p-value <= alpha: Reject the null hypothesis (i.e. significant result).

## Hypothesis Testing Process 

- **Step-1:** Null Hypothesis H0
    - True until proven false
    - Usually posits no relationship

- **Step-2:** Select Test
    - Pick from vast library
    - Know which one to choose

- **Step-3:** Significance Level
    - Usually 1% or 5%
    - What threshold for luck?

- **Step-4:** Alternative Hypothesis
    - Negation of null hypothesis
    - Usually asserts specific relationship
    
- **Step-5:** Test Statistic
    - Convert to p-value
    - How likely it was just luck?


- **Step-6:** Accept or Reject
    - Small p-value? Reject H0 
    - Small: Below significance level 

![images](../img/hypothesis.png)

## Common Statistical Tests
### Variable Distribution Type Tests (Gaussian)
- Shapiro-Wilk Test
- D’Agostino’s K^2 Test
- Anderson-Darling Test

### Variable Relationship Tests (correlation)
- Pearson’s Correlation Coefficient
- Spearman’s Rank Correlation
- Kendall’s Rank Correlation
- Chi-Squared Test

### Compare Sample Means (parametric)
- Student’s t-test
- Paired Student’s t-test
- Analysis of Variance Test (ANOVA)
- Repeated Measures ANOVA Test

### Compare Sample Means (nonparametric)
- Mann-Whitney U Test
- Wilcoxon Signed-Rank Test
- Kruskal-Wallis H Test
- Friedman Test

In [3]:
import pandas as pd 
import numpy as np 
import seaborn as sns 
import matplotlib.pyplot as plt 

In [5]:
df = pd.read_excel('../data/height_weight.xlsx')
df 

Unnamed: 0,Gender,Age Group,Height(m),Weight(kg)
0,female,adult,1.4,60
1,male,child,1.2,15
2,male,adult,1.5,85
3,female,adult,1.3,74
4,male,adult,1.6,77
5,female,elderly,1.5,65


In [6]:
df.dtypes

Gender         object
Age Group      object
Height(m)     float64
Weight(kg)      int64
dtype: object

In [7]:
df.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Gender      6 non-null      object 
 1   Age Group   6 non-null      object 
 2   Height(m)   6 non-null      float64
 3   Weight(kg)  6 non-null      int64  
dtypes: float64(1), int64(1), object(2)
memory usage: 320.0+ bytes


In [8]:
df['Gender'].value_counts() 

male       3
female     2
female     1
Name: Gender, dtype: int64

## Statistical Test Selection 
- 1 categorical variable => 1 sample proportion test 
- 2 categorical variables => chi squared test 
- 1 numeric variable => t-test 
- 1 numeric and 1 categorical variable => t-test or ANOVA
- more than 2 categorical variables => ANOVA
-  2 numeric variables => correlation test 

## References
- https://machinelearningmastery.com/statistical-hypothesis-tests/
- https://www.statisticshowto.com/probability-and-statistics/hypothesis-testing/