# Hypothesis Testing

- NOTE: still feel not very clear with this section
- https://vitalflux.com/when-to-use-z-test-vs-t-test-differences-examples/
- https://youtu.be/ChLO7wwt7h0?si=8h8uSCAVVAuBtpD6

## Terms of Hypothesis Testing

- <big>p-value</big>
  - **Probability of observing as or more extreme (from the samples) effects**
    - **Given the Null hypothesis is True (for the population)**
  - When _`p-value` is less than `Significance Level`_ (e.g. 0.05)
    - Null is Rejected
    - Because, given Null is True
      - **Probaility of observing effect is so low, that it cannot support the claim of equal**
- <big>One-Sample t-test vs Two-Sample t-test</big>
  - One-Sample is to compare a single population value to a standard value
  - Two-Sample is to comapre between value of population A and value of population B
  - They use different formulas to calculate the t-score
  - ...
- <big>Paired t-test</big>
  - To Compare a single population before and after some experimental interventions 
    - I.e. comapre a single population at two different points in time
  - ...
- <big>One-tailed Test vs Two-tailed Test</big>
  - To divide the signifiance level into two parts or just one part
  - E.g. with Significance Level = 0.05 (alpha)
    - Two-tailed means lower-part is 0.025, upper-part is also 0.025
    - One-tailed means either the lower-part is 0.05 or upper-part is 0.05 (only one of the two)
  - This is set with the Alternative hypothesis

In [15]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
import plotly
import plotly.express as px

from scipy import stats
sns.set()

## Steps for Testing

1. **State the Null and Alternative hypothesis**
   - Null:
     - assumed to be True
     - observed event happened only by chance and no effect from treatment
   - Alternative:
     - observed event does not happen by chance, having effect from treatment
   - E.g. Coin Toss 
2. **Choose a significance level**
   - I.e. alpha, usually at 0.05 or lower
   - Probability of falsely rejecting the null hypothesis when it's True, (False Positive)
   - Risk of getting False Positive result on the test
3. **Find the `p-value`**
   - Probability of observing results "as or more" extreme than those observed
     - Given the Null hypothesis is True
   - E.g. a p-value of 0.0237 (2.37%)
     - Given the null is True (no difference) 
     - The probability of observing the effect (or more extreme) from the sample is only 2.37%
4. **"Reject" or "Fail to reject" the null hypothesis**
   - Saying "Faile to Reject" instead of saying "accept", because "accept" implies certainty
     - But probability is not about certainty

## P-Value

- What is p-value
  - **Given the Null hypothesis is True**
    - **<mark>Probability of observing a difference from the samples "as or more extreme"</mark>**
  - **Probability to obtain the effect observed from the given sample (or more extreme effect)**
    - **Given the Null hypothesis is True for the populations**
  - So given a p-value of 0.03 (3%):
    - Only 3% probability to observe as or more extreme the difference
    - Very low probabily
- What is Significance Level"
  - **Given Null hypothesis is True**
    - **<mark>Probability of rejecting the Null hypothesis (i.e. conducting a Type I Error)</mark>**
  - Or, chance (in percentage) that you're willing to accept to be wrong when rejecting the Null hypothesis
    - E.g. with 0.05, 5% chance to be wrong when rejecting Null
  - Also called "α", "alpha" or "False Positive Rate"
  - Usually at 0.05 or 0.01
- IF "P-value" smaller than "Signifiance Level (0.05)"
  - The probability of observing a difference in the results (alternative) is less than 5%
  - Willing to accept 5% probability to conduct a Type I Error of False Positive
  - THEN you can "Reject Null Hypothesis"

### Calculations

- With **Test Statistic**
  - Under the null hypothesis (no difference)
    - A value that shows how closely the observed data matches the distribution expected
- **Conducting Z-test**, so the Test Statistic is **"Z-score"**
  - Measure of how many standard deviations the observed data is below or above the population mean
  - I.e. where the value lies on a normal distribution
  - Formula for Z-score with One Sample Test:
    - (`Sample Mean` - `Population Mean`), divided by (`Population std` / `square root of Sample size`)
    - Require known Population Mean and and Standard Deviation
  - Formula for Z-score with Two Sample Test on Proportions:
    - ...
- **Conduct t-est**, i.e. **t-score**
  - Formula for t-score with Two Sample Test on Means:
    - (`Sample 1 Mean` - `Sample 2 Mean`), divided by
       - Squre Root of (`Sample 1 Std Squared` / `Sample 1 Size` +  `Sample 2 Std Squared` / `Sample 2 Size`)
- Getting `p-value`
  - Area under the curve is `p-value`
  - Left-tailed, Right-tailed or Two-tailed tests
  
      
> https://www.coursera.org/learn/the-power-of-statistics/lecture/Kv9dl/one-sample-test-for-means  
> https://www.coursera.org/learn/the-power-of-statistics/lecture/PmaS3/two-sample-tests-proportions  

## Type of Errors

- https://www.coursera.org/learn/the-power-of-statistics/supplement/Scyf1/type-i-and-type-ii-errors
- <big>**Type I Error**</big>
  - **False Positive**
    - Falsely classified as positive, while true label is negative (classification problem)
    - **<mark>Falsely considered Alternative as True (positive)</mark>, while it should be false**; and rejected null
  - Reject the null hypothesis when it’s actually true
  - significance level, or **alpha (α)**, represents the probability of making a Type I error
  - _NOTE: "I" and then "II", same as "Positive" then "Negative"_
  - To Minimize
    - Choosing a lower significance level, e.g. 1%
- <big>**Type II Error**</big>
  - **False Negative**
    - Falsely classified as negative, while true label is positive (classification problem)
    - **<mark>Falsely considered Alternative as False (negative)</mark>, while it should be true**; and did not reject null
  - Fail to reject the null hypothesis when it’s actually false 
  - Pobability of making a Type II error is called **beta (β)**
    - beta is related to the power of a hypothesis test (power = 1- β)
    - **Power** refers to the likelihood that a test can correctly detect a real effect when there is one
  - To Minimize
    - ensuring your test has enough power, by increasing sample size or significance level

## One & Two Sample Test


### One-Sample Test

> Testing of a population to a standard value, e.g. Data Scientist average income to All Population's average income

- TO test a population parameter is equal to a specific value or not
  - E.g. average sales revenue (of samples) equals to a "50,000" or not
  - E.g. stock portfolio average rate equals to 56% or not
- **One Sample Z-test**
  - Assumptions
    - Data is random _sample from a normaly distributed population_
    - Known _population standard deviation_ 
      - (usually this is unknown, so t-test is often used)
- **One Sample t-test**
  - Assumptions
    - Observations are independent to each other
    - Data are randomly sampled from the target population
    - The population distribution is approximately normal
  - Steps in conducting a test
    1. State the null hypothesis and the alternative hypothesis
       - Null: population means equals to the observed value
       - Alt: "not equal to",  "less than", or "greater than"
    2. Choose a significance level
       - Probability of rejecting the null when it is true (False Positive)
    3. Find the `p-value`
       - Using t-score with the one-sample formula
       - $$t = \frac{\overline{x}-\mu{}_{0}}{s_{\overline{x}}}$$
    4. Reject or fail to reject the null hypothesis

### Two-Sample Test

> Testing of the difference between Population A and Population B, e.g. Data Scientist aveage income vs. Software Engineer average income

- **Two Sample t-test**
  - To Test if two population means (parameters) are equal to each other or not
    - E.g. in A/B Testing, Group A vs Group B
  - Hypotheses
    ... 
  - Assumptions
    - Two samples are independent of each other
    - Samples are drawn randomly from a normally distributed population
    - Population standard deviation is _Unknown_ (thus using t-test)
  - P-value
- **Two Sample Z-test**
  - To test if two population **proportions** are equal to each other or not
    - _NOTE: t-test DO NOT are not applicable to proportions_
    - E.g. Side effects of medicine between two trial groups
    - E.g. Support of percentage for a new law in two districts
    - E.g. Proportion of satisfication to work environment in two different locations
  - Hypotheses
    - Null: no difference between the two proportions
    - Alternative: there is difference between the two proportions
  - Assumptions
    - ...
  - P-value
    - Z-Statistic for proportions
      - (Difference between two sample proportions), divided by 
      - Square Root of (pooled proportion times (1 - pooled proportion) times (1/sample 1 size + 1/ sample 2 size))
      - Pooled Proportion
        - Weighted average of the proportions

### Other

- <font color='crimson'>TODO: add hypothesis testing to your data analyst arsenal</font>
  - https://mverbakel.github.io/2021-02-13/two-sample-proportions

In [10]:
df = px.data.gapminder()
print(df.shape)
df.head()

(1704, 8)


Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
0,Afghanistan,Asia,1952,28.801,8425333,779.445314,AFG,4
1,Afghanistan,Asia,1957,30.332,9240934,820.85303,AFG,4
2,Afghanistan,Asia,1962,31.997,10267083,853.10071,AFG,4
3,Afghanistan,Asia,1967,34.02,11537966,836.197138,AFG,4
4,Afghanistan,Asia,1972,36.088,13079460,739.981106,AFG,4


In [14]:
(
    df.groupby("continent").gdpPercap.mean()
    .rename({"gdpPercap": "GDP_per_capita"}).reset_index()
)

Unnamed: 0,continent,gdpPercap
0,Africa,2193.754578
1,Americas,7136.110356
2,Asia,7902.150428
3,Europe,14469.475533
4,Oceania,18621.609223


> **Q: Is the difference of average GDP between Asias and Americas statistically significant?**

- $H_0$: There is no difference in the mean GDP between America countries and Asia countries
- $H_A$: There is a difference in the mean GDP between America countries and Asia countries

We're comparing two sample means between two independent samples, therefore will be a **"Two Sample t-test"**

In [5]:
significance_level = 0.05
significance_level

0.05

In [7]:
# t-test with two samples
tmp = stats.ttest_ind(
    a=df.query("continent == 'Americas'").gdpPercap,
    b=df.query("continent == 'Asia'").gdpPercap,
    equal_var=False,
)
tmp

TtestResult(statistic=-0.9616471732473416, pvalue=0.3366254286406626, df=583.1581429498881)

In [8]:
tmp.pvalue < significance_level

False

Test Result:

- Since p-value is not smaller than the significance level, we **Failed to reject the null hypothesis**
  - I.e. there is no difference between mean GDP of Asias and Americas
- NOTE: we're actually testing on whole data available, but if subet on only some years, that could be a sample of the GDP

## Chi-Squared Test

> Chi-Squared Tests (χ²)

- **Examine the relationship between "Categorical" Variables**
- Hypothesis Testing with Categorical Variables
- Check significance whether one or more categorical variables follow expected distributions
- https://www.coursera.org/learn/regression-analysis-simplify-complex-data-relationships/lecture/Zbmic/hypothesis-testing-with-chi-squared
- https://www.coursera.org/learn/regression-analysis-simplify-complex-data-relationships/supplement/DMwWu/chi-squared-tests-goodness-of-fit-versus-independence

### Goodness of Fit test

- Whether an observed categorical variable follows an expected distribution
- Hypotheses
  - NULL: The categorical variable does follow the expected distribution
  - ALT: The categorical variable does NOT follow the expected distribution
- E.g. "the number of website visitors to be the same for each day of the week"
  1. Identify the Null and Alternative Hypotheses
     - NULL: the number of website visitors is equal on any given day
     - ALT: the number of website visitors is not equal across the days of the week
  2. Calculate the chi-square test statistic (𝛘2)
     - Sum of
       - Squared (`observed` - `expected`), divided by `expected`
  3. Calculate the p-value
     - `stats.chisquare(f_obs = Observations, f_exp = Expectations)`
  4. Make a conclusion

### Test for Independence

- Whether or not two categorical variables are associated with each other
- Hypotheses
  - NULL: the two categorical variables are independent.
  - ALT: the two categorical variables are NOT independent
- E.g. "if the device used to visit your clothing store (Mac or PC) is independent from the visitor’s membership status (Member or Guest)"
  1. Hypotheses
     - H0: The type of device a website visitor uses to visit the website is independent of the visitor’s membership status.
     - Ha: The type of device a website visitor uses to visit the website is not independent of the visitor’s membership status.
  2. Test Statistics
     - TODO: add it and why it is defined like that
  3. p-value
     - `stats.contingency.chi2_contingency(Observations)`

## ANOVA

> ANONA, Analysis of Variance

- **Examine the relationship between "Categorical" and "Continuous" Variables**
- Test the difference of means between three or more groups
  - Further extension from "t-test" (test beween means of two groups)
- Unlike Regression Test, which helps understand **"How X impact y"**
  - ANOVA allows zooming on the relationship between variables in a pari-wise fashion

### One-Way ANOVA

- One-way ANOVA
  - Compare the means of one Continuous `y` based on three or more of **another categorical variable**
  - E.g. Is "Lifespan" (`y`) of butterflies related to "Specises" (categorical `X1`) of 3 types
- Hypotheses
  - Null: Mean Lifespan is not different between 3 types, i.e. $\mu_a$ = $\mu_b$ = $\mu_c$
    - I.e. the means of each group equals to each other 
  - Alternative: Mean Lifespan is NOT all equal, i.e. NOT $\mu_a$ = $\mu_b$ = $\mu_c$
    - I.e. only one mean lifespan different is enough to reject the null
  
### Two-Way ANOVA

- Two-way ANOVA
  - Compare the means of one Continuous y based on three or more groups of **other two categorical variables**
  - E.g. Is "Lifespan" (`y`) related to "Specises" (categorical `X1`) and "Size" (categorical `X2`) ? 
- 3 Hypotheses
  - Test on Species
    - Null: No difference between 3 means of lifespan by Species
    - Alternative: There is difference between 3 means of lifespan by Species
  - Test on Size
    - Null: No difference between 3 means of lifespan by Size
    - Alternative: No difference between 3 means of lifespan by Size
  - Test on **Interactions** of above two
    - Null: Effect of Specises on Lifespan is independent of Size, and vice versa
    - Alternative: There's an interaction effect between Size and Specises on Lifespan

### Others

- **ANCOVA, Analysis of CoVariance**
  - Statistical techniques that 
    - test the difference of means between three or more groups
    - controlling for the effects of **"covariates"** (i.e. variables irrelevant to your test)
  - E.g. Book Sales as `y`, and Book Genre as `X`, with Publication Year as `Covariate`
  - Hypotheses
    - Null: Book Sales are equal across all Book Genre, regardless the Publication Year
    - Alternative: Book Sales are NOT equal across all Book Genre, regardless the Publication Year
  - Compare with Linear Regression
    - ANCOVA not focus on the covariates, but Regression is interested in all of the independent variables (`X`)
    - ANCOVA focus on the Categorical `X`, while Regression is more on predicting `y`
- **MANOVA, Multivariate Analysis of Variance**
  - Compare how **two or more continuous dependent variables (`y`)**
    - vary according to categorical `X` variables
  - One-way MANOVA
    - Continuous dependent variables vs "one" categorical X variable
  - Two-way MANOVA
    - Continuous dependent variables vs "two" categorical X variables
  - E.g. Book Sales and Book Profit as continuous dependent variables,
    - and Book Genre as the categorical independent variable
  - Hypotheses
    - Nulls
      - Mean Book Sales is the same for each Book Genre
      - Mean Book Profit is the same for each Book Genre
    - Alternatives
      - Mean Book Sales is NOT the same for each Book Genre
      - Mean Book Profit is NOT the same for each Book Genre
      - (any two different will result in rejecting the null)
- MANCOVA
  - Compare how **two or more continuous dependent variables (`y`)**
    - vary according to categorical `X` variables,
    - while controlling for the effect of covariates
  - E.g. Book Sales and Book Profit as continuous dependent variables,
    - and Book Genre as the categorical independent variable
    - but control for the author's popularity, "Author N Followers"
  - Hypotheses
    - Null: Book Sales are Book Profit equal across all Book Genre,
      - regardless Author N Followers
    - Alternative: Book Sales are Book Profit NOT equal across all Book Genre,
      - regardless Author N Followers
    - Alternative: Book Sales are NOT equal across all Book Genre, regardless the Publication Year
- https://www.coursera.org/learn/regression-analysis-simplify-complex-data-relationships/lecture/IuRtP/more-dependent-variables-manova-and-mancova


In [2]:
# Load in diamonds data set from seaborn package
diamonds = sns.load_dataset("diamonds")
diamonds.head()


Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


## Test by Type of Variables

## References

- https://libguides.library.kent.edu/SPSS/OneSampletTest
- https://statisticsbyjim.com/hypothesis-testing/comparing-hypothesis-tests-data-types/
- https://towardsdatascience.com/how-to-know-which-statistical-test-to-use-for-hypothesis-testing-744c91685a5d