# Hypothesis Testing

- NOTE: still feel not very clear with this section
- https://vitalflux.com/when-to-use-z-test-vs-t-test-differences-examples/

## Terms of Hypothesis Testing

- ...

## Steps for Testing

1. **State the null hypothesis and the alternative hypothesis**
   - Null:
     - assumed to be True
     - observed event happened only by chance and no effect from treatment
   - Alternative:
     - observed event does not happen by chance, having effect from treatment
   - E.g. Coin Toss 
2. **Choose a significance level**
3. **Find the `p-value`**
   - Probability of observing results "as or more" extreme than those observed when the null hypothesis is true
   - ...
4. **"Reject" or "Fail to reject" the null hypothesis**
   - Faile to Reject instead of "accept", because "accept" implies certainty
     - But this is about probabilty 
   - ...

## P-Value

- **Probability of observing a difference in the results**
  - **as or more extreme than the difference observed when null is true**
- Significance Level, e.g. 5%
  - the probability of rejecting the null hypothesis when it is true (i.e. conducting a Type I Error)
  - willing to accept a five percent chance you are wrong when you reject the null hypothesis
- IF "P-Value < Significance Level",
  - I.e. the probability of observing a difference in the results (alternative) is less than 5%
  - I.e. willing to accept 5% probability to conduct a Type I Error of False Positive
  - THEN "Reject Null Hypothesis"
- E.g. p-value is the probability of observing a difference
  - that is two minutes or greater if the null hypothesis is true
- TO-DO: How do you rephrase this?
- Calculate `p-value`
  - With **Test Statistic**
    - Under the null hypothesis (no difference)
      - A value that shows how closely the observed data matches the distribution expected
  - **Conducting Z-test**, so the Test Statistic is **"Z-score"**
    - Measure of how many standard deviations the observed data is below or above the population mean
    - I.e. where the value lies on a normal distribution
    - Formula for Z-score with One Sample Test:
      - (`Sample Mean` - `Population Mean`), divided by (`Population std` / `square root of Sample size`)
      - Require known Population Mean and and Standard Deviation
    - Formula for Z-score with Two Sample Test on Proportions:
      - ...
  - **Conduct t-est**, i.e. **t-score**
    - Formula for t-score with Two Sample Test on Means:
      - (`Sample 1 Mean` - `Sample 2 Mean`), divided by
         - Squre Root of (`Sample 1 Std Squared` / `Sample 1 Size` +  `Sample 2 Std Squared` / `Sample 2 Size`)
  - Getting `p-value`
    - Area under the curve is `p-value`
    - Left-tailed, Right-tailed or Two-tailed tests
  
      
> https://www.coursera.org/learn/the-power-of-statistics/lecture/Kv9dl/one-sample-test-for-means  
> https://www.coursera.org/learn/the-power-of-statistics/lecture/PmaS3/two-sample-tests-proportions  

## Type of Errors

- https://www.coursera.org/learn/the-power-of-statistics/supplement/Scyf1/type-i-and-type-ii-errors
- <big>**Type I Error**</big>
  - **False Positive**
    - Falsely classified as positive, while true label is negative (classification problem)
    - <mark>Falsely considered alternative as true (positive)</mark>, while it should be false; and rejected null
  - Reject the null hypothesis when it’s actually true
  - significance level, or **alpha (α)**, represents the probability of making a Type I error
  - _NOTE: "I" and then "II", same as "Positive" then "Negative"_
  - To Minimize
    - Choosing a lower significance level, e.g. 1%
- <big>**Type II Error**</big>
  - **False Negative**
    - Falsely classified as negative, while true label is positive (classification problem)
    - <mark>Falsely considered alternative as false (negative)</mark>, while it should be true; and did not reject null
  - Fail to reject the null hypothesis when it’s actually false 
  - Pobability of making a Type II error is called **beta (β)**
    - beta is related to the power of a hypothesis test (power = 1- β)
    - **Power** refers to the likelihood that a test can correctly detect a real effect when there is one
  - To Minimize
    - ensuring your test has enough power, by increasing sample size or signifiance level

## One & Two Sample Test

- One Sample Test
  - A population parameter is equal to a specific value or not
    - E.g. average sales revenue (of samples) equals to a target value or not
    - E.g. stock portfolio average rate equals to benchmark
  - **One Sample Z-test**
    - Assumptions
      - Data is random _sample from a normaly distributed population_
      - Known _population standard deviation_ 
        - (usually this is unknown, so t-test is often used)
  - **One Sample t-test**
  - Steps.
    1. State the null hypothesis and the alternative hypothesis
       - Null: population means equals to the observed value
       - Alt: "not equal to",  "less than", or "greater than"
    2. Choose a significance level.
       - Probability of rejecting the null when it is true (False Positive)
    3. Find the `p-value` (using t-score)
    4. Reject or fail to reject the null hypothesis.
- **Two Sample t-test**
  - Two population means (parameters) are equal to each other or not
    - E.g. in A/B Testing, Group A vs Group B
  - Assumptions
    - Two samples are independent of each other
    - Samples are drawn randomly from a normally distributed population
    - Population standard deviation is unknown (thus using t-test)
- **Two Sample Z-test**
  - Two population **proportions** are equal to each other or not
    - NOTE: t-test DO NOT apply to proportions
    - E.g. Side effects of medicine between two trial groups
    - E.g. Support of percentage for a new law in two districts
    - E.g. Proportion of satisfication to work environment in two different locations
  - Hypothesis
    - Null: no difference between the two proportions
    - Alternative: there is difference between the two proportions
  - P-value
    - Probabilities of observing a sample proportion as or more extreme
      - than the difference observed, when Null is True
    - Z-Statistic for proportions
      - (Difference between two sample proportions), divided by 
      - Square Root of (pooled proportion times (1 - pooled proportion) times (1/sample 1 size + 1/ sample 2 size))
      - Pooled Proportion
        - Weighted average of the proportions
- <mark>TODO: add hypothesis testing to your data analyst arsenal</mark>
  - https://mverbakel.github.io/2021-02-13/two-sample-proportions

In [5]:
import numpy as np
import pandas as pd

import plotly
import plotly.express as px

from scipy import stats

In [6]:
df = px.data.gapminder()
df.shape

(1704, 8)

In [7]:
df.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
0,Afghanistan,Asia,1952,28.801,8425333,779.445314,AFG,4
1,Afghanistan,Asia,1957,30.332,9240934,820.85303,AFG,4
2,Afghanistan,Asia,1962,31.997,10267083,853.10071,AFG,4
3,Afghanistan,Asia,1967,34.02,11537966,836.197138,AFG,4
4,Afghanistan,Asia,1972,36.088,13079460,739.981106,AFG,4


In [8]:
df.groupby("continent", as_index=False).gdpPercap.mean()

Unnamed: 0,continent,gdpPercap
0,Africa,2193.754578
1,Americas,7136.110356
2,Asia,7902.150428
3,Europe,14469.475533
4,Oceania,18621.609223


> Is the difference of average GDP between Asias and Americas statistically significant?

*   $H_0$: There is no difference in the mean GDP between America countries and Asia countries
*   $H_A$: There is a difference in the mean GDP between America countries and Asia countries

We're comparing two sample means between two independent samples, therefore will be a **"Two Sample t-test"**

In [9]:
significance_level = 0.05
significance_level

0.05

In [10]:
tmp = stats.ttest_ind(
    a=df.query("continent == 'Americas'").gdpPercap,
    b=df.query("continent == 'Asia'").gdpPercap,
    equal_var=False
)
tmp

TtestResult(statistic=-0.9616471732473416, pvalue=0.3366254286406626, df=583.1581429498881)

In [11]:
tmp.pvalue < significance_level

False

- Since p-value is not smaller than the significance level, we failed to reject the null hypothesis
- There is no difference between mean GDP
- NOTE: we're actually testing on whole data available, but if subet on only some years, that could be a sample of the GDP