In [4]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from math import sqrt
from scipy import stats

np.random.seed(123)

from pydataset import data

from env import host, user, password


## Hypothesis Testing
<font color= 'red'>What is Hypothesis Testing?</font>


* Hypothesis tests are used to draw conclusions, answer questions, or interpret beliefs we have about a population using sample data.
* A hypothesis test evaluates two mutually exclusive statements about a population and informs us which statement is best supported by our sample data.

![null_hypothesis.jpg](attachment:null_hypothesis.jpg)
$𝐻_0$ : There is no difference between smokers' tips and the overall population's tip average.

$𝐻_𝑎$ : There is a difference between smokers' tips and the overall population's tip average.

   * After running and interpreting the values returned by the appropriate statistical test for my data, I will either fail to reject or reject the Null Hypothesis.
   
<a href="https://online.stat.psu.edu/statprogram/reviews/statistical-concepts/hypothesis-testing/p-value-approach">simple, yet detailed explanation of this process</a>

<font color="orange">So What?</font>
- "There are two possible outcomes; 
    1. if the result confirms the hypothesis, then you've made a measurement. 
    2. If the result is contrary to the hypothesis, then you've made a discovery." - Enrico Fermi


<font color ='green'>Now What?</font>
![hytpothesis_testing_process.jpg](attachment:hytpothesis_testing_process.jpg)

### Important Terms¶
* `Confidence level`:  the probability that if a poll/test/survey were repeated over and over again, the results obtained would be the same. 
    * It conveys how confident we are in our results.
    * Raising your confidence level lowers your chances of Type I Errors, or False Positives. (Common examples might be 90%, 95%, or 99%)
* `alpha -->  𝑎  = 1 - confidence level`:  If the resulting p-value from the hypothesis test is less than the  𝑎 , then the test findings are significant. 
    * If it is close, it is at your discretion. The results still may be significant even though slightly above your chosen cutoff. 
    * Also, alpha is the maximum probability that you have a Type I Error.
<blockquote> For a 95% confidence level, the value of alpha is .05 which means there is a 5% chance that you will make a Type I Error (False Positive) or reject a True Null hypothesis.</blockquote>

* `t-statistic` : the calculated difference represented in units of standard error. The greater the magnitude of t, the greater the evidence against the null hypothesis.
    * Why does it matter? Short Answer: It allows us to calculate our p-value!
* `p-values` : values we obtain from hypothesis testing. They represent the probability that our obtained result is due to chance given that our stated hypothesis is true.

https://towardsdatascience.com/everything-you-need-to-know-about-hypothesis-testing-part-i-4de9abebbc8a



### Hypothesis Testing Errors
![hyp_testing_errors.jpg](attachment:hyp_testing_errors.jpg)

<a href='https://online.stat.psu.edu/stat500/lesson/6a/6a.1#:~:text=When%20we%20fail%20to%20reject,the%20likelihood%20of%20these%20events.'>Hypothesis Testing Errors</a>

<font color='red'>Type I Error?</font>
* A Type I Error is a False Positive, which I can think of as sounding a false alarm.
* I predict there is a difference or a relationship when in reality there is no difference or no relationship.
* I reject the Null hypothesis when the Null hypothesis is actually True.


<font color='orange'>So What?</font>

* If I am trying to determine whether a customer will churn, and my model predicts that they will churn (positive for churn), but they do not end up churning (we made a False prediction), this is a False Positive. I may have wasted time and money trying to woo a customer who was not going to leave anyway.


<font color='red'>Type II Error?</font>
* A Type II Error is a __False Negative__, which I can think of as a _miss_; I missed identifying a real difference or relationship that exists in reality.
* I predict that there is no difference or relationship when in reality there is a difference or relationship.
* I fail to reject the Null hypothesis when the Null hypothesis is actually false.


<font color='orange'>So What?</font>

* If I am trying to determine whether a customer will churn, and my model predicts that they will not churn (negative for churn), but they end up churning (we made a False Prediction), this is a False Negative. I may have lost the opportunity to woo a customer that was going to leave, before they left.

<font color ='green'>Now What?</font>

* In practice, if I create a classification model to predict customer churn, I can decide how to balance Type I and Type II Errors to have a model that suits my needs. 
    * In the instance of churn, which type of error would be more costly to the Telco company: a Type I Error or a Type II Error? Which would be more costly if I was trying to determine if a shadow in an x-ray were cancer?

## T-Test - Continuous v Categorical Variables
<font color='red'>What is a T-Test?</font> 
* A type of inferential statistic used to determine if there is a significant difference between the means of two groups which may be related in certain features.
* It compares a categorical and a continuous variable by comparing the mean of the continuous variable by subgroups or the mean of a subgroup to the mean of the population.


<font color='orange'>So What?</font>
* `One Sample T-test` : is when I compare the mean for a subgroup to the population mean.

    <blockquote>Are sales for group A higher when we run a promotion?</blockquote>

`Two Sample T-test` is when we compare the mean of one subgroup to the mean of another subgroup.

<font color= 'green'>Now What?</font>
<blockquote>
    <font color='cadetblue'># Set confidence level.</font>
confidence_level = .95
</blockquote>
    <blockquote>
 <font color='cadetblue'># Set alpha.</font>
alpha = 1 - confidence_level
    
</blockquote>

* If the p-value is higher than your alpha, I fail to reject the Null Hypothesis.

* If the p-value is lower than the alpha, I reject the Null Hypothesis.