# Variablity in Point Estimates

* Point Estimates are not exact
* Sampling distribution: reperesents the distribution of the point estimates based on samples of a fixed size from a certain population. 
* Stadard Error of an estimate, $SE$: describes the typical error or uncertainty associated with the estimate. 


**Standard Error of the Mean**    
Given $n$ independent observations from a population with standard deviation $\sigma$ the standard error of the sample mean is equal to: 

$$
SE = \frac{\sigma}{\sqrt{n}}\sim  \frac{s}{\sqrt{n}}
$$

In detail:
* In order to ensure that sample observations are **independent**, a reliable method is to conduct a simple random sample consisting of ***less than 10%*** of the population. 
* Population standard deviations $\sigma$ is typically unknown, but using the point estimate of the standard deviation of the sample, $s$. 
    * This is sufficiently good when the ***sample size*** is ***greater than 30*** (or equal) AND the population distribution is ***not strongly skewed***. 
    * If the sample size $<30$, we need a method to account for extra uncerainty in the SE. 
    * If the skew condition is not met, a larger sample is needed to compesnate for the extra skew.


**Central Limit Theorem**  
If a sample consists of at least 30 independent observations and the data are not strongly skewed, then the distribution of the sample mean is well approximated by the normal model. 

**Confidence Intervals**  
Plausable range of values of a population parameter.  

$$
\text{point estimate} \pm \text{(no. of } SE\text{s}) \times SE
$$

A popular examples is $\text{(no. of } SE\text{s})\approx2$ yields a Confidence Interval in which we have 
95% confidencde that the true value is within the range.  

Correct interpretation:  
> We are XX% confident that the poulation parameter is between ...

Incorrect interpretations:  
* It is incorrect to describe the CI as capturing the population parameter with certain probability. 
* The CI says nothing about the confidence of capturing individual observations, a proportion of the observations or about capturing point estimates.


**Confidence Intervals For Normal Distributions**   
For point estaimtes that follow the normal model:
$$
\text{point estimate} \pm z^* \times SE
$$
* $ z^*$ - The Z-score is used to determine the confidence level selected. 
* $z^* \times SE$ is called the *margin of error*.

$ z^*$   | Confidence Level
------------- | -------------
1.96  | 95%
2.58  | 99%

Conditions to help ensure the sampling distribution is nearly normal and the estimate of $SE$ is sufficiently accurate:  
* The sample of observations are independent
* The sample size is large: $n\ge 30$ is a good rule of thumb 
* The population distribution is not strongly skewed (difficult to evaluate, best to use judgement). The larger the sample size, the more lenient we can be with the sample's skew.   

To verify sample observations are independent: 
* If the observations are from a simple random sample and consist of fewer than 10% of the population they may be considered independent. 
* Subjects in an experiment are considered independent if they undergo random assignment to the treatment groups. 

Checking for strong skew usually means checking for obvious outliers  
* One needs at least 100 observations and in some cases much more. 
* Studentized bootstrap (bootstrap-t) method might be useful


# Null Hypothesis Testing
Also called *Classical Hypothesis Testing*. 

The *Null Hypothesis* $H_0$ often represents either a skeptical perspective of a claim to be tested (e.g no difference between samples).  
The *Alteranative Hypothesis* $H_\text{A}$ represents an alternative claim under consiersation. This is often represented by a range of possible parameter values.  

General Tip of the Hypothesis testing framework:  
The skeptic will not reject $H_0$ unless the evidence in favor of $H_A$ is very strong.  

**Decision Errors** 

*Type 1* - Rejecting $H_0$ when it is true.   
*Type 2* - Failing to reject $H_0$ when $H_\text{A}$  is true.  

The significance level $\alpha$ should reflect the consequences associatd with Type 1 and Type 2 Errors.   
E.g, if Type 1 Error is dangerous/costly, we might want to lower $\alpha$.  
If Type 2 Error is more dangerous/costly, we might want a higher $\alpha$.  

**Test Statistic**  
A summary statistic that is useful for evaluating a hypothesis test or identifying the p-value.  
For a point estimate that is nearly normal we use the Z-score.  


Main disadvantage:  
Does not incorporate prior knowledge.  

## p-value
A way of quantifying the strength of the evidence against the null hypothesis and in favor of the alternative.  

It is the probability of getting a result at least as extreme as the observed if the null hypothesis (and all other modelling assumptions) were true. 

The p-value is used in the context of null hypothesis testing in order to quantify the idea of statistical significance of evidence.  
A result is said to be *statistically significant* if it allows us to reject the null hypothesis.  

$$
\text{right tail:   Pr}(X\ge x|H_0) \\ \text{left tail:   Pr}(X\le x|H_0) \\
\text{double tail:   2min[Pr}(X\ge x|H_0),\text{Pr}(X\le x|H_0)]
$$
* $X$ random variable of which we observed value $x \in X$.   

The null hypothesis $H_0$ is rejected if any of these probabilities is less than or equal to a small, fixed but arbitrarily pre-defined threshold value of level of significance, $\alpha$.  

The smaller the $p$-value the stronger the data favor $H_\text{A}$ over $H_0$.   
Or in other words: A small $p$-value means that if the null hypothesis is true, there is a low probability of seeing a point estimate at least as extereme as the one observed. This is interpreted as strong evidence in favor of $H_\text{A}$.  

We reject $H_0$ if the $p-$value is smaller than the significance level $\alpha$. Otherwise we fail to reject $H_0$. 

The $p$-value is compared to the significance level $\alpha$ in a way to ensure that the Type 1 Error rate does not exceed the signifiacne level standard.  

**One-sided and two-sided tests**  
Always use a two-sided test unless it was made clear prior to data collection that the test should be one-sided. Switching a two-sided test to a one-sided test after observing the data is dangerous because it can inflate the Type 1 Error rate.  

Use a one-sided test when you are interested in checking for an increase of a decrease - but not both.  
Use a two-sided test when you are interested in any difference from the null value.  



## $\chi^2$ Test Of Association
Useful for comparing counts of categorical data.  

In [11]:
from scipy.stats import chisquare
import numpy as np

null_hypothesis = np.array([10., 10., 15., 20., 30., 15.])  # this is the frequencies that is hypothesised
observed = np.array([30., 14, 34., 45., 57., 20]) # this is what we observe

if null_hypothesis is None: # i.e, assuming even frequencies
    null_hypothesis = np.array([1./len(observed)] * len(observed))
    
# prep
expected = null_hypothesis * observed.sum()/null_hypothesis.sum()
print(f'observed:{observed}')
print(f'expected:{expected}')
assert (expected >=5).sum() == len(expected) # large counts condition

# calculations
dof = len(null_hypothesis) - 1   # degrees of freedom. -1 b/c they are not indepenedent (the last one can be figured out from all the rest, due to its adding to 100%)
chi2, pval = chisquare(observed, f_exp=expected)

print(f'chi^2={chi2:0.2f} of p={pval:0.3f} for {dof} dof')

observed:[30. 14. 34. 45. 57. 20.]
expected:[20. 20. 30. 40. 60. 30.]
chi^2=11.44 of p=0.043 for 5 dof


## Error Types

**False Positive** (Type 1 error) 
Falsely detecting a positive when it is actually a negative

E.g, 
* The boy who cried wolf

$\alpha$ = False Positive Rate  
Equal (or also referred to as):  
* Significance Level
* $\alpha = 1 -$ specificity = $1 -$ TNR


**False Negative** (Type 2 error)   
Falsely detecting a negative when it is actually a positive

E.g, 
* Important to minimise FNR for people who have cancer (we prefer high recall)
* $\alpha$ is a meausure Significance Level
* $\alpha = 1 -$ specificity = $1 -$ TNR

$\beta$ = False Negative Rate 
Equal (or also referred to as): 
* $1 - \beta$ is a measure of Power
* $\beta = 1 -$ sensitivity = $1 -$ TFR

## Statistical Power

The power of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis when a specific alternative hypothesis is true.
Alternatively it is the probability of accepting the alternative hypothesis when it is true, i.e, the ability of a test to detect a specific effect that actually exists.

Power $= 1 - \beta = P(\text{reject }H_0|H_A \text{ is true} )$ 


**Power Analysis**  

Effect Size $$d = \frac{\text{estimated difference beween means}}{\text{pooled estimated standard deviations}}$$


**Sample Size**

Rule of thumb (I'm not sure of limitations ...)

$$n=16\frac{\sigma^2}{d^2}$$

* d - minimum effect we wish to detect
* $\sigma^2$ - sample variance expected. If not known for a binomial expeirement may use $\sigma^2 = p(1-p)$

**Rules of Thumb**

* Don’t report significance levels until an experiment is over
* Don't use significance levels to decide whether an experiment should stop or continue. 

Instead of reporting significance of ongoing experiments, report how large of an effect can be detected given the current sample size. That can be calculated with:

$$d=(t_{\alpha/2} + t_{\beta} )\sigma \sqrt{2/n}$$

Where the two 𝑡’s are the t-statistics for a given significance level 𝛼/2 and power (1−𝛽).

**Useful Resources**  
* [Evan Miller's: How Not to Run an A/B test](https://www.evanmiller.org/how-not-to-run-an-ab-test.html)

## Likelihood Ratios
[TBD](https://en.wikipedia.org/wiki/Likelihood_principle#The_law_of_likelihood)

# Bayesian Hypothesis Testing

## Bayes Factor
[TBD](https://en.wikipedia.org/wiki/Bayes_factor)   
Used in Bayesian model comparison. 
The aim of the Bayes factor is to quantify the support for a model over another, regardless of whether these models are correct.

$$
K = \frac{\text{Pr}(D|M_1)}{\text{Pr}(D|M_2)}=\frac{\text{Pr}(M_1|D)\text{Pr}(M_2)}{\text{Pr}(M_2|D)\text{Pr}(M_1)}
$$

Harold Jeffery's scale (TBD)

# Resources

* OpenIntro Statistics, Diez, Barr, Çetinkaya-Rundel ([site](https://www.openintro.org/), [book](https://www.amazon.co.uk/OpenIntro-Statistics-Third-David-Diez/dp/194345003X), [coursera](https://www.coursera.org/learn/inferential-statistics-intro))
* The Art of Statistics, Spiegelhalter ([book](https://www.amazon.co.uk/Art-Statistics-Learning-Pelican-Books/dp/0241398630))
* Johan K. Kruschke
    * Bayesian Estimation Supersedes the t Test ([YouTube](https://www.youtube.com/watch?v=fhw1j1Ru2i0), [pdf](https://pdfs.semanticscholar.org/dea6/0927efbd1f284b4132eae3461ea7ce0fb62a.pdf), [PyMC3](https://docs.pymc.io/notebooks/BEST.html))
* Visual Website Optimizer
    * [Bayesian A/B Testing at VWO](https://cdn2.hubspot.net/hubfs/310840/VWO_SmartStats_technical_whitepaper.pdf)
    * [What Is Smart Decision?](https://help.vwo.com/hc/en-us/articles/360033991153-What-Is-Smart-Decision-)
* Bayesian A/B Testing: a step-by-step guide ([blog](http://www.claudiobellei.com/2017/11/02/bayesian-AB-testing/), [GitHub](https://github.com/cbellei/abyes))