# DATA 5600: Introduction to Regression and Machine Learning for Analytics

## __Section 4.4: Statistical Significance, Hypothesis Testing, and Statistical Error__ <br>

Author:  Tyler J. Brough <br>
Updated: October 20, 2021 <br>

---

<br>

In [1]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [10, 8]

In [2]:
np.random.seed(7)

## Introduction

<br>

* One concern: the possibility of mistakenly coming to strong conclusions that ___do not replicate___ or do not reflect real patterns in the underlying population


* Theories of hypothesis tesing and error analysis have been developed to quantify these possibilities in the context of inference and decision making

<br>


## __Statistical Significance__

<br>

* A common practice that is __NOT RECOMMENDED__ is to consider a result stable or real if it is "statistically signficant" and to take "non-significant" results to be noisy and to be treated with skepticism

<br>

### __Statistical Significance__

---

___Statistical Significance___ is conventionally defined as $p$-value less than 0.05, relative to some _null hypothesis_ or prespecified value that would indicate no effect present. For fitted regressions, this roughly
corresponds to coefficient estimates being labeled as statistically significant if they are at least two standard devaitions from zero, or not statistically significant otherwise.

---


<br>
<br>

More generally:

* an estimate is said to be statistically significant if the observed value could reasonably be explained by simple chance variation, much in the way that a sequence of 20 coin tosses might happen to come up 8 heads and 12 tails

* We would say that this outcome is not statistically significantly different from random chance

* In this case the observed proportion of heads is $0.40$ but with a standard error or $0.11$ - thus the data are less than two standard errors away from the null hypothesis of zero.


<br>

## __Hypothesis Testing for Simple Comparisons__

<br>

* A randomized experiment is performed to compare the effectiveness of two drugs for lowering cholesterol

* The mean and standard deviation of the post-treatment cholesterol levels are $\bar{y}_{T}$ and $s_{T}$ fro the $n_{T}$ people in the treatment group, and $\bar{y}_{C}$ and $s_{C}$ for the $n_{C}$ peopl in the control group.

<br>

* The parameter of interest is $\theta = \theta_{T} - \theta_{C}$ - the expectation of the post-test difference in cholesterol between the two groups

* The estimate is: $\hat{\theta} = \bar{y}_{T} - \bar{y}_{C}$

* The standard error is: $se(\hat{\theta}) = \sqrt{s^{2}_{C}/n_{C} + s^{2}_{T}/n_{T}}$

* The approximate $95\%$ interval is then $[\hat{\theta} \pm t^{0.975}_{n_{C} + n_{T} - 2} \ast se(\hat{\theta})]$

* Where $t^{0.975}_{n_{C} + n_{T} - 2}$ is the $97.5^{th}$ percentile of the unit $t$ distribution with $df$ degrees of freedom. 

* In the limit as $df \rightarrow \infty$, this quantity approahces $1.96$, corresponding to the normal-distribution $95\%$ interval of $\pm 2$ standard errors

<br>

### __Null and Alternative Hypotheses__

<br>

* The null hypothesis is: $H_{0}: \theta = 0$ (i.e., that $\theta_{T} = \theta_{C}$)


* The research hypothesis is: $H_{a}: \theta_{T} \ne \theta_{C}$


* The _test statistic_ summarizes the deviation of the data from what would be expected under the null hypothesis. 


* In this case, $t = \frac{|\hat{\theta}|}{se(\hat{\theta})}$


* The absolute value represents the "two-sided test"  (so that both positive and negative deviations from zero would be noteworthy)

<br>

### __$p$-Value__

<br>


* The deviation of the data from the null hypothesis is summarized by the $p$-value


* The $p$-value is the probability of observing something at least as extreme as the observed test statistic


* In this case, the test statistic has a unit $t$ distribution with $\nu$ degrees of freedom


* If the standard deviation of $\theta$ is known, or if the sample size is large, we can use the normal distribution (also called the $z$-test) instead


* The factor of $2$ corresponds to a _two-sided test_ in which the hypothesis is rejectd if the observed difference is too much higher or too much lower than the comparison point of $0$.


* In common practice the hypothosis is "rejected" if the $p$-value is less than $0.05$ - that is, if the $95\%$ confidence interval for the parameter excludes zero.





<br>

### __Hypothesis Testing: General Formulation__

<br>

* In its simplest form, the null hypothesis $H_{0}$ represents a particular probability model $p(y)$ with potential replication data $y^{rep}$


* To perform a hypothesis test, we must define a test statistic $T$, which is a function of the data


* For any given data $y$, the $p$-value is then $P(T(y^{rep}) \ge T(y))$


* The probability of observing (under the null) something as or more extreme than the data

<br>

### __Comparisons of Parameters to Fixed Values and Each Other: Interpreting Confidence Intervals as Hypothesis Tests__

<br>

* The hypothesis that a parameter of interest equals zero (or any other fixed value) can be directly tested by fitting the model that includes the parameter in question


* Examining the corresponding $95\%$ interval


* If the interval excludes zero (or the specified value) then the hypothesis is said to be rejected at the $5\%$ level


* Testing if two parameters are equal amounts to testing if their difference is equal to zero


* We can do this by including both parameters in the model and then examinging the $95\%$ interval for their difference


* The confidence interval is commonly of more interest than the hypothesis test


* For example, if support for the death penalty has decreased by $6 \pm 2$. percentage pointsl then the magnitude of this estimated difference is probably as important as that the confidence interval for the change excludes zero


* The hypothesis of whether a parameter is positive is directly addressed via its confidence interval


* Testing whether one parameter is greater than the other is equivalent to examinging the confidence interval for their difference and testing for whether it is entirely positive


* The possible outcomes of a hypothesis test are "reject" or "not reject." 


* It is never possible to "accept" a statistical hypothesis


* Only to find that the data are insufficient to reject it


* The wording may feel cumbersome, but we need to be careful


* It is common for researchers to act as if an effect is negligible or zero, just because this hypothesis cannot be rejected from the data at hand


<br>

### __Type I and Type II Errors (And Why We Don't Prefer to Use Them)__

<br>

* Statistical tests are typically understood based on _type 1 error_ - the probability of falsely rejecting a null hypothesis if it is in fact true


* And _type 2 error_ - the probability of _not_ rejecting a null hypothesis that is in fact false


* This paradigm is an ill fit for social science or most science in general


* ___"A fundamental problem with type 2 and type 2 errors is that in many problems we do not think the null hypothesis can be true."___


* Examples:

    - A change in law will produce _some_ changes in behavior - but how do these changes vary across people and situations?!
    
    - A medical intervention will work differently for different people
    
    - A political advertisement will change the opinions of some people but not others
    
    
* One can imagine an average effect that is positive or negative, depending on whom is being averaged over


* But there is no particular interest in the null hypothesis of no effect


* When a hypothesis test is rejected (i.e., the study is a successful finding/discovery) researchers report the point estimate of the magnitude and sign of the underlying effect


* In evaluating a statistical test, we should be interested in the properties of the associated effect-size estimate (conditional on it being statistically significantly different from zero)

    - We should ask "How big?"


* The Type 1 and Type 2 error framework is based on a deterministic approach to science that really isn't appropriate in applications with highly variable effects

<br>

### ___Hypothesis Testing and Statistical Practice___

<br>

> _"We do not generally use null hypothesis significance testing in our work. In the fields in which we work, we do not generally think null hypotheses can be true: in social science and public health, just about every treatment one might consider will have_ ___some___ _effect, and no comparisons or regression coefficient of interest will be_ ___exactly zero___. _We do not find it particularly helpful to formulate and test null hypotheses that we know ahead of time cannot be true._ ___Testing null hypotheses is just a matter of data collection: with sufficient sample size, any hypothesis can be rejected, and there is not real point to gathering a mountain of data just to reject a hypothesis that we did not believe in the first place."___

<br>

* Not all effects and comparisons are detectable from any given study


* So even without a research goal of rejecting null hypotheses, there is value in checking the consistency of a particular dataset with specified null model


* The idea is that _non-rejection_ tells us that there is not enough information in the data to move beyond the null hypothesis


* The point of _rejection_ is not disprove the null (we probably disbelieve the null before we even start!)


* Rather, the point is to indicate that there is information in the data to allow a more complex model to be fit

<br>

> _"A use of hypothesis testing that bothers us is when a researcher starts with hypothesis A (for example, that a certain treatment has a generally positive effect), then as a way of confirming hypothesis A, the researcher comes up with null hypothesis B (for example, that there is a zero correlation between treatment assignment and outcome)._ ___Data are found that reject B, and this is taken as evidence in support of A.___ _The problem here is that a_ ___statistical___ hypothesis (for example, $\beta = 0$ or $\beta_{1} = \beta_{2}$) is much more specific than a_ ___scientific___ hypothesis (for example, that a certain comparison averages to zero in the population, or that any net effects are too small to be detected). A rejection of the former does not necessarily tell you something useful about the latter, because violations of technical assumptions of the statistical model can lead to high probability of rejection of the null hypothesis even in the absence of any real effect. What the rejection_ ___can___ do is to motivate the next step of modeling the comparisons of interest."_