# DATA 5600: Introduction to Regression and Machine Learning for Analytics

## __Section 4.5: Problems with the Concept of Statistical Significance__ <br>

Author:  Tyler J. Brough <br>
Updated: October 20, 2021 <br>

---

<br>

In [1]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [10, 8]

In [2]:
np.random.seed(7)

## __Section 4.5: Statistical Significance Not The Same As Practical Importance__

<br>


* A result can be statistically significant - not easily explainable by chance alone - but without being large enough to be important in practice.


* For example, if a treatment is estimated to increase earnings by $\$10$ per year with a standard error of $\$2$, this would be statistically but not practically significant (in the U.S. context). 


* Conversely, an estimate of $\$10,000$ with a standard error of $\$10,000$ would not be statistically significant, but it has the possibility of being important in practice

<br>

### __Non-Significance is Not the Same as Zero__

<br>

* In Section 3.5 there is an example of arterial stents for heart patients

* The treatment group outperformed the control group, but not statistically so

* The observed average difference in treadmill time was $16.6$ seconds with a standard error of $9.9$

* This corresponds to a $95\%$ confidence interval that included zero and a $p$-value of $0.20$

* A fair summary here is that the results are uncertain: it is unclear whether the net treatment effect is positive or negative in the general population

* It would be inappropriate to say that stents have no effect

<br>

### __The Difference Between "Significant" and "Not Significant" Is Not Itself Statistically Significant__

<br>

* Changes in statistical significance do not themselves necessarily achieve statistical significance


* This is not the same as the observation that any particular threshold is arbitrary


* For example, only a small change is required to move an estimate from $5.1\%$ significance level to $4.9\%$, thus moving it into statistical significance


* Rather even large changes in significance can correspond to small, nonsignificant changes in the underlying variables


* Consider two independent studies with effect estimates and standard errors of $25 \pm 10$ and $10 \pm 10$. 


* The first study is statistically siginificant at the $1\%$ level


* The second is not at all significant at 1 standard error away from zero


* It is tempting to conclude that there is a large difference between the two studies


* In fact, the difference is not even close to being statistically significant


* The estimated difference is 15, with a standard error of $\sqrt{10^{2} + 10^{2}} = 14$

<br>

### __Reseacher Degrees of Freedom, $p$-hacking, and Forking Paths__

<br>


* Multiple comparisons make possible the "gaming" or abuse of statistical significance


* When there are many ways that data can be selected, excluded, and analyzed it is not difficult to attain a low $p$-value in the absence of a real effect


* Part of the problem is the "file-drawer effect" - we don't know how many non-significant models were left unpublished or unmentioned relative to a published statistically significant one


* An even worse problem is "researcher degrees of freedom"


* There are many decisions when programming (especially in complex models), deciding on variables to include/exclude, and which statistical procedures to use


* Researchers can use these degrees of freedom to $p$-hack (which is a form of rent seeking behavior) to achieve low $p$-value (and thus statistical significance) from otherwise unpromising data


<br>

__Consider the following testing procedures:__

1. Simple statistical test based on a unique test statistic, $T$, which when applied to the observed data yields $T(y)$

2. Classical test pre-chosen from a set of possible tests: thus, $T(y; \phi)$, with preregistered $\phi$. 
    - $\phi$ does not represent parameters in the model
    - It represents choices in the analysis
    - Control variables in a regression
    - Transformation procedures
    - Data coding and excluding rules
    - Decision about which main effect or interaction to focus on
    
    
3. Researcher degrees of freedom without fishing: computing a single test based on the data
    - But in an environment where a different test would have been performed given different data
    - Thus $T(y; \phi(y))$, where the function $\phi(\cdot)$ is observed in the observed case
    
4. "Fising": computing $T(y; \phi(y))$, for $j = 1, \ldots, J$
    - That is, performing $J$ tests and then reporting the best result given the data
    - Thus $T(y; \phi^{best}(y))$.
    
    
<br>
<br>

* It is common for researchers to do #3, but when this is pointed out to them they often think they are being accused of #4


* In other words they claim that since they are not doing #4 that means that they are necessarily doing #2


* The problem with #3 is that even without nefarious intent, a researcher can induce a huge number of researcher degrees of freedom and thus obtain statistical significance from noisy data
    - leading to apparently strong conclusions that do not truly represent the underlying population or target of study
    - fail to reproduce in controlled studies
    
<br>

> ___"Our recommended solution to this problem of "forking paths" is not to compute adjusted $p$-values but rather to directly model the variation that is otherwise hidden in all these possible data coding and analysis choices, and to accept uncertainty and not demand statistical significance in our results."___

<br>