# DATA 5600: Introduction to Regression and Machine Learning for Analytics

## __Chapter 4: Statistical Inference__ <br>

Author:      Tyler J. Brough <br>
Updated: September 19, 2021 <br>

---

<br>

In [1]:
import numpy as np
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_style('darkgrid')
plt.rcParams['figure.figsize'] = [10, 5]

<br>

---


<br>

* Statistical inference can be formulated as a set of operations on data that yields estimates and uncertainty statements about predictions and parameters of some underlying process or population.

* Uncertainty statements are derived based on some assumed probability model for observed data

* In this notebook we discuss the basics of probability modeling, estimation, bias and variance, and the interpretation of statistical inference and statistical errors in applied work

* We introduce the theme of uncertainty in statistical inference

* __Important Theme of This Chapter:__ ___It is a mistake to use hypothesis tests or statistical significance to attribute certainty from noisy data___

<br>

## __Section 4.1: Sampling Distributions and Generative Models__

<br>

#### __Sampling, Measurement Error, and Model Error__

<br>

Statistical inference is used to learn from incomplete or imperfect data. There are three standard paradigms for thinking about the role of inference:


1. The __Sampling Model__ - learn something about a population, which we must estimate from a sample, or subset, of that population

2. The __Measurement Error Model__ - we are interested in learning about aspects of some underlying pattern or law 
    - For example the coefficients $a$ and $b$ in $y_{i} = a + b x_{i}$
    - But the data are measured with error so that we work with: $y_{i} = a + b x_{i} + \epsilon_{i}$
    - We can consider measurement error in $x$ as well


3. The __Model Error__ - refers to the inevitable imperfections of the models that we apply to real data


<br>

* These three modes are different: 
    - The sampling model makes no reference to measurements
    - The measurement model can apply even when complete data are observed
    - The model error model can arise even with perfectly precise observations
* We typically consider all three when building models


<br>

__Example:__


* Consider a regression model predicting students' grades from pre-test scores and other background variables

* There is typically a sampling aspect to such a study, performed on some set of students with the goal of generalizing to a larger population

* The model also includes a measurement error, because a student's test score is only an imperfect measure of their abilities

* And also model error because any assumed functional form can only be approximate

* Additionally, any student's ability will vary with time and by circumstances (this could be thought of as measurement error or model error)

<br>

* The textbook takes a standard measurement-error approach framework: 

$$
\large{y_{i} = a + b x_{i} + \epsilon_{i}}
$$

* The $\epsilon$'s can be treated as model errors

* The sampling interpretation is implicit in that errors $\epsilon_{1}, \ldots, \epsilon_{n}$ can be considered as a random sample from a distribution that represents a "superpopulation"

#### __The Sampling Distribution__

<br>

* The ___sampling distribution__ is the set of all possible datasets that could have been observed if the data collection process had been re-done

* The probabilities of these possible values

<br>

## __Section 4.2: Estimates, Standard Errors, and Confidence Intervals__

<br>

<br>

## __Section 4.3: Bias and Unmodeled Uncertainty__

<br>

<br>

## __Section 4.4: Statistical Significance, Hypothesis Testing, and Statistical Errors__

<br>

<br>

## __Section 4.5: Problems With the Concept of Statistical Significance__

<br>

<br>

## __Section 4.6: Example of Hypothesis Testing: 55,000 Residents Need Your Help!__

<br>

<br>

## __Section 4.7: Moving Beyond Hypothesis Testing__

<br>

<br>

## __Bibliographic Note__

<br>