# **Estimators, Bias, and Variance**

## **18.1 Common Random Variables**

We will use many random variables throughout DATA 100: 

#### **Bernoulli**$(p)$
* Takes on value $1$ with probability $p$, and $0$ with probability $(1-p)$
* Also called the *"indicator"* random variable
* **Expectation:** $E[X] = p$
* **Variance:** $\text{Var}(X) = p\cdot (1-p)$

#### **Binomial**$(n, p)$
* Number of $1$s in $n$ independent $\text{Bernoulli}(p)$ trials
* **Expectation:** $E[Y] = np$
* **Variance:** $\text{Var}(X) = np\cdot (1-p)$

Uniform on a finite set of values 
* The probability of each value is $\frac{1}{\text{(number of possible values)}}$
* Example: a standard / fair die 

Uniform on the unit interval $(0,1)$
* Density is flat at $1$ on $(0,1)$ and $0$ elsewhere 

Normal $(\mu, \sigma ^2)$ a.k.a Gaussian 
* $f(x) = \frac{1}{\sigma \sqrt{2 \pi}} \text{exp} (\frac{-1}{2}(\frac{x - \mu}{\sigma})^2)$

## **18.2 Sample Statistics**

* The distribution of a *population* describe how a random variable behaves across *all* individuals of interest 

* The distribution of a *sample* describes how a random variable behaves in a *specific sample* from the population 

* In Data Science, we oftentimes don't have access to the whole population, so we make use of samples in order to make inferences 

* When sampling, we also make the **BIG** assumption that we sample uniformly at random *with replacement* from the population 

* Our sample mean is a random variable, as it depends on our randomly drawn sample 

* The population mean, on the other hand, is a **fixed** number


### **18.2.1 Sample Mean**

We define the sample mean as 

$$ \bar{X_n} = \frac{1}{n} \sum_{i = 1}^n X_i$$

The expectation of the sample mean as 

$$ E[\bar{X_n}] = \mu $$

The variance of the sample mean as 

$$\text{Var}(\bar{X_n}) = \frac{\sigma ^2}{n}$$

### **18.2.2 Central Limit Theorem**
* If an independent and identically distributed sample of $n$ is large, then the probability distribution of the **sample mean** is **roughly normal** with mean $\mu$ and SD of $\frac{\sigma}{\sqrt{n}}$

<img src="https://ds100.org/course-notes/probability_2/images/clt.png" alt="Line Chart" style="display:inline-block; margin-right:10px; width:500px; height=550px;">


### **18.2.3 Using the Sample Mean to Estimate the Population Mean**
* If we want to use our sample mean to estimate our true population mean, one sample is not enough (what if we get a bad sample?)
* What if our sample size is also too small?

<img src="https://ds100.org/course-notes/probability_2/images/CLTdiff.png" alt="Line Chart" style="display:inline-block; margin-right:10px; width:500px; height=550px;">


* We can see that there is less variability in a sample size of $800$

These are all questions which are remedied by **bootstrapping**, which will be covered in the next note


## **18.3 Prediction and Inference**

**Inference**: The task of using a model to infer the true underlying relationships between the features and response variable 

* One major goal of inference is to draw conclusions about the full population of data by using only one random sample 

We have two important definitions: 
* **parameter**: a numerical function of the *population* (like $\mu$)
* **statistic** A numerical function of the random *sample* (like $\bar{X_n}$) 

Due to it's random nature, we call the statistic an **estimator** of the true population parameter
* Let $\theta$ denote the population parameter 
* $\hat{\theta}$ denotes the estimator

Evaluating a good estimator using the following metrics: 
* How close is our answer to the parameter **(Risk / MSE)**

$$\text{MSE}(\hat{\theta}) = E[(\hat{\theta} - \theta)]^2 $$

* Do we get the right answer for the parameter, on average? **(Bias)**
$$\text{Bias}(\hat{\theta}) = E[\hat{\theta}] - \theta$$

* How variable is the answer? **(Variance)**

$$\text{Var}(\hat{\theta}) = E[(\hat{\theta} - E[\theta])^2] $$


<img src="https://ds100.org/course-notes/probability_2/images/bias_v_variance.png" alt="Line Chart" style="display:inline-block; margin-right:10px; width:500px; height=550px;">

In an ideal world, we want our estimator to have low bias and low variance



### **18.3.1 Prediction and Inference**

* Let's consider the relationship $Y = g(x)$ where $g$ represents some "universal truth" that defines the underlying relationship between $x$ and $Y$

* We have plotted $g$ with the red line below

* As Data Scientists, we actually never get to see $g$

* When we collect data in order to try and estimate $g$, our process will always involve some inherent error

* As a result, we say that each observation comes with a **noise**, or random error term $\epsilon$

* We say that $\epsilon$ is a random variale with $E[\epsilon] = 0$ and $\text{Var}(\epsilon = \sigma ^ 2)$, thus making $Y(X)$ also a random variable 


<img src="https://ds100.org/course-notes/probability_2/images/data.png" alt="Line Chart" style="display:inline-block; margin-right:10px; width:500px; height=550px;">

We construct the model $\hat{Y}(x)$ to estimate $g$

* $\text{True relationship: } g(x)$
* $\text{Observed relationship: }Y = g(x) + \epsilon$
* $\text{Prediction: }\hat{Y}(x)$

<img src="https://ds100.org/course-notes/probability_2/images/y_hat.png" alt="Line Chart" style="display:inline-block; margin-right:10px; width:500px; height=550px;">

Choice of features also significantly impact our estimation

<img src="https://ds100.org/course-notes/probability_2/images/y_hat2.png" alt="Line Chart" style="display:inline-block; margin-right:10px; width:500px; height=550px;">




## **18.4 Bias-Variance Tradeoff** ##

With our model: 
* $\text{True relationship: } g(x)$
* $\text{Observed relationship: }Y = g(x) + \epsilon$
* $\text{Prediction: }\hat{Y}(x)$

We can revisit the bias-variance tradeoff curve: 

<img src="https://ds100.org/course-notes/probability_2/images/bvt_old.png" alt="Line Chart" style="display:inline-block; margin-right:10px; width:500px; height=550px;">

and get something that looks more like this: 

<img src="https://ds100.org/course-notes/probability_2/images/bvt.png" alt="Line Chart" style="display:inline-block; margin-right:10px; width:500px; height=550px;">



## **18.4.1 Model Risk**

* **Model risk** is defined as the mean square prediction error of the random variable $\hat{Y}$

* It's an expectation across *all* samples we could have possiblly gotten when fitting the model 

* Considers the model's performance on any sample that is theoretically possible, rather than the specific data we have collected 

$$\text{model risk }=E\left[(Y-\hat{Y(x)})^2\right]$$

The origin of model risk comes from as follows: 
1) **Observation Variance**: Randomness in new observations $Y$ due to random noise $\epsilon$
2) **Model Variance**: Randomness in the sample we used to train the models, as samples $X_1, X_2, \ldots, X_n, Y$ are random 
3) **Model Bias**: non-random error due to our model being different from the true underlying function $g$ 

<img src="https://ds100.org/course-notes/probability_2/images/errors.png" alt="Line Chart" style="display:inline-block; margin-right:10px; width:500px; height=550px;">

Let's now zoom in on a single point: 

<img src="https://ds100.org/course-notes/probability_2/images/error.png" alt="Line Chart" style="display:inline-block; margin-right:10px; width:500px; height=550px;">

* Note that $\hat{Y}(x)$ is a random variable - it's prediction for $x$ depends on the specific sample used for training

We can identify three components of error from the graph above: 


<img src="https://ds100.org/course-notes/probability_2/images/decomposition.png
" alt="Line Chart" style="display:inline-block; margin-right:10px; width:700px; height=750px;">

Putting all these errors into one equation, we get the following decomposition of model risk: 

$$E\left[(Y(x)-\hat{Y}(x))^2\right] = E[\epsilon^2] + \left(g(x)-E\left[\hat{Y}(x)\right]\right)^2 + E\left[\left(E\left[\hat{Y}(x)\right] - \hat{Y}(x)\right)^2\right]$$

Let's now go term by term and see if we can simplify specific terms



#### **18.4.1.1 Observation Variance**

The first term in the above decomposition is $E[\epsilon^2]$
* Recall $\mathbb{E}(\epsilon)=0$ and $\text{Var}(\epsilon) = \sigma^2$. 
* As a result, 

$$ \text{Observation variance} = \text{Var}(\epsilon) = \sigma^2$$

* **observation variance** exists due to randomness in our observations of $Y$
* It is a form of chance error

#### **18.4.1.2 Model Variance**
* We now look at the last term: $E\left[\left(E\left[\hat{Y}(x)\right] - \hat{Y}(x)\right)^2\right]$

* This is precisely $\text{Var}(\hat{Y}(x))$, or **model variance**

* Describe how much $\hat{Y}(x)$ tends to vary when we fit the model on different samples 

* Describe the variability due to the randomness in our sampling process

* Also a form of *chance error* 

$$\text{model variance} = \text{Var}(\hat{Y}(x)) = E\left[\left(\hat{Y}(x) - E\left[\hat{Y}(x)\right]\right)^2\right]$$

* Large model variance is often a result of **overfitting**, where we pay too much attention to the small differences in our sample which lead to large differences in the fitted model 

* In order to remedy this **overfitting**, we can reduce model complexity (take out some features)

#### **18.4.1.3 Model Bias**

* The second term is $\left(g(x)-E\left[\hat{Y}(x)\right]\right)^2$, and refers to **model bias**

* **Model Bias** is how far off $g(x)$ and $\hat{Y}(x)$ are on average over all possible samples 

$$\text{model bias} = E\left[\hat{Y}(x) - g(x)\right] = E\left[\hat{Y}(x)\right] - g(x)$$

* Model bias is not random; it's an average measure for a specific individual $x$ 

* If bias is positive, our model tends to overestimate $g(x)$

* If bias is negative, our model tends to underestimate $g(x)$

* If it's $0$, we can say that our model is **unbiased**

There are two main reasons for large model biases 
* Underfitting: our model is too simple 
* Bad Domain knowledge: We don't understand what features are useful for the response 

We can remedy this by making our model more *complex* 

### **18.4.2 The Decomposition**

With this in mind, we can update our decomposition of model risk 

$$E[(Y(x) - \hat{Y}(x))^2] = \sigma^2 + (E[\hat{Y}(x)] - g(x))^2 + \text{Var}(\hat{Y}(x))$$

$$\text{model risk } = \text{observation variance} + (\text{model bias})^2 \text{+ model variance}$$

This is known as **bias-variance tradeoff**
* Reducing Model Bias to increasing Model Variance (increase complexity)
* Decreasing Model Variance by increasing Model Bias (decreasing complexity)

Remember, 
* High Variance and Low Bias (**overfitting**)
* Low Variance and High Bias (**underfitting**)


<img src="https://ds100.org/course-notes/probability_2/images/bvt.png" alt="Line Chart" style="display:inline-block; margin-right:10px; width:700px; height=750px;">