## Bias and Regression

Harvard University course, CS109 Data Science.  
Slides: https://github.com/cs109/2015/blob/master/Lectures/07-BiasAndRegression.pdf


### Bias Types
* **Selection Bias**. Where did the data come from?
* **Publication Bias**. What percentage of the scientific discoveries are replicatable? If you are to reproduce the results, will you succeed? How true are the discoveries?
* **Non-response Bias**. What if the people who didn't answer the survey questions are an important group of people?
* **Length Bias**. For example, you want to measure the average prison sentence. If you show up at a random point in time you would most probably see the prisoners who are going to be there for a long time and you would not meet many people who are serving 1, 2 weeks, days....

### Mathematical/Statistical Definition of Bias

The bias of an estimator is how far off it is on average: 

\begin{equation*}
bias(\hat{\theta}) = E(\hat{\theta}) - \theta   
\end{equation*}

*where $\theta$ is what we are trying to estimate, and $\hat{\theta}$ is it's estimator*.

The question may arise, why not subtract the bias? 
Because
* We don't know the bias. We can try to estimate it but will have bias in that process as well.
* **Bias_Variance Tradoff**, which is very often formulates the following way  

\begin{equation*}
MSE(\hat{\theta}) = VAR(\hat{\theta}) + bias^2(\hat{\theta})
\end{equation*}

**MSE** is the **Mean Squared Error**, the most common way to measure how good is your estimator (on average, in terms of squared distance, how far off are you from the truth?).

#### So the goal is not to minimize the bias but instead the more appropriate goal is to minimize MSE.

### Fisher Weighting

How should we combine independent, *unbiased* estimators for a parameter into one estimator?

\begin{equation*}
\hat{\theta} = \sum_{i=1}^{k}w_i\hat{\theta_i}
\end{equation*}

The *weights* should sum to 1 but how should they be chosen?

\begin{equation*}
w_i \propto \frac{1}{Var(\hat{\theta_i})}
\end{equation*}

(Inversly proportional to variance)

### Regression Toward the Mean

Galten investigated the heights of fathers and sons and found out that if the father is very tall the sone will be tall but not as tall as his father. And if the father is very short, the sun will be short but not as short as the father. 

This is the regression towards the mean and it's very common in various scenarios.

### Linear Model

Often called **OLS**, Ordinary Least Squares

\begin{equation*}
y = X * \beta + \epsilon
\end{equation*}

where 
* $y$ is n X 1, we want to predict
* $X$ is n X k, matrix of data 
* $\beta$ is k X 1, parameters
* $\epsilon$ is n X 1, error terms

**Note**, what makes the linear models linear are parameters, so we can take $x^2$ or $x^3$ and their linear combination would still be linear.

* For the **sample** (1 predictor)

\begin{equation*}
\hat{\beta_0} = \bar{y} - \hat{\beta_1} \bar{x}
\end{equation*}

\begin{equation*}
\hat{\beta_1} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}
\end{equation*}

* For the **population** (1 predict)
\begin{equation*}
y = \beta_0 + \beta_1 x + \epsilon
\end{equation*}

\begin{equation*}
E(y) = \beta_0 + \beta_1 E(x)
\end{equation*}

\begin{equation*}
cov(y, x) = \beta_1 cov(x,x)
\end{equation*}

**"Explained" Variance** 

\begin{equation*}
var(y) = var(X\hat{\beta}) + var(e)
\end{equation*}

*where e are the residuals, the estimators of errors*.

\begin{equation*}
R^2 = \frac{var(X\hat{\beta})}{var(y)} = \frac{\sum_{i=1}^{n}(\hat{y_i} - \bar{y})^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}
\end{equation*}

* $R^2$ is the variance explained by the fitted model divided by the total variance.
* $R^2$ measures goodness of fit, but it doesn't validate the model.
* Adding more predictors can only increase $R^2$.

