# Bias-Variance Trade-Off

## Irreducible Error
- Data-generating processes are noisy
- Noise is by definition random(not deterministic)
- Can't predict its values, only its statistics(like mean & variance)
- Example:
- we are in charge of the data-generating process
- f(x) = 2x +1 
- linear regression
- If a machine learning guy is working on our data, we can give him this function, his work is done
- but linear regression model is: y = ax+b + $\epsilon$
- $\epsilon \sim N(0,\sigma^2)$ 
- $\hat{f}(x)$ = 2x + 1 doesn't achieve 0 error on y = 2x +1+ $\epsilon$

## Bias 
- Bias here refers to the delta between your average model and the true f(x)
- some sources refer to the square of this as bias, we won't
$$
bias = E[f(x) - \hat{f}(x)]
$$

## Variance
- Variance from statistics = how much a random variable deviates from its mean in squared units
- Variance in the context of bias-variance trade-off is more specific
- Variance = statistical variance of predictor over all possible training sets
- Suppose we have a model that overfits- gets perfect for any training set
- Then the models for each training set are probably very different from each other
- Has nothing to do with accuracy
- variance just measures how inconsistent a predictor is, over different training sets
- remember: goal is not to achieve lowest possible error
- goal is to find true f(x)
- being close to training points is only a proxy solution

## Model Complexity
- Variance is a proxy for model complexity
- Complexity is a malleable term
- can mean different things for different classifiers
- ex. deep learning tree = complex, shallow decision tree = not complex
- ex.K-nearest neighbor: K = 1complex, K =50 = not complex

## Bias-Variance Trade-Off
- In ML we strive to minimize error
- Overall error is a combination of
    - Bias
    - Variance
    - Irreducible error
- Goal is then to make bias and variance as small as possible
- It's a tradeoff
- we need to balance these
- when we lower one, the other increases
- Overfit: bias goes down, variance goes up
- Underfit: bias goes up, variance goes down

![](https://cn.bing.com/th?id=OIP.XRW2556DfOJz2EIz31RoOwHaFu&pid=Api&w=1056&h=816&rs=1)
![](https://cn.bing.com/th?id=OIP.-pwwSpcPJcxzDaRH40w95QAAAA&pid=Api&rs=1)

## Bias -Variance Decomposition
- Expected error = $bias^2$ + variance + irreducible error
- Use mean-squared error for derivation for both regression and classification



## Definition


$$
y = f(x) + \epsilon\\
\epsilon \sim N(0,\sigma^2)\\
\hat{f(x)} = \text{estimate of }f(x)\\
err = E[(y-\hat{f(x)})^2]\\
\bar{f(x)} = E[\hat{f(x)}]\\
= E[(f(x)+\epsilon -\hat{f(x)}+\bar{f(x)}-\bar{f(x)})^2]\\
= E[\epsilon^2]+E[\epsilon(f(x)-\bar{f(x)} - (\hat{f(x)}-\bar{f(x)}) )]\\
E[\epsilon] = 0\\
E[\epsilon^2] = \sigma_{\epsilon}^2+(E[\epsilon])^2 = \sigma_{\epsilon}^2\\
E[\hat{f(x)} - \bar{f(x)}] = E[\hat{f(x)}] - E[\hat{f(x)}] = 0\\
= [f(x) - \bar{f(x)}]^2 + E[(\hat{f(x)} - \bar{f(x)})^2]+E[\epsilon^2]\\
= bias^2 + variance + \sigma_{\epsilon}^2
$$

## Summary
- expected error is a combination of bias,variance, and irreducible error
- this is not just the error between the true f(x) and f_hat(x)
- we never observe f(x), we can only observe y

## In Code

```python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error as mse

NUM_DATASETS = 50
NOISE_VARIANCE = 0.5
MAX_POLY = 12
N = 25
Ntrain = int(0.9*N)

np.random.seed(2)
```

```python
# make a dataset with x^D, x^(D-1), ..., x^0
def make_poly(x, D):
    N = len(x)
    X = np.empty((N, D+1))
    for d in range(D+1):
        X[:,d] = x**d
        print("x[:,d] {}".format(X[:,d].shape))
        print("x {} X[:,d] {}".format(x,X[:,d]))
        if d > 1:
            X[:,d] = (X[:,d] - X[:,d].mean()) / X[:,d].std()
    return X
```

```python
X = np.linspace(-np.pi, np.pi, N)
X
Xpoly = make_poly(X, MAX_POLY)

array([-3.14159265, -2.87979327, -2.61799388, -2.35619449, -2.0943951 ,
       -1.83259571, -1.57079633, -1.30899694, -1.04719755, -0.78539816,
       -0.52359878, -0.26179939,  0.        ,  0.26179939,  0.52359878,
        0.78539816,  1.04719755,  1.30899694,  1.57079633,  1.83259571,
        2.0943951 ,  2.35619449,  2.61799388,  2.87979327,  3.14159265])
```

```python
Xpoly.shape
(25,13)
```


```python

def f(X):
    return np.sin(X)


x_axis = np.linspace(-np.pi, np.pi, 100)
y_axis = f(x_axis)

((100,), (100,))
```

```python
train_scores = np.zeros((NUM_DATASETS, MAX_POLY))
test_scores = np.zeros((NUM_DATASETS, MAX_POLY))
# squared_biases = np.zeros((NUM_DATASETS, MAX_POLY))
# test_predictions = np.zeros((N - Ntrain, NUM_DATASETS, MAX_POLY))
train_predictions = np.zeros((Ntrain, NUM_DATASETS, MAX_POLY))
prediction_curves = np.zeros((100, NUM_DATASETS, MAX_POLY))

train_scores.shape,train_predictions.shape
((50, 12), (22, 50, 12))
```

```python
# create the model
model = LinearRegression()

for k in range(NUM_DATASETS):
  Y = f_X + np.random.randn(N)*NOISE_VARIANCE

  Xtrain = Xpoly[:Ntrain]
  Ytrain = Y[:Ntrain]

  Xtest = Xpoly[Ntrain:]
  Ytest = Y[Ntrain:]

  for d in range(MAX_POLY):
    model.fit(Xtrain[:,:d+2], Ytrain)
    predictions = model.predict(Xpoly[:,:d+2])

    # debug
    x_axis_poly = make_poly(x_axis, d+1)
    prediction_axis = model.predict(x_axis_poly)
    # plt.plot(x_axis, prediction_axis)
    # plt.show()

    prediction_curves[:,k,d] = prediction_axis

    train_prediction = predictions[:Ntrain]
    test_prediction = predictions[Ntrain:]

    train_predictions[:,k,d] = train_prediction # use this to calculate bias/variance later

    train_score = mse(train_prediction, Ytrain)
    test_score = mse(test_prediction, Ytest)

    train_scores[k,d] = train_score
    test_scores[k,d] = test_score
```

$$
= [f(x) - \bar{f(x)}]^2 + E[(\hat{f(x)} - \bar{f(x)})^2]+E[\epsilon^2]\\
= bias^2 + variance + \sigma_{\epsilon}^2
$$


```python

# calculate the squared bias
avg_train_prediction = np.zeros((Ntrain, MAX_POLY))
squared_bias = np.zeros(MAX_POLY)
f_Xtrain = f_X[:Ntrain]
for d in range(MAX_POLY):
  for i in range(Ntrain):
    avg_train_prediction[i,d] = train_predictions[i,:,d].mean()
  squared_bias[d] = ((avg_train_prediction[:,d] - f_Xtrain)**2).mean()

# calculate the variance
variances = np.zeros((Ntrain, MAX_POLY))
for d in range(MAX_POLY):
  for i in range(Ntrain):
    delta = train_predictions[i,:,d] - avg_train_prediction[i,d]
    variances[i,d] = delta.dot(delta) / N
variance = variances.mean(axis=0)
```