#  Bias & Variance

When fitting a model to data there can be three types of errors, Noise, Bias and Variance.

* Noise is random and cannot be reduced (irreducible error)
* in contract bias and variance are reducible errors:
 * Bias represents a constant offset from the true value, so is linked to the concept of accuracy,
 * Variance represents a spread in the modelled values, so is linked to the concept of Precision.

If the model cannot fit the data exactly we can end up in a situation where a model is either overfitted or underfitted.
To assess a model, we need to fit (train) a it on one set of data and then test it on another unseen set of data. 
If the model is overfitted, it has been pushed to get an exact match (e.g. minimise the SSE) at every point, this leads to a low bias error. However when we apply the model to new unseen data it is often a poor fit, as it has tried to model the random noise in the training data. The error is therefore high on this test data and it has a high variance.

Conversely if the model is underfitted then it may have a poor fit (e.g. high SSE) on the training data, so a high bias, but an equally poor fit on the test data, so the variance is low.

This is best demonstrated visuallty, and the following example aims to demonstrate over and under-fitting a model, and to demonstrate the bias/variance trade-off, where we need to get the best compromised.

In this example we are modelling a cosine function using a polynomial fit. The degree of the polynomial is incremented from 1 to 15. 

The degree of the polynomial here is the model's hyperparameter. 

The Root Mean Squared Error (RMSE) is calculated for each model for the training data and an independent test data set to assess the goodness of fit for the model.

#### Pseudocode:
<code>    X = generate N random points
    y = apply cosine to X and add random noise
    split X and y in half
    for degree of polynomial from 1 to 15:
        train polynomial model on training data
        plot model vs expected for given degree of polynomial
        calculate RMSE on training data
        calculate RMSE on test data
    plot training and test RMSE against degree of polynomial
</code>

#### Exercises:
1. What order of polynomial do you think gives the "best fit"
2. Experiment with changing the frequency of the cosine (parameter <code>B</code>) in the input model.
3. Experiment with changing the size (<code>n_samples</code>) of the data set and test/train <code>split</code> on the error plots.
4. Experiment with increasing the irreducible error (<code>noise</code>) variable.
5. Extend the true_fun to a more complex relationship e.g. (<code>A * np.cos(B * np.pi * X + C) + A/2 * np.cos(2*B * np.pi * X + C)</code>

#### Maths
$$RMSE=\sqrt{\dfrac{\sum_{i=1}^{n}{(f(x_i) - y_i)^2}}{n}}$$

Input equation, here $\epsilon$ is random noise (technically normally distributed noise, there are other types of noise!)
$$y = A\cos{(B\pi x + C)} + \epsilon$$


Model used for fitting.
$$f(x) = a_0 + a_1x + a_2x^2 + ... + a_{n-1}x^{n-1}+ a_nx^n$$

This example is adapted from <a href="https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html">Scikit-learn examples: Underfitting vs. Overfitting</a>

In [None]:
# Imports
import matplotlib.pyplot as plt
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split


In [None]:
# Define the "true" function
def true_fun(X, A=1, B=1.5, C=0):
    return A * np.cos(B * np.pi * X + C)

# random seed - ensures we have same model, change as desired
np.random.seed(0)

# define the size of the dataset - grow or shrink
n_samples = 60

# how to partition the test/train data set, 0.5 = 50% split.
split = 0.5 

# noise weight
noise = 0.1

# define the polynomial degrees to work through
degrees = range(1,16,1)
  
    

# create dataset and split
X = np.sort(np.random.rand(n_samples))
y = true_fun(X) + np.random.randn(n_samples) * noise
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=split, random_state=None
)

# setup subplot grids
subplot_width = 5
subplot_height = 1+int(np.ceil(len(degrees)/subplot_width))
plt.figure(figsize=(5*subplot_width, 5*subplot_height))

# setup arrays to store the errors
training_error = np.zeros(len(degrees))
test_error = np.zeros(len(degrees))

# iterate over polynomial degrees
for i in range(len(degrees)):
    ax = plt.subplot(subplot_height, subplot_width, i + 1)
    plt.setp(ax, xticks=(), yticks=())

    # create a model
    polynomial_features = PolynomialFeatures(degree=degrees[i], include_bias=False)
    linear_regression = LinearRegression()
    pipeline = Pipeline(
        [
            ("polynomial_features", polynomial_features),
            ("linear_regression", linear_regression),
        ]
    )
    pipeline.fit(X_train[:, np.newaxis], y_train)

    # create subplot of fit for each polynomial degree
    X_show = np.linspace(0, 1, 100)
    plt.plot(X_show, pipeline.predict(X_show[:, np.newaxis]),color='firebrick',linewidth=3, label="Model")
    plt.plot(X_show, true_fun(X_show),color='darkgrey',linestyle='dashed', label="True function")
    plt.scatter(X_train, y_train, marker='o',color="black", s=20, label="Train samples")
    plt.scatter(X_test, y_test, marker='o',color="black",facecolors='none', s=20, label="Test samples")
    # formatting for plot
    plt.xlabel("x")
    plt.ylabel("y")
    plt.xlim((0, 1))
    plt.ylim((-2, 2))
    plt.legend(loc="best")
    plt.title("Degree {}".format(degrees[i]))

    # calculate the Root Mean Squared Error for a given point.
    np.put(training_error,i,
           mean_squared_error(y_train, pipeline.predict(X_train[:, np.newaxis]),squared=True)
          )
    np.put(test_error,i, 
           mean_squared_error(y_test, pipeline.predict(X_test[:, np.newaxis]),squared=True)
          )
    
# show grid of fits
plt.show()

# Show plot of training vs test errors
plt.figure(figsize=(20, 7))
plt.plot(degrees,training_error,color='steelblue',linewidth=3, label='Training Error')
plt.plot(degrees,test_error,color='firebrick',linewidth=3,label='Test Error')
plt.xlabel("Degree of polynomial")
plt.ylabel("RMS Error")
plt.yscale('log')
plt.legend(loc="best")
plt.title("Root Mean Squared Error for training and test data, for polynomial fit to cosine function")
plt.show()