###### Content under Creative Commons Attribution license CC-BY 4.0, code under BSD 3-Clause License © 2021 Lorena A. Barba, Tingyu Wang

# Overfitting and Regularization

In [None]:
from matplotlib import pyplot as plt
from autograd import grad
import autograd.numpy as np

## Polynomial Regression

Let's generate some synthetic data using a polynomial function of $y = x^4 + x^3 - 4x^2 $ with noise.

In [None]:
# synthetic data
np.random.seed(5)
x = np.linspace(-3, 3, 20)
y = x**4 + x**3 - 4*x**2 + 8*np.random.normal(size=len(x))
plt.scatter(x, y);

Suppose now we are only given the data (not the function that generates them), and our goal is to curve fit the data. The nonlinear relationship between $x$ and $y$ suggests that a linear regression will fail. Intuitively, nonlinear models should be a good fit, and among them, regressions using polynomial functions may first come to your mind.

Let's write this basic model as the $d$th-order polynomial of $x$, $x$ being the only feature here:

$$
\hat{y} = w_0 + w_1 x + w_2 x^2 + \cdots + w_d x^d, 
$$

where $w$ denotes the weights. Keep in mind that in the model fitting context, the objective is always finding the optimal values of these weights given $x$ and $y$. When viewed from a different perspective, the model above is just a linear combination of the weights. In fact, by creating polynomial features of $x$, namely, letting $x_i = x^i$, the model becomes:

$$
\hat{y} = w_0 + w_1 x_1 + w_2 x_2 + \ldots + w_d x_d,
$$

a carbon copy of the multiple linear regression (MLR) model. The matrix form is also $\hat{\mathbf{y}} = X\mathbf{w}$, and the only gap here is to form the matrix $X$ using the power of $x$.

In [None]:
degree = 3

def polynomial_features(x, degree):
    X = np.empty((len(x), degree+1))
    for i in range(degree+1):
        X[:,i] = x**i
    return X

X = polynomial_features(x, degree)
print(X.shape)

Unsurprisingly, scikit-learn offers a counterpart: `PolynomialFeatures()`, which we will use later:
```python
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree, include_bias=True)
X = poly.fit_transform(x.reshape(-1,1))
```

Next, recall that for MLR, we should normalize each feature to a same scale, except for the first column, because $x_0$ is set to 1 for all entries.

In [None]:
from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler()
X_scaled = min_max_scaler.fit_transform(X)
X_scaled[:,0] = 1   # the column for intercept

Let's reuse the same model and loss functions,

In [None]:
def linear_regression(params, X):
    """
    The linear regression model in matrix form.
    """
    return np.dot(X, params)

def mse_loss(params, model, X, y):
    """
    The mean squared error loss function.
    """
    y_pred = model(params, X)
    return np.mean( np.sum((y-y_pred)**2) )

gradient = grad(mse_loss)

and train our model.

In [None]:
max_iter = 3000
alpha = 0.01
params = np.zeros(X_scaled.shape[1])
descent = np.ones(X_scaled.shape[1])
i = 0

from sklearn.metrics import mean_absolute_error

while np.linalg.norm(descent) > 0.01 and i < max_iter:
    descent = gradient(params, linear_regression, X_scaled, y)
    params = params - descent * alpha
    loss = mse_loss(params, linear_regression, X_scaled, y)
    mae = mean_absolute_error(y, X_scaled@params)
    if i%100 == 0:
        print(f"iteration {i:4}, {loss = :.3f}, {mae = :.3f}")
    i += 1

To plot the fitted curve, we need to generate a grid `xgrid` along $x$-axis, and predict $y$ at these locations. But before making the predictions directly by multiplying the weights `params`, don't forget to repeat the procedure of how we prepared $X$ - creating polynomial features and normalizing them - on `xgrid`. Ponder for a moment to get your mind around it.

In [None]:
xgrid = np.linspace(x.min(), x.max(), 30)
xgrid_poly_feat = polynomial_features(xgrid, degree)
xgrid_scaled = min_max_scaler.transform(xgrid_poly_feat)
xgrid_scaled[:,0] = 1 
plt.scatter(x, y, c='r', label='observed')
plt.plot(xgrid, xgrid_scaled@params, label='predicted');

## Observe underfitting & overfitting

In the model above, we just randomly picked a polynomial degree of $3$ out of our mind. However, it is good or complicated enough to model our dataset? In other words, should we try higher orders?

Let's repeat our study with different degrees varying from 1 to 15, and plot these curves interactively using `ipywidget`.
To faciliate our experiment, we use `make_pipeline` from scikit-learn to combine our 3-step workflow: creating polynomial features, features scaling and linear regression into one step. The rest of the code is for plotting.

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

def polyreg_helper(degree, x, y):
    """
    The helper function to plot polynomial linear regression.
    
    Args:
        degree: Polynomial degree.
        x,y: Training data.
    """
    
    x_plot = np.linspace(x.min(), x.max(), 30).reshape(-1,1)
    
    linear = make_pipeline(PolynomialFeatures(degree, include_bias=False),
                           MinMaxScaler(),
                           LinearRegression(fit_intercept=True))
    linear.fit(x.reshape(-1,1), y)
    mae = mean_absolute_error(y, linear.predict(x.reshape(-1,1)))
    fig = plt.figure(figsize=(10,6))
    ax = plt.subplot(111)
    ax.scatter(x, y, c='r', label='observed')
    ax.plot(x_plot, linear.predict(x_plot), label='predicted')
    ax.set_title(f"Polynomial degree = {degree:2}, MAE = {mae:.3f}", fontsize=16)
    ax.legend()
    plt.subplots_adjust(top=0.85)

Drag the slider below to visualize the fitted model using different polynomial orders.

In [None]:
from ipywidgets import widgets, fixed
degree_slider = widgets.IntSlider(min=1, max=15, step=1)
widgets.interact(polyreg_helper, degree=degree_slider, x=fixed(x), y=fixed(y));


### Underfitting

When `degree` is $1$, the straightline clearly fails to capture the underlying relationship between $x$ and $y$. Specifically, the model is too simple to explain the **variance** in the data, and is thus regarded **underfitting**. With a degree of $2$, a quadratic curve still fails to "connect" some of the points.

Another term for underfitting is **high bias**. You can think of that with a degree of $1$, we are biased to assume a linear relationship between $x$ and $y$.

> Question: Would having more training data help resolve underfitting?

### Overfitting

As we increase the polynomial degree, the training error (MAE from the figure title) keep decreasing. 
Does it mean the curve with the highest degree ($15$) fits the best?
Maybe, but probably not.
Drag the slider to the right, we will find that fitted curve passes exactly through many points and looks very strange.
If we are given new data, this model would be unable to generalize because it fits too closely to the old data.
At this point, the model is just too flexible and it fits the noise rather than the true relationship. 
In this case, the model is **overfitting**, or has a **high variance**.

Overfitting usually happens when you have a large number of features (ex. degree of polynomial in this lesson) but you don't have enough data to constrain them. Common ways to avoid overfitting include:

- limit the number of features
- model selection
- regularization

This first method is problem-specific and usually requires you to decide which features to keep. Later in this module, we will discuss how to select from models by splitting our data into train and test sets. Now let's move on to regularization in the following section.

## Regularization

The idea of regularization is straighforward: to keep all features by introducing a term in the loss function that penalizes complicated models. In our polynomial regression model:

$$
\hat{y} = w_0 + w_1 x + w_2 x^2 + \cdots + w_d x^d,
$$

every polynomial term, even the intercept term $w_0$, contributes to the overall complexity of the model.
A simple idea is to add constraints to the magnitudes of these weights.

The MSE loss can be written in matrix form, as a function of $\mathbf{w}$:

$$
L(\mathbf{w})=\frac{1}{N} {\lVert \mathbf{y} - X\mathbf{w} \rVert}^2 
$$

By adding a regularization term (in $L_2$-norm) ${\lVert \mathbf{w} \rVert}^2 = \sum_{j=0}^d w_j^2$ to the loss, the new loss now favors smaller weights. As weights are closer to zero, the model tends to be simpler, thus is less likely to overfit. The matrix form of the regularized loss is:

$$
L(\mathbf{w})=\frac{1}{N} {\lVert \mathbf{y} - X\mathbf{w} \rVert}^2 + \lambda {\lVert \mathbf{w} \rVert}^2,
$$

where $\lambda$ denotes the regularization parameter that controls the tradeoff between fitting the data well (the first term) and keeping the model simple to avoid overfitting (the second term).

Let's code the regularized MSE loss.

**to do**:

- clarify that we don't penalize the bias term $w_0$, need to rewrite the equations
- shall we mention the tradeoff between model error and bias error (variance-bias tradeoff)

In [None]:
def regularized_loss(params, model, X, y, _lambda=1.0):
    """
    The mean squared error loss function with an L2 penalty.
    """
    y_pred = model(params, X)
    return np.mean( np.sum((y-y_pred)**2) ) + _lambda * np.sum( params[1:]**2 )

gradient = grad(regularized_loss)                                                        

In [None]:
no_regularization_params = params.copy()

And train the model using gradient descent.

In [None]:
max_iter = 3000
alpha = 0.01
params = np.zeros(X_scaled.shape[1])
descent = np.ones(X_scaled.shape[1])
i = 0

from sklearn.metrics import mean_absolute_error

while np.linalg.norm(descent) > 0.01 and i < max_iter:
    descent = gradient(params, linear_regression, X_scaled, y)
    params = params - descent * alpha
    loss = mse_loss(params, linear_regression, X_scaled, y)
    mae = mean_absolute_error(y, X_scaled@params)
    if i%100 == 0:
        print(f"iteration {i:4}, {loss = :.3f}, {mae = }")
    i += 1

With `degree` set to $3$, let's compare the optimal weights before and after regularization. We can reuse the `xgrid` to plot both curves.

In [None]:
print("weights without regularization")
print(no_regularization_params)
print("weights with regularization")
print(params)

plt.scatter(x, y, c='r')
plt.plot(xgrid, xgrid_scaled@no_regularization_params, label='w/o regularization')
plt.plot(xgrid, xgrid_scaled@params, label='with regularization')
plt.legend();

There are many choices of regularization. The one we showed here uses the $L_2$-norm of the weights.

Again, let's plot both models with varying polynomial degree using ipywidget.

In [None]:
def polyreg_regularization_helper(degree, x, y):
    """
    The helper function to plot polynomial linear regression with regularization.
    
    Args:
        degree: Polynomial degree.
        x,y: Training data.
    """
    from sklearn.linear_model import Ridge
    
    x_plot = np.linspace(x.min(), x.max(), 30).reshape(-1,1)
    
    linear = make_pipeline(PolynomialFeatures(degree, include_bias=False),
                           MinMaxScaler(),
                           LinearRegression())
    linear.fit(x.reshape(-1,1), y)
    mae_linear = mean_absolute_error(y, linear.predict(x.reshape(-1,1)))
    
    ridge = make_pipeline(PolynomialFeatures(degree, include_bias=False),
                          MinMaxScaler(),
                          Ridge(alpha=1.0))
    ridge.fit(x.reshape(-1,1), y)
    mae_ridge = mean_absolute_error(y, ridge.predict(x.reshape(-1,1)))
    
    fig = plt.figure(figsize=(10,6))
    ax = plt.subplot(111)
    ax.scatter(x, y, c='r', label='observed')
    ax.plot(x_plot, linear.predict(x_plot), label='w/o regularization, predicted')
    ax.plot(x_plot, ridge.predict(x_plot), label='with regularization, predicted')
    ax.set_title(f"Poly degree = {degree:2}, MAE_linear = {mae_linear:.3f}, MAE_ridge = {mae_ridge:.3f}", fontsize=16)
    ax.legend()
    plt.subplots_adjust(top=0.85)

In [None]:
widgets.interact(polyreg_regularization_helper,
                 degree=degree_slider, x=fixed(x), y=fixed(y));

In [None]:
# Execute this cell to load the notebook's style sheet, then ignore it
from IPython.core.display import HTML
css_file = '../style/custom.css'
HTML(open(css_file, "r").read())