# Regression: underfitting and overfitting

We will look at a simple example of fitting a polynomial function to a set of data points.  A polynomial is defined by its degree $n$ and can be written as: $y = \sum_{k=0}^n a_k x^k$.

The simplest polynomial, with a degree of $n=1$, is the linear function: $y = a_1x + a_0$. For example a third degree ($n=3$) polynomial would have the form $y = a_3x^3 + a_2x^2 + a_1x + a_0$.


But first, let's start with the neccessary Python imports.  Here we will be using the popular [scikit-learn](https://scikit-learn.org/stable/index.html) framework for machine learning.

In [None]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.pipeline import make_pipeline

%matplotlib inline

## Generate data points

First, we will create a set of points on a sine curve.  We will a bit of randomness, just to make more fun.

In [None]:
np.random.seed(42)
N=20

x = np.sort(np.random.rand(N))
y = np.sin(1.2 * x * np.pi) + 0.1 * np.random.randn(len(x))

We create a helper function to plot the points.

In [None]:
def plot_curve(x, y):
    plt.plot(x, y, 'ko')
    plt.ylim(-1, 1.5)
    plt.xlim(0, 1)
    plt.xlabel('x')
    plt.ylabel('y')
    plt.show()

In [None]:
plot_curve(x, y)

## Fitting a polynomial model

First, we will create another helper function that takes a given model and generates its predictions for the whole range of $x$ values, drawn on top of the data points.

In [None]:
def plot_curve_and_model(x, y, model):
    model.fit(x.reshape(-1, 1), y)
    x_plot = np.linspace(0, 1, 100)
    y_pred = model.predict(x_plot.reshape(-1, 1))
    
    plt.plot(x_plot, y_pred)
    plot_curve(x, y)


### Underfitting

First, we'll start with the simples linear model, where the degree is 1.  What can you say about the result?

In [None]:
model = make_pipeline(PolynomialFeatures(degree=1), LinearRegression())

plot_curve_and_model(x, y, model)

### Overfitting

Next, we will try with a very complex 13-degree model. What happens?

In [None]:
model = make_pipeline(PolynomialFeatures(degree=13), LinearRegression())

plot_curve_and_model(x, y, model)

### Balanced model

Try to find a balanced model.  You can try:

- finding the right degree for the model
- using regularization such as in [Ridge regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)

In [None]:
model = make_pipeline(PolynomialFeatures(degree=5), LinearRegression())
#model = make_pipeline(PolynomialFeatures(degree=13), Ridge(alpha=0.001))

plot_curve_and_model(x, y, model)