<div><img style="float: right; width: 120px; vertical-align:middle" src="https://www.upm.es/sfs/Rectorado/Gabinete%20del%20Rector/Logos/EU_Informatica/ETSI%20SIST_INFORM_COLOR.png" alt="ETSISI logo" />

# Underfitting and Overfitting<a id="top"></a>

<i><small>Authors: Alberto Díaz Álvarez<br>Last update: 2023-04-09</small></i></div>

***

## Introduction

Underfitting and overfitting are two critical elements to master in machine learning.

In deep learning, **more**, especially the second one (overfitting), since the models are in general much more complex than those of traditional Machine Learning.

## Goals

We will explore both concepts with two simple but illustrative examples of the problems of underfitting (also bias) and overfitting (also _variance_) in regression and classification problems.

## Libraries and configuration

Next we will import the libraries that will be used throughout the notebook.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
import tensorflow as tf

We will also configure some parameters to adapt the graphic presentation.

In [None]:
%matplotlib inline
plt.style.use('ggplot')
plt.rcParams.update({'figure.figsize': (20, 6),'figure.dpi': 64})

***

## Regression

Underfitting_ occurs when the model is not complex enough to capture the relationships between input and output variables, resulting in low accuracy in predicting output values in both training and test data.

On the other hand, overfitting occurs when the model is too complex and fits the training data too closely, resulting in high accuracy on the training data but low accuracy on the test data. This is because the model is capturing noise and wrong relationships present in the training data rather than the true underlying relationships.

To avoid these problems, it is necessary to find a balance between the complexity of the model and its ability to generalize to new and unseen data.

The problem we will be working on is that of trying to approximate random values extracted from a sine function (although the idea is that in the problems we do not know where these values come from and that is the point of using models).

In [None]:
n_samples = 32
noise_factor = 0.1

f = lambda x: np.sin(2 * np.pi * x)
true_x = np.linspace(0, 1, 100)
x = np.sort(np.random.rand(n_samples))
y = f(x) + np.random.randn(n_samples) * noise_factor

plt.scatter(x, y, label='Sample')
plt.plot(true_x, f(true_x), label='Original', alpha=0.1)
plt.legend();

### Underfitted model

We will start with a model so small that it is not able to capture all the detail of the examples and therefore not able to fit the values well:

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(2, activation='relu', input_dim=1),
    tf.keras.layers.Dense(1, activation='tanh')
])
model.compile(loss='mae', optimizer='adam', metrics = [tf.keras.metrics.RootMeanSquaredError()])
model.summary()

history = model.fit(x, y, epochs=1000, batch_size=len(x), verbose=0)

Let's see how well the training turned out

In [None]:
pd.DataFrame(history.history).plot()
plt.xlabel('Epoch num.')
plt.show()

It can be seen that the error decreases, but it seems to stagnate in values that can be considered bad (the data belong to the $[0, 1]$ interval).

Veamos qué forma tiene la función con la que nuestro modelo describe a los puntos:

In [None]:
plt.scatter(x, y, label='Sample')
plt.plot(true_x, f(true_x), label='Original', alpha=0.2)
plt.plot(true_x, model.predict(true_x), label='Predicted')
plt.legend();

Indeed, it has tried, but one can not get something out of nothing.

### Overfitted model

Now we will go to the opposite extreme. What happens when our model is extremely large? Well, it learns the training set so much that it is not able to generalize, and when new examples come to it it fails to classify them. Let's train a large model:

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(1024, activation='relu', input_dim=1),
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dense(8, activation='relu'),
    tf.keras.layers.Dense(1, activation='tanh')
])
model.compile(loss='mae', optimizer='adam', metrics = [tf.keras.metrics.RootMeanSquaredError()])
model.summary()

history = model.fit(x, y, epochs=2500, batch_size=len(x), verbose=0)

Let's see how the training went:

In [None]:
pd.DataFrame(history.history).plot()
plt.xlabel('Epoch num.')
plt.show()

Well, it seems to have learned quite a bit, but the validation error has skyrocketed. This is showing us that yes, it has learned the training set very well, but when there are new values the model has no idea where to fit them.

Let's see what shape the function modeled by our network has:

In [None]:
plt.scatter(x, y, label='Sample')
plt.plot(true_x, f(true_x), label='Original', alpha=0.2)
plt.plot(true_x, model.predict(true_x), label='Predicted')
plt.legend();

Well, you have learned what you could, but it is too large a model that is not able to generalize sufficiently.

A good exercise to do now would be to try to obtain a model that approximates this function as closely as possible from its example values. The truth is that, although fitting a model of which we know the origin function is very implausible, the exercise is more for playing with Keras.

## Classification

In classification problems, _underfitting_ and _overfitting_ are problems similar to those encountered in regression: either they fall short in capturing the relationships between input and output, or they overspecialize and are unable to generalize.

To avoid them the remedy is the same: try to select a model with adequate complexity by studying the _loss_ trends in training and testing to evaluate its generalization capability.

In [None]:
colors = np.array(['red', 'blue'])
n_samples = 50

X, y = datasets.make_moons(n_samples=n_samples, noise=.5)

plt.scatter(X[:,0], X[:,1], label='Sample', color=colors[y])
plt.legend();

### Underfitted model

Let's create a simple model to try to classify these two example sets.

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(2, activation='relu', input_dim=2),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics = ['binary_accuracy'])
model.summary()

history = model.fit(X, y, epochs=1000, verbose=0)

After training, let's look at the trend of the training:

In [None]:
pd.DataFrame(history.history).plot()
plt.xlabel('Epoch num.')
plt.show()

Wow, a similar case to the previous one. The training is stuck at a relatively high error. Let's see which regions the classifier has determined:

In [None]:
gridx1, gridx2 = np.meshgrid(np.linspace(-2.5, 2.5, 500), np.linspace(-2.5, 2.5, 500))
grid = np.c_[gridx1.flatten(), gridx2.flatten()]
probs = model.predict(grid, verbose=0)

plt.xlim(-2.5, 2.5)
plt.ylim(-2.5, 2.5)
plt.pcolor(gridx1, gridx2, probs.reshape(500, 500))
plt.colorbar()
plt.scatter(X[:,0], X[:,1], color=colors[y]);

Nearly a straight line separating the two sets, making quite a lot of mistakes.

### Overfitted model

Now we will try just the opposite. Let's create a model that is so large that you end up learning practically all the examples by heart:

In [None]:
import tensorflow as tf

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(1024, activation='relu', input_dim=2),
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dense(8, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy', optimizer="adam")
model.summary()

history = model.fit(X, y, epochs=1000, verbose=0)

Let's take a look at the evolution of the training.

In [None]:
pd.DataFrame(history.history).plot()
plt.xlabel('Epoch num.')
plt.show()

We can observe the same problem as with regression. The model fits very well to the training set but poorly to the test set. Let us look at the regions it delimits in the space to discern between sets:

In [None]:
probs = model.predict(grid, verbose=0)

plt.xlim(-2.5, 2.5)
plt.ylim(-2.5, 2.5)
plt.pcolor(gridx1, gridx2, probs.reshape(500, 500))
plt.colorbar()
plt.scatter(X[:,0], X[:,1], color=colors[y]);

We can see that it is very specialized to try to correctly classify the examples.

As an exercise, it would also be very good to play with different architectures to try to reach a correct classification. Thus, in addition to gaining more intuition in model training, it is useful to better understand the operation of `keras`.

## Conclusions

To conclude, we have performed two examples illustrating the problems of _underfitting_ and _overfitting_ in regression and classification problems, and how these issues impact the predictive ability of the models.

Underfitting_ results in low prediction accuracy on training and test data, while _overfitting_ results in high accuracy on training data but low accuracy on test data. To avoid these problems, it is important to find a balance between the complexity of the model and its ability to generalize to new and unseen data.

Proper feature selection, cross-validation, hyperparameter fitting and regularization are useful techniques to avoid _overfitting_ and _underfitting_ and improve the predictive ability of the models. These concepts will be discussed later.

***

<div><img style="float: right; width: 120px; vertical-align:top" src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by-nc-sa.png" alt="Creative Commons by-nc-sa logo" />

[Back to top](#top)

</div>