# Understanding Forgetting in Neural Networks with One Neuron

In the past few years a number of papers have shown the impact of forgetting in neural networks when learning continually over time and without access to previously encountered data. However, many of the examples used to describe the phenomenon are quite complex and involve thousands if not million of parameters.

In this brief notebook I'll try to make the simplest possible example of catastrophic forgetting in neural networks, with just **one neuron** and **two paramerers** (a weight and a bias term), i.e. using a linear regression.

We will build on top of the "*House Prices*" dataset and the example used in the famous Coursera "*Machine Learning*" course by Andrew Ng and we will:

1. Build a continual learning setting
2. Show ideal trained parameters for the linear regression model
3. Show the impact of forgetting when changing the data distribution



In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from matplotlib import rc
import unittest

%matplotlib inline
sns.set(style='whitegrid', palette='muted', font_scale=1.5)
rcParams['figure.figsize'] = 14, 8
rcParams['animation.embed_limit'] = 2**128

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

In [None]:
!wget https://raw.githubusercontent.com/Data-Science-FMI/ml-from-scratch-2019/master/data/house_prices_train.csv

This is the summary of the dataset we are going to use and some of his main attributes:

In [None]:
df_train = pd.read_csv('house_prices_train.csv')
df_train.describe()

In [None]:
df_train['SalePrice'].describe()

In [None]:
sns.distplot(df_train['SalePrice']);

Below we can see how the Living Room square feets nicely correlates with the House sale price:

In [None]:
var = 'GrLivArea'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000), s=32);

Now, to create a continual learning setting we split the dataset in two: we assume for example that the data comes in two distinct batch, the first one in houses build before 2000 and the second of the more recent houses.

In [None]:
df_new = df_train[df_train.YearBuilt > 2000]
df_old = df_train[df_train.YearBuilt <= 2000]

In [None]:
df_new.describe()

In [None]:
df_old.describe()

In [None]:
var = 'GrLivArea'
data = pd.concat([df_train['SalePrice'], df_new[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000), s=32, color="orange");

In [None]:
var = 'GrLivArea'
data = pd.concat([df_train['SalePrice'], df_old[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000), s=32, color="blue");

It makes sense that on average more recent house are sold at higher prices.
Let us know train the linear regression model our single neuron on the entire training set to get the best params:

In [None]:
cumul_x = df_train['GrLivArea']
cumul_y = df_train['SalePrice']

# x = (x - x.mean()) / x.std()
cumul_x = np.c_[np.ones(cumul_x.shape[0]), cumul_x] 

cumul_x.shape

In [None]:
def loss(h, y):
  sq_error = (h - y)**2
  n = len(y)
  return 1.0 / (2*n) * sq_error.sum()

In [None]:
class LinearRegression:

  def __init__(self):
    
    self._W = np.zeros(2)
    self._cost_history = []
    self._w_history = [self._W]
  
  def predict(self, X):

    return np.dot(X, self._W)
  
  def _gradient_descent_step(self, X, targets, lr):

    predictions = self.predict(X)
    
    error = predictions - targets
    gradient = np.dot(X.T,  error) / len(X)

    self._W -= lr * gradient
      
  def fit(self, X, y, n_iter=100000, lr=0.01):

    for i in range(n_iter):
      
        prediction = self.predict(X)
        cost = loss(prediction, y)
        
        self._cost_history.append(cost)
        
        self._gradient_descent_step(X, y, lr)
        
        self._w_history.append(self._W.copy())
        
    return self
      
        

In [None]:
cumul_clf = LinearRegression()
cumul_clf.fit(cumul_x, cumul_y, n_iter=150, lr=1e-7)

cumul_clf._W

In [None]:
plt.title('Cost Function J')
plt.xlabel('No. of iterations')
plt.ylabel('Cost')
plt.plot(cumul_clf._cost_history)
plt.show()

In [None]:
#Animation
def animate(clf, set_x, set_y, frames=150):
    #Set the plot up,
    fig = plt.figure()
    ax = plt.axes()
    plt.title('Sale Price vs Living Area')
    plt.xlabel('Living Area in square feet')
    plt.ylabel('Sale Price ($)')
    if len(set_x) == 1:
        plt.scatter(set_x[0][:,1], set_y[0])
    else:
        plt.scatter(set_x[0][:,1], set_y[0], color="blue")
        plt.scatter(set_x[1][:,1], set_y[1], color="orange")
    line, = ax.plot([], [], lw=2, color='red')
    annotation = ax.text(200, 700000, '')
    # optimal
    x = np.linspace(0, 7000, 1000)
    y = cumul_clf._W[1]*x + cumul_clf._W[0]
    ax.plot(x, y, 'g--')
    annotation.set_animated(True)
    plt.close()

    #Generate the animation data,
    def init():
        line.set_data([], [])
        annotation.set_text('')
        return line, annotation

    # animation function.  This is called sequentially
    def animate(i):
        # x = np.linspace(-5, 20, 1000)
        x = np.linspace(0, 7000, 1000)
        y = clf._w_history[i][1]*x + clf._w_history[i][0]
        line.set_data(x, y)
        annotation.set_text(
            'Cost = %.2f e10\nWeight: %.2f\nBias: %.2f' % 
            (clf._cost_history[i]/1e10, clf._w_history[i][1],
             clf._w_history[i][0]))
        return line, annotation

    anim = animation.FuncAnimation(fig, animate, init_func=init,
                                frames=frames, interval=10, blit=True)

    rc('animation', html='jshtml')

    return anim

In [None]:
anim = animate(cumul_clf, [cumul_x], [cumul_y])
anim

Ok so the best parameters for the job are weight: 9.94290254e-02 and bias:1.18069042e+02. This will appear as a green dashed line in the plot. Let's now move the continual learning scenario.

In this case we will start with the first batch of data (that is the batch with all the old houses data) and than, with the optimal parameters computed at this step we will try to model also the data of the second batch (with the newest houses data).



In [None]:
x_old = df_old['GrLivArea']
y_old = df_old['SalePrice']

x_old = np.c_[np.ones(x_old.shape[0]), x_old] 

x_old.shape

In [None]:
x_new = df_new['GrLivArea']
y_new = df_new['SalePrice']

x_new = np.c_[np.ones(x_new.shape[0]), x_new] 

x_new.shape

In [None]:
cl_clf = LinearRegression()
cl_clf.fit(x_old, y_old, n_iter=150, lr=1e-7)

cl_clf._W

In [None]:
anim = animate(cl_clf, [x_old, x_new], [y_old, y_new])
anim

In the plot above we can see that the model is only fitting the old houses data as we would expect! Let us know see what happens if we finetune the model on the newest houses batch!

In [None]:
# cl_clf = LinearRegression()
cl_clf.fit(x_new, y_new, n_iter=150, lr=1e-7)

cl_clf._W

In [None]:
plt.title('Cost Function J')
plt.xlabel('No. of iterations')
plt.ylabel('Cost')
plt.plot(cl_clf._cost_history)
plt.show()

In [None]:
anim = animate(cl_clf, [x_old, x_new], [y_old, y_new], frames=300)
anim

So what we can see from the plot above is that even though we are starting from the best possible solution of the previous step our weight and bias parameters are somehow overwritten only to suit the new data distibution of the newest houses. 

Here, we are essetially "*forgetting*" how to correctly predict the price of houses build before 2000 just to better predict the price of the houses built after 2000, even though (here's the point) a better and general parametrization **do exist** and would have reduced the total prediction error.

How can we efficiently learn that best parametrization over time and without accessing previously encontered data is one of the main focus of Continual Learning.