#Illustrate Bias/Variance Tradeoff and Validation Exercise Using Financial Asset Prediction Example

In this notebook, we will illustrate two concepts for predictive modeling:



1.   Bias/Variance tradeoff
2.   The role of validation exercises

We will do this in a toy example where we try to predict a household's net financial assets using the household income. The data are drawn from the 1991 SIPP and have been used extensively in academic research starting from Poterba, Venti, and Wise (1994) "401(k) Plans and Tax-Deferred savings." We are just using them as a simple, readily accessible set of data to illustrate predictive modeling.



# Load some python libraries

We are going to do our analysis using Python. We need to make sure that libraries containing the tools we will use are loaded so we can actually make use of them.

We will essentialy always use `numpy` and `pandas` (which contain a bunch of useful data manipulation and "algebra" tools). We will also generally use functions taken from `scikit-learn` (aka `sklearn` the leading library of basic ML algorithms) and `matplotlib` (plotting functions). `seaborn` has additional data visualization tools that we will often use.

In the following code, we import `numpy`, `pandas`, a linear regression function, a basic plotting function, and a KNN regression function (which you can view just as a black-box prediction algorithm - though ask [ChatGPT](https://chatgpt.com/) to explain it to you.)





In [1]:
# Import relevant packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor


# Load, examine, and prepare the data

We need to load the data. We will do this directly from a github repository for the course.

In [None]:
file = "https://raw.githubusercontent.com/chansen776/MBA-ML-Course-Materials/main/Data/401k.csv"
data = pd.read_csv(file)

Let's take a look at what's in the data

In [None]:
data.describe()

In [None]:
data.columns

There are a lot of variables here. For our exercise, we're only going to use two of them (`inc` = income and `net_tfa` = net total financial assets). Let's look at those two variables.

In [None]:
data[['inc','net_tfa']].describe()

We have 9915 observations in the data. For our exercise, we're going to subset these into 6000 "training" observations and 3915 "validation" observations. (More on this later.) We're going to focus initially on the 6000 training observations.

In [None]:
# Splitting the data into one subset of train_size = 6000 observations
# with the other subset containing the remaining observations.

# We are splitting at random and want results to replicable, so we set the
# state of the random number generator. One should generally do this. Of course,
# you don't want your results to depend on the specific arbitrary set state, so
# it's a good idea to try a few out and make sure results aren't highly sensitive

train, test = train_test_split(data, train_size=6000, random_state=7224)


Finally, let's look at a scatter plot of the data we are going to use to build our model.

In [None]:
plot_net_tfa = sns.scatterplot(x='inc', y='net_tfa', data=train, alpha = 0.5)
plot_net_tfa.set(xlabel='Income', ylabel='Net Financial Assets')
plt.show()

# Let's fit some candidate predictive models

1. Sample mean (baseline)
2. Linear regression
3. KNN with 30 neighbors
4. KNN with 1 neighbor

In [None]:
# Sample mean as prediction rule

ytrmean = np.mean(train['net_tfa'])
print('Sample mean of net financial assets: {m:=.2f}'.format(m=ytrmean))

# Mean squared error
mse0 = np.mean((train['net_tfa'] - ytrmean)**2)
print('MSE of sample mean: {m:=.2f}'.format(m=mse0))

# Root mean squared error
rmse0 = np.sqrt(mse0)
print('RMSE of sample mean: {m:=.2f}'.format(m=rmse0))

# Mean absolute error
mae0 = np.mean(np.abs(train['net_tfa'] - ytrmean))
print('MAE of sample mean: {m:=.2f}'.format(m=mae0))

# Scatter plot with mean line drawn on
plot_net_tfa = sns.scatterplot(x='inc', y='net_tfa', data=train, alpha = 0.5)
plot_net_tfa.set(xlabel='Income', ylabel='Net Financial Assets')
plt.axhline(y=ytrmean, color = 'red')
plt.show()


Let's see how linear regression does. Recall that linear regression with one variable is just fitting a prediction rule

$$\widehat{\texttt{net_tfa}} = b_0 + b_1 \texttt{income}$$

Despite it's simplicity, linear regression is an ML tool and is often very useful.

In [None]:
# Linear regression as prediction rule

# Fit linear model using training data
# Define the model
lm_nettfa = LinearRegression()

# Fit the model on the training data
lm_nettfa.fit(train[['inc']], train['net_tfa'])

# Predict on the training data
lm_yhat = lm_nettfa.predict(train[['inc']])

# Mean squared error
mselm = np.mean((train['net_tfa'] - lm_yhat)**2)
print('MSE of linear model: {m:=.2f}'.format(m=mselm))

# Root mean squared error
rmselm = np.sqrt(mselm)
print('RMSE of linear model: {m:=.2f}'.format(m=rmselm))

# Mean absolute error
maelm = np.mean(np.abs(train['net_tfa'] - lm_yhat))
print('MAE of linear model: {m:=.2f}'.format(m=maelm))

# R^2 relative to baseline
r2lm = 1 - (mselm/mse0)
print('R^2 of linear model: {m:=.3f}'.format(m=r2lm))

# Scatter plot with linear regression model drawn on
plot_net_tfa = sns.regplot(x='inc', y='net_tfa', data=train,
                           scatter_kws = {'alpha': 0.25},
                           line_kws = {'color': 'red'}, ci = None)
plot_net_tfa.set(xlabel='Income', ylabel='Net Financial Assets')
plt.show()

Let's try KNN with 30 neighbors

In [None]:
# KNN regression as prediction rule

# Define the model
knn30_nettfa = KNeighborsRegressor(n_neighbors=30)

# Fit the model on the training data
knn30_nettfa.fit(train[['inc']], train['net_tfa'])

# Predict on the training data
knn30_yhat = knn30_nettfa.predict(train[['inc']])

# Create data frame with KNN fits for plotting
knn_plot = pd.DataFrame({'net_tfa': train['net_tfa'], 'inc': train['inc'],
                         'fits': knn30_yhat})

# Mean squared error
mseknn30 = np.mean((train['net_tfa'] - knn30_yhat)**2)
print('MSE of KNN with 30 neighbors: {m:=.2f}'.format(m=mseknn30))

# Root mean squared error
rmseknn30 = np.sqrt(mseknn30)
print('RMSE of KNN with 30 neighbors: {m:=.2f}'.format(m=rmseknn30))

# Mean absolute error
maeknn30 = np.mean(np.abs(train['net_tfa'] - knn30_yhat))
print('MAE of KNN with 30 neighbors: {m:=.2f}'.format(m=maeknn30))

# R^2 relative to baseline
r2knn30 = 1 - (mseknn30/mse0)
print('R^2 of KNN with 30 neighbors: {m:=.3f}'.format(m=r2knn30))

# Scatter plot with KNN fit drawn on
plot_knn_net_tfa = sns.scatterplot(x='inc', y='net_tfa', data=train,
                                   alpha = 0.5)
plot_knn_net_tfa.set(xlabel='Income', ylabel='Net Financial Assets')
sns.lineplot(x='inc', y='fits', data=knn_plot, color='red')  # k-NN fit line


As our final example, let's look at KNN with 1 neighbor.

In [None]:
# Define the model
knn1_nettfa = KNeighborsRegressor(n_neighbors=1)

# Fit the model on the training data
knn1_nettfa.fit(train[['inc']], train['net_tfa'])

# Predict on the training data
knn1_yhat = knn1_nettfa.predict(train[['inc']])

# Create data frame with KNN fits for plotting
knn_plot = pd.DataFrame({'net_tfa': train['net_tfa'], 'inc': train['inc'],
                         'fits': knn1_yhat})

# Mean squared error
mseknn1 = np.mean((train['net_tfa'] - knn1_yhat)**2)
print('MSE of KNN with 1 neighbor: {m:=.2f}'.format(m=mseknn1))

# Root mean squared error
rmseknn1 = np.sqrt(mseknn1)
print('RMSE of KNN with 1 neighbor: {m:=.2f}'.format(m=rmseknn1))

# Mean absolute error
maeknn1 = np.mean(np.abs(train['net_tfa'] - knn1_yhat))
print('MAE of KNN with 1 neighbor: {m:=.2f}'.format(m=maeknn1))

# R^2 relative to baseline
r2knn1 = 1 - (mseknn1/mse0)
print('R^2 of KNN with 1 neighbor: {m:=.3f}'.format(m=r2knn1))

# Scatter plot with KNN fit drawn on
plot_knn_net_tfa = sns.scatterplot(x='inc', y='net_tfa', data=train,
                                   alpha = 0.5)
plot_knn_net_tfa.set(xlabel='Income', ylabel='Net Financial Assets')
sns.lineplot(x='inc', y='fits', data=knn_plot, color='red')  # k-NN fit line

# Summary

As we allow the prediction rule to become more "complex," we capture the data used to learn the prediction rule better.

With a continuous input, we can find lots of rules that memorize the data. That is, perfectly predict each outcome and have 0 loss **in the sample that was used to learn the prediction rule!**

# Question:

How do we decide whether the rule we have learned generalizes to predict **new** observations?






# Validation exercise

We actually try out our learned rules to predict new observations!

We have already fit our candidate prediction rules. Let's see how they do in
the test data that was not included as part of the training/learning/fitting
process.



In [None]:
# Sample mean as prediction rule

# Validation mean squared error
Vmse0 = np.mean((test['net_tfa'] - ytrmean)**2)
print('Validation MSE of sample mean: {m:=.2f}'.format(m=Vmse0))

# Validation root mean squared error
Vrmse0 = np.sqrt(Vmse0)
print('Validation RMSE of sample mean: {m:=.2f}'.format(m=Vrmse0))

# Validation mean absolute error
Vmae0 = np.mean(np.abs(test['net_tfa'] - ytrmean))
print('Validation MAE of sample mean: {m:=.2f}'.format(m=Vmae0))


In [None]:
# Linear model as prediction rule

# Predict on the validation data
lm_yhat = lm_nettfa.predict(test[['inc']])

# Validation mean squared error
Vmselm = np.mean((test['net_tfa'] - lm_yhat)**2)
print('Validation MSE of linear model: {m:=.2f}'.format(m=Vmselm))

# Validation root mean squared error
Vrmselm = np.sqrt(Vmselm)
print('Validation RMSE of linear model: {m:=.2f}'.format(m=Vrmselm))

# Validation mean absolute error
Vmaelm = np.mean(np.abs(test['net_tfa'] - lm_yhat))
print('Validation MAE of linear model: {m:=.2f}'.format(m=Vmaelm))

# Validation R^2 relative to baseline
Vr2lm = 1 - (Vmselm/Vmse0)
print('Validation R^2 of linear model: {m:=.3f}'.format(m=Vr2lm))

In [None]:
# KNN regression as prediction rule

# Predict on the validation data
knn30_yhat = knn30_nettfa.predict(test[['inc']])

# Validation mean squared error
Vmseknn30 = np.mean((test['net_tfa'] - knn30_yhat)**2)
print('Validation MSE of KNN with 30 neighbors: {m:=.2f}'.format(m=Vmseknn30))

# Validation root mean squared error
Vrmseknn30 = np.sqrt(Vmseknn30)
print('Validation RMSE of KNN with 30 neighbors: {m:=.2f}'.format(m=Vrmseknn30))

# Validation mean absolute error
Vmaeknn30 = np.mean(np.abs(test['net_tfa'] - knn30_yhat))
print('Validation MAE of KNN with 30 neighbors: {m:=.2f}'.format(m=Vmaeknn30))

# Validation R^2 relative to baseline
Vr2knn30 = 1 - (Vmseknn30/Vmse0)
print('Validation R^2 of KNN with 30 neighbors: {m:=.3f}'.format(m=Vr2knn30))

In [None]:
# Predict on the validation data
knn1_yhat = knn1_nettfa.predict(test[['inc']])

# Validation mean squared error
Vmseknn1 = np.mean((test['net_tfa'] - knn1_yhat)**2)
print('Validation MSE of KNN with 1 neighbor: {m:=.2f}'.format(m=Vmseknn1))

# Validation root mean squared error
Vrmseknn1 = np.sqrt(Vmseknn1)
print('Validation RMSE of KNN with 1 neighbor: {m:=.2f}'.format(m=Vrmseknn1))

# Validation mean absolute error
Vmaeknn1 = np.mean(np.abs(test['net_tfa'] - knn1_yhat))
print('Validation MAE of KNN with 1 neighbor: {m:=.2f}'.format(m=Vmaeknn1))

# Validation R^2 relative to baseline
Vr2knn1 = 1 - (Vmseknn1/Vmse0)
print('Validation R^2 of KNN with 1 neighbor: {m:=.3f}'.format(m=Vr2knn1))

# Aside: Stability

We should probably make sure our results are not driven by an "unfortunate" train/test split. Let's try a few more times with random splits.

Because we are making random splits, we will get different results each time we run the following code.

Let's start by defining a function so we don't have to retype the same things over and over.

In [None]:
def check_robustness(data):

  # Create a new train and test split with a random seed
  train, test = train_test_split(data, train_size=6000)

  # Sample mean as prediction rule
  m0 = np.mean(train['net_tfa'])

  # Linear model as prediction rule
  lm0 = LinearRegression()
  lm0.fit(train[['inc']], train['net_tfa'])

  # KNN regression with 30 neighbors as prediction rule
  knn30 = KNeighborsRegressor(n_neighbors=30)
  knn30.fit(train[['inc']], train['net_tfa'])

  # KNN regression with 1 neighbor as prediction rule
  knn1 = KNeighborsRegressor(n_neighbors=1)
  knn1.fit(train[['inc']], train['net_tfa'])

  # Evaluate test data prediction performance
  rmse = [np.sqrt(np.mean((test['net_tfa'] - m0)**2)) ,
          np.sqrt(np.mean((test['net_tfa'] - lm0.predict(test[['inc']]))**2)) ,
          np.sqrt(np.mean((test['net_tfa'] - knn30.predict(test[['inc']]))**2)),
          np.sqrt(np.mean((test['net_tfa'] - knn1.predict(test[['inc']]))**2))]

  mae = [np.mean(np.abs(test['net_tfa'] - m0)) ,
         np.mean(np.abs(test['net_tfa'] - lm0.predict(test[['inc']]))) ,
         np.mean(np.abs(test['net_tfa'] - knn30.predict(test[['inc']]))) ,
         np.mean(np.abs(test['net_tfa'] - knn1.predict(test[['inc']]))) ]

  R2 = 1-(np.array(rmse)**2/np.mean((test['net_tfa'] - m0)**2))

  return rmse, mae, R2



Now let's repeat our exercise K = 10 times and look at the results.

In [None]:
rmse = pd.DataFrame(columns=['Mean', 'Linear', 'KNN30', 'KNN1'])
mae = pd.DataFrame(columns=['Mean', 'Linear', 'KNN30', 'KNN1'])
R2 = pd.DataFrame(columns=['Mean', 'Linear', 'KNN30', 'KNN1'])

for k in range(10):
  rmsek, maek, R2k = check_robustness(data)
  rmse.loc[k] = rmsek
  mae.loc[k] = maek
  R2.loc[k] = R2k


In [None]:
rmse

In [None]:
mae

In [None]:
R2

Broadly speaking, results are fairly consistent. Based on MSE, we bounce back and forth between linear and KNN30. KNN30 is pretty dominant on MAE.

# Cross-validation

Finally, let's see what happens when we try $K$-fold cross-validation.

Recall that cross-validation is another way to structure a validation exercise. We partition the data into $K$ approximately equal sized subsets. We then treat each subset in turn as held out data for a validation exercise, using the remaining subsets to train the prediction rules. We repeat the whole exercise $K$ times, so that each subset of observations gets a turn to be the hold-out set.

`sklearn` has built in functions to help with the cross-validation process that we're going to make use of.

In [None]:
# Packages for K-Fold cross-validation
from sklearn.model_selection import cross_val_score, KFold

# Let's get back our original train/test split
train, test = train_test_split(data, train_size=6000, random_state=7224)

# Generate sample partitions for 5-fold cross-validation
cv = KFold(n_splits=5, shuffle = True, random_state = 911)

# Look at CV performance of sample mean using MSE as the metric
# We're tricking the linear regression command to only return an intercept, which
# is equivalent to using the sample mean
mmse = cross_val_score(LinearRegression(), np.zeros_like(train[['inc']]), train['net_tfa'], scoring='neg_mean_squared_error', cv=cv)
print('CV RMSE: Mean')
print(np.sqrt(-mmse))
# We get NEGATIVE of MSE returned

# Look at CV performance of linear model using MSE as the metric
lmmse = cross_val_score(LinearRegression(), train[['inc']], train['net_tfa'], scoring='neg_mean_squared_error', cv=cv)
print('CV RMSE: Linear')
print(np.sqrt(-lmmse))
# We get NEGATIVE of MSE returned

# Look at CV performance of KNN30 using MSE as the metric
knn30mse = cross_val_score(KNeighborsRegressor(n_neighbors=30), train[['inc']], train['net_tfa'], scoring='neg_mean_squared_error', cv=cv)
print('CV RMSE: KNN30')
print(np.sqrt(-knn30mse))
# We get NEGATIVE of MSE returned

# Look at CV performance of KNN1 using MSE as the metric
knn1mse = cross_val_score(KNeighborsRegressor(n_neighbors=1), train[['inc']], train['net_tfa'], scoring='neg_mean_squared_error', cv=cv)
print('CV RMSE: KNN1')
print(np.sqrt(-knn1mse))
# We get NEGATIVE of MSE returned

Let's aggregate the performance across all the folds. We get negative MSE as prediction metric above:

$$-MSE_k = -\frac{1}{1200}\sum_{i \in \textrm{Fold} \ k} (y_i - \hat{y}_i)^2.$$

We get overall MSE by just averaging $MSE_k$:

$$MSE = \frac{1}{5}\sum_{k=1}^{5} MSE_k.$$

We can also calculate overall CV $R^2$ relative to the baseline provided by using the sample mean in the usual way.

In [None]:
MSE_mean = np.mean(-mmse)
MSE_linear = np.mean(-lmmse)
MSE_knn30 = np.mean(-knn30mse)
MSE_knn1 = np.mean(-knn1mse)

print('RMSE mean: {m:=.2f}'.format(m=np.sqrt(MSE_mean)))
print('RMSE linear: {m:=.2f}'.format(m=np.sqrt(MSE_linear)))
print('RMSE knn30: {m:=.2f}'.format(m=np.sqrt(MSE_knn30)))
print('RMSE knn1: {m:=.2f}'.format(m=np.sqrt(MSE_knn1)))

R2_mean = 1-(MSE_mean/MSE_mean)
R2_linear = 1-(MSE_linear/MSE_mean)
R2_knn30 = 1-(MSE_knn30/MSE_mean)
R2_knn1 = 1-(MSE_knn1/MSE_mean)

print('R2 mean: {m:=.3f}'.format(m=R2_mean))
print('R2 linear: {m:=.3f}'.format(m=R2_linear))
print('R2 knn30: {m:=.3f}'.format(m=R2_knn30))
print('R2 knn1: {m:=.3f}'.format(m=R2_knn1))

More or less aligns with our pure validation exercise. According to MSE, the linear model looks best with KNN30 close.

Of course, given that we have a validation data set, we should double check (but we've already done that).

# Aside

If we want, we can verify that splits did indeed equally split our 6000 available observations.

In [None]:
for i, (train_index, test_index) in enumerate(cv.split(train)):
    print(f"Fold {i}:")
    print(f"  Number of training observations: {len(train_index)}")
    print(f"  Number of test observations: {len(test_index)}")
