#Illustrate Bias/Variance Tradeoff and Validation Exercise Using Financial Asset Prediction Example

In this notebook, we will illustrate two concepts for predictive modeling:



1.   Bias/Variance tradeoff
2.   The role of validation exercises

We will do this in a toy example where we try to predict a household's net financial assets using the household income. The data are drawn from the 1991 SIPP and have been used extensively in academic research starting from Poterba, Venti, and Wise (1994) "401(k) Plans and Tax-Deferred savings." We are just using them as a simple, readily accessible set of data to illustrate predictive modeling.



# Load some python libraries

We are going to do our analysis using Python. We need to make sure that libraries containing the tools we will use are loaded so we can actually make use of them.

We will essentialy always use `numpy` and `pandas` (which contain a bunch of useful data manipulation and "algebra" tools). We will also generally use functions taken from `scikit-learn` (aka `sklearn` the leading library of basic ML algorithms) and `matplotlib` (plotting functions). `seaborn` has additional data visualization tools that we will often use.

In the following code, we import `numpy`, `pandas`, a linear regression function, a basic plotting function, and a KNN regression function (which you can view just as a black-box prediction algorithm - though ask [ChatGPT](https://https://chatgpt.com/) to explain it to you.)





In [None]:
# Import relevant packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor


# Load, examine, and prepare the data

We need to load the data. We will do this directly from a github repository for the course.

In [None]:
file = "https://raw.githubusercontent.com/chansen776/MBA-ML-Course-Materials/main/Data/401k.csv"
data = pd.read_csv(file)

Let's take a look at what's in the data

In [None]:
data.describe()

In [None]:
data.columns

There are a lot of variables here. For our exercise, we're only going to use two of them (`inc` = income and `net_tfa` = net total financial assets). Let's look at those two variables.

In [None]:
data[['inc','net_tfa']].describe()

We have 9915 observations in the data. For our exercise, we're going to subset these into 6000 "training" observations and 3915 "validation" observations. (More on this later.) We're going to focus initially on the 6000 training observations.

In [None]:
# Splitting the data into one subset of train_size = 6000 observations
# with the other subset containing the remaining observations.

# We are splitting at random and want results to replicable, so we set the
# state of the random number generator. One should generally do this. Of course,
# you don't want your results to depend on the specific arbitrary set state, so
# it's a good idea to try a few out and make sure results aren't highly sensitive

train, test = train_test_split(data, train_size=6000, random_state=7224)


Finally, let's look at a scatter plot of the data we are going to use to build our model.

In [None]:
plot_net_tfa = sns.scatterplot(x='inc', y='net_tfa', data=train, alpha = 0.5)
plot_net_tfa.set(xlabel='Income', ylabel='Net Financial Assets')
plt.show()

# Let's fit some candidate predictive models

1. Sample mean (baseline)
2. Linear regression
3. KNN with 30 neighbors
4. KNN with 1 neighbor

In [None]:
# Sample mean as prediction rule

ytrmean = np.mean(train['net_tfa'])
print('Sample mean of net financial assets: {m:=.2f}'.format(m=ytrmean))

# Mean squared error
mse0 = np.mean((train['net_tfa'] - ytrmean)**2)
print('MSE of sample mean: {m:=.2f}'.format(m=mse0))

# Root mean squared error
rmse0 = np.sqrt(mse0)
print('RMSE of sample mean: {m:=.2f}'.format(m=rmse0))

# Mean absolute error
mae0 = np.mean(np.abs(train['net_tfa'] - ytrmean))
print('MAE of sample mean: {m:=.2f}'.format(m=mae0))

# Scatter plot with mean line drawn on
plot_net_tfa = sns.scatterplot(x='inc', y='net_tfa', data=train, alpha = 0.5)
plot_net_tfa.set(xlabel='Income', ylabel='Net Financial Assets')
plt.axhline(y=ytrmean, color = 'red')
plt.show()


Let's see how linear regression does. Recall that linear regression with one variable is just fitting a prediction rule

$$\widehat{\texttt{net_tfa}} = b_0 + b_1 \texttt{income}$$

Despite it's simplicity, linear regression is an ML tool and is often very useful.

In [None]:
# Linear regression as prediction rule

# Fit linear model using training data
# Define the model
lm_nettfa = LinearRegression()

# Fit the model on the training data
lm_nettfa.fit(train[['inc']], train['net_tfa'])

# Predict on the training data
lm_yhat = lm_nettfa.predict(train[['inc']])

# Mean squared error
mselm = np.mean((train['net_tfa'] - lm_yhat)**2)
print('MSE of linear model: {m:=.2f}'.format(m=mselm))

# Root mean squared error
rmselm = np.sqrt(mselm)
print('RMSE of linear model: {m:=.2f}'.format(m=rmselm))

# Mean absolute error
maelm = np.mean(np.abs(train['net_tfa'] - lm_yhat))
print('MAE of linear model: {m:=.2f}'.format(m=maelm))

# R^2 relative to baseline
r2lm = 1 - (mselm/mse0)
print('R^2 of linear model: {m:=.3f}'.format(m=r2lm))

# Scatter plot with linear regression model drawn on
plot_net_tfa = sns.regplot(x='inc', y='net_tfa', data=train,
                           scatter_kws = {'alpha': 0.25},
                           line_kws = {'color': 'red'}, ci = None)
plot_net_tfa.set(xlabel='Income', ylabel='Net Financial Assets')
plt.show()

Let's try KNN with 30 neighbors

In [None]:
# KNN regression as prediction rule

# Define the model
knn30_nettfa = KNeighborsRegressor(n_neighbors=30)

# Fit the model on the training data
knn30_nettfa.fit(train[['inc']], train['net_tfa'])

# Predict on the training data
knn30_yhat = knn30_nettfa.predict(train[['inc']])

# Create data frame with KNN fits for plotting
knn_plot = pd.DataFrame({'net_tfa': train['net_tfa'], 'inc': train['inc'],
                         'fits': knn30_yhat})

# Mean squared error
mseknn30 = np.mean((train['net_tfa'] - knn30_yhat)**2)
print('MSE of with 30 neighbors: {m:=.2f}'.format(m=mseknn30))

# Root mean squared error
rmseknn30 = np.sqrt(mseknn30)
print('RMSE of with 30 neighbors: {m:=.2f}'.format(m=rmseknn30))

# Mean absolute error
maeknn30 = np.mean(np.abs(train['net_tfa'] - knn30_yhat))
print('MAE of with 30 neighbors: {m:=.2f}'.format(m=maeknn30))

# R^2 relative to baseline
r2knn30 = 1 - (mseknn30/mse0)
print('R^2 of KNN with 30 neighbors: {m:=.3f}'.format(m=r2knn30))

# Scatter plot with KNN fit drawn on
plot_knn_net_tfa = sns.scatterplot(x='inc', y='net_tfa', data=train,
                                   alpha = 0.5)
plot_knn_net_tfa.set(xlabel='Income', ylabel='Net Financial Assets')
sns.lineplot(x='inc', y='fits', data=knn_plot, color='red')  # k-NN fit line


As our final example, let's look at KNN with 1 neighbor.

In [None]:
# Define the model
knn1_nettfa = KNeighborsRegressor(n_neighbors=1)

# Fit the model on the training data
knn1_nettfa.fit(train[['inc']], train['net_tfa'])

# Predict on the training data
knn1_yhat = knn1_nettfa.predict(train[['inc']])

# Create data frame with KNN fits for plotting
knn_plot = pd.DataFrame({'net_tfa': train['net_tfa'], 'inc': train['inc'],
                         'fits': knn1_yhat})

# Mean squared error
mseknn1 = np.mean((train['net_tfa'] - knn1_yhat)**2)
print('MSE of with 1 neighbors: {m:=.2f}'.format(m=mseknn1))

# Root mean squared error
rmseknn1 = np.sqrt(mseknn1)
print('RMSE of with 1 neighbors: {m:=.2f}'.format(m=rmseknn1))

# Mean absolute error
maeknn1 = np.mean(np.abs(train['net_tfa'] - knn1_yhat))
print('MAE of with 1 neighbors: {m:=.2f}'.format(m=maeknn1))

# R^2 relative to baseline
r2knn1 = 1 - (mseknn1/mse0)
print('R^2 of KNN with 1 neighbors: {m:=.3f}'.format(m=r2knn1))

# Scatter plot with KNN fit drawn on
plot_knn_net_tfa = sns.scatterplot(x='inc', y='net_tfa', data=train,
                                   alpha = 0.5)
plot_knn_net_tfa.set(xlabel='Income', ylabel='Net Financial Assets')
sns.lineplot(x='inc', y='fits', data=knn_plot, color='red')  # k-NN fit line

# Summary

As we allow the prediction rule to become more "complex," we capture the data used to learn the prediction rule better.

With a continuous input, we can find lots of rules that memorize the data. That is, perfectly predict each outcome and have 0 loss **in the sample that was used to learn the prediction rule!**

# Question:

How do we decide whether the rule we have learned generalizes to predict **new** observations?




