#Model Selection

*Intro to Statistical Learning*: Chapter 6

*Ethical Algorithm*: Chapter 2

#Dataset:

To make learning about model selection simpler, we are going to use the data sets and original model process from the linear model tutorial. 

Dataset on Kaggle: https://www.kaggle.com/datasets/saravananselvamohan/freddie-mac-singlefamily-loanlevel-dataset

Freddie Mac Single Family Loan-Level Dataset: https://www.freddiemac.com/research/datasets/sf-loanlevel-dataset

Description of data fields: https://www.freddiemac.com/fmac-resources/research/pdf/file_layout.xlsx

The Federal Home Loan Mortgage Corporation, commonly known as Freddie Mac, is a publicly traded, government-sponsored enterprise, headquartered in Tysons Corner, Virginia. The FHLMC was created in 1970 to expand the secondary market for mortgages in the US: https://en.wikipedia.org/wiki/Freddie_Mac

The Federal National Mortgage Association, commonly known as Fannie Mae, is a United States government-sponsored enterprise and, since 1968, a publicly traded company: https://en.wikipedia.org/wiki/Fannie_Mae

The primary difference between Freddie Mac and Fannie Mae is where they source their mortgages from. Fannie Mae buys mortgages from larger, commercial banks, while Freddie Mac buys them from much smaller banks.

Dataset: Home Mortgage Disclosure Act, National Loan Applications for 2020 https://ffiec.cfpb.gov/data-publication/dynamic-national-loan-level-dataset/2020

Data field definitions/values: https://ffiec.cfpb.gov/documentation/2020/lar-data-fields/

In [1]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
import pandas as pd
import math
import sklearn
from sklearn import metrics
import seaborn as sns
import researchpy as rp


In [2]:
#first, we will quickly run the code from the linear regression tutorial. Since we already explored the data during that tutorial, we will not do so again here.
#load file as a dataframe
df = pd.read_csv('loan_level_500k.csv')
#encode the columns that include text as numerical categories
from sklearn.preprocessing import LabelEncoder

#create an instance of label encoder
labelencoder = LabelEncoder()

#encode binary outcomes for first time homeowner flag column
df["First Time Homebuyer Flag_N"] = labelencoder.fit_transform(df["FIRST_TIME_HOMEBUYER_FLAG"])

#identify which columns are our predictors and which is our target - added loan purpose and postal code

cols= ['ORIGINAL_INTEREST_RATE', 'CREDIT_SCORE', 'First Time Homebuyer Flag_N','ORIGINAL_DEBT_TO_INCOME_RATIO', 'NUMBER_OF_BORROWERS','POSTAL_CODE','LOAN_PURPOSE']
selected_df = df[cols]
selected_df = selected_df.dropna()

feature_cols= ['CREDIT_SCORE', 'First Time Homebuyer Flag_N','ORIGINAL_DEBT_TO_INCOME_RATIO', 'NUMBER_OF_BORROWERS']
predictors = selected_df[feature_cols]
target_col = ['ORIGINAL_INTEREST_RATE']
target = selected_df[target_col]

#split data into training set and test set
x_train, x_test, y_train, y_test =  train_test_split(predictors,target,test_size = 0.3)

#initiate the linear regression model
linreg = LinearRegression()
#fit it to the training data
linreg.fit(x_train,y_train)
y_pred = linreg.predict(x_test)

#print the mean absolute error, mean squared error, and root mean squared error to evaluate the model
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
print('MSE:', metrics.mean_squared_error(y_test, y_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

MAE: 0.43031114873595966
MSE: 0.324172733200998
RMSE: 0.5693616892635103


The above model is the full model for all of the variables we are looking at. However, we might not need the full model, in fact the full model could be overfitting the data. We will next use various selection methods to create different models and then we will compare the models.

In [3]:
#Best Subset

In [4]:
#Forward selection



In [5]:
#Backward Selection

In [6]:
#AIC BIC Mallow's CP, Adjusted R squared

In [7]:
#Ridge Regression

##Least Absolute Shrinkage and Selection Operator (LASSO)

This regression analysis method is useful for prediction accuracy and interpretability of statistical models. The model uses shrinkage which is where data values are shrunk towards a central point as the mean, thus encouraging simple and sparse models. 




In [9]:
#Tuning Parameters