# Capstone 2: Lending Club - Pre-Processing and Training

In the previous notebook, for EDA, we explored the features (looking for missing data, outliers and making sure the data was a consistent data type).

The purpose of this notebook is to take our cleaned up dataset and get it ready for training. Below is a list of steps we'll be performing to get our data ready for training. 

1. From our dataset, there are still categorical features which need to be converted to numerical ones via OneHotEncoding. In order for categorical data to be utilized in those predictions it'll need to be converted to binary features for numerical representation. This is important as our models only take numerical data in order to perform it's predictions. 

2. The ranges of each individual feature still seems to be quite different between one another. It's important to utilize some standardization of the individual features so that no algorithm mistakenly puts additional weight in performing it's calculations. We'll be using StandardScaler() to perform this standardization process.

3. For training, we'll need to separate our dataset to be split for testing and training. Using the train_test_split() function we make 70% of the data for training and the remaining 30% of the data to see evaluate the model's performance. Two new CSV files will be genearted and saved for subsequent notebooks to utilize for testing and training.



### 1. Imports

In [1]:
!python -m pip install --upgrade pip
!python -m pip install autogluon

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting autogluon
  Downloading autogluon-0.7.0-py3-none-any.whl (9.7 kB)
Collecting autogluon.core[all]==0.7.0 (from autogluon)
  Downloading autogluon.core-0.7.0-py3-none-any.whl (218 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m218.3/218.3 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting autogluon.features==0.7.0 (from autogluon)
  Downloading autogluon.features-0.7.0-py3-none-any.whl (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.1/60.1 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting autogluon.tabular[all]==0.7.0 (from autogluon)
  Downloading autogluon.tabular-0.7.0-py3-none-any.whl (292 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m292.2/292.2 kB[0m [31m14.4 MB/s[0m eta [36m0:00:00

In [2]:
import pandas as pd
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt
import seaborn as sns

#Processing functions
from sklearn import __version__ as sklearn_version
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve, KFold
from sklearn.preprocessing import StandardScaler, MinMaxScaler, scale


#statistical measures for model performance
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

#functions to help with building multiple models
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
import datetime

from sb_utils import save_file

import autogluon
from autogluon.tabular import TabularDataset, TabularPredictor

#Regression models
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.tree import DecisionTreeRegressor
import lightgbm as lgb

from sklearn.utils import all_estimators
from sklearn.base import RegressorMixin

#**1. OneHotEncoding**

Previously in the "Capstone Two - EDA" notebook, we generated took our original file "loan_data.csv" and performed the following: 1.) removed features we deemed non-informative to our goal, 2.) removed outliers, 3.) imputed the median for missing data, 4.) onehotencoded the categorical features to be represented via numerical (binary) values.

Below the imported "loan_data_eda_features.csv" is already onehotencoded.

In [3]:
loan_data_eda = pd.read_csv('loan_data_eda_features.csv')

In [4]:
loan_data_eda.head()

Unnamed: 0,interest_rate,installment,log_annual_income,debt_to_income_ratio,fico,days_with_credit_line,revolving_balance,revolving_utilization,inquiries_last_6months,delinquency_2yrs,public_record,purpose_all_other,purpose_credit_card,purpose_debt_consolidation,purpose_educational,purpose_home_improvement,purpose_major_purchase,purpose_small_business,not_fully_paid_0,not_fully_paid_1
0,0.1189,829.1,11.350407,19.48,737,5639.958333,28854,52.1,0,0,0,0,0,1,0,0,0,0,1,0
1,0.1071,228.22,11.082143,14.29,707,2760.0,33623,76.7,0,0,0,0,1,0,0,0,0,0,1,0
2,0.1357,366.86,10.373491,11.63,682,4710.0,3511,25.6,1,0,0,0,0,1,0,0,0,0,1,0
3,0.1008,162.34,11.350407,8.1,712,2699.958333,33667,73.2,1,0,0,0,0,1,0,0,0,0,1,0
4,0.1426,102.92,11.299732,14.97,667,4066.0,4740,39.5,0,1,0,0,1,0,0,0,0,0,1,0


In [6]:
loan_data_eda.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7710 entries, 0 to 7709
Data columns (total 20 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   interest_rate               7710 non-null   float64
 1   installment                 7710 non-null   float64
 2   log_annual_income           7710 non-null   float64
 3   debt_to_income_ratio        7710 non-null   float64
 4   fico                        7710 non-null   int64  
 5   days_with_credit_line       7710 non-null   float64
 6   revolving_balance           7710 non-null   int64  
 7   revolving_utilization       7710 non-null   float64
 8   inquiries_last_6months      7710 non-null   int64  
 9   delinquency_2yrs            7710 non-null   int64  
 10  public_record               7710 non-null   int64  
 11  purpose_all_other           7710 non-null   int64  
 12  purpose_credit_card         7710 non-null   int64  
 13  purpose_debt_consolidation  7710 

In [7]:
loan_data_eda.describe()

Unnamed: 0,interest_rate,installment,log_annual_income,debt_to_income_ratio,fico,days_with_credit_line,revolving_balance,revolving_utilization,inquiries_last_6months,delinquency_2yrs,public_record,purpose_all_other,purpose_credit_card,purpose_debt_consolidation,purpose_educational,purpose_home_improvement,purpose_major_purchase,purpose_small_business,not_fully_paid_0,not_fully_paid_1
count,7710.0,7710.0,7710.0,7710.0,7710.0,7710.0,7710.0,7710.0,7710.0,7710.0,7710.0,7710.0,7710.0,7710.0,7710.0,7710.0,7710.0,7710.0,7710.0,7710.0
mean,0.118754,325.0792,10.94268,12.298684,717.356031,4682.468461,13798.40428,45.312677,0.997536,0.143191,0.055123,0.238003,0.132296,0.418029,0.032944,0.066407,0.048119,0.064202,0.868482,0.131518
std,0.025571,205.611447,0.585371,6.627485,36.630697,2429.932117,16878.560424,28.821751,1.15258,0.469033,0.241491,0.425888,0.338834,0.493267,0.178502,0.249009,0.214032,0.245129,0.337987,0.337987
min,0.06,15.69,8.29405,0.0,627.0,1110.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0976,166.5,10.596535,7.13,687.0,2970.010417,3334.25,21.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
50%,0.1189,278.605,10.933107,12.38,712.0,4230.041667,8707.5,44.3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
75%,0.1357,447.7475,11.289819,17.52,742.0,5789.958333,17579.75,68.675,2.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
max,0.2121,918.02,14.528354,29.42,827.0,17616.0,149527.0,99.9,8.0,6.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [8]:
#splitting my data in to X (inputs) and y (outputs) 
#interest rate is what we're trying to predict
y = loan_data_eda['interest_rate']
#the other 19 columns are our independent variables we have a suspicion from the data that it'll predict our "y"
X = loan_data_eda.drop(columns='interest_rate')

In [9]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7710 entries, 0 to 7709
Data columns (total 19 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   installment                 7710 non-null   float64
 1   log_annual_income           7710 non-null   float64
 2   debt_to_income_ratio        7710 non-null   float64
 3   fico                        7710 non-null   int64  
 4   days_with_credit_line       7710 non-null   float64
 5   revolving_balance           7710 non-null   int64  
 6   revolving_utilization       7710 non-null   float64
 7   inquiries_last_6months      7710 non-null   int64  
 8   delinquency_2yrs            7710 non-null   int64  
 9   public_record               7710 non-null   int64  
 10  purpose_all_other           7710 non-null   int64  
 11  purpose_credit_card         7710 non-null   int64  
 12  purpose_debt_consolidation  7710 non-null   int64  
 13  purpose_educational         7710 

In [10]:
y.shape

(7710,)

#**2. Scaling the Independent Variables**

In [11]:
#Standard Scaling the training data so that each independent variable has a zero mean and a unit variance.
#this transformation makes it a ndarray
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

#**3. Generating Training and Testing Data**

In [12]:
#splitting the data into train-test split 70:30
X_train_scaled_arr, X_test_scaled_arr, y_train_arr, y_test_arr = train_test_split(X_scaled, y, test_size=0.3, train_size=0.7, random_state = 15)

In [13]:
X_train_scaled_arr.shape, X_test_scaled_arr.shape

((5397, 19), (2313, 19))

In [14]:
y_train_arr.shape, y_test_arr.shape

((5397,), (2313,))

In [15]:
X_train_scaled_arr.dtype

dtype('float64')

In [16]:
X_test_scaled_arr.dtype

dtype('float64')

We've just performed a 70/30 train/test split on our data.

Scaled Training X = "X_train_scaled" ; (5397 rows, 19 columns)

Training y = "y_train" ; (5397 rows, )

Scaled Testing X = "X_test_scaled" ; (2313 rows, 19 columns)

testing y = "y_test" ; (2313 rows,)

#**AmazonGluon**

Below we'll utilize AmazonGluon's automatic machine learning model generator to generate leads for which models to prioritize building and optimizing.

In [17]:
#splitting the data into train-test split 70:30
#this is the unscaled data, because we believe that amazon glu-on may have built in scaling in their code
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, train_size=0.7, random_state = 15)

In [18]:
#by default when we generate a nd-array using pd.Series, by default it provides a numerical index starting from 0 for the first element to the last
series = pd.Series(y_train)
 
training_data = X_train
training_data['interest_rate'] = series.values
training_data.head()

Unnamed: 0,installment,log_annual_income,debt_to_income_ratio,fico,days_with_credit_line,revolving_balance,revolving_utilization,inquiries_last_6months,delinquency_2yrs,public_record,purpose_all_other,purpose_credit_card,purpose_debt_consolidation,purpose_educational,purpose_home_improvement,purpose_major_purchase,purpose_small_business,not_fully_paid_0,not_fully_paid_1,interest_rate
1020,173.17,11.112448,10.33,727,5189.958333,2410,4.4,1,1,0,0,0,0,0,1,0,0,1,0,0.0863
2931,33.62,10.085809,24.35,687,3480.0,2533,38.4,0,0,1,1,0,0,0,0,0,0,1,0,0.1284
2796,210.01,10.858999,23.35,667,3870.0,7160,62.3,1,0,0,0,0,1,0,0,0,0,1,0,0.1568
5786,285.95,11.112388,17.7,737,6970.041667,17870,42.8,1,0,0,0,1,0,0,0,0,0,1,0,0.0894
2463,159.92,11.066638,16.72,712,7319.958333,9878,56.4,3,1,0,1,0,0,0,0,0,0,1,0,0.1221


In [19]:
training_data.shape

(5397, 20)

In [20]:
series_2 = pd.Series(y_test)
#this X_test doesn't include the interest_rate column, thus we can use it as a way to measure predictive accuracy
testing_data = X_test

In [21]:
testing_data.shape

(2313, 19)

In [22]:
i_column = 'interest_rate'

Utilizing AutoGluon

In [23]:
#Using AutoGluon a automated ML model builder on just the training data
predictor_interest_rate = TabularPredictor(label=i_column, path="agModels-predictInterestRate").fit(training_data, time_limit=120)                                               

Beginning AutoGluon training ... Time limit = 120s
AutoGluon will save models to "agModels-predictInterestRate/"
AutoGluon Version:  0.7.0
Python Version:     3.10.11
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Sat Apr 29 09:15:28 UTC 2023
Train Data Rows:    5397
Train Data Columns: 19
Label Column: interest_rate
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and label-values can't be converted to int).
	Label info (max, min, mean, stddev): (0.2121, 0.06, 0.11892, 0.02604)
	If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    11712.38 MB
	Train Data (Original)  Memory Usage: 0.82 MB (0.0% of availabl

Running AutoGluon on this particular dataset only took roughly 1 minute.


In [24]:
#using pickle file to load autogluon results
predictor = TabularPredictor.load("agModels-predictInterestRate/")

In [25]:
y_pred_gluon = predictor_interest_rate.predict(testing_data)     

In [26]:
#the error issue shown below is possibly due to a version of a particular package being out of date/not entirely compatible with the gluon package
y_pred_gluon.head()

1093    0.120134
6941    0.082609
1790    0.114956
4660    0.114137
1708    0.076789
Name: interest_rate, dtype: float32

In [27]:
y_pred_gluon.shape

(2313,)

In [28]:
testing_data['interest_rate'] = y_pred_gluon

In [29]:
predictor_interest_rate.leaderboard(testing_data, silent=True)

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,-0.0,-0.012285,1.080121,0.418703,62.596196,0.009336,0.001281,0.71844,2,True,12
1,LightGBMLarge,-0.002191,-0.012491,0.082878,0.016846,2.694989,0.082878,0.016846,2.694989,1,True,11
2,LightGBM,-0.002432,-0.012507,0.038426,0.00964,1.004905,0.038426,0.00964,1.004905,1,True,4
3,RandomForestMSE,-0.00269,-0.012556,0.413988,0.239011,12.582806,0.413988,0.239011,12.582806,1,True,5
4,CatBoost,-0.002856,-0.012654,0.015023,0.008762,10.162949,0.015023,0.008762,10.162949,1,True,6
5,ExtraTreesMSE,-0.003094,-0.012718,0.410341,0.096392,3.904968,0.410341,0.096392,3.904968,1,True,7
6,XGBoost,-0.003119,-0.012694,0.039201,0.035667,1.405962,0.039201,0.035667,1.405962,1,True,9
7,LightGBMXT,-0.003138,-0.012726,0.083448,0.017808,1.154908,0.083448,0.017808,1.154908,1,True,3
8,NeuralNetFastAI,-0.005013,-0.0131,0.069417,0.016244,6.091812,0.069417,0.016244,6.091812,1,True,8
9,NeuralNetTorch,-0.005193,-0.0133,0.040712,0.030527,25.435328,0.040712,0.030527,25.435328,1,True,10


Interpreting the results from the Gluon model:

We can increase the time for the model calculations, meaning it'll generate more models.

What is score test? Gluon was able to infer this is a regression problem. By default it's score test is calculating RMSE. Ideally you want as low as a score as possible. If the prediction is wrong, essentially it's squared before being averaged, thus if you have a high RMSE value, the model performed poorly.

The evaluation metric can be set to be MAE or R^2, need to specify this.

Specific to RMSE, a negative sign is added infront of the RMSE value for the model.

What is Score val?
There is most likely a set number that determines the percentage of the data to be the validation set. Think, originally when you split the data into training and test set, you also have an additional third category, the validation set.

#**Next Steps**

Based off the Gluon results, in the modeling portion, the models of focus will be the following:

1. Random Forest
2. Linear Regression
3. Gradient Boosting
4. XGBRegressor
5. Decision Tree  


#**Extracting dependent and scaled independent variables**

In [30]:
X.index

RangeIndex(start=0, stop=7710, step=1)

In [31]:
#extracting column values for making dataframes
for col in X.columns:
    print(col)

installment
log_annual_income
debt_to_income_ratio
fico
days_with_credit_line
revolving_balance
revolving_utilization
inquiries_last_6months
delinquency_2yrs
public_record
purpose_all_other
purpose_credit_card
purpose_debt_consolidation
purpose_educational
purpose_home_improvement
purpose_major_purchase
purpose_small_business
not_fully_paid_0
not_fully_paid_1


In [33]:
#transforming nd-arrays into dataframes
X_train_scaled = pd.DataFrame(X_train_scaled_arr, columns=['installment', 'log_annual_income', 'debt_to_income_ratio', 'fico', 'days_with_credit_line', 'revolving_balance', 'revolving_utilization', 'inquiries_last_6months', 'delinquency_2yrs', 'public_record', 'purpose_all_other', 'purpose_credit_card', 'purpose_debt_consolidation', 'purpose_educational', 'purpose_home_improvement', 'purpose_major_purchase', 'purpose_small_business', 'not_fully_paid_0', 'not_fully_paid_1'])
X_test_scaled = pd.DataFrame(X_test_scaled_arr, columns=['installment', 'log_annual_income', 'debt_to_income_ratio', 'fico', 'days_with_credit_line', 'revolving_balance', 'revolving_utilization', 'inquiries_last_6months', 'delinquency_2yrs', 'public_record', 'purpose_all_other', 'purpose_credit_card', 'purpose_debt_consolidation', 'purpose_educational', 'purpose_home_improvement', 'purpose_major_purchase', 'purpose_small_business', 'not_fully_paid_0', 'not_fully_paid_1'])
y_train = pd.DataFrame(y_train_arr, columns=['interest_rate'])
y_test = pd.DataFrame(y_test_arr, columns=['interest_rate'])

In [34]:
# Save the data 
#first value in save_file("dataframe name","filepath" )
datapath = 'data'
save_file(X_train_scaled, 'X_train_scaled.csv', datapath)
save_file(X_test_scaled, 'X_test_scaled.csv', datapath)
save_file(y_train, 'y_train.csv', datapath)
save_file(y_test, 'y_test.csv', datapath)

Directory data was created.
Writing file.  "data/X_train_scaled.csv"
Writing file.  "data/X_test_scaled.csv"
Writing file.  "data/y_train.csv"
Writing file.  "data/y_test.csv"
