In [1]:
#Run the following code to print multiple outputs from a cell
get_ipython().ast_node_interactivity = 'all'

## Import Data

In [2]:
import pandas as pd
df = pd.read_csv("creditCardDefaultReduced.csv")
df

Unnamed: 0,Limit_Bal,Education,Marriage,Age,Pay_0,Bill_Amt1,Pay_Amt1,Payment,Card
0,20000,University,married,24,Delay2,3913,0,Missed,Normal
1,120000,University,single,26,ontime,2682,0,Missed,Normal
2,90000,University,single,34,Delay0,29239,1518,Paid,Normal
3,50000,University,married,37,Delay0,46990,2000,Paid,Normal
4,50000,University,married,57,ontime,8617,2000,Paid,Gold
...,...,...,...,...,...,...,...,...,...
29995,220000,High school,married,39,Delay0,188948,8500,Paid,Gold
29996,150000,High school,single,43,ontime,1683,1837,Paid,Gold
29997,30000,University,single,37,Delay4,3565,0,Missed,Gold
29998,80000,High school,married,41,Delay1,-1645,85900,Missed,Gold


## Set-up outcome & features variables

For these models, I'm using `Pay_Amt1` as the outcome variable...we're trying to predict how much someone will pay. 

For the features, I'll include everything EXCEPT `Payment` since payment indicates whether or not someone paid their bill. If I wanted to use these models with new data, it's likely I wouldn't yet have this `Payment` information.

In [3]:
outcome = df["Pay_Amt1"]
numericFeatures = df[["Limit_Bal", "Age", "Bill_Amt1"]]
dummiesMarriage = pd.get_dummies(df["Marriage"], prefix = "Marriage", drop_first = True)
dummiesCard = pd.get_dummies(df["Card"], prefix = "Card", drop_first = True)
dummiesPay_0 = pd.get_dummies(df["Pay_0"], prefix = "Pay_0", drop_first = True)
features = pd.concat([numericFeatures, dummiesMarriage, dummiesCard, dummiesPay_0], axis = 1)
outcome
features

0            0
1            0
2         1518
3         2000
4         2000
         ...  
29995     8500
29996     1837
29997        0
29998    85900
29999     2078
Name: Pay_Amt1, Length: 30000, dtype: int64

Unnamed: 0,Limit_Bal,Age,Bill_Amt1,Marriage_other,Marriage_single,Marriage_unknown,Card_Normal,Pay_0_Delay1,Pay_0_Delay2,Pay_0_Delay3,Pay_0_Delay4,Pay_0_Delay5,Pay_0_Delay6,Pay_0_Delay7,Pay_0_Delay8+,Pay_0_ontime,Pay_0_unknown
0,20000,24,3913,False,False,False,True,False,True,False,False,False,False,False,False,False,False
1,120000,26,2682,False,True,False,True,False,False,False,False,False,False,False,False,True,False
2,90000,34,29239,False,True,False,True,False,False,False,False,False,False,False,False,False,False
3,50000,37,46990,False,False,False,True,False,False,False,False,False,False,False,False,False,False
4,50000,57,8617,False,False,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29995,220000,39,188948,False,False,False,False,False,False,False,False,False,False,False,False,False,False
29996,150000,43,1683,False,True,False,False,False,False,False,False,False,False,False,False,True,False
29997,30000,37,3565,False,True,False,False,False,False,False,True,False,False,False,False,False,False
29998,80000,41,-1645,False,False,False,False,True,False,False,False,False,False,False,False,False,False


## Partition the Data

In [4]:
from sklearn.model_selection import train_test_split
featuresTrain, featuresTest, outcomeTrain, outcomeTest = train_test_split(features, 
                                                                          outcome, 
                                                                          test_size = 0.33, 
                                                                          random_state = 42)

## Decision Tree

For each of these models, instead of using a "Classifier", we'll be using a "Regressor" since the outcome variable is numeric. 

For Decision Trees, the command is `DecisionTreeRegressor` instead of `DecisionTreeClassifier`. 

In [5]:
# 1. Set-up the model
import sklearn.tree
modelTree = sklearn.tree.DecisionTreeRegressor(random_state = 42)

# 2. Fit the model using the training data
resultTree = modelTree.fit(featuresTrain, outcomeTrain)

# 3. Predict outcomes from the training and testing data
predTreeTrain = modelTree.predict(featuresTrain)
predTreeTest = modelTree.predict(featuresTest)

To assess the fit of a machine learning regression model, we can't use the classification report. Instead, the most commonly used measure of fit is the root mean squared error (RMSE). This is a measure of the average error in the model, so a lower number means the model is better.

In [6]:
# 4. Assess the fit
import numpy as np
np.sqrt(sklearn.metrics.mean_squared_error(outcomeTrain, predTreeTrain))
np.sqrt(sklearn.metrics.mean_squared_error(outcomeTest, predTreeTest))

648.4096019547605

23323.81196215324

## Random Forest

As with the classification models, all of the steps are the same between Decision Trees and Random Forests. The command to set-up the model is `RandomForestRegressor` instead of `RandomForestClassifier`.

In [7]:
# 1. Set-up the model
import sklearn.ensemble
modelForest = sklearn.ensemble.RandomForestRegressor(random_state = 42)

# 2. Fit the model using the training data
resultForest = modelForest.fit(featuresTrain, outcomeTrain)

# 3. Predict outcomes from the training and testing data
predForestTrain = modelForest.predict(featuresTrain)
predForestTest = modelForest.predict(featuresTest)

# 4. Assess the fit
np.sqrt(sklearn.metrics.mean_squared_error(outcomeTrain, predForestTrain))
np.sqrt(sklearn.metrics.mean_squared_error(outcomeTest, predForestTest))

6370.420362793068

17053.19262195185

## Support Vector Machine

Again, the steps are the same, although it's a good idea to first normalize your features. Then the command to set-up the model is `SVR` (for support vector regressor) instead of `SVC`.

In [15]:
# Normalize the features
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
featuresTrain_norm = scaler.fit_transform(featuresTrain)
featuresTest_norm = scaler.transform(featuresTest)

# 1. Set-up the model
import sklearn.svm
modelSVM = sklearn.svm.SVR()

# 2. Fit the model using the training data
resultSVM = modelSVM.fit(featuresTrain_norm, outcomeTrain)

# 3. Predict outcomes from the training and testing data
predSVMTrain = modelSVM.predict(featuresTrain_norm)
predSVMTest = modelSVM.predict(featuresTest_norm)

# 4. Assess the fit
np.sqrt(sklearn.metrics.mean_squared_error(outcomeTrain, predSVMTrain))
np.sqrt(sklearn.metrics.mean_squared_error(outcomeTest, predSVMTest))

17320.035575152506

16089.650094371218

## Neural Network

Again, steps 2-4 are the same. For step 1, we'll use `MLPRegressor` instead of `MLPClassifier` (and remember to add in the `hidden_layer_sizes` parameter to specify how many hidden layers and how many nodes you'd like to include).

In [14]:
# 1. Set-up the model
import sklearn.neural_network
modelNeural = sklearn.neural_network.MLPRegressor(hidden_layer_sizes = (20, 20), random_state = 42)

# 2. Fit the model using the training data
resultNeural = modelNeural.fit(featuresTrain_norm, outcomeTrain)

# 3. Predict outcomes from the training and testing data
predNeuralTrain = modelNeural.predict(featuresTrain_norm)
predNeuralTest = modelNeural.predict(featuresTest_norm)

# 4. Assess the fit
np.sqrt(sklearn.metrics.mean_squared_error(outcomeTrain, predNeuralTrain))
np.sqrt(sklearn.metrics.mean_squared_error(outcomeTest, predNeuralTest))



16555.705275011325

15236.704046609599

## Conclusion

Just as with classification models, we still want to focus in on the fit for the Test data. Here are the RMSE's for each model:

* Decision Tree: 23,324
* Random Forest: 17,053
* Support Vector Machine: 16,089
* Neural Network: 15,236

Because RMSE is a measure of error, the lower the number, the better. In this case, **the Neural Network is the best of the 4 models to predict someone's payment amount**.