### Using Decision Tree for Loan Default Prediction

### What is our objective ?
* To reliably predict wether a person's loan payment will be defaulted based on features such as Salary, Account Balance etc.

### Getting to know the dataset!
LoanDefault dataset contains historic data for loan defaultees, along with their associated financial background, it has the following features.
* Employed - Employment status of the borrower, (1 - Employed | 0 - Unemployed).
* Bank Balance - Account Balance of the borrower at the time of repayment / default.
* Annual Salary - Per year income of the borrower at the time of repayment / default.
* Default - Target variable, indicated if the borrower repayed the loaned amount within the stipulated time period, (1 - Defaulted | 0 - Re-Paid).

### Approach
* This is an trivial example for dataset containing class imbalance, considering most of the people will be repaying their loan without default.
* So, we have to explore our data to check for imbalance, handle it using various techniques.
* Explore the correlation between various features in the dataset
* Split the preprocessed dataset into train and test sets respectively.
* Train a DecisionTree (Classifier) using mlpack.
* Finally we'll predict on the test set and using various evaluation metrics such as Accuracy, F1-Score, ROC AUC to judge the performance of our model on unseen data.

#### NOTE: In this example we'll be implementing 4 parts i.e modelling on imbalanced, oversampled, SMOTE & undersampled data respectively.

In [None]:
!wget -q http://datasets.mlpack.org/LoanDefault.csv

In [1]:
// Import necessary library headers.
#include <mlpack/xeus-cling.hpp>
#include <mlpack/core.hpp>
#include <mlpack/core/data/split_data.hpp>
#include <mlpack/methods/decision_tree/decision_tree.hpp>
#include <mlpack/core/data/scaler_methods/standard_scaler.hpp>

In [2]:
// Import utility headers.
#define WITHOUT_NUMPY 1
#include "matplotlibcpp.h"
#include "xwidgets/ximage.hpp"
#include "../utils/preprocess.hpp"
#include "../utils/plot.hpp"

namespace plt = matplotlibcpp;

In [3]:
using namespace mlpack;

In [4]:
using namespace mlpack::data;

In [5]:
using namespace mlpack::tree;

In [6]:
// Utility functions for evaluation metrics.
double Accuracy(const arma::Row<size_t>& yPreds, const arma::Row<size_t>& yTrue)
{
    const size_t correct = arma::accu(yPreds == yTrue);
    return (double)correct / (double)yTrue.n_elem;
}

In [7]:
double Precision(const size_t truePos, const size_t falsePos)
{
    return (double)truePos / (double)(truePos + falsePos);
}

In [8]:
double Recall(const size_t truePos, const size_t falseNeg)
{
    return (double)truePos / (double)(truePos + falseNeg);
}

In [9]:
double F1Score(const size_t truePos, const size_t falsePos, const size_t falseNeg)
{
    double prec = Precision(truePos, falsePos);
    double rec = Recall(truePos, falseNeg);
    return 2 * (prec * rec) / (prec + rec);
}

In [10]:
void ClassificationReport(const arma::Row<size_t>& yPreds, const arma::Row<size_t>& yTrue)
{
    arma::Row<size_t> uniqs = arma::unique(yTrue);
    std::cout << std::setw(29) << "precision" << std::setw(15) << "recall" 
              << std::setw(15) << "f1-score" << std::setw(15) << "support" 
              << std::endl << std::endl;
    
    for(auto val: uniqs)
    {
        size_t truePos = arma::accu(yTrue == val && yPreds == val && yPreds == yTrue);
        size_t falsePos = arma::accu(yPreds == val && yPreds != yTrue);
        size_t trueNeg = arma::accu(yTrue != val && yPreds != val && yPreds == yTrue);
        size_t falseNeg = arma::accu(yPreds != val && yPreds != yTrue);
        
        std::cout << std::setw(15) << val
                  << std::setw(12) << std::setprecision(2) << Precision(truePos, falsePos) 
                  << std::setw(16) << std::setprecision(2) << Recall(truePos, falseNeg) 
                  << std::setw(14) << std::setprecision(2) << F1Score(truePos, falsePos, falseNeg)
                  << std::setw(16) << truePos
                  << std::endl;
    }
}

Create a directory named data to store all preprocessed csv.

In [11]:
!mkdir -p ./data

Drop the dataset header using sed, sed is an unix utility that prases and transforms text.

In [12]:
!cat LoanDefault.csv | sed 1d > ./data/LoanDefault_trim.csv

### Loading the Data

In [13]:
// Load the preprocessed dataset into armadillo matrix.
arma::mat loanData;
data::Load("./data/LoanDefault_trim.csv", loanData);

In [14]:
// Inspect the first 5 examples in the dataset
std::cout << std::setw(12) << "Employed" << std::setw(15) << "Bank Balance" << std::setw(15) << "Annual Salary" 
          << std::setw(12) << "Defaulted" << std::endl;
std::cout << loanData.submat(0, 0, loanData.n_rows-1, 5).t() << std::endl;

    Employed   Bank Balance  Annual Salary   Defaulted
   1.0000e+00   8.7544e+03   5.3234e+05            0
            0   9.8062e+03   1.4527e+05            0
   1.0000e+00   1.2883e+04   3.8121e+05            0
   1.0000e+00   6.3510e+03   4.2845e+05            0
   1.0000e+00   9.4279e+03   4.6156e+05            0
            0   1.1035e+04   8.9899e+04            0



### Part 1 - Modelling using Imbalanced Dataset

In [15]:
// Visualize the distribution of target classes.
CountPlot("LoanDefault.csv", "Defaulted?", "", "Part-1 Distribution of target class");
auto img = xw::image_from_file("./plots/Part-1 Distribution of target class.png").finalize();
img

A Jupyter widget with unique id: f5c8da1cd651478abbd7d60997e230a6

From the above visualization, we can observe that the presence of "0" and "1", so there is a huge class imbalance. For the first part we would not be handling the class imbalance. In order to see how our model performs on the raw imbalanced data

In [16]:
// Visualize the distibution of target classes with respect to Employment.
CountPlot("LoanDefault.csv", "Defaulted?", "Employed", "Part-1 Distribution of target class & Employed");
auto img = xw::image_from_file("./plots/Part-1 Distribution of target class & Employed.png").finalize();
img

A Jupyter widget with unique id: 1244ff4bc7b84e65a7bc8ab6f0e4ec0d

### Visualize Correlation

In [17]:
// Plot the correlation matrix as heatmap.
HeatMapPlot("LoanDefault.csv", "coolwarm", "Part-1 Correlation Heatmap", 1, 5, 5);
auto img = xw::image_from_file("./plots/Part-1 Correlation Heatmap.png").finalize();
img

A Jupyter widget with unique id: 7733405de28f4b488228b6cd126095ee

In [18]:
// Split the data into features (X) and target (y) variables, targets are the last row.
arma::Row<size_t> targets = arma::conv_to<arma::Row<size_t>>::from(loanData.row(loanData.n_rows - 1));
// Targets are dropped from the loaded matrix.
loanData.shed_row(loanData.n_rows-1);

### Train Test Split
The data set has to be split into a training set and a test set. Here the dataset has 10000 observations and the test Ratio is taken as 25% of the total observations. This indicates the test set should have 25% * 10000 = 2500 observations and trainng test should have 7500 observations respectively. This can be done using the `data::Split()` api from mlpack.

In [19]:
// Split the dataset into train and test sets using mlpack.
arma::mat Xtrain, Xtest;
arma::Row<size_t> Ytrain, Ytest;
mlpack::data::Split(loanData, targets, Xtrain, Xtest, Ytrain, Ytest, 0.25);

### Training Decision Tree model
Decision trees start with a basic question, From there you can ask a series of questions to determine an answer. These questions make up the decision nodes in the tree, acting as a means to split the data. Each question helps an individual to arrive at a final decision, which would be denoted by the leaf node. Observations that fit the criteria will follow the “Yes” branch and those that don’t will follow the alternate path.  Decision trees seek to find the best split to subset the data. To create the model we'll be using `DecisionTree<>` API from mlpack.

In [20]:
// Create and train Decision Tree model using mlpack.
DecisionTree<> dt(Xtrain, Ytrain, 2);

### Making Predictions on Test set

In [21]:
// Classify the test set using trained model & get the probabilities.
arma::Row<size_t> output;
arma::mat probs;
dt.Classify(Xtest, output, probs);

### Evaluation metrics

* True Positive - The actual value was true & the model predicted true.
* False Positive - The actual value was false & the model predicted true, Type I error.
* True Negative - The actual value was false & the model predicted false.
* False Negative - The actual value was true & the model predicted false, Type II error.

`Accuracy`: is a metric that generally describes how the model performs across all classes. It is useful when all classes are of equal importance. It is calculated as the ratio between the number of correct predictions to the total number of predictions.

$$Accuracy = \frac{True_{positive} + True_{negative}}{True_{positive} + True_{negative} + False_{positive} + False_{negative}}$$

`Precision`: is calculated as the ratio between the number of positive samples correctly classified to the total number of samples classified as Positive. The precision measures the model's accuracy in classifying a sample as positive.

$$Precision = \frac{True_{positive}}{True_{positive} + False_{positive}}$$

`Recall`: is calulated as the ratio between the number of positive samples correctly classified as Positive to the total number of Positive samples. The recall measures the model's ability to detect Positive samples. The higher the recall, the more positive samples detected.

$$Recall = \frac{True_{positive}}{True_{positive} + False_{negative}}$$

* The decision of whether to use precision or recall depends on the type of problem begin solved.
* If the goal is to detect all positive samples then use recall.
* Use precision if the problem is sensitive to classifying a sample as Positive in general.

* ROC graph has the True Positive rate on the y axis and the False Positive rate on the x axis.
* ROC Area under the curve in the graph is the primary metric to determine if the classifier is doing well, the higher the value the higher the model performance.

In [22]:
// Save the yTest and probabilities into csv for generating ROC AUC plot.
data::Save("./data/probabilities.csv", probs);
data::Save("./data/ytest.csv", Ytest);

In [23]:
// Model evaluation metrics.
std::cout << "Accuracy: " << Accuracy(output, Ytest) << std::endl;
ClassificationReport(output, Ytest);

Accuracy: 0.9708
                    precision         recall       f1-score        support

              0        0.98            0.99          0.99            2399
              1        0.57            0.35          0.43              28


In [24]:
// Plot ROC AUC Curve to visualize the performance of the model on TP & FP.
RocAucPlot("./data/ytest.csv", "./data/probabilities.csv", "Part-1 Imbalanced Targets ROC AUC Curve");
auto img = xw::image_from_file("./plots/Part-1 Imbalanced Targets ROC AUC Curve.png").finalize();
img

A Jupyter widget with unique id: b2301534f91b4d7eb9852f7d6f42df48

From the above classification report, we can infer that our model trained on imbalanced data performs well on negative class but not the same for positive class.

### Part 2 - Modelling using Random Oversampling
For this part we would be handling the class imbalance. In order to see how our model performs on the randomly oversampled data. We will be using `Resample()` method to oversample the minority class i.e "1, signifying Defaulted"

In [25]:
// Oversample the minority population.
Resample("LoanDefault.csv", "Defaulted?", 0, 1, "oversample");

In [26]:
// Visualize the distribution of target classes.
CountPlot("./data/LoanDefault_oversampled.csv", "Defaulted?", "", "Part-2 Distribution of target class");
auto img = xw::image_from_file("./plots/Part-2 Distribution of target class.png").finalize();
img

A Jupyter widget with unique id: 3dd293f0687c4feca705eb9ad3814bd1

From the above plot we can see that after resampling the minority class (Yes) is oversampled to be equal to the majority class (No). This solves our imbalanced data issue for this part.

In [27]:
!cat ./data/LoanDefault_oversampled.csv | sed 1d > ./data/LoanDefault_trim.csv

In [29]:
// Load the preprocessed dataset into armadillo matrix.
arma::mat loanData;
data::Load("./data/LoanDefault_trim.csv", loanData);

In [31]:
// Plot the correlation matrix as heatmap.
HeatMapPlot("./data/LoanDefault_oversampled.csv", "coolwarm", "Part-2 Correlation Heatmap", 1, 5, 5);
auto img = xw::image_from_file("./plots/Part-2 Correlation Heatmap.png").finalize();
img

A Jupyter widget with unique id: f8218673289344ebb98d7237df3cc769

In [32]:
// Split the data into features (X) and target (y) variables, targets are the last row.
arma::Row<size_t> targets = arma::conv_to<arma::Row<size_t>>::from(loanData.row(loanData.n_rows - 1));
// Targets are dropped from the loaded matrix.
loanData.shed_row(loanData.n_rows-1);

### Train Test Split
The dataset has to be split into training and test set. Here the dataset has 19334 observations and the test ratio is taken as 20% of the total observations. This indicates that the test set should have 20% * 19334 = 3866 observations and training set should have 15468 observations respectively. This can be done using the `data::Split()` api from mlpack.

In [33]:
// Split the dataset into train and test sets using mlpack.
arma::mat Xtrain, Xtest;
arma::Row<size_t> Ytrain, Ytest;
mlpack::data::Split(loanData, targets, Xtrain, Xtest, Ytrain, Ytest, 0.25);

### Training Decision Tree model
We will use `DecisionTree<>` API from mlpack to train the model on oversampled data.

In [34]:
// Create and train Decision Tree model using mlpack.
DecisionTree<> dt(Xtrain, Ytrain, 2);

### Making Predictions on Test set

In [35]:
// Classify the test set using trained model & get the probabilities.
arma::Row<size_t> output;
arma::mat probs;
dt.Classify(Xtest, output, probs);

In [36]:
// Save the yTest and probabilities into csv for generating ROC AUC plot.
data::Save("./data/probabilities.csv", probs);
data::Save("./data/ytest.csv", Ytest);

In [42]:
// Model evaluation metrics.
std::cout << "Accuracy: " << Accuracy(output, Ytest) << std::endl;
ClassificationReport(output, Ytest);

Accuracy: 0.96
                    precision         recall       f1-score        support

              0           1            0.92          0.96            2238
              1        0.92               1          0.96            2390


In [38]:
// Plot ROC AUC Curve to visualize the performance of the model on TP & FP.
RocAucPlot("./data/ytest.csv", "./data/probabilities.csv", "Part-2 Random Oversampled Targets ROC AUC Curve");
auto img = xw::image_from_file("./plots/Part-2 Random Oversampled Targets ROC AUC Curve.png").finalize();
img

A Jupyter widget with unique id: 44d9be945f954550b413117bf7ea1e41

From the above classification report, we can infer that our model trained on oversampled data performs well on both the classes, This proves the fact that imbalanced data has affected the model trained in part one. Also from the ROC AUC Curve, we can infer the True Positive Rate is around 99%, which is a good significance that our model performs well on unseen data.

### Part 3 - Modelling using Synthetic Minority Oversampling Technique
For this part we would be handling the class imbalance. In order to see how our model performs on the oversampled data using SMOTE. We will be using `SMOTE` API from imblearn to oversample the minority class i.e "1, signifying Defaulted"

In [43]:
// Oversample the minority class using SMOTE resampling strategy.
Resample("LoanDefault.csv", "Defaulted?", 0, 1, "smote");

We need to put back the headers manually into the newely sampled dataset for visualization purpose.

In [44]:
!sed -i "1iEmployed,Bank Balance,Annual Salary,Defaulted?" ./data/LoanDefault_smotesampled.csv

In [45]:
// Visualize the distribution of target classes.
CountPlot("./data/LoanDefault_smotesampled.csv", "Defaulted?", "", "Part-3 Distribution of target class");
auto img = xw::image_from_file("./plots/Part-3 Distribution of target class.png").finalize();
img

A Jupyter widget with unique id: 0ac888c2030d41ba9e43cf0f19cca88b

In [46]:
!cat ./data/LoanDefault_smotesampled.csv | sed 1d > ./data/LoanDefault_trim.csv

In [47]:
// Load the preprocessed dataset into armadillo matrix.
arma::mat loanData;
data::Load("./data/LoanDefault_trim.csv", loanData);

In [48]:
// Plot the correlation matrix as heatmap.
HeatMapPlot("./data/LoanDefault_smotesampled.csv", "coolwarm", "Part-3 Correlation Heatmap", 1, 5, 5);
auto img = xw::image_from_file("./plots/Part-3 Correlation Heatmap.png").finalize();
img

A Jupyter widget with unique id: 9e57bd3c232c40d7add6e0a2bb72f897

In [49]:
// Split the data into features (X) and target (y) variables, targets are the last row.
arma::Row<size_t> targets = arma::conv_to<arma::Row<size_t>>::from(loanData.row(loanData.n_rows - 1));
// Targets are dropped from the loaded matrix.
loanData.shed_row(loanData.n_rows-1);

### Train Test Split
The dataset has to be split into training and test set. The test ratio is taken as 25% of the total observations. This can be done using the `data::Split()` api from mlpack.

In [50]:
// Split the dataset into train and test sets using mlpack.
arma::mat Xtrain, Xtest;
arma::Row<size_t> Ytrain, Ytest;
mlpack::data::Split(loanData, targets, Xtrain, Xtest, Ytrain, Ytest, 0.25);

### Training Decision Tree model
We will use `DecisionTree<>` API from mlpack to train the model on SMOTE data.

In [51]:
// Create and train Decision Tree model.
DecisionTree<> dt(Xtrain, Ytrain, 2);

### Making Predictions on Test set

In [52]:
// Classify the test set using trained model & get the probabilities.
arma::Row<size_t> output;
arma::mat probs;
dt.Classify(Xtest, output, probs);

In [53]:
// Save the yTest and probabilities into csv for generating ROC AUC plot.
data::Save("./data/probabilities.csv", probs);
data::Save("./data/ytest.csv", Ytest);

In [55]:
// Model evaluation metrics.
std::cout << "Accuracy: " << Accuracy(output, Ytest) << std::endl;
ClassificationReport(output, Ytest);

Accuracy: 0.91
                    precision         recall       f1-score        support

              0        0.93            0.89          0.91            2172
              1         0.9            0.93          0.91            2243


In [56]:
// Plot ROC AUC Curve to visualize the performance of the model on TP & FP.
RocAucPlot("./data/ytest.csv", "./data/probabilities.csv", "Part-3 SMOTE ROC AUC Curve");
auto img = xw::image_from_file("./plots/Part-3 SMOTE ROC AUC Curve.png").finalize();
img

A Jupyter widget with unique id: 47c4989f06a940f58bb4255e316c87c0

From the above classification report, we can infer that our model trained on SMOTE data performs well on both the classes. Also from the ROC AUC Curve, we can infer the True Positive Rate is around 90%, which is a quantifies that our model performs well on unseen data. But it performs slightly lower than the Oversampled data.

### Part 4 - Modelling using Random Undersampling
For this part we would be handling the class imbalance by undersampling the majority class, to see how well our model trains and performs on randomly undersampled data.

Since the size of the data set is quite small, undersampling of majority class would not make much sense here. But still we are going forward with this part to get a sense of how our model performs on less amount of data and it's impact on the learning.

In [57]:
// Undersample the majority class.
Resample("LoanDefault.csv", "Defaulted?", 0, 1, "undersample");

In [59]:
// Visualize the distribution of target classes.
CountPlot("./data/LoanDefault_undersampled.csv", "Defaulted?", "", "Part-4 Distribution of target class");
auto img = xw::image_from_file("./plots/Part-4 Distribution of target class.png").finalize();
img

A Jupyter widget with unique id: 40ef4a99fcab467496642602db73c185

From the above plot we can see that after resampling the majority class (No) is undersampled to be equal to the majority class (Yes). This solves our imbalanced data issue for this part.

In [60]:
!cat ./data/LoanDefault_undersampled.csv | sed 1d > ./data/LoanDefault_trim.csv

In [61]:
// Load the preprocessed dataset into armadillo matrix.
arma::mat loanData;
data::Load("./data/LoanDefault_trim.csv", loanData);

In [63]:
// Plot the correlation matrix as heatmap.
HeatMapPlot("./data/LoanDefault_undersampled.csv", "coolwarm", "Part-4 Correlation Heatmap", 1, 5, 5);
auto img = xw::image_from_file("./plots/Part-4 Correlation Heatmap.png").finalize();
img

A Jupyter widget with unique id: 5381e23c83b84ddb88cec8b87e0d05ee

In [64]:
// Split the data into features (X) and target (y) variables, targets are the last row.
arma::Row<size_t> targets = arma::conv_to<arma::Row<size_t>>::from(loanData.row(loanData.n_rows - 1));
// Targets are dropped from the loaded matrix.
loanData.shed_row(loanData.n_rows-1);

### Train Test Split
The dataset has to be split into training and test set. Here the dataset has 666 observations and the test ratio is taken as 20% of the total observations. This indicates that the test set should have 20% * 666 = 133 observations and training set should have 533 observations respectively. This can be done using the `data::Split()` api from mlpack.

In [65]:
// Split the dataset into train and test sets using mlpack.
arma::mat Xtrain, Xtest;
arma::Row<size_t> Ytrain, Ytest;
mlpack::data::Split(loanData, targets, Xtrain, Xtest, Ytrain, Ytest, 0.25);

### Training Decision Tree model
We will use `DecisionTree<>` API from mlpack to train the model on SMOTE data.

In [66]:
// Create and train Decision Tree model.
DecisionTree<> dt(Xtrain, Ytrain, 2);

In [67]:
// Classify the test set using trained model & get the probabilities.
arma::Row<size_t> output;
arma::mat probs;
dt.Classify(Xtest, output, probs);

In [68]:
// Save the yTest and probabilities into csv for generating ROC AUC plot.
data::Save("./data/probabilities.csv", probs);
data::Save("./data/ytest.csv", Ytest);

In [70]:
// Model evaluation metrics.
std::cout << "Accuracy: " << Accuracy(output, Ytest) << std::endl;
ClassificationReport(output, Ytest);

Accuracy: 0.89
                    precision         recall       f1-score        support

              0        0.87             0.9          0.88              72
              1         0.9            0.87          0.89              75


In [71]:
// Plot ROC AUC Curve to visualize the performance of the model on TP & FP.
RocAucPlot("./data/ytest.csv", "./data/probabilities.csv", "Part-4 Random Undersampled targets ROC AUC Curve");
auto img = xw::image_from_file("./plots/Part-4 Random Undersampled targets ROC AUC Curve.png").finalize();
img

A Jupyter widget with unique id: ef733728eeb14a9f9b9d29b35e59f3a9

From the above classification report, we can infer that our model trained on undersampled data performs well on both the classes compared to imbalanced model in Part 1. Also from the ROC AUC Curve, we can infer the True Positive Rate is around 80% although there is a small flatline, but still performs better than imbalanced model.

### Conclusion
Models trained on resampled data performs well, but there is still room for improvement. Feel free to play around with the hyperparameters, training data split ratio etc. 