# Training Neural Nets on Bank Loan Data to Minimize Defaults

## 1- Motivation

When someone wants to apply for a loan at a financial institution, the applicant's payback ability and credit history are evaluated against the loan amount to see whether thte applicant is likely to pay back. If the applicant is likely to pay back, then the loan is granted. Otherwise, the loan would be rejected.

Ever since the 2008 financial crisis, regulations on the financial industry has tightened (for a good cause). Banks stopped giving out loans to those that are unlikely to pay back (which is what they should be doing in the first place). This project intends on investigating the current methodologies that the banks are using, and see if we can use machine learning / deep learning approaches to beat the bank's current model. 

## 2- Bank's current model

It is difficult to find any publicly available loan data from financial institutions. These datasets are typically highly confidential, as they contain sensitive personal information from the loan applicants. However, I managed to find the approved loan datasets from the peer-to-peer lending company, "lending club", from 2007 to 2015 (https://www.kaggle.com/wendykan/lending-club-loan-data). 

#### After removing ongoing loans, there are ~1.3 million samples. By classifying all problematic loans (late payments, called offs, defaults) as defaults, the bank's current model has a 78.38% precision. (There are no ways to know whether the rejected loans would default or not, if the applicant were given the loan)

## 3- Preliminary Dataset Analysis

The datasets have many columns, but not all are useful. Some columns have repetitive features as others. For example, "monthly installment" is equivalent to some mathematical formula involving "loan amount" and "interest rate". There are also columns they are irrelevant to our analysis, such as "next payment date". After removing the extra columns, I ended up with these variables:
    - Income: applicant annual income in USD.
    - Verification: whether the applicant's self-declared income has been verified. Three categories for this data: "Verified", "Source Verified" (Verified that the applicant works at the self-declared company), and "Not Verified"
    - Installment: the applicant's monthly payment.
    - Term: the lengths of the loan period, in this case either "36 months" or "60 months"
    - Homeownership: the collateral status, whether the applicant is the "owner", is "morgaging", or is "renting"
    - Purpose: The reason for the loan, e.g. "debt consolidation", "education", etc.
    - Location: The home state of the applicant, e.g. "CA", "TX", "NY", etc.
    - Dti: Debt-to-Income ratio, ranges between 0 to 100.
    - Delinq_2yrs: The number of times that the applicant becomes delinquent (late for payment) in the past 2 years
    - Revov_util: The percentage of available credit that is being utilized. Usually ranges from 0 to 100, but can sometimes exceed 100 (unlikely in this dataset, since this type of applicant usually won't be granted loans)
    - loan_status: the final outcome of the loan, either "Fully Paid" or "Default"
    - grade: the loan grade that the lending club assigns to each loan. Loans with higher grade are more likely to be fully paid
    - sub_grade: similar to grade, except that it's the sub_grade within each grade

## 4- Classification Models

### 4.1 Logistic Regression

#### Goal

The goal is to use the features, [income, verification, installment, term, home_ownership, purpose, location, dti, delinq_2yrs, Revov_util] to predict [loan_status] using logistic regression

#### Methodology

Train_test_split is set at 70:30, and the train_test_split is done 10 times. The categorical values are converted to dummy nodes

#### Results

In [1]:
import pandas as pd
lr = pd.DataFrame([[124,861929],[137,3128630]])
lr

Unnamed: 0,0,1
0,124,861929
1,137,3128630


In [2]:
print('precision: {}'.format(lr[1][1]/(lr[1][0]+lr[1][1])))
print('accuracy: {}'.format((lr[0][0]+lr[1][1])/(lr[0][0]+lr[0][1]+lr[1][0]+lr[1][1])))

precision: 0.7840079547752583
accuracy: 0.7839877518905889


The results show that there are some slight improvement in the model, although the improvement is negligible. Basically the logistic regression didn't find any patterns to classify the 'defaults' and 'fully paid' loans

#### Since we have millions of datapoints, using SVM takes too much computation time, so we will go straight to the neural nets approach.

### 4.2 Fully Connected Nets

#### Goal

Use neural net to train the features and see if there exist non-linear relationship among the features, which would produce a better classification than logistic regression

#### Methodology

Simple neural nets are chosen for this particular dataset. The input categorical features are also converted to numeric values through one-hot encoding. I experiminented with the following structures:
    - Number of Dense layers:
        -1
        -2
        -3
        -4
    - Hidden layer type:
        -sigmoid
        -tanh
        -relu
        -elu
        -selu
    - Dimensionality of the output space of each layer (256, 128, etc.)

#### Results

All settings with sigmoid and tanh activations produce accuracy of around 78.46% (this depends on the train_test_split randomizer). Relu, elu, and selu had weird performances where all loans are marked as "Defaults", which gave very low accuracies of 21.66%. The accuracies usually converge after 1-2 epoch. This shows that the FCN also failed to distinguish the "Defaults" and "Fully Paid" loans.

##### 1 hidden layer

In [3]:
layer1data = [['sigmoid', 0.7843, 0.7843, 0.7843, 0.7843, 0.7843, 0.7843, 0.7843, 0.7843], ['tanh', 0.7843, 0.7843, 0.7843, 0.7843, 0.7843, 0.7843, 0.7843, 0.7843], ['relu', 0.2157, 0.2157, 0.2157, 0.2157, 0.2157, 0.2157, 0.2157, 0.2157],['elu', 0.2157, 0.2157, 0.2157, 0.2157, 0.2157, 0.2157, 0.2157, 0.7843],['selu', 0.2157, 0.2157, 0.2157, 0.2157, 0.2157, 0.7843, 0.7843, 0.7843]]
l1 = pd.DataFrame(layer1data, columns = ['Activation', 'dim=256', 'dim=128', 'dim=64','dim=32','dim=16', 'dim=8','dim=4','dim=2'])
l1

Unnamed: 0,Activation,dim=256,dim=128,dim=64,dim=32,dim=16,dim=8,dim=4,dim=2
0,sigmoid,0.7843,0.7843,0.7843,0.7843,0.7843,0.7843,0.7843,0.7843
1,tanh,0.7843,0.7843,0.7843,0.7843,0.7843,0.7843,0.7843,0.7843
2,relu,0.2157,0.2157,0.2157,0.2157,0.2157,0.2157,0.2157,0.2157
3,elu,0.2157,0.2157,0.2157,0.2157,0.2157,0.2157,0.2157,0.7843
4,selu,0.2157,0.2157,0.2157,0.2157,0.2157,0.7843,0.7843,0.7843


##### 2 hidden layers

In [4]:
layer2data = [['sigmoid', 0.7843, 0.7843, 0.7843, 0.7843, 0.7843, 0.7843, 0.7843], ['tanh', 0.7843, 0.7843, 0.7843, 0.7843, 0.7843, 0.7843, 0.7843], ['relu', 0.2157, 0.2157, 0.2157, 0.2157, 0.2157, 0.2157, 0.2157],['elu', 0.2157, 0.2157, 0.2157, 0.2157, 0.2157, 0.2157, 0.7843],['selu', 0.2157, 0.2157, 0.2157, 0.2157, 0.2157, 0.2157, 0.7843]]
l2 = pd.DataFrame(layer2data, columns = ['Activation', 'dim=256', 'dim=128', 'dim=64','dim=32','dim=16', 'dim=8','dim=4'])
l2

Unnamed: 0,Activation,dim=256,dim=128,dim=64,dim=32,dim=16,dim=8,dim=4
0,sigmoid,0.7843,0.7843,0.7843,0.7843,0.7843,0.7843,0.7843
1,tanh,0.7843,0.7843,0.7843,0.7843,0.7843,0.7843,0.7843
2,relu,0.2157,0.2157,0.2157,0.2157,0.2157,0.2157,0.2157
3,elu,0.2157,0.2157,0.2157,0.2157,0.2157,0.2157,0.7843
4,selu,0.2157,0.2157,0.2157,0.2157,0.2157,0.2157,0.7843


##### 3 hidden layers

In [5]:
layer3data = [['sigmoid', 0.7843, 0.7843, 0.7843, 0.7843, 0.7843, 0.7843], ['tanh', 0.7843, 0.7843, 0.7843, 0.7843, 0.7843, 0.7843], ['relu', 0.2157, 0.2157, 0.2157, 0.2157, 0.2157, 0.2157],['elu', 0.2157, 0.2157, 0.2157, 0.2157, 0.2157, 0.2157],['selu', 0.2157, 0.2157, 0.2157, 0.2157, 0.2157, 0.2157]]
l3 = pd.DataFrame(layer3data, columns = ['Activation', 'dim=256', 'dim=128', 'dim=64','dim=32','dim=16', 'dim=8'])
l3

Unnamed: 0,Activation,dim=256,dim=128,dim=64,dim=32,dim=16,dim=8
0,sigmoid,0.7843,0.7843,0.7843,0.7843,0.7843,0.7843
1,tanh,0.7843,0.7843,0.7843,0.7843,0.7843,0.7843
2,relu,0.2157,0.2157,0.2157,0.2157,0.2157,0.2157
3,elu,0.2157,0.2157,0.2157,0.2157,0.2157,0.2157
4,selu,0.2157,0.2157,0.2157,0.2157,0.2157,0.2157


##### 4 hidden layers

In [6]:
layer4data = [['sigmoid', 0.7843, 0.7843, 0.7843, 0.7843, 0.7843], ['tanh', 0.7843, 0.7843, 0.7843, 0.7843, 0.7843], ['relu', 0.2157, 0.2157, 0.2157, 0.2157, 0.2157],['elu', 0.2157, 0.2157, 0.2157, 0.2157, 0.7843],['selu', 0.2157, 0.2157, 0.2157, 0.7843, 0.2157]]
l4 = pd.DataFrame(layer4data, columns = ['Activation', 'dim=256', 'dim=128', 'dim=64','dim=32','dim=16'])
l4

Unnamed: 0,Activation,dim=256,dim=128,dim=64,dim=32,dim=16
0,sigmoid,0.7843,0.7843,0.7843,0.7843,0.7843
1,tanh,0.7843,0.7843,0.7843,0.7843,0.7843
2,relu,0.2157,0.2157,0.2157,0.2157,0.2157
3,elu,0.2157,0.2157,0.2157,0.2157,0.7843
4,selu,0.2157,0.2157,0.2157,0.7843,0.2157


### 4.3 Logistic Regression + Autoencoder trained categorical features

#### Goal

Since the first 2 models failed to beat the bank's current model, let's see if we can improve the results by training the categorical data to discover the hidden relationship between each category.

#### Methodology

The idea of training autoencoder for categorical data is taken from this blog: https://medium.com/@satnalikamayank12/on-learning-embeddings-for-categorical-data-using-keras-165ff2773fc9

Several depreciated keras methodology was reconstructued. Multiple structures were tested.

#### Results

In [7]:
ae = pd.DataFrame([[116,862473],[136,3128095]])
ae

Unnamed: 0,0,1
0,116,862473
1,136,3128095


In [8]:
print('precision: {}'.format(ae[1][1]/(ae[1][0]+ae[1][1])))
print('accuracy: {}'.format((ae[0][0]+ae[1][1])/(ae[0][0]+ae[0][1]+ae[1][0]+ae[1][1])))

precision: 0.7838721204600447
accuracy: 0.7838516896276956


As we can see, the results did not improve... No matter what the output structure of the autoencoder is chosen, the classifier always classify everything as "Fully Paid. This leads to the conclusion.

## 5- Conclusion

### After testing the dataset with these three models, I have reached a clear conclusion: among the loans that are approved by banks, the chance of the loaner to go default is COMPLETELY RANDOM. Regardless of the income level and credit history, every applicant has a not-so-insignificant chance of defaulting. In other word, the existing bank model (no matter what they're using) has already been optimized, and the current bank model is able to identify the high-risk applicants and reject them. 

## 6- Some Additional Analysis

Instead of predicting whether the loan is going to default or not, the bank assigns each loan a grade and a subgrade, much like how credit rating agencies S&P or Moody assigns bonds with a alphebetical rating ("A", "B", "C", etc.). However, this is a tricky process to model, since loans with lower grades would results in a higher interest rate, which would increase the chance of loan defaults (this would create a cycle, violating the DAG assumption in many of the models).
Nevertheless, the bank's existing system on loan grades is quite accurate in giving out loan grades.

In [9]:
gd_df = pd.DataFrame(data = {'grade':['E', 'D', 'B', 'F', 'C', 'A', 'G'],'payback_rate':[0.5976554234203733, 0.675571154819941, 0.8530354337037632, 0.5335076923076923, 0.7565798326644662, 0.9326246273146014, 0.5138755980861244]}).set_index('grade').sort_index()
gd_df

Unnamed: 0_level_0,payback_rate
grade,Unnamed: 1_level_1
A,0.932625
B,0.853035
C,0.75658
D,0.675571
E,0.597655
F,0.533508
G,0.513876


In [10]:
sgd_df = pd.DataFrame({'sub_grade':['B2', 'G4', 'A1', 'C5', 'F2', 'F3', 'G2', 'E1', 'D5', 'A5', 'E2', 'C3', 'F5', 'C4', 'B1', 'F4', 'D2', 'G5', 'E5', 'C1', 'D1', 'A4', 'B5', 'G3', 'D4', 'C2', 'E3', 'G1', 'F1', 'A3', 'B4', 'D3', 'A2', 'E4', 'B3'],'payback_rate':[0.8754203649594796, 0.5291044776119403, 0.9633297903179051, 0.7178874577076038, 0.5338924790320363, 0.5352112676056338, 0.5016062413951354, 0.6234953655160844, 0.6438809626270316, 0.9084486781958252, 0.6095658230787091, 0.7560621928623218, 0.5075889524757402, 0.7296998462824628, 0.8842627795870934, 0.5063623510401939, 0.6838801903017578, 0.5492839090143218, 0.5603946920721333, 0.7940838632515068, 0.7028887978621725, 0.923348778763842, 0.815936706676854, 0.5226586102719033, 0.6556187487742695, 0.7753298354643167, 0.5939873929206401, 0.5026281208935611, 0.5619047619047619, 0.9375559716681593, 0.8378576489847778, 0.6729629817187421, 0.947410823424825, 0.5814758846492885, 0.8586563211101578]}).set_index('sub_grade').sort_index()
sgd_df

Unnamed: 0_level_0,payback_rate
sub_grade,Unnamed: 1_level_1
A1,0.96333
A2,0.947411
A3,0.937556
A4,0.923349
A5,0.908449
B1,0.884263
B2,0.87542
B3,0.858656
B4,0.837858
B5,0.815937


As we can see, there exists a clear hierachy in the percentage of "Fully Paid" loans within the loan grade and subgrades. We can see that there are some mismatches at the lowest sub-grade levels, particularlly the "G" grade. However, it should be noted that the "G" grade loans only make up a small percentage of the whole dataset (<0.08% for each sub-grade). Our classification models managed to identify a small negligible amount of bad loans, which would most likely be the ones from the "G" grade loans.

## 7- Future Work

The challenges that are encountered in this project can be summarized in the following few points:
    1. The target bank model to beat is an optimized one, and this is discovered after building all the models.
    2. The nature of the data made it impossible to do a regression instead of a "Paid" vs "Default" classification. The only indication of Default rate is the "grade" and "sub-grade" that the bank assigns to each loan, which is subjective rather than objective. 
    3. Even if we could train a 100% accurate regression model that reflects the default rate of each loan, it is equivalent to reproducing the bank's existing grading model, which is useless to any interested stakeholder.
    4. There were some paper that suggest that banks use a graphical model approach towards loan default rate. Upon further review, these would most likely be corporate loans, since some of the data regarding corporate loans are unmeasurable/unreportable (e.g. industry outlook, corporate outlook). However, banks require complete information for personal loans, which means the graphical model approach would not be effective (no need for inferences, just train P(Default|features))

Therefore, future work should be focused on acquiring a complete dataset, which includes both approved and rejected loan data (and assume that all rejected loan data will be defaults), to see how the each feature relate to one another in causing the default. On the other hand, as regulations are tightening since the financial crisis, we can see that the existing loan evaluation system implemented at the financial institutions are quite mature. In conclusion, banks have become a lot more responsible in terms of issuing loans since the subprime mortgage crisis.