# Challenge : Predicting Loan pre-delinquency using Deep learning classifier

## Introduction
in this challenge i will be working with Carbo'ns loan disbursement dataset to predict whether or not a loan will be defaulted on or paid back.The dataset contains details about clients, their location ,  income, employment status and other features used to determine whether a loan should be given or not. The goal of this challenge is  to build both a random forest classifier and a neural network model. Once the models have been trained and tested,i will then evaluate and compare them and explain which is best. in this notebook i will be building a deep learning Neural network to predict loan defaults
This notebook is organised as follows :
1. Data Upload
2. Data preparation
3. Building the models
5. Model Evaluation and Comparision
6. Conclusion



In [0]:
#import some packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
% matplotlib inline

In [0]:
#upload data
df = pd.read_csv('cleanData')


In [0]:
df.shape

(159589, 33)

## 2. Data preparation
in this section i will perform several operations to prepare my data for training. i will handle missing values, one-hot-encode categorical data, balance the classes, select relevant features i deem fit and split my data into a training and testing set.

The data has already been cleaned and missing values taken care of in the first notebook, so i will be skipping this action.

### One-hot-encoding
Many machine learning algorithms cannot work with categorical data directly. The categories must be converted into numbers. so i will be performing One-hot-encoding on certain categorical features. before this i will select my X features and create my ouput Y(Label).

In [0]:
#select Features
X = df[['clientIncome', 'incomeVerified', 'clientAge',
       'clientGender', 'clientMaritalStatus', 'clientLoanPurpose',
       'clientResidentialStatus', 'clientState', 'clientTimeAtEmployer',
       'clientNumberPhoneContacts', 'clientAvgCallsPerDay','loanNumber','loanAmount',
       'interestRate', 'loanTerm', 'max_amount_taken', 'max_tenor_taken','settleDays', 'firstPaymentRatio','firstPaymentDefault']]
Y = df['loanDefault']

In [0]:
#one hot encode categorical features
X = pd.get_dummies(X,columns=['clientMaritalStatus','incomeVerified','clientResidentialStatus','clientGender','clientState','clientLoanPurpose'])

### Train and Test split
i will Split the data into training set (70%), and test set (30%). Training set will be used to fit the model, and test set will be to evaluate the best model.

In [0]:
# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets
X_train,Y_train,X_labels,Y_labels = train_test_split(X, Y, test_size = 0.3, random_state = 42)

### Balance the Classes
Classification problems in most real world applications have imbalanced data sets. In other words, the positive examples (minority class) are a lot less than negative examples (majority class). in our dataset we have 72% (no default)negative values and 28% positive(loan default).class imbalance influences a learning algorithm during training by making the decision rule biased towards the majority class, that optimizes the model to make predictions based on the majority class in the dataset. so i am going  to balance the data set to achieve a model that is able to generalize and make good predictions on the minority class.

## 3. Buliding The Models.
The next step is to build the model. i will build a  neural network, that performs a binary classification on each loan. Also i will perform hyper parameter tuning on the model to improve the performance. The following Metrics, will be used to evaluate my final model.

1.  **Accuracy** :
It’s the ratio of the correctly labeled subjects to the whole pool of subjects.
Accuracy is the most intuitive one.
2. **Precision** :
Precision is the ratio of the correctly +ve labeled by our program to all +ve labeled.
3. **Recall ** :
Recall is the ratio of the correctly +ve labeled by our program to all who are diabetic in reality.
4. ** F1-score**
F1 Score is the weighted average of precision and recall.
5. ** Auc ** :If you randomly chose one positive and one negative observation, AUC represents the likelihood that your classifier will assign a higher predicted probability to the positive observation.

**General Flow**:
I will first build a base model, then do hyper parameter tuning, select the model with the best parameters. then use class balancing techiques  on the best model to build a model that predicts the minority class with higher accuracy(recall) i.e improving the Recall of the model.

### Deep learning model
I will be buidling a NN classifer using Keras
write about the features

In [0]:
#import required packages
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from keras.layers.core import Dense, Dropout, Activation
from keras.utils import np_utils

#import evaluation metrics
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score



Using TensorFlow backend.


**Build base model**

In [0]:
# create model
model = Sequential()

In [0]:
# Add an input layer 
model.add(Dense(512, activation='relu', input_dim=104))

# Add one hidden layer 
model.add(Dense(8, activation='relu'))

# Add an output layer 
model.add(Dense(1, activation='sigmoid'))

Instructions for updating:
Colocations handled automatically by placer.


In [0]:
# Compile model
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])


In [0]:
                   
#fit model
base = model.fit(X_train, X_labels,epochs=2, batch_size=1, verbose=1)

Epoch 1/2
Epoch 2/2


**Evaluate model**

In [0]:
#predict with model
y_pred = model.predict(Y_train)

In [0]:
#accuracy
accuracy_score(y_pred,Y_labels)

0.7235206884307706

In [0]:
#recall
recall_score(Y_labels,y_pred)

0.0

In [0]:
#Precision
precision_score(Y_labels,y_pred)

  'precision', 'predicted', average, warn_for)


0.0

In [0]:
#f1 score
f1_score(Y_labels,y_pred)

  'precision', 'predicted', average, warn_for)


0.0

### Hyperparameter tuning
i will be using a third party hyperparameter optimization tool to do hyperparameter tuning of my deep learning model. Keras can be combined with Hyperopt to do hyper parameter tuning.
Hyperas is  very simple convenience wrapper around hyperopt for fast prototyping with keras models. Hyperas lets you use the power of hyperopt without having to learn the syntax of it. Instead, just define your keras model as you are used to, but use a simple template notation to define hyper-parameter ranges to tune.

In [0]:
!pip install hyperas

Collecting hyperas
  Downloading https://files.pythonhosted.org/packages/04/34/87ad6ffb42df9c1fa9c4c906f65813d42ad70d68c66af4ffff048c228cd4/hyperas-0.4.1-py3-none-any.whl
Collecting prompt-toolkit<2.1.0,>=2.0.0 (from jupyter-console->jupyter->hyperas)
[?25l  Downloading https://files.pythonhosted.org/packages/f7/a7/9b1dd14ef45345f186ef69d175bdd2491c40ab1dfa4b2b3e4352df719ed7/prompt_toolkit-2.0.9-py3-none-any.whl (337kB)
[K    100% |████████████████████████████████| 337kB 27.1MB/s 
[31mipython 5.5.0 has requirement prompt-toolkit<2.0.0,>=1.0.4, but you'll have prompt-toolkit 2.0.9 which is incompatible.[0m
Installing collected packages: hyperas, prompt-toolkit
  Found existing installation: prompt-toolkit 1.0.15
    Uninstalling prompt-toolkit-1.0.15:
      Successfully uninstalled prompt-toolkit-1.0.15
Successfully installed hyperas-0.4.1 prompt-toolkit-2.0.9


In [0]:
from hyperas import optim
from hyperas.distributions import choice, uniform
from hyperopt import Trials, STATUS_OK, tpe

i would like to reference the source of this code :https://github.com/maxpumperla/hyperas
,without him this would not be possible.

In [0]:
def data():
    '''
    Data providing function:
    This function is separated from model() so that hyperopt
    won't reload data for each evaluation run.
    '''
    X_train,Y_train,X_labels,Y_labels = train_test_split(X, Y, test_size = 0.3, random_state = 42)

In [0]:
#Hyper parameter tunning 
def create_model(X_train, Y_train, X_labels, Y_labels):
    """
    Model providing function:

    Create Keras model with double curly brackets dropped-in as needed.
    Return value has to be a valid python dictionary with two customary keys:
        - loss: Specify a numeric evaluation metric to be minimized
        - status: Just use STATUS_OK and see hyperopt documentation if not feasible
    The last one is optional, though recommended, namely:
        - model: specify the model just created so that we can later use it again.
    """
    model = Sequential()
    model.add(Dense(512,input_dim=104)))
    model.add(Activation('relu'))
    model.add(Dropout({{uniform(0, 1)}}))
    model.add(Dense({{choice([256, 512, 1024])}}))
    model.add(Activation({{choice(['relu', 'sigmoid'])}}))
    model.add(Dropout({{uniform(0, 1)}}))

    # If we choose 'four', add an additional fourth layer
    if {{choice(['three', 'four'])}} == 'four':
        model.add(Dense(100))

        # We can also choose between complete sets of layers

        model.add({{choice([Dropout(0.5), Activation('linear')])}})
        model.add(Activation('relu'))

    model.add(Dense(10))
    model.add(Activation('softmax'))

    model.compile(loss='categorical_crossentropy', metrics=['accuracy'],
                  optimizer={{choice(['rmsprop', 'adam', 'sgd'])}})

    result = model.fit(X_train, X_labels,
              batch_size={{choice([64, 128])}},
              epochs=2,
              verbose=2,
              validation_split=0.1)
    #get the highest validation accuracy of the training epochs
    validation_acc = np.amax(result.history['val_acc']) 
    print('Best validation acc of epoch:', validation_acc)
    return {'loss': -validation_acc, 'status': STATUS_OK, 'model': model}


if __name__ == '__main__':
    best_run, best_model = optim.minimize(model=create_model,
                                          data=data,
                                          algo=tpe.suggest,
                                          max_evals=5,
                                          trials=Trials())
    X_train, Y_train, X_labels, Y_labels = data()
    print("Evalutation of best performing model:")
    print(best_model.evaluate(Y_train, Y_labels))
    print("Best performing model chosen hyper-parameters:")
    print(best_run)

**Balance the classes and train on the best performing model**

Oversampling and undersampling both have major drawbacks like prone to overfitting in oversampling minority class and loss of information by reducing the majority class.
the technique i will be using to balance the data set are:

* **Synthetic Minority Oversampling Technique (SMOTE):** It over-samples the minority class but using synthesized examples. It operates on feature space not the data space. 
i will be using alibrary - imbalanced learn to perform this operations

In [0]:
from imblearn.pipeline import make_pipeline
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import RobustScaler

In [0]:
#perform SMOTE)
#first scale the data
scalar = RobustScaler()
X_train = scalar.fit_transform(X_train)
Y_train = scalar.fit_transform(Y_train)

In [0]:
# Build model with Smote
smote = SMOTE(random_state=42,ratio='minority')

In [0]:
#res means resampled
X_train_res, X_labels_res = smote.fit_sample(X_train, X_labels)

In [0]:
#train on resampled data
model.fit(X_train_res, X_labels_res,epochs=3, batch_size=1, verbose=1)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f609a8f6358>

**Evaluate model performance using Metrics**
* this is the performance of the model after balancing the classes

In [0]:
#predict test labels
y_pred = model.predict(Y_train)

In [0]:
#recall
recall_score(Y_labels,y_pred.round())

0.740197930044572

In [0]:
#Precision
precision_score(Y_labels,y_pred.round())

0.7148172466622893

In [0]:
#f1 score
f1_score(Y_labels,y_pred.round())

0.7272862232779097

In [0]:
#AUC
roc_auc_score(y_pred.round(),Y_labels)

0.8070867035184436

In [0]:
#accuracy
accuracy_score(y_pred.round(),Y_labels)

0.8465233828351818

## 4.  Model Evaluation and Comparision

Comparing the performance of models at training and testing.
The business problem being to reduce the risk of Carbon losing money due to clients defaulting on loans, My focus was to build a model that was geared towards having a High recall(predicting positvie values as positive) and High AUC. After building the base model , i noticed that the recall,precision and Auc where all Zeros . i then balanced the data set using SMOTE technique to oversample the minority class. the final model had the highest** Precision : 0.71,** **Accuracy : 0.84**,
**Recall : 0.74**, and an **AUC : of  0.81**. i Know further tuning could be performed on the model but that was not possible for me since, everytime i tried installing my hyperas library to perform hyperparameter tuning, my environment kept crashing.

### Conclusion

The NN classifer out performs the Random forest Classifer in every performance metric except the Auc. further tuning has not been performed but still i was able to get a Recall of 0.74 far better than the 0.64 i got from my Random forest classifier after tuning.with the focus on reducing loan defaults , my NN classifier is the go to model since it has a better recall than the Random forest Classifer.