## Implementation

Now that we have an overview of the data, we will implement a Bayesian neural network using the dropout technique to predict profit rates for the lending club data. The model will be based on `keras`, a deep learning library for Python. For pre-processing our data we will use `klearn` as well as `imblearn`.

In [7]:
import pandas as pd
from sklearn import preprocessing
from imblearn.over_sampling import SMOTE
from keras import Model as mcModel
from keras import Input as mcInput
from keras.models import Sequential
from keras.layers import Dropout
from keras.layers import Dense
from keras.regularizers import l2
from sklearn.metrics import mean_squared_error
import numpy as np
import pickle

We start by reading in our data sets. We have prepared the data by splitting the data set into a test data set with 40k rows and multiple training data sets with 60k, 100k, 150k, 200k and 240k rows, respectively. This way, we ensure that the same data is used on each model and the results are comparable. For splitting the data, we used the function `train_test_split()` by sklearn.

For easy access, we create a list of the file names of the data sets. Then we read in each file using a for loop, and save the data in a dictionary using the file name as a key.

In [2]:
data_sets = ["test_set_40k", "train_set_60k", "train_set_100k", 
             "train_set_150k", "train_set_200k", "train_set_240k"]

data = {} 
for item in data_sets:
    file = pd.read_csv("data/" + item + ".csv")
    data[item] = file

First, we implement a function to pre-process the data according to our fininds in the data exploration section. The first step is to drop any missing data. As mentioned in the data exploration section, our data set is imbalanced. A solution to balancing data sets is to use the SMOTE (Synthetic Minority Over-sampling Technique) function provided by `imblearn`. SMOTE generates synthetic data points to oversample the minority class. However, SMOTE usually cannot be used for regression problems. For this reason, we have to temporarily change our target variable to binary classes. We classify our profit rates into negative profits and positive profits, and apply the SMOTE function on the classes. We use profit rate as a feature instead. After oversampling, we simply drop the classes and re-define the profit rate as the target variable. Since SMOTE is only applied to training data, we included a boolean parameter. In the case of test data, we simply define our features and target variable.

Lastly, we scale the data using `StandardScaler()` provided by `sklearn`.

In [9]:
def preprocess(train_data, smote = True):
    train_data = train_data.dropna()
    
    if smote:
        features = train_data[['int_rate','loan_amnt', 'purpose', 'dti','term','grade', 'profit_rate']]
        features = pd.get_dummies(features)
        target = [1 if i > 0 else 0 for i in features['profit_rate']]
        
        sm = SMOTE(random_state = 43, ratio = 'minority')
        x_feat_1, x_targ_1 = sm.fit_sample(features, target)
        
        x_features = pd.DataFrame(x_feat_1)
        x_features.columns = features.columns
        x_target = x_features[["profit_rate"]]
        x_features = x_features.drop("profit_rate", axis=1)
        
    else:
        x_features = train_data[['int_rate','loan_amnt', 'purpose', 'dti','term','grade']]
        x_features = pd.get_dummies(x_features)
        x_target = train_data[['profit_rate']]

    scaler_features = preprocessing.StandardScaler().fit(x_features)
    x_features = scaler_features.transform(x_features)
  
    scaler_target = preprocessing.StandardScaler().fit(x_target)
    x_target = scaler_target.transform(x_target)
  
    return x_features, x_target

The following function defines the architecture of our model. It passes the following five parameters:
*. model_type: the type of our model. For now, we only implement the mcDropout option, but this could easily be extended with more model types.
*. n_hidden: a vector containing the number of neurons per hidden layer
*. input_dim: the number of input dimensions, which is equal to the number of columns in our training set
*. dropout_prob: the dropout probability for the dropout layers in the neural network
*. reg: used for regularization during dropout

Next, we instantiate our model. We start with the input layer using the keras `Input()` function and pass the input dimensions. Using `Dropout()`, we apply the dropout technique to our inputs. Lastly, we instatiate a regular densely-connected layer using the `Dense()` function. Here we pass the dimensionality as an argument using the `n_hidden` vector. We use the ReLu (Reactive linear unit) activation function and pass the regularization function, which will be explained in more detail below. We use a weight regularizer (called `W_regularizer`), with the l2 weight regularization penalty - this corresponds to the weight decay. 

Now, for each hidden layer (using the length of the `n_hidden` vector), we instantiate the dropout function and a dense layer, as described above. Lastly, we create the output layer in the same way.

In [10]:
def architecture(model_type, n_hidden, input_dim, dropout_prob, reg):
      if model_type == 'mcDropout':
        inputs = mcInput(shape=(input_dim,))
        inter = Dropout(dropout_prob)(inputs, training=True)
        inter = Dense(n_hidden[0], activation='relu',
                      W_regularizer=l2(reg))(inter)
        for i in range(len(n_hidden) - 1):
            inter = Dropout(dropout_prob)(inter, training=True)
            inter = Dense(n_hidden[i+1], activation='relu',
                            W_regularizer=l2(reg))(inter)
        inter = Dropout(dropout_prob)(inter, training=True)
        outputs = Dense(1, W_regularizer=l2(reg))(inter) 
        model3 = mcModel(inputs, outputs)
        return model3

In the following step, we define a function that runs our model. Before we do so, we pre-process our test data so we can use it as a default argument for our predictions.

The function `model_runner()` takes the following arguments: 
*. X_train/y_train: the training data sets
*. X_test/y_test: the test data sets, using the 40k test set as a default
*. dropout_prob: the dropout probability, which is then passed on to the model
*. n_epochs: the number of epochs 
*. tau: tau value used for regularization
*. batch_size: the size of the batches used for fitting the model
*. lengthscale: the prior length scale
*. n_hidden: a vector containing the number of neurons per layer - this is then passed on to the model

We now define the input dimension to equal the number of columns in the training set. We also define a value `N`, which corresponds to the number of rows in the training set and is also used for the regularization function. 
Now we simply build the model using the `architecture()` function implemented above. `Compile()` configures the model for training. We then train our model using the function `fit()`, where we pass our training data, as well as the batch size and the number of epochs. The argument `verbose = 1` results in a progress bar being shown during training. 

In [25]:
test_features, test_target = preprocess(data['test_set_40k'], smote=False)
def model_runner(X_train, y_train, X_test=test_features, y_test=test_target,
                dropout_prob=0.20, n_epochs=100, tau=1.0, batch_size=500, 
                lengthscale=1e-2, n_hidden=[100,100]):
  
    input_dim = X_train.shape[1]
    N = X_train.shape[0]
    reg = lengthscale**2 * (1 - dropout_prob) / (2. * N * tau)


    print('McDropout NN fit')

    model_mc_dropout = architecture(model_type = 'mcDropout', 
                                    n_hidden=n_hidden, input_dim=input_dim, 
                                    dropout_prob=dropout_prob, reg=reg)
    model_mc_dropout.compile(optimizer='sgd', loss='mse', metrics=['mae'])
    model_mc_dropout.fit(X_train, y_train, batch_size=batch_size, nb_epoch=n_epochs, verbose=1)


    return model_mc_dropout

  return self.partial_fit(X, y)


It is now time to use the model to make predictions. We implement another function called `predictor()`, which takes the following arguments:
*. model_mc_dropout: the dropout model
*. X_test/y_test: the test data
*. T: the number of predictions made for each observation

We now use T in a for loop to run the predictions as many times as we specified. The results are added to a list called `probs_mc_dropout`. This results in a two-dimensional list with T items in each list. In other words, for each row in the test set, we now have a set of T predictions, enabling us to calculate uncertainty.

In [26]:
def predictor(model_mc_dropout, 
              X_test=test_features, y_test = test_target, T = 1000):
    probs_mc_dropout = []
    for _ in range(T):
        probs_mc_dropout += [model_mc_dropout.predict(X_test,verbose=1)]
    predictive_mean = np.mean(probs_mc_dropout, axis=0)
    predictive_variance = np.var(probs_mc_dropout, axis=0)
    mse_mc_dropout = mean_squared_error(predictive_mean, y_test)
    print(mse_mc_dropout)
  
    return probs_mc_dropout

In order to evaluate the results, we save the predictions for each test-run in a pickle file. This is done by the following function. The necessary parameters are:
*. predictions: the two-dimensional array of predictions
*. filename: a name for the file the predictions are saved in

In [27]:
def pred_to_pickle(predictions, filename):
    outfile = open('pred_'+filename+'.pickle','wb')
    pickle.dump(predictions, outfile)
    outfile.close()

We now iterate over the data dictionary and make predictions for the test set, using another training set each time. We save the results, which will be used for evaluation in the following section.

In [28]:
for filename, training_set in data.items():
        features, target = preprocess(training_set)
        model = model_runner(features, target)
        probs = predictor(model)
        pred_to_pickle(probs, filename)

McDropout NN fit


  
  # Remove the CWD from sys.path while we load stuff.
  if sys.path[0] == '':


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78







0.985710879949668
