# 2. Modeling

This worksheet focuses on training different kinds of models that predict the outcomes of the loans.

In [1]:
import pandas as pd
import numpy as np
from joblib import dump

np.random.seed(1151)

## Preparing data for modeling

In order to start training models we need to choose which features to use. After that we check for null values and deal with categorical values by one-hot encoding them. The last step is to create the input and target values.

In [2]:
df = pd.read_csv("data/train_data.csv")

# Columns that are used as inputs in the models.
input_cols = ['NewCreditCustomer', 'VerificationType', 'LanguageCode', 'Age', 'Gender',
                 'Amount', 'Interest', 'LoanDuration', 'MonthlyPayment',
                 'Education', 'EmploymentDurationCurrentEmployer', 'HomeOwnershipType', 'IncomeTotal',
                 'ExistingLiabilities', 'LiabilitiesTotal', 'Rating',
                 'CreditScoreEeMini', 'NoOfPreviousLoansBeforeLoan', 'AmountOfPreviousLoansBeforeLoan',
                 'PreviousRepaymentsBeforeLoan', 'PreviousEarlyRepaymentsBefoleLoan', 'PreferLoan']

df = df[input_cols]

In [3]:
# Check for null values.
df.isna().sum()

NewCreditCustomer                    0
VerificationType                     0
LanguageCode                         0
Age                                  0
Gender                               0
Amount                               0
Interest                             0
LoanDuration                         0
MonthlyPayment                       0
Education                            0
EmploymentDurationCurrentEmployer    0
HomeOwnershipType                    0
IncomeTotal                          0
ExistingLiabilities                  0
LiabilitiesTotal                     0
Rating                               0
CreditScoreEeMini                    0
NoOfPreviousLoansBeforeLoan          0
AmountOfPreviousLoansBeforeLoan      0
PreviousRepaymentsBeforeLoan         0
PreviousEarlyRepaymentsBefoleLoan    0
PreferLoan                           0
dtype: int64

In [4]:
# One-hot encoding.
df = pd.get_dummies(df)
df

Unnamed: 0,NewCreditCustomer,VerificationType,LanguageCode,Age,Gender,Amount,Interest,LoanDuration,MonthlyPayment,Education,...,EmploymentDurationCurrentEmployer_UpTo4Years,EmploymentDurationCurrentEmployer_UpTo5Years,Rating_A,Rating_AA,Rating_B,Rating_C,Rating_D,Rating_E,Rating_F,Rating_HR
0,False,1.0,1,35,0.0,4250.0,20.80,36,173.86,4.0,...,0,1,0,0,0,1,0,0,0,0
1,False,1.0,1,38,1.0,1380.0,13.99,36,51.76,4.0,...,0,0,0,0,1,0,0,0,0,0
2,False,4.0,1,33,0.0,1275.0,19.62,36,51.39,4.0,...,1,0,0,0,0,1,0,0,0,0
3,False,1.0,1,55,0.0,635.0,16.58,36,24.62,4.0,...,0,0,0,0,1,0,0,0,0,0
4,True,1.0,3,24,0.0,2126.0,23.22,36,87.31,4.0,...,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42116,False,4.0,1,67,0.0,635.0,11.52,36,23.06,4.0,...,0,0,1,0,0,0,0,0,0,0
42117,True,4.0,1,30,0.0,1063.0,12.72,6,187.06,1.0,...,0,1,1,0,0,0,0,0,0,0
42118,True,4.0,1,31,1.0,1805.0,36.27,60,71.55,4.0,...,0,0,0,0,0,0,0,1,0,0
42119,True,4.0,1,24,1.0,2339.0,22.79,24,127.77,4.0,...,0,1,0,0,0,1,0,0,0,0


In [5]:
# Defining the input and target values.
X = df.drop('PreferLoan', axis=1)
y = df['PreferLoan']

An imbalance of preferred loans and not preferred loans can be seen, but it should not be a problem.

In [6]:
y.value_counts()

0    32135
1     9986
Name: PreferLoan, dtype: int64

## Training the models

We create a dictionary that consists of different models and hyperparameters for those models. After that we use RandomizedSearchCV for finding the optimal hyperparameters. In this instance we prefer it to GridSearchCV because it is a lot faster. However with GridSearchCV it is possible to do a more exhaustive search and therefore potentially get more accurate hyperparameters.

We create a dictionary of 3 different models that are commonly used for such classification tasks. Parameters with different possible values are also included.

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.model_selection import RandomizedSearchCV

scale_pos_weight = y.value_counts()[1]/y.value_counts()[0]

model_params = {
    'random_forest' : {
        'model' : RandomForestClassifier(),
        'params' : {
            'max_depth': [50, 75, 100, 125, 150, 175, 200, None],
            'max_features': ['auto', 'sqrt', 'log2', 1, 2, 3, 4, 5, 6],
            'min_samples_leaf': [1, 2, 3, 4, 8, 10, 12, 15],
            'min_samples_split' : [1, 2, 3, 4, 8, 10, 12, 15],
            'n_estimators': [10, 50, 75, 100, 125, 150, 175, 200],
            'criterion': ['gini', 'entropy']
        }
    },
    'logistic_regression' : {
        'model' : LogisticRegression(),
        'params' : {
            'penalty': ['none', 'l1', 'l2', 'elasticnet'],
            'solver' : ['lbfgs', 'newton-cg', 'liblinear', 'sag', 'saga'],
            'C' : [0.01, 0.05, 0.1, 0.25, 0.5, 0,75, 1, 1.25, 2, 5, 10],
            'max_iter' : [10, 50, 75, 100, 150, 200, 250, 300]
        }
    },
    'xgboost' : {
        'model' : xgb.XGBClassifier(),
        'params' : {
            "max_depth": [1, 2, 3, 4, 5, 7],
            "gamma": [0, 0.1, 0.25, 0.5, 1],
            "reg_lambda": [0, 1, 5, 10, 15, 25, 50],
            "subsample": [0.7, 0.8, 0.9],
            "colsample_bytree": [0.1, 0.25, 0.5, 0.75, 1],
            'scale_pos_weight ' : [scale_pos_weight, 1, 3, 5, 10, 15, 20]
        }
    }
}

We create an empty list "scores" in which we append the training results and best parameters. We also use three different scoring functions to train the models as it is interesting to see how models with different scoring functions perform in the validation step. After that models with the best parameters are saved so they can be used in the validation step.

In [8]:
# List for keeping track of training results.
scores = []
score_func = ['precision', 'f1', 'roc_auc']

In [None]:
for s in score_func:
    for m_name, m_params in model_params.items():
        model_name = m_name + "_" + s

        clf = RandomizedSearchCV(m_params['model'], m_params['params'], cv = 3, return_train_score = False, scoring = s, n_iter=25, n_jobs=-1)
        clf.fit(X, y)

        scores.append({
            'model': model_name,
            'best_score': clf.best_score_,
            'best_params': clf.best_params_
        })
        dump(clf, f'models/{model_name}.joblib')

The results can be seen below. The mean cross-validated score of the best model can be seen for each model and scoring function. It is hard to draw any conclusions from these results because each scoring function is calculated differently. And we have yet to seen how these models perform when using different threshold levels for the classification probability.

In [10]:
pd.DataFrame(scores, columns=['model', 'best_score', 'best_params'])

Unnamed: 0,model,best_score,best_params
0,random_forest_precision,0.762117,"{'n_estimators': 100, 'min_samples_split': 3, ..."
1,logistic_regression_precision,0.640689,"{'solver': 'liblinear', 'penalty': 'l1', 'max_..."
2,xgboost_precision,0.640899,"{'subsample': 0.9, 'scale_pos_weight ': 3, 're..."
3,random_forest_f1,0.319817,"{'n_estimators': 10, 'min_samples_split': 3, '..."
4,logistic_regression_f1,0.204101,"{'solver': 'newton-cg', 'penalty': 'none', 'ma..."
5,xgboost_f1,0.369104,"{'subsample': 0.8, 'scale_pos_weight ': 10, 'r..."
6,random_forest_roc_auc,0.741491,"{'n_estimators': 175, 'min_samples_split': 4, ..."
7,logistic_regression_roc_auc,0.690164,"{'solver': 'newton-cg', 'penalty': 'l2', 'max_..."
8,xgboost_roc_auc,0.739972,"{'subsample': 0.9, 'scale_pos_weight ': 15, 'r..."
