# Goal: predict whether a loan will default or not

---
#### Target variable: `zeroBalCode` 
* Type: **Categorical** 
* Model type: Classification 
* Data: 
    - "0" means "Closed" (i.e. a successful outcome for Fannie Mae)
    - "1" means "Default" (i.e. a negative outcome)

---
#### Inputs that we want to build text boxes/etc for:
* TBD

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#!pip install pycaret
from pycaret.classification import *
#from pycaret.regression import *

from sklearn.feature_selection import VarianceThreshold

# Importing the data

In [51]:
dforig=pd.read_csv("data/MLReady/FM_FULL_EPOCH2_MLReady.csv")
df=dforig.copy()
df.head()
print(f'Epoch 2: {df.origYear.unique().tolist()}')

Epoch 2: [2009, 2010, 2011, 2012, 2013]


In [52]:
# Get the data into the data types you want:
df = df.astype({
    'origLTV':'int'
    , 'numBorrowers':'int'
    , 'origDebtIncRatio':'int'
    , 'borrCreditScore':'int'
    , 'mortInsType':'int'
    , 'bestCreditScore':'int'
    , 'worstCreditScore':'int'
    , 'avgCreditScore':'int'
    , 'zeroBalCode':'object'}
)

# Pre-Processing: Feature Elimination

Remove each features' dataseries and remove features that:
* Step 1: Run a Pandas Profiling Report
* Step 2: Remove the index
* Step 3: Have zero to low variance
* Step 4: Are not part of a either a "closed" event or a "default" event
* Step 5: Any unique identifiers that are not helpful in predicting the target

### Step 1: Run a Pandas Profiling Report

In [60]:
from pandas_profiling import ProfileReport

profile = ProfileReport(df, title="Pandas Profiling Report - Before")

### Step 2: Remove the index

In [53]:
# There is only one column that matches the above: the index
df.drop(['Unnamed: 0'], 1, inplace=True)

### Step 2: Have zero to low variance

In [54]:
X = df.loc[:, df.columns != 'zeroBalCode']
y = df['zeroBalCode']

rows, cols = X.shape
print(f'There are currently {cols} features and {rows} rows')
print(type(X))

There are currently 30 features and 119696 rows
<class 'pandas.core.frame.DataFrame'>


In [55]:
dfOptimized = df.copy()

dfOptimized.drop([
        'borrCreditScore'
        , 'bestCreditScore'
        , 'avgCreditScore'
        # For Tableau, not model prediction:
        , 'rateDiffAbove'
        , 'rateDiffBelow'
        , 'rateDiffAvg'
        , 'rateDiffBelowPct'
        , 'rateDiffAvgPct'
        , 'fmacRateMax'
        , 'fmacRateMin'
        , 'fmacRateAvg'
        , 'mSA'
        , 'fmacRateVolatility'
        , 'fredRate'
        , 'mortInsType'
    ]
    , axis=1
    , inplace=True
)

dfOptimized.head()

Unnamed: 0,origChannel,origIntRate,origUPB,origLTV,numBorrowers,origDebtIncRatio,loanPurp,zipCode,pMIperct,worstCreditScore,bankNumber,stateNumber,zeroBalCode,rateDiffAbovePct,origYear,origMonth
0,2,5.125,348000,87,1,50,2,51,25.0,689,80,49,1,-0.02381,2009,2
1,3,4.625,195000,52,2,54,1,82,0.0,703,4,32,0,-0.119048,2009,2
2,2,4.875,342000,80,1,54,1,981,0.0,746,3,50,0,-0.071429,2009,2
3,1,5.375,93000,70,1,50,1,496,0.0,780,54,23,1,0.02381,2009,2
4,1,4.875,182000,76,2,22,1,18,0.0,776,45,20,0,-0.071429,2009,2


# Encoding

Label encoding is converting categoricals (like "Texas") to a static number (30), in which you can later perform a lookup

One-hot encoding is better when you have dichotomous *Yes/No* and it will make columns for those. 

Rule of thumb: Use label encoding when you have a large # of distinct values in your categorical feature, and use one-hot encoding when you have a small # of distinct values

In [56]:
# What is the variance?
import statistics as stat
for c in dfOptimized.columns.tolist():
    varianceNb = stat.variance(dfOptimized[c])
    print(f'{c}: {varianceNb}')

origChannel: 0.8790166775255016
origIntRate: 0.3563562677994478
origUPB: 15948893967.594942
origLTV: 241.3171404652037
numBorrowers: 0.2544270766865069
origDebtIncRatio: 107.87451262617431
loanPurp: 0.2500110113673438
zipCode: 103990.91998733362
pMIperct: 41.47753906342591
worstCreditScore: 2400.1981184373512
bankNumber: 607.1906651982089
stateNumber: 258.4066966439463
zeroBalCode: 0.11768339548176236
rateDiffAbovePct: 0.008391784161827086
origYear: 1.794211791486425
origMonth: 11.151953164734818


# Assumptions
1. Assumption 1: Fannie Mae states that they do not buy "jumbo loans" (identified as > $450,000 original loan). However, the data set does have some of those

1. Assumption 2: loanPurp has 1 record in Epoch 2 - removing all `loanPurp == 3`

1. Assumption 3: numBorrowers has only 158 rows w 3 borrowers - removing all `numBorrowers == 3`


In [57]:
# Assumption 1: Drop any loans with an Unpaid Balance upon acquire > $417,000
rows, cols = dfOptimized.shape
print('#############################################')
print('Assumption 1: Remove jumbo loans')
print(f'   - Before removing: {rows}')

dfAssumption1 = dfOptimized[dfOptimized['origUPB'] < 417000]
rows2, cols2 = dfAssumption1.shape
print(f'   - After removing: {rows2}')
print(f'   - Net removed: {rows - rows2}')
print('#############################################')

#############################################
Assumption 1: Remove jumbo loans
   - Before removing: 119696
   - After removing: 109120
   - Net removed: 10576
#############################################


# Normalization
https://pycaret.org/normalization/

> `normalize: bool, default = False` - When set to True, the feature space is transformed using the normalized_method param. **Generally, linear algorithms perform better with normalized data** however, the results may vary and it is advised to run multiple experiments to evaluate the benefit of normalization.

In [58]:
dfAssumption1.dtypes

origChannel           int64
origIntRate         float64
origUPB               int64
origLTV               int32
numBorrowers          int32
origDebtIncRatio      int32
loanPurp              int64
zipCode               int64
pMIperct            float64
worstCreditScore      int32
bankNumber            int64
stateNumber           int64
zeroBalCode          object
rateDiffAbovePct    float64
origYear              int64
origMonth             int64
dtype: object

In [47]:
model_setup = setup(
    dfAssumption1
    , target = 'zeroBalCode' # PyCaret will list this as "Label"
    , pca = False 
    , ignore_low_variance = True # Variance is calculated using the ratio of unique values to the number of samples, and the ratio of the most common value to the frequency of the second most common value.
    , normalize = True
    , ignore_features = None
    , handle_unknown_categorical = True
    , remove_outliers = True # outliers from the training data are removed using PCA linear dimensionality reduction using the Singular Value Decomposition technique.
    , bin_numeric_features = [
            'origIntRate'
            , 'origUPB'
            , 'origLTV'
            , 'origDebtIncRatio'
            , 'worstCreditScore'
        ] # Set to True to bin numerics using K Means
    , feature_selection = True
    , silent = True
    , profile = False
    , categorical_features = [
            'origChannel'
            , 'numBorrowers'
            , 'loanPurp'
            , 'zipCode'
            , 'bankNumber'
            , 'stateNumber'
            , 'origYear'
            , 'origMonth'
        ]
        , numeric_features = [
            'origIntRate'
            , 'origUPB'
            , 'origLTV'
            , 'pMIperct'
            , 'origDebtIncRatio'
            , 'worstCreditScore'
        ]
)

# session_id - if you ever want to reprint the results later, pass the session_id to setup()
#      and it will run the setup using the same split of test/train



# Choice: Light Gradient Boosting Machine 
Other models have slightly better AUC or Recall but took *forever*. We were able to get `lightgbm` to run much faster and with less system resources than XGBoost and others, and without losing much accuracy/precision.

In [48]:
gbm = create_model(
    'lightgbm'
    , ensemble = True
    , method = 'Boosting'
)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa
0,0.8494,0.7575,0.1177,0.3991,0.1818,0.1252
1,0.8484,0.7453,0.1316,0.3992,0.1979,0.1371
2,0.852,0.7469,0.1288,0.4306,0.1983,0.1421
3,0.8524,0.7645,0.1163,0.4286,0.183,0.1302
4,0.851,0.7645,0.1151,0.4109,0.1798,0.1255
5,0.8519,0.7395,0.1248,0.4265,0.1931,0.1377
6,0.8529,0.7523,0.1359,0.4414,0.2078,0.1511
7,0.8549,0.7596,0.1318,0.4612,0.205,0.1514
8,0.8508,0.7575,0.126,0.4174,0.1936,0.1367
9,0.8513,0.7561,0.1302,0.4253,0.1994,0.1422


---

# For `zeroBalCode`, we want highest possible Recall score
Recall is 0 (lowest) to 1 (highest = Perfect Recall)

If Recall is low, that means that if you deploy this and try it against newer/incoming data, it will not be able to have good results.

Our target variable is categorical/dichotomour (0 / 1) in which 95% of the data is "0". If we use a random training set, the model will incorrectly weight the results and result in a low Recall score. 

To fix this, use an oversampling technique - create 50/50 split of test so the model can figure out how to differentiate better.

https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/plot_sampling_strategy_usage.html

```python
from imblearn.over_sampling import RandomOverSampler

```

In [69]:
# imblearn needs your label to be an int instead of object
dfAssumption1 = dfAssumption1.astype({"zeroBalCode": int})

dfAssumption1.dtypes

origChannel           int64
origIntRate         float64
origUPB               int64
origLTV               int32
numBorrowers          int32
origDebtIncRatio      int32
loanPurp              int64
zipCode               int64
pMIperct            float64
worstCreditScore      int32
bankNumber            int64
stateNumber           int64
zeroBalCode           int32
rateDiffAbovePct    float64
origYear              int64
origMonth             int64
dtype: object

# Oversampling Tests - DO NOT USE

In [106]:
# First, let's split our data into X and y:
X = dfAssumption1.loc[:, dfAssumption1.columns != 'zeroBalCode']
y = dfAssumption1['zeroBalCode']

rows, cols = X.shape
print(f'There are currently {cols} features and {rows} rows')
print(type(X))

There are currently 15 features and 109120 rows
<class 'pandas.core.frame.DataFrame'>


In [107]:
# https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/over-sampling/plot_comparison_over_sampling.html#sphx-glr-auto-examples-over-sampling-plot-comparison-over-sampling-py
from imblearn.over_sampling import RandomOverSampler

# sampling_strategy='minority'
# sampling_strategy=0.5
ros = RandomOverSampler(sampling_strategy='minority')
X_res, y_res = ros.fit_resample(X, y)

In [111]:
print('#############################################')
print('Before oversampling: "Closed" crushes "Default" and causes issues:')
print(y.value_counts())
print('')
print('Before oversampling: "Closed" and "Default" are equal')
print(y_res.value_counts())

#############################################
Before oversampling: "Closed" crushes "Default" and causes issues:
0    93322
1    15798
Name: zeroBalCode, dtype: int64

Before oversampling: "Closed" and "Default" are equal
1    93322
0    93322
Name: zeroBalCode, dtype: int64


# Now set up the test/train

In [121]:
dfAssumption1.head()

Unnamed: 0,origChannel,origIntRate,origUPB,origLTV,numBorrowers,origDebtIncRatio,loanPurp,zipCode,pMIperct,worstCreditScore,bankNumber,stateNumber,zeroBalCode,rateDiffAbovePct,origYear,origMonth
0,2,5.125,348000,87,1,50,2,51,25.0,689,80,49,1,-0.02381,2009,2
1,3,4.625,195000,52,2,54,1,82,0.0,703,4,32,0,-0.119048,2009,2
2,2,4.875,342000,80,1,54,1,981,0.0,746,3,50,0,-0.071429,2009,2
3,1,5.375,93000,70,1,50,1,496,0.0,780,54,23,1,0.02381,2009,2
4,1,4.875,182000,76,2,22,1,18,0.0,776,45,20,0,-0.071429,2009,2


In [123]:
from sklearn.model_selection import train_test_split

training_features, test_features, \
training_target, test_target, = train_test_split(
    dfAssumption1.drop(['zeroBalCode'], axis=1)
    , dfAssumption1['zeroBalCode']
    , test_size = .1
    , random_state=12
)

In [125]:
# Further split the training data into training/test
x_train, x_val, y_train, y_val = train_test_split(
    training_features
    , training_target
    , test_size = .1
    ,random_state=12
)

In [127]:
# For the training data, randomly sample 
ros = RandomOverSampler(sampling_strategy='minority')
x_train_res, y_train_res = ros.fit_sample(x_train, y_train)

# Testing Random Forest Classifier

In [130]:
from sklearn.ensemble import RandomForestClassifier

clf_rf = RandomForestClassifier(n_estimators=25, random_state=12)
clf_rf.fit(x_train_res, y_train_res)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=25,
                       n_jobs=None, oob_score=False, random_state=12, verbose=0,
                       warm_start=False)

In [167]:
# https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/

# https://beckernick.github.io/oversampling-modeling/

# Do your validation results match your test results closely?
from sklearn.metrics import recall_score

def GetResults(classify_result, classifier):   
    print('##############################################')
    print(f'# {classifier}')
    print (f'Validation Results for {classifier}')
    print (f'   - Accuracy: {round(classify_result.score(x_val, y_val), 7) * 100}%')
    print (f'   - Recall: {round(recall_score(y_val, classify_result.predict(x_val)), 7) * 100}%')
    print ('\nTest Results')
    print (f'   - Accuracy: {round(classify_result.score(test_features, test_target), 7) * 100}%')
    print (f'   - Recall: {round(recall_score(test_target, classify_result.predict(test_features)), 7) * 100}%')
    print('##############################################')
    print('')

# Let's try GradientBoostingClassifier

In [148]:
from sklearn.ensemble import GradientBoostingClassifier

clf_gbc = GradientBoostingClassifier(n_estimators=25, random_state=12)
clf_gbc.fit(x_train_res, y_train_res)

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=25,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=12, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

# LightGBM Classifier 

In [150]:
import lightgbm as lgb
from lightgbm import LGBMClassifier

# Defaults to accuracy; change w `metric='auc'`
# boosting_type=BOOSTING
clf_lgbc = LGBMClassifier(n_estimators=25, random_state=12)
clf_lgbc.fit(x_train_res, y_train_res)

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=-1,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=25, n_jobs=-1, num_leaves=31, objective=None,
               random_state=12, reg_alpha=0.0, reg_lambda=0.0, silent=True,
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

In [168]:
# View results
GetResults(clf_rf, "Random Forest Classifier")
GetResults(clf_gbc, "Gradient Boosting Classifier")
GetResults(clf_lgbc, "LightGBM Classifier")

##############################################
# Random Forest Classifier
Validation Results for Random Forest Classifier
   - Accuracy: 84.42114%
   - Recall: 17.12864%

Test Results
   - Accuracy: 84.20088%
   - Recall: 15.837100000000001%
##############################################

##############################################
# Gradient Boosting Classifier
Validation Results for Gradient Boosting Classifier
   - Accuracy: 63.88351%
   - Recall: 76.19048000000001%

Test Results
   - Accuracy: 63.74632999999999%
   - Recall: 77.8927%
##############################################

##############################################
# LightGBM Classifier
Validation Results for LightGBM Classifier
   - Accuracy: 66.37817%
   - Recall: 75.47973999999999%

Test Results
   - Accuracy: 66.12903%
   - Recall: 77.69877%
##############################################



## Evaluate Models
F1 (a.k.a. F score): closest to 1.0 is winner

SHAP - a way to evaluate ML models
* https://pycaret.org/interpret-model/
* https://shap.readthedocs.io/en/latest/

Examples: 
* https://medium.com/@shekharshashank1/2-words-code-to-compare-20-ml-regression-models-with-pycaret-8ed70c62a6b7
* https://www.linkedin.com/pulse/credit-card-fault-detection-using-pycaret-sagar-vasaikar/?articleId=6657385017775321088
* https://www.kaggle.com/frtgnn/pycaret-introduction-classification-regression
* https://prog.world/introducing-pycaret-an-open-low-code-python-machine-learning-library/
* https://towardsdatascience.com/announcing-pycaret-an-open-source-low-code-machine-learning-library-in-python-4a1f1aad8d46

# Calibrate the model
This function takes the input of trained estimator and performs probability calibration with sigmoid or isotonic regression. The output prints a score  grid that shows Accuracy, AUC, Recall, Precision, F1 and Kappa by fold (default = 10 Fold). The output of the original estimator and the calibrated estimator (created using this function) might not differ much. In order to see the calibration differences, use ‘calibration’ plot in plot_model to see the difference before and after.

In [None]:
#calibrate trained and boosted model
calibrated_dt = calibrate_model(dt_boosted)