<a href="https://colab.research.google.com/github/evroth/gsb545repo/blob/main/Loan_Default_Deep_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Loan Default

Levergage deep learning to build an inference model to make predictions on dictionaries of raw feature values. Leveraging FeatureSpace to index, preprocess, and encode the feature variables.

This project will walk through an example of structured data classification, modeled from https://keras.io/examples/structured_data/structured_data_classification_with_feature_space/ : "Structured Data Classification with FeatureSpace".

## Libraries

In [101]:
import tensorflow as tf
import pandas as pd
from tensorflow import keras
from keras.utils import FeatureSpace

## Read in the Data

The dataset is enormous & consists of multiple deteministic factors like borrowe's income, gender, loan pupose etc. The dataset is subject to strong multicollinearity & empty values. Each row is a loan that was given out. There are many variables included in this file, but we will select only a dozen or so that are most accessible and relevant to someone giving out a loan so they can detect fraud.

In [102]:
df = pd.read_csv("Loan_Default.csv")

In [103]:
df.head()

Unnamed: 0,ID,year,loan_limit,Gender,approv_in_adv,loan_type,loan_purpose,Credit_Worthiness,open_credit,business_or_commercial,...,credit_type,Credit_Score,co-applicant_credit_type,age,submission_of_application,LTV,Region,Security_Type,Status,dtir1
0,24890,2019,cf,Sex Not Available,nopre,type1,p1,l1,nopc,nob/c,...,EXP,758,CIB,25-34,to_inst,98.728814,south,direct,1,45.0
1,24891,2019,cf,Male,nopre,type2,p1,l1,nopc,b/c,...,EQUI,552,EXP,55-64,to_inst,,North,direct,1,
2,24892,2019,cf,Male,pre,type1,p1,l1,nopc,nob/c,...,EXP,834,CIB,35-44,to_inst,80.019685,south,direct,0,46.0
3,24893,2019,cf,Male,nopre,type1,p4,l1,nopc,nob/c,...,EXP,587,CIB,45-54,not_inst,69.3769,North,direct,0,42.0
4,24894,2019,cf,Joint,pre,type1,p1,l1,nopc,nob/c,...,CRIF,602,EXP,25-34,not_inst,91.886544,North,direct,0,39.0


In [104]:
df.dtypes

ID                             int64
year                           int64
loan_limit                    object
Gender                        object
approv_in_adv                 object
loan_type                     object
loan_purpose                  object
Credit_Worthiness             object
open_credit                   object
business_or_commercial        object
loan_amount                    int64
rate_of_interest             float64
Interest_rate_spread         float64
Upfront_charges              float64
term                         float64
Neg_ammortization             object
interest_only                 object
lump_sum_payment              object
property_value               float64
construction_type             object
occupancy_type                object
Secured_by                    object
total_units                   object
income                       float64
credit_type                   object
Credit_Score                   int64
co-applicant_credit_type      object
a

Select only the variables of interest that we want. Again the purpose here of this activity is to utilize FeatureSpace.

In [105]:
df = selected_columns = df[['loan_limit','Gender','approv_in_adv','loan_type','loan_purpose','business_or_commercial',
                                  'loan_amount','term','property_value','credit_type','Credit_Score',
                                  'age','submission_of_application','Region','Status']]

In [106]:
df.describe()

Unnamed: 0,loan_amount,term,property_value,Credit_Score,Status
count,148670.0,148629.0,133572.0,148670.0,148670.0
mean,331117.7,335.136582,497893.5,699.789103,0.246445
std,183909.3,58.409084,359935.3,115.875857,0.430942
min,16500.0,96.0,8000.0,500.0,0.0
25%,196500.0,360.0,268000.0,599.0,0.0
50%,296500.0,360.0,418000.0,699.0,0.0
75%,436500.0,360.0,628000.0,800.0,0.0
max,3576500.0,360.0,16508000.0,900.0,1.0


In [107]:
df = df.dropna()

In [121]:
df.describe()

Unnamed: 0,loan_amount,term,property_value,Credit_Score,Status
count,129458.0,129458.0,129458.0,129458.0,129458.0
mean,332009.4,335.233497,499532.4,699.637257,0.159774
std,182102.2,58.418701,361343.4,115.906588,0.366398
min,16500.0,96.0,8000.0,500.0,0.0
25%,196500.0,360.0,278000.0,599.0,0.0
50%,296500.0,360.0,418000.0,699.0,0.0
75%,436500.0,360.0,628000.0,800.0,0.0
max,3576500.0,360.0,16508000.0,900.0,1.0


## Test and Trian

Make a dataset with this data:

We run into an error with the age column so we need to preprocess it outside the FeatureSpace. Seems stupid like what is the point of feature space then??? 

Actually I have learned somemore after playing around. It just needs a string type not object type.

In [None]:
# Convert the "approv_in_adv" column to character type
df["age"] = df["age"].astype(str)
df["approv_in_adv"] = df["approv_in_adv"].astype(str)
df["loan_limit"] = df["loan_limit"].astype(str)
df["loan_purpose"] = df["loan_purpose"].astype(str)
df["submission_of_application"] = df["submission_of_application"].astype(str)

In [109]:
val_dataframe = df.sample(frac=0.2, random_state=42)
train_dataframe = df.drop(val_dataframe.index)

print(
    "Using %d samples for training and %d for validation"
    % (len(train_dataframe), len(val_dataframe))
)

Using 103566 samples for training and 25892 for validation


In [110]:
def dataframe_to_dataset(dataframe):
    dataframe = dataframe.copy()
    labels = dataframe.pop("Status")
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
    ds = ds.shuffle(buffer_size=len(dataframe))
    return ds


train_ds = dataframe_to_dataset(train_dataframe)
val_ds = dataframe_to_dataset(val_dataframe)

In [111]:
for x, y in train_ds.take(1):
    print("Input:", x)
    print("Target:", y)

Input: {'loan_limit': <tf.Tensor: shape=(), dtype=string, numpy=b'cf'>, 'Gender': <tf.Tensor: shape=(), dtype=string, numpy=b'Joint'>, 'approv_in_adv': <tf.Tensor: shape=(), dtype=string, numpy=b'pre'>, 'loan_type': <tf.Tensor: shape=(), dtype=string, numpy=b'type1'>, 'loan_purpose': <tf.Tensor: shape=(), dtype=string, numpy=b'p1'>, 'business_or_commercial': <tf.Tensor: shape=(), dtype=string, numpy=b'nob/c'>, 'loan_amount': <tf.Tensor: shape=(), dtype=int64, numpy=146500>, 'term': <tf.Tensor: shape=(), dtype=float64, numpy=180.0>, 'property_value': <tf.Tensor: shape=(), dtype=float64, numpy=178000.0>, 'credit_type': <tf.Tensor: shape=(), dtype=string, numpy=b'EXP'>, 'Credit_Score': <tf.Tensor: shape=(), dtype=int64, numpy=842>, 'age': <tf.Tensor: shape=(), dtype=string, numpy=b'25-34'>, 'submission_of_application': <tf.Tensor: shape=(), dtype=string, numpy=b'to_inst'>, 'Region': <tf.Tensor: shape=(), dtype=string, numpy=b'North'>}
Target: tf.Tensor(0, shape=(), dtype=int64)


Doing batch sizing

In [112]:
train_ds = train_ds.batch(32)
val_ds = val_ds.batch(32)

## Configuring the FeatureSpace

This is what will determine how each feature should be preprocessed. We will create a dictionary with featurespace that will describe each variable and how they should be preprocessed.

In [113]:
df

Unnamed: 0,loan_limit,Gender,approv_in_adv,loan_type,loan_purpose,business_or_commercial,loan_amount,term,property_value,credit_type,Credit_Score,age,submission_of_application,Region,Status
0,cf,Sex Not Available,nopre,type1,p1,nob/c,116500,360.0,118000.0,EXP,758,25-34,to_inst,south,1
2,cf,Male,pre,type1,p1,nob/c,406500,360.0,508000.0,EXP,834,35-44,to_inst,south,0
3,cf,Male,nopre,type1,p4,nob/c,456500,360.0,658000.0,EXP,587,45-54,not_inst,North,0
4,cf,Joint,pre,type1,p1,nob/c,696500,360.0,758000.0,CRIF,602,25-34,not_inst,North,0
5,cf,Joint,pre,type1,p1,nob/c,706500,360.0,1008000.0,EXP,864,35-44,not_inst,North,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
148665,cf,Sex Not Available,nopre,type1,p3,nob/c,436500,180.0,608000.0,CIB,659,55-64,to_inst,south,0
148666,cf,Male,nopre,type1,p1,nob/c,586500,360.0,788000.0,CIB,569,25-34,not_inst,south,0
148667,cf,Male,nopre,type1,p4,nob/c,446500,180.0,728000.0,CIB,702,45-54,not_inst,North,0
148668,cf,Female,nopre,type1,p4,nob/c,196500,180.0,278000.0,EXP,737,55-64,to_inst,North,0


As we can see above, we have many different types of variables. String catagoricals like "loan_limit", "Gender", and "approv_in_adv". Numerican that we want to normalize like "loan_amount" and "property_value". 

In [114]:
feature_space = FeatureSpace(
    features={
        # Categorical feature encoded as string
        "loan_limit": FeatureSpace.string_categorical(num_oov_indices=1),
        "Gender": FeatureSpace.string_categorical(num_oov_indices=1),
        "approv_in_adv": FeatureSpace.string_categorical(num_oov_indices=1),
        "loan_type": FeatureSpace.string_categorical(num_oov_indices=1),
        "loan_purpose": FeatureSpace.string_categorical(num_oov_indices=1),
        "business_or_commercial": FeatureSpace.string_categorical(num_oov_indices=0),
        "credit_type": FeatureSpace.string_categorical(num_oov_indices=1),
        "submission_of_application": FeatureSpace.string_categorical(num_oov_indices=1),
        "age": FeatureSpace.string_categorical(num_oov_indices=1),
        "Region": FeatureSpace.string_categorical(num_oov_indices=1),

        # Numerical features to normalize
        "loan_amount": FeatureSpace.float_normalized(),
        "property_value": FeatureSpace.float_normalized(),
        "Credit_Score": FeatureSpace.float_normalized(),
        "term": FeatureSpace.float_normalized()

    },

    output_mode="concat",
)

## Adapt FeatureSpace to Training Data

During adapt(), the FeatureSpace will:

- Index the set of possible values for categorical features.
- Compute the mean and variance for numerical features to normalize.
- Compute the value boundaries for the different bins for numerical features to discretize.

In [115]:
train_ds_with_no_labels = train_ds.map(lambda x, _: x)
feature_space.adapt(train_ds_with_no_labels)

In [116]:
for x, _ in train_ds.take(1):
    preprocessed_x = feature_space(x)
    print("preprocessed_x.shape:", preprocessed_x.shape)
    print("preprocessed_x.dtype:", preprocessed_x.dtype)

preprocessed_x.shape: (32, 47)
preprocessed_x.dtype: <dtype: 'float32'>


In [117]:
preprocessed_train_ds = train_ds.map(
    lambda x, y: (feature_space(x), y), num_parallel_calls=tf.data.AUTOTUNE
)
preprocessed_train_ds = preprocessed_train_ds.prefetch(tf.data.AUTOTUNE)

preprocessed_val_ds = val_ds.map(
    lambda x, y: (feature_space(x), y), num_parallel_calls=tf.data.AUTOTUNE
)
preprocessed_val_ds = preprocessed_val_ds.prefetch(tf.data.AUTOTUNE)

## Building the Model

In [123]:
dict_inputs = feature_space.get_inputs()
encoded_features = feature_space.get_encoded_features()

x = keras.layers.Dense(100, activation="relu")(encoded_features)
x = keras.layers.Dropout(0.5)(x)
predictions = keras.layers.Dense(1, activation="sigmoid")(x)

training_model = keras.Model(inputs=encoded_features, outputs=predictions)
training_model.compile(
    optimizer="adam", loss="binary_crossentropy", metrics=["AUC"]
)

inference_model = keras.Model(inputs=dict_inputs, outputs=predictions)

## Training the Model

In [124]:
training_model.fit(
    preprocessed_train_ds, epochs=20, validation_data=preprocessed_val_ds, verbose=2
)

Epoch 1/20
3237/3237 - 8s - loss: 0.4284 - auc: 0.6269 - val_loss: 0.4156 - val_auc: 0.6594 - 8s/epoch - 2ms/step
Epoch 2/20
3237/3237 - 7s - loss: 0.4202 - auc: 0.6535 - val_loss: 0.4134 - val_auc: 0.6680 - 7s/epoch - 2ms/step
Epoch 3/20
3237/3237 - 7s - loss: 0.4180 - auc: 0.6595 - val_loss: 0.4132 - val_auc: 0.6696 - 7s/epoch - 2ms/step
Epoch 4/20
3237/3237 - 6s - loss: 0.4166 - auc: 0.6642 - val_loss: 0.4128 - val_auc: 0.6751 - 6s/epoch - 2ms/step
Epoch 5/20
3237/3237 - 6s - loss: 0.4150 - auc: 0.6697 - val_loss: 0.4109 - val_auc: 0.6745 - 6s/epoch - 2ms/step
Epoch 6/20
3237/3237 - 6s - loss: 0.4142 - auc: 0.6713 - val_loss: 0.4114 - val_auc: 0.6797 - 6s/epoch - 2ms/step
Epoch 7/20
3237/3237 - 7s - loss: 0.4129 - auc: 0.6753 - val_loss: 0.4088 - val_auc: 0.6822 - 7s/epoch - 2ms/step
Epoch 8/20
3237/3237 - 6s - loss: 0.4116 - auc: 0.6787 - val_loss: 0.4069 - val_auc: 0.6870 - 6s/epoch - 2ms/step
Epoch 9/20
3237/3237 - 7s - loss: 0.4104 - auc: 0.6809 - val_loss: 0.4066 - val_auc: 0.6

<keras.callbacks.History at 0x7efda8630bb0>

This is a decent model, with an ROC_AUC just under .7 which is the minimum I would take for being acceptable

We can use our model (which includes the FeatureSpace) to make predictions based on dicts of raw features values, as follows:

In [127]:
sample = {

        "loan_limit": "cf",
        "Gender": "Male",
        "approv_in_adv": "nopre",
        "loan_type": "type2",
        "loan_purpose": "p3",
        "business_or_commercial": "nob/c",
        "credit_type": "EXP",
        "submission_of_application": "to_inst",
        "age": "25-34",
        "Region": "North",

        "loan_amount": 450000,
        "property_value": 500000,
        "Credit_Score": 700,
        "term": 360
}

input_dict = {name: tf.convert_to_tensor([value]) for name, value in sample.items()}
predictions = inference_model.predict(input_dict)

print(
    f"This particular sample has a {100 * predictions[0][0]:.2f}% probability "
    "of being FRAUD, as evaluated by our model."
)

This particular patient had a 37.20% probability of being FRAUD, as evaluated by our model.


This is a fun project that may take initailly a slightly longer start up time, but once this is produced and built out, we can see time savings in terms of evaluating new cases of data we may come accross. A bank may use this and input the data for someone applying for a loan to evaluate risk. Or an individual or organization may apply this to their data to assess which accounts to audit first. Start with the most risky and work down.

I like this because it can easily be expandable to iterate through may inputs or automated to evaluate new cases as they are created. Similar could be done with different types of data like stream data.

## Model 2

Like our labs, I want to play around with a few other specs to see if I can create a better model. I will leave this here to compare but I will only keep one as to not leave this notebook messy.

In [131]:
dict_inputs = feature_space.get_inputs()
encoded_features = feature_space.get_encoded_features()

x = keras.layers.Dense(96, activation="sigmoid")(encoded_features)
x = keras.layers.Dense(96, activation="sigmoid")(x)
x = keras.layers.Dropout(0.25)(x)
predictions = keras.layers.Dense(1, activation="sigmoid")(x)

training_model = keras.Model(inputs=encoded_features, outputs=predictions)
training_model.compile(
    optimizer="adam", loss="binary_crossentropy", metrics=["AUC"]
)

inference_model = keras.Model(inputs=dict_inputs, outputs=predictions)

In [132]:
training_model.fit(
    preprocessed_train_ds, epochs=20, validation_data=preprocessed_val_ds, verbose=2
)

Epoch 1/20
3237/3237 - 8s - loss: 0.4307 - auc: 0.6125 - val_loss: 0.4220 - val_auc: 0.6466 - 8s/epoch - 3ms/step
Epoch 2/20
3237/3237 - 8s - loss: 0.4228 - auc: 0.6419 - val_loss: 0.4170 - val_auc: 0.6539 - 8s/epoch - 2ms/step
Epoch 3/20
3237/3237 - 6s - loss: 0.4204 - auc: 0.6495 - val_loss: 0.4165 - val_auc: 0.6570 - 6s/epoch - 2ms/step
Epoch 4/20
3237/3237 - 7s - loss: 0.4192 - auc: 0.6535 - val_loss: 0.4169 - val_auc: 0.6592 - 7s/epoch - 2ms/step
Epoch 5/20
3237/3237 - 8s - loss: 0.4175 - auc: 0.6594 - val_loss: 0.4142 - val_auc: 0.6632 - 8s/epoch - 2ms/step
Epoch 6/20
3237/3237 - 8s - loss: 0.4162 - auc: 0.6636 - val_loss: 0.4139 - val_auc: 0.6650 - 8s/epoch - 2ms/step
Epoch 7/20
3237/3237 - 7s - loss: 0.4148 - auc: 0.6680 - val_loss: 0.4121 - val_auc: 0.6714 - 7s/epoch - 2ms/step
Epoch 8/20
3237/3237 - 7s - loss: 0.4139 - auc: 0.6702 - val_loss: 0.4158 - val_auc: 0.6733 - 7s/epoch - 2ms/step
Epoch 9/20
3237/3237 - 7s - loss: 0.4130 - auc: 0.6736 - val_loss: 0.4099 - val_auc: 0.6

<keras.callbacks.History at 0x7efda85c8e20>

In [133]:
sample = {

        "loan_limit": "cf",
        "Gender": "Male",
        "approv_in_adv": "nopre",
        "loan_type": "type2",
        "loan_purpose": "p3",
        "business_or_commercial": "nob/c",
        "credit_type": "EXP",
        "submission_of_application": "to_inst",
        "age": "25-34",
        "Region": "North",

        "loan_amount": 450000,
        "property_value": 500000,
        "Credit_Score": 700,
        "term": 360
}

input_dict = {name: tf.convert_to_tensor([value]) for name, value in sample.items()}
predictions = inference_model.predict(input_dict)

print(
    f"This particular sample has a {100 * predictions[0][0]:.2f}% probability "
    "of being FRAUD, as evaluated by our model."
)



This particular sample has a 37.88% probability of being FRAUD, as evaluated by our model.


## Conclusion

Like stated above, I think this application has many different uses where it would be beneficial to apply it to.

The second model listed above had the highest overall AUC I was able to put together, but more options should be explored. Perhaps additional business knowledge and data exploration could be done, so that the best model or most applicaple could be built. We see from the 2 probabilities of the one sample case that they produce a similar likelihood of being Fraud. Different models could be explored to give greater insight into which variables contribute the most to that number.

Some of the models I built that had AUC in the .67-.7 range would give probabilities of being fraud to this sample of 20%, which is much different than what we found with the two above models. Therefore those models must have been picking up something different. As always we needed to be careful of overfitting, that seemed to be a slight issue at times between all the models I built.