## ***Intro***

Sometimes the simplest models are exactly what we need. Before planning to build very complex and impressive models, it’s worth giving the classics a try.

Logistic regression is one of the foundational models specifically designed for binary classification. The core idea is to enhance a linear regression model, represented by the equation 
𝑧
=
𝑊
⋅
𝑋
+
𝑏
 (where 
𝑊 is a vector of adjustable parameters and 𝑏 is the intercept or bias), with a smooth function that outputs the probability of 
𝑋 belonging to class 1 (positive). This function is the sigmoid function, defined as 
𝑆(𝑧)
=
1
/
1
+
𝑒
−
𝑧

In this notebook, we tackle a binary classification problem (**predicting loan defaults**) using logistic regression. 

## ***Imports***

In [1]:
# For data manipulation and linear algebra
import pandas as pd
import numpy as np

# To build a keras model
from keras import Sequential
from keras.layers import Dense, InputLayer, Normalization
from keras.optimizers import Adam

# some utilities from Scikit learn
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight

2024-05-27 14:50:28.083197: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-27 14:50:28.083286: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-27 14:50:28.085159: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## ***Functions***

In [2]:
def convert_to_tensor(*args, target=False) -> tuple:
    """Converts one or more DataFrames to Tensors.

    This function takes one or more DataFrames as input and returns a tuple of Tensors.

    Args:
    *args: A variable number of DataFrames to be converted.
    target: (Optional) Boolean flag indicating the data type of the output Tensors.
    - If True (default: False), Tensors will have the original data types of the DataFrame columns.
    - If False, Tensors will be cast to float type.

    Returns:
    A tuple of Tensors, one for each input DataFrame.
    """
    # Create a tensor from dataframe (convert inputs to float)
    return (np.array(df)for df in args) if target else (np.array(df).astype(float) for df in args)

## ***Data preparation***

**We will use a base table of 100 features selected from a set of features generated with Deep Feature Synthesis on the Home Credit - Credit Risk Model Stability competition dataset totaling over 1,500,000 examples.**

In [3]:
dataset = pd.read_parquet("/kaggle/input/deep-feature-synthesis-home-credit-stability/base_100features.parquet")
dataset

Unnamed: 0_level_0,WEEK_NUM,COUNT(person_1),MODE(person_1.contaddr_district_15M),MODE(person_1.contaddr_zipcode_807M),MODE(person_1.education_927M),MODE(person_1.empladdr_district_926M),MODE(person_1.empladdr_zipcode_114M),MODE(person_1.incometype_1044T),MODE(person_1.language1_981M),MODE(person_1.role_1084L),...,MODE(static_cb_0.education_88M),MODE(static_cb_0.maritalst_385M),MODE(static_cb_0.requesttype_4525192L),SUM(static_cb_0.days120_123L),SUM(static_cb_0.days30_165L),MAX_MIN_DELTA(applprev_2.num_group2),MONTH(date_decision),SEASON(date_decision),WEEKDAY(date_decision),target
case_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,4,259,2828,5,221,2093,6,2,2,...,2,2,0,0.0,0.0,0.0,1,3,3,0
1,0,5,259,2828,5,221,2093,6,2,0,...,2,2,0,0.0,0.0,0.0,1,3,3,0
2,0,5,259,2828,5,221,2093,8,2,0,...,2,2,0,0.0,0.0,1.0,1,3,4,0
3,0,3,259,2828,5,221,2093,8,2,0,...,2,2,0,0.0,0.0,2.0,1,3,3,0
4,0,4,259,2828,5,221,2093,8,2,2,...,2,2,0,0.0,0.0,1.0,1,3,4,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2703450,91,1,143,356,5,221,2093,0,1,0,...,4,3,0,0.0,0.0,2.0,10,0,0,0
2703451,91,2,403,2825,5,221,2093,0,1,0,...,4,3,0,0.0,0.0,3.0,10,0,0,0
2703452,91,1,273,3202,5,221,2093,1,1,0,...,4,3,0,2.0,0.0,1.0,10,0,0,0
2703453,91,2,148,1959,5,221,2093,0,1,0,...,4,0,0,2.0,1.0,3.0,10,0,0,0


In [4]:
# You can notice that the positive class is extremely under represented (totalling about 3% of the entire dataset)
# This complicates the training process as the model will likely get a 97% classification accuracy by always prediction class 0
# So we have to find a solution to that problem first
dataset[dataset["target"] == 1]

Unnamed: 0_level_0,WEEK_NUM,COUNT(person_1),MODE(person_1.contaddr_district_15M),MODE(person_1.contaddr_zipcode_807M),MODE(person_1.education_927M),MODE(person_1.empladdr_district_926M),MODE(person_1.empladdr_zipcode_114M),MODE(person_1.incometype_1044T),MODE(person_1.language1_981M),MODE(person_1.role_1084L),...,MODE(static_cb_0.education_88M),MODE(static_cb_0.maritalst_385M),MODE(static_cb_0.requesttype_4525192L),SUM(static_cb_0.days120_123L),SUM(static_cb_0.days30_165L),MAX_MIN_DELTA(applprev_2.num_group2),MONTH(date_decision),SEASON(date_decision),WEEKDAY(date_decision),target
case_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4,0,4,259,2828,5,221,2093,8,2,2,...,2,2,0,0.0,0.0,1.0,1,3,4,1
101,0,4,259,2828,5,221,2093,1,2,2,...,2,2,0,0.0,0.0,1.0,1,3,3,1
118,0,4,259,2828,5,221,2093,1,2,2,...,2,2,0,0.0,0.0,0.0,1,3,3,1
129,0,4,259,2828,5,221,2093,8,2,2,...,2,2,0,0.0,0.0,0.0,1,3,4,1
148,0,4,259,2828,5,221,2093,1,2,2,...,2,2,0,0.0,0.0,0.0,1,3,4,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2702884,91,1,515,2177,5,221,2093,6,0,0,...,4,4,0,0.0,0.0,1.0,10,0,4,1
2702904,91,1,297,1304,5,221,2093,8,1,0,...,4,0,0,1.0,1.0,2.0,10,0,4,1
2703005,91,1,232,233,5,221,2093,6,0,0,...,4,0,0,3.0,3.0,2.0,10,0,5,1
2703400,91,1,798,1536,5,221,2093,8,1,0,...,4,0,0,8.0,4.0,2.0,10,0,0,1


In [5]:
# Drop WEEK_NUM and target to form the training entries X (you can use WEEK_NUM as a feature too)
week_num = dataset["WEEK_NUM"]
target = dataset["target"]
dataset.drop(columns=["WEEK_NUM", "target"], inplace=True)
dataset

Unnamed: 0_level_0,COUNT(person_1),MODE(person_1.contaddr_district_15M),MODE(person_1.contaddr_zipcode_807M),MODE(person_1.education_927M),MODE(person_1.empladdr_district_926M),MODE(person_1.empladdr_zipcode_114M),MODE(person_1.incometype_1044T),MODE(person_1.language1_981M),MODE(person_1.role_1084L),MODE(person_1.sex_738L),...,MODE(static_cb_0.education_1103M),MODE(static_cb_0.education_88M),MODE(static_cb_0.maritalst_385M),MODE(static_cb_0.requesttype_4525192L),SUM(static_cb_0.days120_123L),SUM(static_cb_0.days30_165L),MAX_MIN_DELTA(applprev_2.num_group2),MONTH(date_decision),SEASON(date_decision),WEEKDAY(date_decision)
case_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,4,259,2828,5,221,2093,6,2,2,0,...,3,2,2,0,0.0,0.0,0.0,1,3,3
1,5,259,2828,5,221,2093,6,2,0,1,...,3,2,2,0,0.0,0.0,0.0,1,3,3
2,5,259,2828,5,221,2093,8,2,0,0,...,3,2,2,0,0.0,0.0,1.0,1,3,4
3,3,259,2828,5,221,2093,8,2,0,0,...,3,2,2,0,0.0,0.0,2.0,1,3,3
4,4,259,2828,5,221,2093,8,2,2,0,...,3,2,2,0,0.0,0.0,1.0,1,3,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2703450,1,143,356,5,221,2093,0,1,0,0,...,4,4,3,0,0.0,0.0,2.0,10,0,0
2703451,2,403,2825,5,221,2093,0,1,0,0,...,4,4,3,0,0.0,0.0,3.0,10,0,0
2703452,1,273,3202,5,221,2093,1,1,0,1,...,4,4,3,0,2.0,0.0,1.0,10,0,0
2703453,2,148,1959,5,221,2093,0,1,0,0,...,1,4,0,0,2.0,1.0,3.0,10,0,0


In [6]:
# Split the dataset into training and test sets (randomly shuffling the data)
# Only 5% of the data will be used as the test set (the dataset being really big, 5% should be largely enough)
X_train, X_test, Y_train, Y_test = train_test_split(dataset, target, test_size=0.05, shuffle=True, random_state=123)

In [7]:
# keep the true test labels in a Dataframe before converting to tensor
true_label_test = pd.DataFrame(Y_test)
true_label_test

Unnamed: 0_level_0,target
case_id,Unnamed: 1_level_1
1017805,0
2543910,0
225134,0
955887,0
23690,0
...,...
2540270,0
2553269,0
947602,0
755641,0


In [8]:
# Convert Xs and Ys to tensor (np.ndarray)
X_train, X_test = convert_to_tensor(X_train, X_test)
Y_train, Y_test = convert_to_tensor(Y_train, Y_test, target=True)

In [9]:
X_train.shape, X_test.shape

((1450326, 100), (76333, 100))

In [10]:
Y_train.dtype, X_train.dtype

(dtype('int64'), dtype('float64'))

## ***Create a normalization layer***

**This layer will act as a preprocessing layer that shifts and scale inputs into a distribution centered around 0 with standard deviation 1. It accomplishes this by precomputing the mean and variance of the data, and calling (input - mean) / sqrt(var) at runtime.**

In [11]:
normalizer = Normalization()
normalizer.adapt(X_train)

In [12]:
normalizer(X_test[0])

<tf.Tensor: shape=(1, 100), dtype=float32, numpy=
array([[ 0.04624718, -0.28207028, -1.983003  , -4.6552796 ,  0.24187519,
         0.1974189 ,  0.78313464,  1.1939479 , -0.37999895, -0.7762619 ,
        -1.37151   , -0.24940884, -2.189564  ,  0.6568108 , -1.0199934 ,
        -0.92357796, -1.0008069 ,  0.4900323 , -1.182728  ,  0.6504713 ,
         0.521577  , -0.02339314, -0.6821551 , -0.16142493, -0.9974977 ,
        -0.7085818 ,  0.12254757,  0.11174086, -0.6286927 ,  0.75663817,
        -0.5329664 , -1.0458726 , -0.71103936, -0.11861729, -0.02432021,
        -0.8803327 , -0.38889712, -2.001316  , -0.47380036, -1.327544  ,
        -1.390471  , -1.5428802 , -0.87488216, -0.9901183 , -0.85514   ,
        -1.6092083 , -1.287082  , -1.0557375 , -0.8173162 , -1.3617547 ,
        -0.79203176, -1.0052189 , -0.14500466,  0.01480454,  0.42456862,
        -0.78026426, -0.5758279 ,  0.37803945, -0.7489057 , -0.6271946 ,
        -1.0284046 , -0.9170442 , -0.6473859 , -0.52873063, -0.40075794,
 

## ***Create and compile model***

In [13]:
# Create a logistic regression model
# This is technically a single layer Neural Net with only one neuron but it is conceptually very similar to a logistic regression model
model = Sequential([InputLayer(shape=(100,)),
                    normalizer,
                    Dense(units=1, activation="sigmoid", kernel_initializer="glorot_uniform", kernel_regularizer=None)], name="logistic_reg")

In [14]:
# set up the optimizer
optimizer = Adam(learning_rate=0.001)

In [15]:
# Compile the model for training
model.compile(optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy", "auc", "precision", "recall"])

## ***Dealing with imbalance***

As we observed earlier, the dataset is highly unbalanced, requiring a strategy for the model to effectively learn to distinguish between the two classes. Without such a strategy, the model is likely to predict class 0 for almost every example and still achieve a high classification accuracy. Therefore, accuracy is not as relevant for this kind of dataset. To evaluate the model, we need metrics that can assess its ability to **recognize the minority class**. By weighting the classes, we penalize the model more heavily for misclassifying the minority class than the majority class.

Additionally, we will use precision and recall to evaluate the model's performance. Both metrics provide insights into the model's ability to accurately predict the positive class. Precision measures the percentage of true positive predictions among all examples predicted as positive, while recall measures the percentage of actual positive cases that the model correctly identified as positive.

In [16]:
# Compute class weight before calling model.fit
# Run if not dataset resampled (to train on the whole training set)
class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(Y_train), y=Y_train)
class_weights = {0: class_weights[0], 1: class_weights[1]}
class_weights

{0: 0.5162114467827323, 1: 15.921202274573517}

## ***Training***

In [17]:
# Train for 7 epochs
model.fit(x=X_train, y=Y_train, batch_size=32, epochs=7, class_weight=class_weights, shuffle=True)

Epoch 1/7
[1m45323/45323[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m63s[0m 1ms/step - accuracy: 0.6811 - auc: 0.7382 - loss: 0.6096 - precision: 0.0649 - recall: 0.6737
Epoch 2/7
[1m45323/45323[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m61s[0m 1ms/step - accuracy: 0.7130 - auc: 0.7607 - loss: 0.5853 - precision: 0.0703 - recall: 0.6701
Epoch 3/7
[1m45323/45323[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m60s[0m 1ms/step - accuracy: 0.7105 - auc: 0.7609 - loss: 0.5864 - precision: 0.0703 - recall: 0.6733
Epoch 4/7
[1m45323/45323[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m60s[0m 1ms/step - accuracy: 0.7088 - auc: 0.7563 - loss: 0.5903 - precision: 0.0695 - recall: 0.6676
Epoch 5/7
[1m45323/45323[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m61s[0m 1ms/step - accuracy: 0.7087 - auc: 0.7578 - loss: 0.5901 - precision: 0.0699 - recall: 0.6707
Epoch 6/7
[1m45323/45323[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m62s[0m 1ms/step - accuracy: 0.7072 - auc: 0.7598 - 

<keras.src.callbacks.history.History at 0x7ea387120700>

In [18]:
model.summary()

## ***Evaluation***

In [19]:
model.evaluate(x=X_test, y=Y_test)

[1m2386/2386[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 1ms/step - accuracy: 0.7002 - auc: 0.7666 - loss: 0.6004 - precision: 0.0704 - recall: 0.6897


[0.6004137396812439,
 0.6996318697929382,
 0.7575650811195374,
 0.06981726735830307,
 0.6791990399360657]

In [20]:
y_pred_prob = model.predict(x=X_test)  

[1m2386/2386[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 1ms/step


In [21]:
y_pred_prob

array([[0.19490084],
       [0.30687913],
       [0.13840008],
       ...,
       [0.4712462 ],
       [0.4813835 ],
       [0.22795266]], dtype=float32)

In [22]:
y_pred = np.where(y_pred_prob > 0.5, 1, 0)
y_pred

array([[0],
       [0],
       [0],
       ...,
       [0],
       [0],
       [0]])

In [23]:
true_label_test["predicted probs"] = y_pred_prob
true_label_test["predicted class"] = y_pred
true_label_test

Unnamed: 0_level_0,target,predicted probs,predicted class
case_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1017805,0,0.194901,0
2543910,0,0.306879,0
225134,0,0.138400,0
955887,0,0.287505,0
23690,0,0.468134,0
...,...,...,...
2540270,0,0.447218,0
2553269,0,0.682336,1
947602,0,0.471246,0
755641,0,0.481384,0


## ***Conclusion***

**Feel free to tune the hyper-parameters of the model or to train with a different set of features. Thanks for reading**