<h1 align="left"> Use case: Early expired loans </h1>

The objective of this assignment is to evaluate the candidate's technical skills and their approach to a Data Science problem. Specifically, the task is to develop a predictive model to classify the probability that a customer will pay back and close the loan before the contract end date.

**Dataset:**

Inside the **data** folder, you will find the use_case_customer_data.csv and use_case_loans_data.csv files, which contains all the variables described below. The target variable "STATUS" indicates whether the customer has closed the loan regularly or early.

**Assignment:**

The request of this assignment is to build a predictive model with satisfactory performance, demonstrating all the typical steps that should be addressed in a Data Science project: from data cleaning and preparation to testing the performance of the constructed model.

The completed notebook should be properly commented and should be delivered through sharing a personal accessible GitHub repository that allows for its reproduction.


### Datasets details

### *use_case_customer_data.csv*

Variables:
- **CUSTOMER_ID**: customer ID

- **SEX**: gender of the customer
    - M: MAN
    - W: WOMAN
    
- **AGE**: age of the customer

- **ANNUAL_INCOME**: annual salary value of the customer

- **NUMBER_OF_MONTHS**: monthly salary number

- **MARITAL_STATUS**: marital status of the customer
    - D: DIVORCED
    - G: SINGLE
    - C: COHABITANT
    - J: CONJUGATE
    - S: SEPARATE
    - W: WIDOWER
    - X: OTHER

- **LEASE**: type of customer lease
    - P: PROPERTY
    - E: AT THE EMPLOYER
    - R: RENT
    - A: PARENTS/RELATIVES
    - T: THIRD PARTIES
    - X: OTHER


### *use_case_loans_data.csv*

Variables:
- **CUSTOMER_ID**: customer ID

- **STATUS**: loan status (target)
    - CONCLUDED REGULARLY
    - EARLY EXPIRED

- **SECTOR_TYPE**: type of loan
    - CL: CAR LOAN
    - FL: FINALIZED LOAN
    - PL: PERSONAL LOAN

- **GOOD_VALUE**: value of the mortgaged property

- **ADVANCE_VALUE**: advance paid

- **LOAN_VALUE**: value of the loan

- **INSTALLMENT_VALUE**: value of the installment

- **NUMBER_INSTALLMENT**: number of installments

- **GAPR**: Gross Annual Percentage Rate

- **NIR**: Nominal Interest Rate 

- **REFINANCED**: loan subject to refinancing (Y / N)

- **FROM_REFINANCE**: loan from a refinancing (Y / N)




In [108]:
import pandas as pd
from pathlib import Path
import os
import pickle
import numpy as np
import datetime

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import StratifiedKFold, cross_validate, RandomizedSearchCV, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.ensemble import VotingClassifier, StackingClassifier
from sklearn import metrics
# from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
import shap

ImportError: Numba needs NumPy 2.0 or less. Got NumPy 2.1.

# LOADING DATA

In [43]:
PATH_DF_CDATA = Path(r".\data\use_case_customer_data.csv")
PATH_DF_LDATA = Path(r".\data\use_case_loans_data.csv")

In [44]:
df_cdata = pd.read_csv(PATH_DF_CDATA)
df_ldata = pd.read_csv(PATH_DF_LDATA)

In [54]:
df_cdata.head()

Unnamed: 0,CUSTOMER_ID,SEX,AGE,ANNUAL_INCOME,NUMBER_OF_MONTHS,MARITAL_STATUS,LEASE
0,1088,M,36,25200.0,14,J,P
1,1097,W,45,26610.78,14,J,P
2,1102,M,49,24700.0,13,J,P
3,1104,W,45,15951.0,13,J,P
4,1106,M,47,28114.45,13,J,P


In [55]:
df_ldata.head()

Unnamed: 0,CUSTOMER_ID,STATUS,SECTOR_TYPE,GOOD_VALUE,ADVANCE_VALUE,LOAN_VALUE,INSTALLMENT_VALUE,NUMBER_INSTALLMENT,GAPR,NIR,REFINANCED,FROM_REFINANCE
0,1088,CONCLUDED REGULARLY,FL,469.0,69.0,400.0,44.45,10,21.74882,19.84123,N,N
1,1088,CONCLUDED REGULARLY,FL,794.3,0.0,794.3,70.0,12,9.422,9.03804,N,N
2,1097,CONCLUDED REGULARLY,FL,399.0,0.0,419.0,69.85,6,8.192,0.038,N,N
3,1097,CONCLUDED REGULARLY,FL,1039.98,0.0,1039.98,52.0,20,0.0022,0.0022,N,N
4,1097,EARLY EXPIRED,CL,23500.0,3500.0,21387.86,310.0,84,6.72602,5.7609,Y,N


In [45]:
print("Shape of Customer data:", df_cdata.shape)
print("Shape of Loans data:", df_ldata.shape)

Shape of Customer data: (9561, 7)
Shape of Loans data: (37291, 12)


In [58]:
print("Columns of Customer data:", df_cdata.columns.to_list())
print("Columns of Loans data:", df_ldata.columns.to_list())

Columns of Customer data: ['CUSTOMER_ID', 'SEX', 'AGE', 'ANNUAL_INCOME', 'NUMBER_OF_MONTHS', 'MARITAL_STATUS', 'LEASE']
Columns of Loans data: ['CUSTOMER_ID', 'STATUS', 'SECTOR_TYPE', 'GOOD_VALUE', 'ADVANCE_VALUE', 'LOAN_VALUE', 'INSTALLMENT_VALUE', 'NUMBER_INSTALLMENT', 'GAPR', 'NIR', 'REFINANCED', 'FROM_REFINANCE']


In [None]:
TARGET = 'STATUS'
FEATURES = [c for c in df_ldata.columns if c != TARGET]

In [46]:
df_cdata['CUSTOMER_ID'].nunique() #ciacun ID è presente una sola volta

9561

In [None]:
df_ldata['CUSTOMER_ID'].nunique() #ci sono ID che si ripetono, ma il numero di ID unici è uguale al numero di ID unici in df_cdata

9561

# Train test split

### Si effettua lo split in training e test solamente di *df_ldata*, poichè *df_cdata* contiene informazioni per ciascun 

In [67]:
df_train, df_test = train_test_split(df_ldata, test_size=0.2, random_state=42)

In [70]:
print("Shape of train set:", df_train.shape)
print("Shape of test set:", df_test.shape)

Shape of train set: (29832, 12)
Shape of test set: (7459, 12)


In [69]:
X_train = df_train[FEATURES]
X_test = df_test[FEATURES]
y_train = df_train[TARGET]
y_test = df_test[TARGET]

# EDA

In [None]:
df_cdata.isna().sum() # MARITAL_STATUS e LEASE sono le uniche variabili che contengono NA

CUSTOMER_ID          0
SEX                  0
AGE                  0
ANNUAL_INCOME        0
NUMBER_OF_MONTHS     0
MARITAL_STATUS      15
LEASE                2
dtype: int64

In [None]:
df_ldata.isna().sum() #non ci sono valori mancanti

CUSTOMER_ID           0
STATUS                0
SECTOR_TYPE           0
GOOD_VALUE            0
ADVANCE_VALUE         0
LOAN_VALUE            0
INSTALLMENT_VALUE     0
NUMBER_INSTALLMENT    0
GAPR                  0
NIR                   0
REFINANCED            0
FROM_REFINANCE        0
dtype: int64

### IMPUTE NA VALUES

#### Avendo valori NA di variabili categoriali, si decide di inserire la modalità "Not available" al posto dei valori mancanti. 

In [50]:
vars_with_na = df_cdata.columns[df_cdata.isna().sum() > 0].tolist()
vars_with_na

['MARITAL_STATUS', 'LEASE']

In [51]:
def impute_na(df, var):
    df_copy = df.copy()
    mask_na = df_copy[var].isna()
    df_copy.loc[mask_na, var] = 'Not available'
    # df_copy['IS_NA_' + var] = mask_na
    return df_copy

In [52]:
for var in vars_with_na:
    df_cdata = impute_na(df_cdata, var)

In [53]:
df_cdata.isna().sum()

CUSTOMER_ID         0
SEX                 0
AGE                 0
ANNUAL_INCOME       0
NUMBER_OF_MONTHS    0
MARITAL_STATUS      0
LEASE               0
dtype: int64

In [12]:
df_ldata['STATUS'].value_counts(normalize=False)

STATUS
CONCLUDED REGULARLY    30037
EARLY EXPIRED           7254
Name: count, dtype: int64

In [11]:
df_ldata['STATUS'].value_counts(normalize=True)

STATUS
CONCLUDED REGULARLY    0.805476
EARLY EXPIRED          0.194524
Name: proportion, dtype: float64

# MODELS - HYPERPARAMETERS TUNING

In [None]:
def hyperparams_selection(model:str = "logistic", out_folder:str = ".\models",  n_fold:int = 5, n_iter:int = 15):

    today = datetime.datetime.now()
    date_output = today.strftime("%d_%m_%Y")
    
    if model == "logistic":
        params = {
            "penalty": [None, "l1", "l2"],
            "C": [1e-2, 1e-1, 1, 2, 5, 10],
            "max_iter": [100, 200, 300],
            "random_state": [1],
            "class_weight":[None, "balanced"],
            "solver": ["saga"]
        }

        # gs = RandomizedSearchCV(LogisticRegression(), params, cv = n_fold, n_iter=n_iter)
        gs = GridSearchCV(LogisticRegression(), params, cv = n_fold)
        
    elif model == "tree":
        #Decision Tree classifier
        params = {
            "max_depth": [None, 10, 20, 40, 50],
            "min_samples_split": [2, 5, 10],
            "min_samples_leaf": [1, 2, 5],
            "class_weight":[None, "balanced"],
            "random_state": [1]}

        # gs = RandomizedSearchCV(DecisionTreeClassifier(), params, cv = n_fold, n_iter=n_iter n_jobs = -1)
        gs = GridSearchCV(DecisionTreeClassifier(), params, cv = n_fold)
    
    elif model == "random_forest":
        params = {
            "n_estimators": [50, 100, 200, 300, 400],
            "max_depth": [None, 10, 20],
            "min_samples_split": [2, 5, 10],
            "min_samples_leaf": [1, 2, 5],
            "max_features":["sqrt", "log2", None],
            "random_state": [1],
            "class_weight":[None, "balanced"]
            }

        # gs = RandomizedSearchCV(RandomForestClassifier(), params, cv = n_fold, n_iter=n_iter, n_jobs=-1)
        gs = GridSearchCV(RandomForestClassifier(), params, cv = n_fold)
    
    elif model == "svc":
        params = {
            "C": [1e-1, 1, 2, 5, 10],
            "kernel":["linear", "poly", "rbf", "sigmoid"],
            "degree":[2,3,4,5],
            "gamma":["scale", "auto"],
            "class_weight":[None, "balanced"],
            "random_state": [1]
            }

        # gs = RandomizedSearchCV(SVC(), params, cv = n_fold, n_iter=n_iter, n_jobs=-1)
        gs = GridSearchCV(SVC(), params, cv = n_fold)

    elif model == "knn":
        params = {
            "n_neighbors": [2, 5, 10, 15, 20],
            "weights": ["uniform", "distance"],
            "kernel":["linear", "poly", "rbf", "sigmoid"],
            "p":[1,2],
            "random_state": [1]
            }

        # gs = RandomizedSearchCV(SVC(), params, cv =n_fold, n_iter=n_iter, n_jobs=-1)
        gs = GridSearchCV(KNeighborsClassifier(), params, cv = n_fold)
        
    elif model == "extra_tree":
        params = {
            "n_estimators": [50, 100, 200, 300, 400],
            "max_depth": [None, 10, 20],
            "min_samples_split": [2, 5, 10],
            "min_samples_leaf": [1, 2, 5],
            "max_features":["sqrt", "log2", None],
            "random_state": [1],
            "class_weight":[None, "balanced"]
            }

        # gs = RandomizedSearchCV(ExtraTreesClassifier(), params, cv = n_fold, n_iter=n_iter, n_jobs=-1)
        gs = GridSearchCV(ExtraTreesClassifier(), params, cv = n_fold)
    
    elif model == "xgb":
        params = {
            "n_estimators": [50, 100, 200, 300, 400],
            "max_depth": [2, 5, 10, 15],
            "learning_rate": [1e-3, 1e-2, 1e-1, 3e-1, 5e-1],
            "colsample_bytree": [0.6, 0.8, 1.0],
            "subsample": [0.6, 0.8, 1.0],
            "reg_alpha": [1e-3, 1e-2, 1e-1, 3e-1, 5e-1],
            "reg_lambda": [1e-3, 1e-2, 1e-1, 3e-1, 5e-1]
        }

        # gs = RandomizedSearchCV(XGBClassifier(), params, cv = n_fold, n_iter=n_iter, n_jobs=-1)
        gs = GridSearchCV(XGBClassifier(), params, cv = n_fold)
    
    elif model == "light_gbm":
        params = {
            "bagging_fraction": [0.2, 0.4, 0.6, 0.8, 1.0], 
            "feature_fraction": [0.6, 0.8, 1.0], 
            "learning_rate": [1e-3, 1e-2, 1e-1, 3e-1, 5e-1], 
            "max_depth": [2, 5, 10, 15], 
            "n_estimators": [100, 200, 300, 400], 
            # "num_leaves": [5, 10, 15, 20, 30, 35], 
            "class_weight": [None, "balanced"],
            "random_state":[1],
            # "min_data_in_leaf": [5, 10, 15, 20], 
            # "max_bin": [5, 10, 15, 20],
            # "subsample": 1.0,
            # "min_sum_hessian_in_leaf": 0.001, 
            }

        # gs = RandomizedSearchCV(LGBMClassifier(), params, cv = n_fold, n_iter=n_iter, n_jobs=-1)
        gs = GridSearchCV(LGBMClassifier(), params, cv = n_fold)
    
    else:
        raise ValueError("Modello non valido.")
        


    gs = gs.fit(X_train, y_train)
    best_params = gs.best_params_
    best_model = gs.best_estimator_
    y_predicted = best_model.predict(X_test)

    if not os.path.exists(out_folder):
        os.makedirs(out_folder)

    FULL_OUT = os.path.join(out_folder, date_output, model)
    with open(FULL_OUT, 'wb') as f:
        pickle.dump(best_model, f)


In [None]:
#prova
hyperparams_selection("logistic")

In [None]:
model_names = ["logistic", "tree", "random_forest", "svc", "knn", "extra_tree", "xgb", "light_gbm"]

In [None]:
for model in model_names:
    hyperparams_selection(model)

# MODELS - EVALUATION ON TEST SET

In [None]:
models = []
recall = []
precision = []
f1 = []
roc_auc = []
accuracy = []

In [None]:
def eval_test_set(model:str = "logistic", out_folder:str = ".\models"):

    FULL_OUT = os.path.join(out_folder, date_output, model)
    with open(FULL_OUT, 'rb') as f:
        best_model = pickle.load(f)
    y_predicted = best_model.predict(X_test)

    models.append(model)
    recall.append(metrics.recall_score(y_test, y_predicted))
    precision.append(metrics.precision_score(y_test, y_predicted))
    f1.append(metrics.f1_score(y_test, y_predicted))
    roc_auc.append(metrics.roc_auc_score(y_test, y_predicted))
    accuracy.append(metrics.accuracy_score(y_test, y_predicted))


In [None]:
for model in model_names:
    eval_test_set(model)

In [None]:
results_df = pd.DataFrame({
    "Models": models,
    "F1": f1,
    "Recall": recall,
    "Precision": precision,
    "Roc_auc": roc_auc,
    "Accuracy": accuracy
})

# INTERPRETABILITY OF THE BEST MODEL

# NO UTILS CODE

In [None]:
today = datetime.datetime.now()
date_output = today.strftime("%d_%m_%Y")
name_output = "stringa"

#Logistic regression
params = {"penalty": [None, "l1", "l2"],
              "C": [1e-2, 1e-1, 1, 2, 5, 10],
              "max_iter": [100, 200, 300],
              "multi_class": ["ovr"],
              "random_state": [1],
              "class_weight":[None, "balanced"],
            #   "solver": ["saga"]
            }

# gs = RandomizedSearchCV(LogisticRegression(), params, cv = 5, n_jobs = -1)
gs = GridSearchCV(LogisticRegression(), params, cv = 5, n_jobs = -1)
gs = gs.fit(X_train, y_train)
best_params = gs.best_params_
best_model = gs.best_estimator_
y_predicted = best_model.predict(X_test)


FULL_OUT = os.path.join('./models', date_output, '_SVM_', name_output)
with open(FULL_OUT, 'wb') as f:
    pickle.dump(best_model, f)
with open(FULL_OUT, 'rb') as f:
    best_model = pickle.load(f)
y_predicted = best_model.predict(X_test)


#Decision Tree classifier
params = {"max_depth": [None, 10, 20, 40, 50],
              "min_samples_split": [2, 5, 10],
              "min_samples_leaf": [1, 2, 5],
              "class_weight":[None, "balanced"],
              "random_state": [1]}

# gs = RandomizedSearchCV(DecisionTreeClassifier(), params, cv = 5, n_jobs = -1)
gs = GridSearchCV(DecisionTreeClassifier(), params, cv = 5, n_jobs= -1)
gs = gs.fit(X_train, y_train)
best_params = gs.best_params_
best_model = gs.best_estimator_
y_predicted = best_model.predict(X_test)


FULL_OUT = os.path.join('./models', date_output, '_SVM_', name_output)
with open(FULL_OUT, 'wb') as f:
    pickle.dump(best_model, f)
with open(FULL_OUT, 'rb') as f:
    best_model = pickle.load(f)
y_predicted = best_model.predict(X_test)



#Random Forest
params = {"n_estimators": [50, 100, 200, 300, 400],
              "max_depth": [None, 10, 20],
              "min_samples_split": [2, 5, 10],
              "min_samples_leaf": [1, 2, 5],
              "max_features":["sqrt", "log2", None],
              "random_state": [1],
              "class_weight":[None, "balanced"]
              #,"n_jobs":[-1]
              }

# gs = RandomizedSearchCV(RandomForestClassifier(), params, cv = 5)
gs = GridSearchCV(RandomForestClassifier(), params, cv = 5, n_jobs= -1)
gs = gs.fit(X_train, y_train)
best_params = gs.best_params_
best_model = gs.best_estimator_
y_predicted = best_model.predict(X_test)


FULL_OUT = os.path.join('./models', date_output, '_SVM_', name_output)
with open(FULL_OUT, 'wb') as f:
    pickle.dump(best_model, f)
with open(FULL_OUT, 'rb') as f:
    best_model = pickle.load(f)
y_predicted = best_model.predict(X_test)



#SVM
params = {"C": [1e-1, 1, 2, 5, 10],
              "kernel":["linear", "poly", "rbf", "sigmoid"],
              "degree":[2,3,4,5],
              "gamma":["scale", "auto"],
              "class_weight":[None, "balanced"],
              "random_state": [1]}

# gs = RandomizedSearchCV(SVC(), params, cv = 5, n_iter=5)
gs = GridSearchCV(SVC(), params, cv = 5, n_jobs= -1)
gs = gs.fit(X_train, y_train)
best_params = gs.best_params_
best_model = gs.best_estimator_
y_predicted = best_model.predict(X_test)


FULL_OUT = os.path.join('./models', date_output, '_SVM_', name_output)
with open(FULL_OUT, 'wb') as f:
    pickle.dump(best_model, f)
with open(FULL_OUT, 'rb') as f:
    best_model = pickle.load(f)
y_predicted = best_model.predict(X_test)

#KNeighborsClassifier
params = {"n_neighbors": [2, 5, 10, 15, 20],
              "weights": ["uniform", "distance"],
              "kernel":["linear", "poly", "rbf", "sigmoid"],
              "p":[1,2],
              "random_state": [1]}

# gs = RandomizedSearchCV(SVC(), params, cv = 5, n_iter=5)
gs = GridSearchCV(KNeighborsClassifier(), params, cv = 5, n_jobs= -1)
gs = gs.fit(X_train, y_train)
best_params = gs.best_params_
best_model = gs.best_estimator_
y_predicted = best_model.predict(X_test)


FULL_OUT = os.path.join('./models', date_output, '_SVM_', name_output)
with open(FULL_OUT, 'wb') as f:
    pickle.dump(best_model, f)
with open(FULL_OUT, 'rb') as f:
    best_model = pickle.load(f)
y_predicted = best_model.predict(X_test)



#ExtraTreesClassifier

params = {
    "n_estimators": [50, 100, 200, 300, 400],
    "max_depth": [None, 10, 20],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 5],
    "max_features":["sqrt", "log2", None],
    "random_state": [1],
    "class_weight":[None, "balanced"]
    #,"n_jobs":[-1]
    }

# gs = RandomizedSearchCV(ExtraTreesClassifier(), params, cv = 5, n_iter=5)
gs = GridSearchCV(ExtraTreesClassifier(), params, cv = 5, n_jobs= -1)
gs = gs.fit(X_train, y_train)
best_params = gs.best_params_
best_model = gs.best_estimator_
y_predicted = best_model.predict(X_test)

FULL_OUT = os.path.join('./models', date_output, '_SVM_', name_output)
with open(FULL_OUT, 'wb') as f:
    pickle.dump(best_model, f)
with open(FULL_OUT, 'rb') as f:
    best_model = pickle.load(f)
y_predicted = best_model.predict(X_test)



#MLPClassifier


#XGBClassifier
params = {
    "n_estimators": [50, 100, 200, 300, 400],
    "max_depth": [2, 5, 10, 15],
    "learning_rate": [1e-3, 1e-2, 1e-1, 3e-1, 5e-1],
    "colsample_bytree": [0.6, 0.8, 1.0],
    "subsample": [0.6, 0.8, 1.0],
    "reg_alpha": [1e-3, 1e-2, 1e-1, 3e-1, 5e-1],
    "reg_lambda": [1e-3, 1e-2, 1e-1, 3e-1, 5e-1]
}

# gs = RandomizedSearchCV(XGBClassifier(), params, cv = 5, n_iter=5)
gs = GridSearchCV(XGBClassifier(), params, cv = 5, n_jobs= -1)
gs = gs.fit(X_train, y_train)
best_params = gs.best_params_
best_model = gs.best_estimator_
y_predicted = best_model.predict(X_test)



FULL_OUT = os.path.join('./models', date_output, '_SVM_', name_output)
with open(FULL_OUT, 'wb') as f:
    pickle.dump(best_model, f)
with open(FULL_OUT, 'rb') as f:
    best_model = pickle.load(f)
y_predicted = best_model.predict(X_test)



#LGBMClassifier
params = {
    "bagging_fraction": [0.2, 0.4, 0.6, 0.8, 1.0], 
    "feature_fraction": [0.6, 0.8, 1.0], 
    "learning_rate": [1e-3, 1e-2, 1e-1, 3e-1, 5e-1], 
    "max_depth": [2, 5, 10, 15], 
    "n_estimators": [100, 200, 300, 400], 
    # "num_leaves": [5, 10, 15, 20, 30, 35], 
    "class_weight": [None, "balanced"],
    "random_state":[1],
    # "min_data_in_leaf": [5, 10, 15, 20], 
    # "max_bin": [5, 10, 15, 20],
    # "subsample": 1.0,
    # "min_sum_hessian_in_leaf": 0.001, 
    }

# gs = RandomizedSearchCV(LGBMClassifier(), params, cv = 5, n_iter=5)
gs = GridSearchCV(LGBMClassifier(), params, cv = 5, n_jobs= -1)
gs = gs.fit(X_train, y_train)
best_params = gs.best_params_
best_model = gs.best_estimator_
y_predicted = best_model.predict(X_test)

FULL_OUT = os.path.join('./models', date_output, '_SVM_', name_output)
with open(FULL_OUT, 'wb') as f:
    pickle.dump(best_model, f)
with open(FULL_OUT, 'rb') as f:
    best_model = pickle.load(f)
y_predicted = best_model.predict(X_test)


#VotingClassifier



#StackingClassifier