# Santander

## Contents
- [1. Import Libraries](#1-Import-Libraries)
- [2. Get Data](#2-Get-Data)
    - [2.1 Read Data](#21-Read-Data)
    - [2.2 Drop an unneeded column](#22-Drop-an-unneeded-column)
    - [2.3 Split 70/30 (stratified sampling)](#23-Split-7030-stratified-sampling)
- [3. Baseline: Predict the Majority Class](#3-Baseline-Predict-the-Majority-Class)
- [4. Train ML Models](#4-Train-ML-Models)
    - [4.1 Random Forest Classifier](#41-Random-Forest-Classifier)
    - [4.2 Logistic Regression](#42-Logistic-Regression)
    - [4.3 XGBoost Classifier](#43-XGBoost-Classifier)
    - [4.4 LightGBM Classifier](#44-LightGBM-Classifier)
    - [4.5 CatBoost Classifier](#45-CatBoost-Classifier)
- [5. Evaluation](#5-Evaluation)
    - [5.1 Metrics which Require Predict Method](#51-Metrics-which-Require-Predict-Method)
        - [5.1.1 Classification metrics](#511-Classification-metrics)
        - [5.1.2 Confusion Matrix](#512-Confusion-Matrix)
    - [5.2 Metrics which Require Predict_Proba Method](#52-Metrics-which-Require-Predict_Proba-Method)
- [6. Export Classification Metrics](#6-Export-Classification-Metrics)
- [7. Theory notes](#7-Theory-notes)

In [14]:
# 1. Import Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import warnings
import xgboost as xgb
from xgboost import XGBClassifier
import lightgbm as lgb
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import (
    accuracy_score,
    balanced_accuracy_score,
    recall_score,
    precision_score,
    f1_score,
    classification_report,
    confusion_matrix,
    roc_auc_score,
    average_precision_score
)
from imblearn.metrics import (
    geometric_mean_score,
    make_index_balanced_accuracy,
)

# Suppress the specific warning
warnings.filterwarnings("ignore", category=UserWarning, module='sklearn.metrics')

# 2. Get Data

# 2.1 Read Data
bank_ds = pd.read_csv('train.csv')
bank_test_ds = pd.read_csv('test.csv')

# 2.2 Drop an unneeded column
column_to_drop = 'ID_code'
bank_ds.drop(column_to_drop, axis=1, inplace=True)

# 2.3 Split 70/30 (stratified sampling)
target = bank_ds['target']
X_train, X_test, y_train, y_test = train_test_split(
    bank_ds.drop(labels=['target'], axis=1),
    bank_ds['target'],
    test_size=0.3,
    stratify=target,
    random_state=42
)
print(X_train.shape, X_test.shape)

# 3. Baseline: Predict the Majority Class
y_train_base = pd.Series(np.zeros(len(y_train)))
y_test_base = pd.Series(np.zeros(len(y_test)))

# 4. Train ML Models

# 4.1 Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42, max_depth=2, n_jobs=4)
rf.fit(X_train, y_train)
y_train_rf = rf.predict_proba(X_train)[:, 1]
y_test_rf = rf.predict_proba(X_test)[:, 1]

# 4.2 Logistic Regression
logit = LogisticRegression(random_state=42, max_iter=2000)
logit.fit(X_train, y_train)
y_train_logit = logit.predict_proba(X_train)[:, 1]
y_test_logit = logit.predict_proba(X_test)[:, 1]

# 4.3 XGB Classifier
xgb_model = XGBClassifier(random_state=42, use_label_encoder=False)
xgb_model.fit(X_train, y_train)
y_train_xgb = xgb_model.predict_proba(X_train)[:, 1]
y_test_xgb = xgb_model.predict_proba(X_test)[:, 1]

# 4.4 Light GBM
lgb_model = LGBMClassifier(random_state=42)
lgb_model.fit(X_train, y_train)
y_train_lgb = lgb_model.predict_proba(X_train)[:, 1]
y_test_lgb = lgb_model.predict_proba(X_test)[:, 1]

# 4.5 Cat Boost
catboost_model = CatBoostClassifier(random_state=42)
catboost_model.fit(X_train, y_train)
y_train_catboost = catboost_model.predict_proba(X_train)[:, 1]
y_test_catboost = catboost_model.predict_proba(X_test)[:, 1]

# 5. Evaluation

# 5.1 Metrics which Require Predict Method

# 5.1.1 Classification metrics
def dominance(y_true, y_pred):
    tpr = recall_score(y_test, y_pred, pos_label=1)
    tnr = recall_score(y_test, y_pred, pos_label=0)
    return tpr - tnr

def get_classification_metrics(y_true, y_pred):
    report = classification_report(y_true, y_pred, output_dict=True)
    accuracy = accuracy_score(y_test, y_pred)
    balanced_accuracy = balanced_accuracy_score(y_test, y_pred)
    recall_weighted = recall_score(y_test, y_pred, average='weighted')
    precision_weighted = precision_score(y_test, y_pred, average='weighted')
    f1_weighted = f1_score(y_test, y_pred, average='weighted')
    recall_micro = recall_score(y_test, y_pred, average='micro')
    precision_micro = precision_score(y_test, y_pred, average='micro')
    f1_micro = f1_score(y_test, y_pred, average='micro')
    recall_macro = recall_score(y_test, y_pred, average='macro')
    precision_macro = precision_score(y_test, y_pred, average='macro')
    f1_macro = f1_score(y_test, y_pred, average='macro')
    g_mean_binary = geometric_mean_score(y_test, y_pred, average='binary')
    g_mean_weighted = geometric_mean_score(y_test, y_pred, average='weighted')
    g_mean_micro = geometric_mean_score(y_test, y_pred, average='micro')
    g_mean_macro = geometric_mean_score(y_test, y_pred, average='macro')
    gmean = make_index_balanced_accuracy(alpha=0.5, squared=True)(geometric_mean_score)
    corrected_g_mean = gmean(y_test, y_pred)
    corrected_g_mean_binary = gmean(y_test, y_pred, average='binary')
    corrected_g_mean_weighted = gmean(y_test, y_pred, average='weighted')
    corrected_g_mean_micro = gmean(y_test, y_pred, average='micro')
    corrected_g_mean_macro = gmean(y_test, y_pred, average='macro')
    dominance_score = dominance(y_test, y_pred)
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred, labels=[0, 1]).ravel()
    FPR = fp / (tn + fp)
    FNR = fn / (tp + fn)
    return {
        'Accuracy': accuracy,
        'Recall Majority (TNR)': report['0']['recall'],
        'Recall Minority (TPR)': report['1']['recall'],
        'Balanced Accuracy': balanced_accuracy,
        'FPR': FPR,
        'FNR': FNR,
        'Precision Majority': report['0']['precision'],
        'Precision Minority': report['1']['precision'],
        'F1-Score Majority': report['0']['f1-score'],
        'F1-Score Minority': report['1']['f1-score'],
        'Weighted Precision': precision_weighted,
        'Weighted Recall': recall_weighted,
        'Weighted F1-Score': f1_weighted,
        'Micro Precision': precision_micro,
        'Micro Recall': recall_micro,
        'Micro F1-Score': f1_micro,
        'Macro Precision': precision_macro,
        'Macro Recall': recall_macro,
        'Macro F1-Score': f1_macro,
        'G-mean-binary': g_mean_binary,
        'G-mean-weighted': g_mean_weighted,
        'G-mean-micro': g_mean_micro,
        'G-mean-macro': g_mean_macro,
        'Corrected G-mean': corrected_g_mean,
        'Corrected G-mean-binary': corrected_g_mean_binary,
        'Corrected G-mean-weighted': corrected_g_mean_weighted,
        'Corrected G-mean-micro': corrected_g_mean_micro,
        'Corrected G-mean-macro': corrected_g_mean_macro,
        'Dominance': dominance_score
    }
results = []   
models = {
    'Baseline': y_test_base,
    'Random Forest': rf.predict(X_test),
    'Logistic Regression': logit.predict(X_test),
    'XGBoost': xgb_model.predict(X_test),
    'LightGBM': lgb_model.predict(X_test),
    'CatBoost': catboost_model.predict(X_test)
}


confusion_matrices = []
for model_name, y_pred in models.items():
    metrics = get_classification_metrics(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)
    results.append({
        'Model': model_name,
        **metrics,
    })
    confusion_matrices.append({
        'Model': model_name,
        'Confusion matrix': conf_matrix
    })
df_results_1 = pd.DataFrame(results)
print(df_results_1)


(140000, 200) (60000, 200)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[LightGBM] [Info] Number of positive: 14069, number of negative: 125931
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.158759 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51000
[LightGBM] [Info] Number of data points in the train set: 140000, number of used features: 200
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.100493 -> initscore=-2.191760
[LightGBM] [Info] Start training from score -2.191760
Learning rate set to 0.084983
0:	learn: 0.6221520	total: 94.6ms	remaining: 1m 34s
1:	learn: 0.5650402	total: 183ms	remaining: 1m 31s
2:	learn: 0.5182997	total: 268ms	remaining: 1m 28s
3:	learn: 0.4813299	total: 352ms	remaining: 1m 27s
4:	learn: 0.4505648	total: 441ms	remaining: 1m 27s
5:	learn: 0.4262857	total: 528ms	remaining: 1m 27s
6:	learn: 0.4043163	total: 605ms	remaining: 1m 25s
7:	learn: 0.3868582	total: 684ms	remaining: 1m 24s
8:	learn: 0.3728083	total: 762ms	remaining: 1m 23s
9:	learn: 0.

In [15]:
# 5.1.2 Confusion Matrix
df_confusion_matrices = pd.DataFrame(confusion_matrices)

print(df_confusion_matrices)



                 Model              Confusion matrix
0             Baseline       [[53971, 0], [6029, 0]]
1        Random Forest       [[53971, 0], [6029, 0]]
2  Logistic Regression  [[53205, 766], [4439, 1590]]
3              XGBoost  [[53305, 666], [4543, 1486]]
4             LightGBM    [[53876, 95], [5446, 583]]
5             CatBoost  [[53400, 571], [4189, 1840]]


In [21]:
# 5.2 Metrics which Require Predict_Proba Method
models = {
    'Baseline': y_test_base,
    'Random Forest': rf.predict_proba(X_test)[:, 1],
    'Logistic Regression': logit.predict_proba(X_test)[:, 1],
    'XGBoost': xgb_model.predict_proba(X_test)[:, 1],
    'LightGBM': lgb_model.predict_proba(X_test)[:, 1],
    'CatBoost': catboost_model.predict_proba(X_test)[:, 1]
}

results = []
for model_name, y_pred in models.items():
    roc_auc = roc_auc_score(y_test, y_pred)
    pr_auc = average_precision_score(y_test, y_pred)
    results.append({
        'Model': model_name,
        'ROC-AUC': roc_auc,
        'PR-AUC': pr_auc
    })

df_results_2 = pd.DataFrame(results)

print(df_results_2)

# 6. Export Classification Metrics
df_combined_results = pd.concat([df_results_2, df_results_1], axis=1)
df_combined_results.to_csv('2_evaluation_results_stratify_sampling.csv', index=False)
df_confusion_matrices.to_csv('2_confusion_matrices_stratify_sampling.csv', index=False)



                 Model   ROC-AUC    PR-AUC
0             Baseline  0.500000  0.100483
1        Random Forest  0.748211  0.277832
2  Logistic Regression  0.858369  0.497739
3              XGBoost  0.854883  0.488493
4             LightGBM  0.863721  0.507548
5             CatBoost  0.893907  0.592283


## 7. Theory notes

While **predict_proba** is useful for many classification tasks, there are certain scenarios where you might choose not to use it:

- **Binary Classification with Balanced Classes**: When the classes are balanced and you only care about the final classification (not the confidence of predictions), using **predict** to get class labels might be sufficient.
- **Efficiency and Performance**: Calculating probabilities can be computationally more expensive than just predicting class labels. In real-time systems where speed is critical, you might prefer using **predict** to get the class labels directly.
- **Certain Metrics**: Metrics like accuracy, precision, recall, and F1 score do not require probability estimates and can be computed directly from class labels.
- **Simple Decision Making**: When a simple yes/no decision is needed without considering the probability or confidence level.


##### Confusion matrix
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

TN | FP

FN | TP

**Metrics which require predict_proba method**
- The ROC AUC score is calculated based on the predicted probabilities for each class, which are obtained using the predict_proba method. This method - provides the probability estimates for each class, which are necessary to compute the ROC curve and subsequently the AUC (Area Under the Curve).