# Santander

## Contents
- [0. Theory notes](#0-Theory-notes)
- [1. Import Libraries](#1-Import-Libraries)
- [2. Get Data](#2-Get-Data)
    - [2.1 Read Data](#21-Read-Data)
    - [2.2 Drop an unneeded column](#22-Drop-an-unneeded-column)
    - [2.3 Split 70/30](#23-Split-7030)
- [3. Baseline: Predict the Majority Class](#3-Baseline-Predict-the-Majority-Class)
- [4. Train ML Models](#4-Train-ML-Models)
    - [4.1 Random Forest Classifier](#41-Random-Forest-Classifier)
    - [4.2 Logistic Regression](#42-Logistic-Regression)
    - [4.3 XGBoost Classifier](#43-XGBoost-Classifier)
    - [4.4 LightGBM Classifier](#44-LightGBM-Classifier)
    - [4.5 CatBoost Classifier](#45-CatBoost-Classifier)
- [5. Evaluation](#5-Evaluation)
    - [5.1 Metrics which Require Predict Method](#51-Metrics-which-Require-Predict-Method)
        - [5.1.1 Classification metrics](#511-Classification-metrics)
        - [5.1.2 Confusion Matrix](#512-Confusion-Matrix)
    - [5.2 Metrics which Require Predict_Proba Method](#52-Metrics-which-Require-Predict_Proba-Method)
- [6. Export Classification Metrics](#6-Export-Classification-Metrics)



### 0. Theory Notes
[top](#Contents)

While **predict_proba** is useful for many classification tasks, there are certain scenarios where you might choose not to use it:

- **Binary Classification with Balanced Classes**: When the classes are balanced and you only care about the final classification (not the confidence of predictions), using **predict** to get class labels might be sufficient.
- **Efficiency and Performance**: Calculating probabilities can be computationally more expensive than just predicting class labels. In real-time systems where speed is critical, you might prefer using **predict** to get the class labels directly.
- **Certain Metrics**: Metrics like accuracy, precision, recall, and F1 score do not require probability estimates and can be computed directly from class labels.
- **Simple Decision Making**: When a simple yes/no decision is needed without considering the probability or confidence level.
dence level.

# 1. Import libs
[top](#Contents)

In [1]:
# pandas and numpy imports
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

import warnings
# Suppress the specific warning
warnings.filterwarnings("ignore", category=UserWarning, module='sklearn.metrics')

In [2]:
# pip install -U imbalanced-learn

# 2. Get data
[top](#Contents)

#### Read data
[top](#Contents)

In [3]:
bank_ds = pd.read_csv('train.csv')

In [4]:
bank_test_ds = pd.read_csv('test.csv')

In [23]:
bank_ds.head(1)

Unnamed: 0,ID_code,target,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,...,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199
0,train_0,0,8.9255,-6.7863,11.9081,5.093,11.4607,-9.2834,5.1187,18.6266,...,4.4354,3.9642,3.1364,1.691,18.5227,-2.3978,7.8784,8.5635,12.7803,-1.0914


In [24]:
bank_test_ds.head(1)

Unnamed: 0,ID_code,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,...,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199
0,test_0,11.0656,7.7798,12.9536,9.4292,11.4327,-2.3805,5.8493,18.2675,2.1337,...,-2.1556,11.8495,-1.43,2.4508,13.7112,2.4669,4.3654,10.72,15.4722,-8.7197


#### Drop an unneeded column

In [5]:
# Assuming 'column_to_drop' is the name of the column you want to drop
column_to_drop = 'ID_code'

# Drop the column from bank_ds
bank_ds.drop(column_to_drop, axis=1, inplace=True)


### Split 70/30

In [6]:
# separate dataset into train and test

X_train, X_test, y_train, y_test = train_test_split(
    bank_ds.drop(labels=['target'], axis=1),  # drop the target
    bank_ds['target'],  # just the target
    test_size=0.3,
    random_state=42)

X_train.shape, X_test.shape

((140000, 200), (60000, 200))

### Baseline: predict the majority class

In [7]:
# Baseline prediction: predict the majority class

y_train_base = pd.Series(np.zeros(len(y_train)))
y_test_base = pd.Series(np.zeros(len(y_test)))

## Train ML models
### Random Forests

In [8]:
rf = RandomForestClassifier(n_estimators=100, random_state=42, max_depth=2, n_jobs=4)

rf.fit(X_train, y_train)

y_train_rf = rf.predict_proba(X_train)[:,1]
y_test_rf = rf.predict_proba(X_test)[:,1]

### Logistic Regression

In [9]:
logit = LogisticRegression(random_state=42,  max_iter=2000)

logit.fit(X_train, y_train)

y_train_logit = logit.predict_proba(X_train)[:,1]
y_test_logit = logit.predict_proba(X_test)[:,1]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### XGB Classifier

In [10]:
import xgboost as xgb
from xgboost import XGBClassifier

xgb_model = XGBClassifier(random_state=42, use_label_encoder=False)

xgb_model.fit(X_train, y_train)

y_train_xgb = xgb_model.predict_proba(X_train)[:, 1]
y_test_xgb = xgb_model.predict_proba(X_test)[:, 1]


### Light GBM

In [11]:
import lightgbm as lgb
from lightgbm import LGBMClassifier

lgb_model = LGBMClassifier(random_state=42)

lgb_model.fit(X_train, y_train)

y_train_lgb = lgb_model.predict_proba(X_train)[:, 1]
y_test_lgb = lgb_model.predict_proba(X_test)[:, 1]


[LightGBM] [Info] Number of positive: 13954, number of negative: 126046
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.189911 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51000
[LightGBM] [Info] Number of data points in the train set: 140000, number of used features: 200
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.099671 -> initscore=-2.200881
[LightGBM] [Info] Start training from score -2.200881


In [12]:
# pip install lightgbm

### AdaBoost

In [13]:
from sklearn.ensemble import AdaBoostClassifier

adaboost_model = AdaBoostClassifier(random_state=42)

adaboost_model.fit(X_train, y_train)

# Predict probabilities for the training and test sets
y_train_adaboost = adaboost_model.predict_proba(X_train)[:, 1]
y_test_adaboost = adaboost_model.predict_proba(X_test)[:, 1]


### CatBoost

In [14]:
from catboost import CatBoostClassifier

catboost_model = CatBoostClassifier(random_state=42)

catboost_model.fit(X_train, y_train)

# Predict probabilities for the training and test sets
y_train_catboost = catboost_model.predict_proba(X_train)[:, 1]
y_test_catboost = catboost_model.predict_proba(X_test)[:, 1]

Learning rate set to 0.084983
0:	learn: 0.6217063	total: 279ms	remaining: 4m 38s
1:	learn: 0.5642701	total: 396ms	remaining: 3m 17s
2:	learn: 0.5172629	total: 510ms	remaining: 2m 49s
3:	learn: 0.4801122	total: 615ms	remaining: 2m 33s
4:	learn: 0.4489135	total: 731ms	remaining: 2m 25s
5:	learn: 0.4244753	total: 839ms	remaining: 2m 18s
6:	learn: 0.4032821	total: 935ms	remaining: 2m 12s
7:	learn: 0.3864991	total: 1.04s	remaining: 2m 8s
8:	learn: 0.3720655	total: 1.14s	remaining: 2m 6s
9:	learn: 0.3599100	total: 1.24s	remaining: 2m 2s
10:	learn: 0.3503661	total: 1.34s	remaining: 2m
11:	learn: 0.3423601	total: 1.45s	remaining: 1m 59s
12:	learn: 0.3354061	total: 1.54s	remaining: 1m 57s
13:	learn: 0.3297183	total: 1.63s	remaining: 1m 55s
14:	learn: 0.3245210	total: 1.73s	remaining: 1m 53s
15:	learn: 0.3203170	total: 1.82s	remaining: 1m 51s
16:	learn: 0.3165151	total: 1.91s	remaining: 1m 50s
17:	learn: 0.3132907	total: 2s	remaining: 1m 49s
18:	learn: 0.3103479	total: 2.09s	remaining: 1m 47s
19

In [15]:
#!pip install catboost

## Evaluation

#### Metrics which require predict method

##### Confusion matrix
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

TN | FP

FN | TP

#### Confusion matrix, Recall, Precision, F1-score, G-mean, etc 

In [16]:
from sklearn.metrics import (
    accuracy_score,
    balanced_accuracy_score,
    recall_score,
    precision_score,
    f1_score,
    classification_report,
    confusion_matrix
)

# require scaling
from imblearn.metrics import (
    geometric_mean_score,
    make_index_balanced_accuracy,
)

def dominance(y_true, y_pred):
    tpr = recall_score(y_test, y_pred, pos_label=1)
    tnr = recall_score(y_test, y_pred, pos_label=0)
    return tpr - tnr

# Function to get classification metrics
def get_classification_metrics(y_true, y_pred):
    report = classification_report(y_true, y_pred, output_dict=True)
    accuracy          = accuracy_score(y_test, y_pred)
    balanced_accuracy = balanced_accuracy_score(y_test, y_pred)
    
    # Adding possible averages to recall, precision, and f1-score calculations
    # Default average is None - scores are returned for each class individually rather than being averaged
    recall_weighted    = recall_score(y_test, y_pred, average='weighted')
    precision_weighted = precision_score(y_test, y_pred, average='weighted') 
    f1_weighted        = f1_score(y_test, y_pred, average='weighted')

    recall_micro       = recall_score(y_test, y_pred, average='micro')
    precision_micro    = precision_score(y_test, y_pred, average='micro')
    f1_micro           = f1_score(y_test, y_pred, average='micro')
    
    recall_macro       = recall_score(y_test, y_pred, average='macro')
    precision_macro    = precision_score(y_test, y_pred, average='macro')
    f1_macro           = f1_score(y_test, y_pred, average='macro')

    # Adding possible averages to G-mean
    # If the average parameter is not specified, it defaults to average='binary'.
    g_mean_binary      = geometric_mean_score(y_test, y_pred, average='binary')
    g_mean_weighted    = geometric_mean_score(y_test, y_pred, average='weighted')
    g_mean_micro       = geometric_mean_score(y_test, y_pred, average='micro')
    g_mean_macro       = geometric_mean_score(y_test, y_pred, average='macro')

    # A lower alpha gives more weight to sensitivity (TPR), while a higher alpha gives more weight to specificity (TNR). 
    # In other words,in case of a lower alpha more emphasis is placed on correctly identifying positive instances
    gmean = make_index_balanced_accuracy(alpha=0.5, squared=True)(geometric_mean_score) 
    corrected_g_mean  = gmean(y_test, y_pred)
    # specifying an average might not be necessary or meaningful. Leaving for the time being
    corrected_g_mean_binary  = gmean(y_test, y_pred,average='binary')
    corrected_g_mean_weighted  = gmean(y_test, y_pred,average='weighted')
    corrected_g_mean_micro  = gmean(y_test, y_pred,average='micro')
    corrected_g_mean_macro  = gmean(y_test, y_pred,average='macro')
    
    dominance_score   = dominance(y_test, y_pred) 

    tn, fp, fn, tp = confusion_matrix(y_test, y_pred, labels=[0, 1]).ravel()
    FPR = fp / (tn + fp)
    FNR = fn / (tp + fn)
    # TPR = tp / (tp + fn) # recallminority
    # TNR = tn / (tn + fp) # recall majority
    
    return {
        'Accuracy': accuracy,
        'Recall Majority (TNR)': report['0']['recall'],
        'Recall Minority (TPR)': report['1']['recall'],
        'Balanced Accuracy': balanced_accuracy,
        'FPR': FPR,
        'FNR': FNR,
        
        'Precision Majority': report['0']['precision'],
        'Precision Minority': report['1']['precision'],
        'F1-Score Majority': report['0']['f1-score'],
        'F1-Score Minority': report['1']['f1-score'], 
        
        # Weighted average (Use 'weighted' if you want to take class imbalance into account)
        'Weighted Precision': precision_weighted,
        'Weighted Recall': recall_weighted,
        'Weighted F1-Score': f1_weighted,

        # Micro average (Use 'micro' if you want a metric that gives equal weight to every individual prediction)
        'Micro Precision': precision_micro,
        'Micro Recall': recall_micro,
        'Micro F1-Score': f1_micro,
        
        # Macro average (Use 'macro' if you want to treat each class equally)
        'Macro Precision': precision_macro,
        'Macro Recall': recall_macro,
        'Macro F1-Score': f1_macro,

        # Geometric average (a suitable metric when you want to balance the trade-off between detecting the minority class 
        # and avoiding false positives from the majority class)
        'G-mean-binary': g_mean_binary, # (compute two geometric means for each class, then the arithmetic mean of these G-means)
        'G-mean-weighted': g_mean_weighted, # calculates the geometric mean for each class and then computes the weighted arithmetic mean of these G-means
        'G-mean-micro': g_mean_micro, # treats all classes equally, regardless of their size
        'G-mean-macro': g_mean_macro, # use when you want to weight each instance equally, regardless of their class

        # Corrected G-means
        'Corrected G-mean': corrected_g_mean,
        'Corrected G-mean-binary': corrected_g_mean_binary,
        'Corrected G-mean-weighted': corrected_g_mean_weighted,
        'Corrected G-mean-micro': corrected_g_mean_micro,
        'Corrected G-mean-macro': corrected_g_mean_macro,
        
        'Dominance': dominance_score
    }

models = {
    'Baseline': y_test_base,
    'Random Forest': rf.predict(X_test),
    'Logistic Regression': logit.predict(X_test),
    'XGBoost': xgb_model.predict(X_test),
    'LightGBM': lgb_model.predict(X_test),
    'AdaBoost': adaboost_model.predict(X_test),
    'CatBoost': catboost_model.predict(X_test)
}

# Collecting results
results = []
confusion_matricies = []
for model_name, y_pred in models.items():
    metrics           = get_classification_metrics(y_test, y_pred)
    conf_matrix       = confusion_matrix(y_test, y_pred)
    results.append({
        'Model': model_name,
        **metrics,
    })
    confusion_matricies.append({
        'Model': model_name,
        'Confusion matrix': conf_matrix
    })

# Create a DataFrame from the results
df_results_1 = pd.DataFrame(results)
df_confusion_matricies = pd.DataFrame(confusion_matricies)

# Print the DataFrame as a table
print(df_results_1)
print(df_confusion_matricies)


                 Model  Accuracy  Recall Majority (TNR)  \
0             Baseline  0.897600               1.000000   
1        Random Forest  0.897600               1.000000   
2  Logistic Regression  0.912817               0.986612   
3              XGBoost  0.911717               0.986650   
4             LightGBM  0.906350               0.997995   
5             AdaBoost  0.905017               0.989268   
6             CatBoost  0.922533               0.990159   

   Recall Minority (TPR)  Balanced Accuracy       FPR       FNR  \
0               0.000000           0.500000  0.000000  1.000000   
1               0.000000           0.500000  0.000000  1.000000   
2               0.265951           0.626281  0.013388  0.734049   
3               0.254883           0.620766  0.013350  0.745117   
4               0.103027           0.550511  0.002005  0.896973   
5               0.166504           0.577886  0.010732  0.833496   
6               0.329753           0.659956  0.009841  0.6

#### Metrics which require predict_proba method
The ROC AUC score is calculated based on the predicted probabilities for each class, which are obtained using the predict_proba method. This method provides the probability estimates for each class, which are necessary to compute the ROC curve and subsequently the AUC (Area Under the Curve).

In [17]:
from sklearn.metrics import (
    roc_auc_score,
    average_precision_score
)

models = {
    'Baseline': y_test_base,
    'Random Forest': rf.predict_proba(X_test)[:,1],
    'Logistic Regression': logit.predict_proba(X_test)[:,1],
    'XGBoost': xgb_model.predict_proba(X_test)[:,1],
    'LightGBM': lgb_model.predict_proba(X_test)[:,1],
    'AdaBoost': adaboost_model.predict_proba(X_test)[:,1],
    'CatBoost': catboost_model.predict_proba(X_test)[:,1]
}

# Collecting results
results = []
for model_name, y_pred in models.items():
    roc_auc   = roc_auc_score(y_test, y_pred) 
    pr_auc = average_precision_score(y_test, y_pred)
    results.append({
        'Model': model_name,
        'ROC-AUC': roc_auc,
        'PR-AUC': pr_auc
    })

# Create a DataFrame from the results
df_results_2 = pd.DataFrame(results)

# Print the DataFrame as a table
print(df_results_2)

                 Model   ROC-AUC    PR-AUC
0             Baseline  0.500000  0.102400
1        Random Forest  0.747842  0.289410
2  Logistic Regression  0.860041  0.510377
3              XGBoost  0.857861  0.497970
4             LightGBM  0.862282  0.517279
5             AdaBoost  0.803598  0.401087
6             CatBoost  0.893987  0.606566


### Export classification metrics

In [18]:
# Concatenate df_results_1 and df_results_2 horisontally
df_combined_results = pd.concat([df_results_2, df_results_1], axis=1)

# Export results to CSV files
df_combined_results.to_csv('1_evaluation_results.csv', index=False)
df_confusion_matricies.to_csv('1_confusion_matricies.csv', index=False)