# Santander

#### Contents
 - [Readme](#Readme)
 - [1. Import libs](#1.-Import-libs)
 - [2. Get data](#2.-Get-data)
     - [Read data](#Read-data)


### Readme
[top](#Contents)

# 1. Import libs
[top](#Contents)

In [1]:
# pandas and numpy imports
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

import warnings
# Suppress the specific warning
warnings.filterwarnings("ignore", category=UserWarning, module='sklearn.metrics')

In [2]:
# pip install -U imbalanced-learn

# 2. Get data
[top](#Contents)

#### Read data
[top](#Contents)

In [3]:
bank_ds = pd.read_csv('train.csv')

In [4]:
bank_test_ds = pd.read_csv('test.csv')

#### Drop an unneeded column

In [4]:
# Assuming 'column_to_drop' is the name of the column you want to drop
column_to_drop = 'ID_code'

# Drop the column from bank_ds
bank_ds.drop(column_to_drop, axis=1, inplace=True)


### Split 70/30 (stratified sampling)

In [5]:
target = bank_ds['target']

In [6]:
# separate dataset into train and test

X_train, X_test, y_train, y_test = train_test_split(
    bank_ds.drop(labels=['target'], axis=1),  # drop the target
    bank_ds['target'],  # just the target
    test_size=0.3,
    stratify=target,
    random_state=42)

X_train.shape, X_test.shape

((140000, 200), (60000, 200))

### Scaling

Models like Logistic Regression require scaling for better performance. Tree-based models like Random Forest, XGBoost, and LightGBM do not inherently require scaling but scaling can still be beneficial (can improve performance and convergence in some cases).

In [7]:
from sklearn.preprocessing import MinMaxScaler
# we put the variables in the same scale
scaler = MinMaxScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

### Undersampling

In [8]:
from imblearn.under_sampling import (
    RandomUnderSampler,
    TomekLinks,
    OneSidedSelection,
    EditedNearestNeighbours,
    RepeatedEditedNearestNeighbours,
    NeighbourhoodCleaningRule,
    NearMiss
)

In [9]:
# sampling_strategy='auto' undersamples only majority class
undersampler_dict = {
    'random': RandomUnderSampler(sampling_strategy='auto', random_state=0, replacement=False),
    'ncr':    NeighbourhoodCleaningRule(sampling_strategy='auto', n_neighbors=3, kind_sel='all', n_jobs=4, threshold_cleaning=0.5),
    'enn':    EditedNearestNeighbours(sampling_strategy='auto', n_neighbors=3, kind_sel='all', n_jobs=4),
    'renn':   RepeatedEditedNearestNeighbours(sampling_strategy='auto', n_neighbors=3, kind_sel='all', n_jobs=4, max_iter=100),   
    'nm1':    NearMiss(sampling_strategy='auto', version=1, n_neighbors=3, n_jobs=4),
    'tomek':  TomekLinks(sampling_strategy='auto', n_jobs=4),
    'oss':    OneSidedSelection(sampling_strategy='auto', random_state=0, n_neighbors=1, n_jobs=4)
}

In [None]:
# from imblearn.under_sampling import (
    CondensedNearestNeighbour,
    AllKNN,
    InstanceHardnessThreshold
)

In [28]:
# sampling_strategy='auto' undersamples only majority class
undersampler_dict = {
    'random': RandomUnderSampler(sampling_strategy='auto', random_state=0, replacement=False),
    'ncr':    NeighbourhoodCleaningRule(sampling_strategy='auto', n_neighbors=3, kind_sel='all', n_jobs=4, threshold_cleaning=0.5),
    'tomek':  TomekLinks(sampling_strategy='auto', n_jobs=4),
    'oss':    OneSidedSelection(sampling_strategy='auto', random_state=0, n_neighbors=1, n_jobs=4),
    'enn':    EditedNearestNeighbours(sampling_strategy='auto', n_neighbors=3, kind_sel='all', n_jobs=4),
    'renn':   RepeatedEditedNearestNeighbours(sampling_strategy='auto', n_neighbors=3, kind_sel='all', n_jobs=4, max_iter=100),   
    'nm1':    NearMiss(sampling_strategy='auto', version=1, n_neighbors=3, n_jobs=4),

    'allknn': AllKNN(sampling_strategy='auto', n_neighbors=3, kind_sel='all', n_jobs=4),
    'cnn':    CondensedNearestNeighbour(sampling_strategy='auto', random_state=0, n_neighbors=1, n_jobs=4),
    'nm2':    NearMiss(sampling_strategy='auto', version=2, n_neighbors=3, n_jobs=4)
}

## Evaluation

#### Metrics which require predict method

#### Confusion matrix, Recall, Precision, F1-score, G-mean, etc 

In [12]:
from sklearn.metrics import (
    accuracy_score,
    balanced_accuracy_score,
    recall_score,
    precision_score,
    f1_score,
    classification_report,
    confusion_matrix
)

# require scaling
from imblearn.metrics import (
    geometric_mean_score,
    make_index_balanced_accuracy,
)

def dominance(y_true, y_pred):
    tpr = recall_score(y_test, y_pred, pos_label=1)
    tnr = recall_score(y_test, y_pred, pos_label=0)
    return tpr - tnr

# Function to get classification metrics
def get_classification_metrics(y_true, y_pred):
    report = classification_report(y_true, y_pred, output_dict=True)
    accuracy          = accuracy_score(y_test, y_pred)
    balanced_accuracy = balanced_accuracy_score(y_test, y_pred)
    
    # Adding possible averages to recall, precision, and f1-score calculations
    # Default average is None - scores are returned for each class individually rather than being averaged
    recall_weighted    = recall_score(y_test, y_pred, average='weighted')
    precision_weighted = precision_score(y_test, y_pred, average='weighted') 
    f1_weighted        = f1_score(y_test, y_pred, average='weighted')

    recall_micro       = recall_score(y_test, y_pred, average='micro')
    precision_micro    = precision_score(y_test, y_pred, average='micro')
    f1_micro           = f1_score(y_test, y_pred, average='micro')
    
    recall_macro       = recall_score(y_test, y_pred, average='macro')
    precision_macro    = precision_score(y_test, y_pred, average='macro')
    f1_macro           = f1_score(y_test, y_pred, average='macro')

    # Adding possible averages to G-mean
    # If the average parameter is not specified, it defaults to average='binary'.
    g_mean_binary      = geometric_mean_score(y_test, y_pred, average='binary')
    g_mean_weighted    = geometric_mean_score(y_test, y_pred, average='weighted')
    g_mean_micro       = geometric_mean_score(y_test, y_pred, average='micro')
    g_mean_macro       = geometric_mean_score(y_test, y_pred, average='macro')

    # A lower alpha gives more weight to sensitivity (TPR), while a higher alpha gives more weight to specificity (TNR). 
    # In other words,in case of a lower alpha more emphasis is placed on correctly identifying positive instances
    gmean = make_index_balanced_accuracy(alpha=0.5, squared=True)(geometric_mean_score) 
    corrected_g_mean  = gmean(y_test, y_pred)
    # specifying an average might not be necessary or meaningful. Leaving for the time being
    corrected_g_mean_binary  = gmean(y_test, y_pred,average='binary')
    corrected_g_mean_weighted  = gmean(y_test, y_pred,average='weighted')
    corrected_g_mean_micro  = gmean(y_test, y_pred,average='micro')
    corrected_g_mean_macro  = gmean(y_test, y_pred,average='macro')
    
    dominance_score   = dominance(y_test, y_pred) 

    tn, fp, fn, tp = confusion_matrix(y_test, y_pred, labels=[0, 1]).ravel()
    FPR = fp / (tn + fp)
    FNR = fn / (tp + fn)
    # TPR = tp / (tp + fn) # recallminority
    # TNR = tn / (tn + fp) # recall majority
    
    return {
        'Accuracy': accuracy,
        'Recall Majority (TNR)': report['0']['recall'],
        'Recall Minority (TPR)': report['1']['recall'],
        'Balanced Accuracy': balanced_accuracy,
        'FPR': FPR,
        'FNR': FNR,
        
        'Precision Majority': report['0']['precision'],
        'Precision Minority': report['1']['precision'],
        'F1-Score Majority': report['0']['f1-score'],
        'F1-Score Minority': report['1']['f1-score'], 
        
        # Weighted average (Use 'weighted' if you want to take class imbalance into account)
        'Weighted Precision': precision_weighted,
        'Weighted Recall': recall_weighted,
        'Weighted F1-Score': f1_weighted,

        # Micro average (Use 'micro' if you want a metric that gives equal weight to every individual prediction)
        'Micro Precision': precision_micro,
        'Micro Recall': recall_micro,
        'Micro F1-Score': f1_micro,
        
        # Macro average (Use 'macro' if you want to treat each class equally)
        'Macro Precision': precision_macro,
        'Macro Recall': recall_macro,
        'Macro F1-Score': f1_macro,

        # Geometric average (a suitable metric when you want to balance the trade-off between detecting the minority class 
        # and avoiding false positives from the majority class)
        'G-mean-binary': g_mean_binary, # (compute two geometric means for each class, then the arithmetic mean of these G-means)
        'G-mean-weighted': g_mean_weighted, # calculates the geometric mean for each class and then computes the weighted arithmetic mean of these G-means
        'G-mean-micro': g_mean_micro, # treats all classes equally, regardless of their size
        'G-mean-macro': g_mean_macro, # use when you want to weight each instance equally, regardless of their class

        # Corrected G-means
        'Corrected G-mean': corrected_g_mean,
        'Corrected G-mean-binary': corrected_g_mean_binary,
        'Corrected G-mean-weighted': corrected_g_mean_weighted,
        'Corrected G-mean-micro': corrected_g_mean_micro,
        'Corrected G-mean-macro': corrected_g_mean_macro,
        
        'Dominance': dominance_score
    }

# Dictionary to hold the models
models_dict = {
    #'Random Forest': RandomForestClassifier(random_state=42),
    'Logistic Regression': LogisticRegression(random_state=42),
    #'XGBoost': XGBClassifier(random_state=42),
    'LightGBM': LGBMClassifier(random_state=42),
    'CatBoost': CatBoostClassifier(random_state=42, verbose=False) # Disable logging
}
# Collecting results
results = []
confusion_matrices = []

# Iterate over each undersampling technique
for name, undersampler in undersampler_dict.items():
    print(f"Applying undersampling technique: {name}")
    
    # Apply undersampling to the training data
    X_resampled, y_resampled = undersampler.fit_resample(X_train, y_train)
    
    # Iterate over each model
    for model_name, model in models_dict.items():
        print(f"Training model: {model_name} with undersampling technique: {name}")
        
        # Train the model
        model.fit(X_resampled, y_resampled)
        
        # Make predictions on the test set
        y_pred = model.predict(X_test)
        
        # Calculate metrics
        metrics = get_classification_metrics(y_test, y_pred)
        
        # Append the results
        results.append({
            'Undersampling Technique': name,
            'Model': model_name,
            **metrics
        })

# Create a DataFrame from the results
df_results_1 = pd.DataFrame(results)

# Print the DataFrame as a table
print(df_results_1)

Applying undersampling technique: random
Training model: Logistic Regression with undersampling technique: random
Training model: LightGBM with undersampling technique: random
[LightGBM] [Info] Number of positive: 14069, number of negative: 14069
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.049050 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51000
[LightGBM] [Info] Number of data points in the train set: 28138, number of used features: 200
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
Applying undersampling technique: ncr




Training model: Logistic Regression with undersampling technique: ncr
Training model: LightGBM with undersampling technique: ncr
[LightGBM] [Info] Number of positive: 14069, number of negative: 106100
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.210969 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51000
[LightGBM] [Info] Number of data points in the train set: 120169, number of used features: 200
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.117077 -> initscore=-2.020408
[LightGBM] [Info] Start training from score -2.020408
Applying undersampling technique: enn
Training model: Logistic Regression with undersampling technique: enn
Training model: LightGBM with undersampling technique: enn
[LightGBM] [Info] Number of positive: 14069, number of negative: 117372
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.239032 seconds.
You can set `force_col_wise=tr

#### Metrics which require predict_proba method
The ROC AUC score is calculated based on the predicted probabilities for each class, which are obtained using the predict_proba method. This method provides the probability estimates for each class, which are necessary to compute the ROC curve and subsequently the AUC (Area Under the Curve).

In [10]:
from sklearn.metrics import (
    roc_auc_score,
    average_precision_score
)

# Dictionary to hold the models
models_dict = {
    #'Random Forest': RandomForestClassifier(random_state=42),
    'Logistic Regression': LogisticRegression(random_state=42),
    #'XGBoost': XGBClassifier(random_state=42),
    'LightGBM': LGBMClassifier(random_state=42),
    'CatBoost': CatBoostClassifier(random_state=42, verbose=False) # Disable logging
}

# Collecting results
results = []
# Iterate over each undersampling technique
for name, undersampler in undersampler_dict.items():
    print(f"Applying undersampling technique: {name}")
    
    # Apply undersampling to the training data
    X_resampled, y_resampled = undersampler.fit_resample(X_train, y_train)
    
    # Iterate over each model
    for model_name, model in models_dict.items():
        print(f"Training model: {model_name} with undersampling technique: {name}")

        # Train the model
        model.fit(X_resampled, y_resampled)
        
        # Make predictions on the test set
        y_pred_proba = model.predict_proba(X_test)[:, 1]  
        
        # Calculate metrics
        roc_auc = roc_auc_score(y_test, y_pred_proba)
        pr_auc = average_precision_score(y_test, y_pred_proba)
        
        # Append the results
        results.append({
            'Undersampling Technique': name,
            'Model': model_name,
            'ROC-AUC': roc_auc,
            'PR-AUC': pr_auc
        })

# Create a DataFrame from the results
df_results_2 = pd.DataFrame(results)

# Print the DataFrame as a table
print(df_results_2)

Applying undersampling technique: random
Training model: Logistic Regression with undersampling technique: random
Training model: LightGBM with undersampling technique: random
[LightGBM] [Info] Number of positive: 14069, number of negative: 14069
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.049135 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51000
[LightGBM] [Info] Number of data points in the train set: 28138, number of used features: 200
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
Training model: CatBoost with undersampling technique: random
Learning rate set to 0.042833
0:	learn: 0.6896089	total: 248ms	remaining: 4m 7s
1:	learn: 0.6865107	total: 292ms	remaining: 2m 25s
2:	learn: 0.6835010	total: 338ms	remaining: 1m 52s
3:	learn: 0.6806626	total: 383ms	remaining: 1m 35s
4:	learn: 0.6775753	total: 430ms	remaining: 1m 25s
5:	learn: 0.6748614	total: 478ms	rema



Training model: Logistic Regression with undersampling technique: ncr
Training model: LightGBM with undersampling technique: ncr
[LightGBM] [Info] Number of positive: 14069, number of negative: 106100
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.181979 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51000
[LightGBM] [Info] Number of data points in the train set: 120169, number of used features: 200
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.117077 -> initscore=-2.020408
[LightGBM] [Info] Start training from score -2.020408
Training model: CatBoost with undersampling technique: ncr
Learning rate set to 0.079617
0:	learn: 0.6340630	total: 114ms	remaining: 1m 53s
1:	learn: 0.5846146	total: 208ms	remaining: 1m 43s
2:	learn: 0.5441372	total: 302ms	remaining: 1m 40s
3:	learn: 0.5115949	total: 397ms	remaining: 1m 38s
4:	learn: 0.4835697	total: 488ms	remaining: 1m 37s
5:	learn: 0.4611917	total: 59

### Export classification metrics

In [13]:
# Concatenate df_results_1 and df_results_2 horisontally
df_combined_results = pd.concat([df_results_2, df_results_1], axis=1)

# Export results to CSV files
df_combined_results.to_csv('4_evaluation_results_undersampling.csv', index=False)