# Santander

### Readme
The focus of this notebook was on detecting and handling outliers in the dataset to improve model training. Key steps included:

- **Outlier Detection**: Used the IQR method to identify outliers in each feature.
- **Outlier Analysis**: Counted outliers in each target class to understand their distribution.
- **Data Cleaning**: Removed outliers from the majority class (target == 0) while retaining all samples from the minority class (target == 1).
- **Dataset Comparison**: Compared the shapes and target distributions of the original and cleaned datasets to ensure data quality and class balance.
- **Preparation for Training**: Prepared cleaned training data and retained the test data for further model training.
This process aimed to enhance the model's performance by reducing noise and potential bias introduced by outliers.

# 1. Import libs
[top](#Contents)

In [1]:
# pandas and numpy imports
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

import warnings
# Suppress the specific warning
warnings.filterwarnings("ignore", category=UserWarning, module='sklearn.metrics')

In [2]:
# pip install -U imbalanced-learn

# 2. Get data
[top](#Contents)

#### Read data
[top](#Contents)

In [3]:
bank_ds = pd.read_csv('train.csv')

In [3]:
bank_test_ds = pd.read_csv('test.csv')

#### Drop an unneeded column

In [4]:
# Assuming 'column_to_drop' is the name of the column you want to drop
column_to_drop = 'ID_code'

# Drop the column from bank_ds
bank_ds.drop(column_to_drop, axis=1, inplace=True)


### Split 70/30 (stratified sampling)

In [5]:
target = bank_ds['target']

In [6]:
# separate dataset into train and test

X_train, X_test, y_train, y_test = train_test_split(
    bank_ds.drop(labels=['target'], axis=1),  # drop the target
    bank_ds['target'],  # just the target
    test_size=0.3,
    stratify=target,
    random_state=42)

X_train.shape, X_test.shape

((140000, 200), (60000, 200))

### Scaling

Models like Logistic Regression require scaling for better performance. Tree-based models like Random Forest, XGBoost, and LightGBM do not inherently require scaling but scaling can still be beneficial (can improve performance and convergence in some cases).

In [7]:
from sklearn.preprocessing import MinMaxScaler
# we put the variables in the same scale
scaler = MinMaxScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

### Outliers Engineering

#### Count the number of outliers in each class

In [8]:
import pandas as pd
import numpy as np

# Separate the features and target
features = bank_ds.drop(columns='target')
target = bank_ds['target']

# Define a function to detect outliers using IQR
def detect_outliers_iqr(data):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    is_outlier = ((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR)))
    return is_outlier

# Detect outliers in each feature
outliers = features.apply(detect_outliers_iqr)

# Sum outliers for each sample
outliers['is_outlier'] = outliers.sum(axis=1) > 0

# Add target column back to the outliers DataFrame
outliers['target'] = target

# Count outliers in each class
outliers_count = outliers[outliers['is_outlier']].groupby('target').size()

# Print the results
print(outliers_count)


target
0    21903
1     2993
dtype: int64


In [30]:
import pandas as pd
import numpy as np

# Convert X_train to a pandas DataFrame
X_train_df = pd.DataFrame(X_train, columns=bank_ds.drop(labels=['target'], axis=1).columns)

# Separate the features and target
features = X_train_df
target = y_train

# Define a function to detect outliers using IQR
def detect_outliers_iqr(data):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    is_outlier = ((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR)))
    return is_outlier

# Detect outliers in each feature
outliers = features.apply(detect_outliers_iqr)

# Sum outliers for each sample
outliers['is_outlier'] = outliers.sum(axis=1) > 0

# Add target column back to the outliers DataFrame
outliers['target'] = target

# Count outliers in each class
outliers_count = outliers[outliers['is_outlier']].groupby('target').size()

# Print the results
print(outliers_count)


target
0.0    11063
1.0     1287
dtype: int64


##### Remove outliers from majority class, but not minority

In [9]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Assuming bank_ds is your dataset and has been defined earlier in your code
# Separate the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(
    bank_ds.drop(labels=['target'], axis=1),  # drop the target
    bank_ds['target'],  # just the target
    test_size=0.3,
    stratify=bank_ds['target'],
    random_state=42)

# Convert X_train to a pandas DataFrame
X_train_df = pd.DataFrame(X_train, columns=bank_ds.drop(labels=['target'], axis=1).columns)

# Ensure y_train is a pandas Series with aligned indices
y_train = y_train.reset_index(drop=True)
X_train_df = X_train_df.reset_index(drop=True)

# Separate the features and target
features = X_train_df
target = y_train

# Separate majority and minority classes
majority_class = features[target == 0]
minority_class = features[target == 1]

# Define a function to detect outliers using IQR
def detect_outliers_iqr(data):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    is_outlier = ((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR)))
    return is_outlier

# Detect outliers in the majority class
outliers_majority = majority_class.apply(detect_outliers_iqr)

# Sum outliers for each sample
outliers_majority['is_outlier'] = outliers_majority.sum(axis=1) > 0

# Keep only non-outliers in the majority class
cleaned_majority_class = majority_class[~outliers_majority['is_outlier']]

# Align the indices of the target and cleaned_majority_class
cleaned_majority_class_target = target.loc[cleaned_majority_class.index]

# Combine cleaned majority class with minority class
cleaned_data = pd.concat([cleaned_majority_class, minority_class])
cleaned_target = pd.concat([cleaned_majority_class_target, target[target == 1]])

# Ensure the target is properly aligned
cleaned_data = cleaned_data.reset_index(drop=True)
cleaned_target = cleaned_target.reset_index(drop=True)

# Print the shapes of the original and cleaned datasets
print("Original dataset shape:", X_train_df.shape)
print("Cleaned dataset shape:", cleaned_data.shape)

# Print the target distribution before and after cleaning
print("Original target distribution:\n", y_train.value_counts())
print("Cleaned target distribution:\n", cleaned_target.value_counts())

# Optionally, you can return the cleaned training data and test data as follows:
X_train_cleaned = cleaned_data
y_train_cleaned = cleaned_target

# The test set remains unchanged:
X_test_df = pd.DataFrame(X_test, columns=bank_ds.drop(labels=['target'], axis=1).columns)


Original dataset shape: (140000, 200)
Cleaned dataset shape: (123320, 200)
Original target distribution:
 target
0    125931
1     14069
Name: count, dtype: int64
Cleaned target distribution:
 target
0    109251
1     14069
Name: count, dtype: int64


##### Renaming

In [10]:
X_train = X_train_cleaned
y_train = y_train_cleaned
X_test = X_test_df

## Train ML models
### Random Forests

In [11]:
rf = RandomForestClassifier(n_estimators=100, random_state=42, max_depth=2, n_jobs=4)

rf.fit(X_train, y_train)

y_train_rf = rf.predict_proba(X_train)[:,1]
y_test_rf = rf.predict_proba(X_test)[:,1]

### Logistic Regression

In [12]:
logit = LogisticRegression(random_state=42,  max_iter=2000)

logit.fit(X_train, y_train)

y_train_logit = logit.predict_proba(X_train)[:,1]
y_test_logit = logit.predict_proba(X_test)[:,1]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### XGB Classifier

In [13]:
import xgboost as xgb
from xgboost import XGBClassifier

xgb_model = XGBClassifier(random_state=42, use_label_encoder=False)

xgb_model.fit(X_train, y_train)

y_train_xgb = xgb_model.predict_proba(X_train)[:, 1]
y_test_xgb = xgb_model.predict_proba(X_test)[:, 1]

### Light GBM

In [14]:
import lightgbm as lgb
from lightgbm import LGBMClassifier

lgb_model = LGBMClassifier(random_state=42)

lgb_model.fit(X_train, y_train)

y_train_lgb = lgb_model.predict_proba(X_train)[:, 1]
y_test_lgb = lgb_model.predict_proba(X_test)[:, 1]

[LightGBM] [Info] Number of positive: 14069, number of negative: 109251
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.137303 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51000
[LightGBM] [Info] Number of data points in the train set: 123320, number of used features: 200
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.114085 -> initscore=-2.049674
[LightGBM] [Info] Start training from score -2.049674


### CatBoost

In [15]:
from catboost import CatBoostClassifier

catboost_model = CatBoostClassifier(random_state=42)

catboost_model.fit(X_train, y_train)

# Predict probabilities for the training and test sets
y_train_catboost = catboost_model.predict_proba(X_train)[:, 1]
y_test_catboost = catboost_model.predict_proba(X_test)[:, 1]

Learning rate set to 0.080502
0:	learn: 0.6319535	total: 302ms	remaining: 5m 1s
1:	learn: 0.5810590	total: 442ms	remaining: 3m 40s
2:	learn: 0.5394485	total: 579ms	remaining: 3m 12s
3:	learn: 0.5046726	total: 712ms	remaining: 2m 57s
4:	learn: 0.4766759	total: 872ms	remaining: 2m 53s
5:	learn: 0.4535324	total: 1.01s	remaining: 2m 46s
6:	learn: 0.4334646	total: 1.13s	remaining: 2m 40s
7:	learn: 0.4162499	total: 1.24s	remaining: 2m 33s
8:	learn: 0.4030335	total: 1.38s	remaining: 2m 32s
9:	learn: 0.3907472	total: 1.51s	remaining: 2m 29s
10:	learn: 0.3809563	total: 1.66s	remaining: 2m 29s
11:	learn: 0.3722892	total: 1.8s	remaining: 2m 27s
12:	learn: 0.3648129	total: 1.91s	remaining: 2m 24s
13:	learn: 0.3586594	total: 2.03s	remaining: 2m 23s
14:	learn: 0.3532511	total: 2.15s	remaining: 2m 20s
15:	learn: 0.3486108	total: 2.25s	remaining: 2m 18s
16:	learn: 0.3445799	total: 2.35s	remaining: 2m 16s
17:	learn: 0.3411927	total: 2.46s	remaining: 2m 14s
18:	learn: 0.3382288	total: 2.58s	remaining: 2

## Evaluation

#### Metrics which require predict method

##### Confusion matrix
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

TN | FP

FN | TP

#### Confusion matrix, Recall, Precision, F1-score, G-mean, etc 

In [None]:
from sklearn.metrics import (
    accuracy_score,
    balanced_accuracy_score,
    recall_score,
    precision_score,
    f1_score,
    classification_report,
    confusion_matrix
)

from imblearn.metrics import (
    geometric_mean_score,
    make_index_balanced_accuracy,
)

In [None]:
# Helper functions


def dominance(y_true, y_pred):
    tpr = recall_score(y_test, y_pred, pos_label=1)
    tnr = recall_score(y_test, y_pred, pos_label=0)
    return tpr - tnr

# Function to get classification metrics
def get_classification_metrics(y_true, y_pred):
    report = classification_report(y_true, y_pred, output_dict=True)
    accuracy          = accuracy_score(y_test, y_pred)
    balanced_accuracy = balanced_accuracy_score(y_test, y_pred)
    
    # Adding possible averages to recall, precision, and f1-score calculations
    # Default average is None - scores are returned for each class individually rather than being averaged
    recall_weighted    = recall_score(y_test, y_pred, average='weighted')
    precision_weighted = precision_score(y_test, y_pred, average='weighted') 
    f1_weighted        = f1_score(y_test, y_pred, average='weighted')

    recall_micro       = recall_score(y_test, y_pred, average='micro')
    precision_micro    = precision_score(y_test, y_pred, average='micro')
    f1_micro           = f1_score(y_test, y_pred, average='micro')
    
    recall_macro       = recall_score(y_test, y_pred, average='macro')
    precision_macro    = precision_score(y_test, y_pred, average='macro')
    f1_macro           = f1_score(y_test, y_pred, average='macro')

    # Adding possible averages to G-mean
    # If the average parameter is not specified, it defaults to average='binary'.
    g_mean_binary      = geometric_mean_score(y_test, y_pred, average='binary')
    g_mean_weighted    = geometric_mean_score(y_test, y_pred, average='weighted')
    g_mean_micro       = geometric_mean_score(y_test, y_pred, average='micro')
    g_mean_macro       = geometric_mean_score(y_test, y_pred, average='macro')

    # A lower alpha gives more weight to sensitivity (TPR), while a higher alpha gives more weight to specificity (TNR). 
    # In other words,in case of a lower alpha more emphasis is placed on correctly identifying positive instances
    gmean = make_index_balanced_accuracy(alpha=0.5, squared=True)(geometric_mean_score) 
    corrected_g_mean  = gmean(y_test, y_pred)
    # specifying an average might not be necessary or meaningful. Leaving for the time being
    corrected_g_mean_binary  = gmean(y_test, y_pred,average='binary')
    corrected_g_mean_weighted  = gmean(y_test, y_pred,average='weighted')
    corrected_g_mean_micro  = gmean(y_test, y_pred,average='micro')
    corrected_g_mean_macro  = gmean(y_test, y_pred,average='macro')
    
    dominance_score   = dominance(y_test, y_pred) 

    tn, fp, fn, tp = confusion_matrix(y_test, y_pred, labels=[0, 1]).ravel()
    FPR = fp / (tn + fp)
    FNR = fn / (tp + fn)
    # TPR = tp / (tp + fn) # recallminority
    # TNR = tn / (tn + fp) # recall majority
    
    return {
        'Accuracy': accuracy,
        'Recall Majority (TNR)': report['0']['recall'],
        'Recall Minority (TPR)': report['1']['recall'],
        'Balanced Accuracy': balanced_accuracy,
        'FPR': FPR,
        'FNR': FNR,
        
        'Precision Majority': report['0']['precision'],
        'Precision Minority': report['1']['precision'],
        'F1-Score Majority': report['0']['f1-score'],
        'F1-Score Minority': report['1']['f1-score'], 
        
        # Weighted average (Use 'weighted' if you want to take class imbalance into account)
        'Weighted Precision': precision_weighted,
        'Weighted Recall': recall_weighted,
        'Weighted F1-Score': f1_weighted,

        # Micro average (Use 'micro' if you want a metric that gives equal weight to every individual prediction)
        'Micro Precision': precision_micro,
        'Micro Recall': recall_micro,
        'Micro F1-Score': f1_micro,
        
        # Macro average (Use 'macro' if you want to treat each class equally)
        'Macro Precision': precision_macro,
        'Macro Recall': recall_macro,
        'Macro F1-Score': f1_macro,

        # Geometric average (a suitable metric when you want to balance the trade-off between detecting the minority class 
        # and avoiding false positives from the majority class)
        'G-mean-binary': g_mean_binary, # (compute two geometric means for each class, then the arithmetic mean of these G-means)
        'G-mean-weighted': g_mean_weighted, # calculates the geometric mean for each class and then computes the weighted arithmetic mean of these G-means
        'G-mean-micro': g_mean_micro, # treats all classes equally, regardless of their size
        'G-mean-macro': g_mean_macro, # use when you want to weight each instance equally, regardless of their class

        # Corrected G-means
        'Corrected G-mean': corrected_g_mean,
        'Corrected G-mean-binary': corrected_g_mean_binary,
        'Corrected G-mean-weighted': corrected_g_mean_weighted,
        'Corrected G-mean-micro': corrected_g_mean_micro,
        'Corrected G-mean-macro': corrected_g_mean_macro,
        
        'Dominance': dominance_score
    }

In [37]:
models = {
    'Random Forest': rf.predict(X_test),
    'Logistic Regression': logit.predict(X_test),
    'XGBoost': xgb_model.predict(X_test),
    'LightGBM': lgb_model.predict(X_test),
    'CatBoost': catboost_model.predict(X_test)
}

# Collecting results
results = []
confusion_matricies = []
for model_name, y_pred in models.items():
    metrics           = get_classification_metrics(y_test, y_pred)
    conf_matrix       = confusion_matrix(y_test, y_pred)
    results.append({
        'Model': model_name,
        **metrics,
    })
    confusion_matricies.append({
        'Model': model_name,
        'Confusion matrix': conf_matrix
    })

# Create a DataFrame from the results
df_results_1 = pd.DataFrame(results)
df_confusion_matricies = pd.DataFrame(confusion_matricies)

# Print the DataFrame as a table
print(df_results_1)
print(df_confusion_matricies)

                 Model  Accuracy  Recall Majority (TNR)  \
0        Random Forest  0.899517               1.000000   
1  Logistic Regression  0.912850               0.981953   
2              XGBoost  0.912117               0.982639   
3             LightGBM  0.909450               0.996146   

   Recall Minority (TPR)  Balanced Accuracy       FPR       FNR  \
0               0.000000           0.500000  0.000000  1.000000   
1               0.294244           0.638099  0.018047  0.705756   
2               0.280809           0.631724  0.017361  0.719191   
3               0.133355           0.564751  0.003854  0.866645   

   Precision Majority  Precision Minority  F1-Score Majority  ...  \
0            0.899517            0.000000           0.947101  ...   
1            0.925679            0.645560           0.952986  ...   
2            0.924420            0.643726           0.952641  ...   
3            0.911423            0.794466           0.951903  ...   

   G-mean-binary  G-me

#### Metrics which require predict_proba method
The ROC AUC score is calculated based on the predicted probabilities for each class, which are obtained using the predict_proba method. This method provides the probability estimates for each class, which are necessary to compute the ROC curve and subsequently the AUC (Area Under the Curve).

In [16]:
from sklearn.metrics import (
    roc_auc_score,
    average_precision_score
)

models = {
    'Random Forest': rf.predict_proba(X_test)[:,1],
    'Logistic Regression': logit.predict_proba(X_test)[:,1],
    'XGBoost': xgb_model.predict_proba(X_test)[:,1],
    'LightGBM': lgb_model.predict_proba(X_test)[:,1],
    'CatBoost': catboost_model.predict_proba(X_test)[:,1]
}

# Collecting results
results = []
for model_name, y_pred in models.items():
    roc_auc   = roc_auc_score(y_test, y_pred) 
    pr_auc = average_precision_score(y_test, y_pred)
    results.append({
        'Model': model_name,
        'ROC-AUC': roc_auc,
        'PR-AUC': pr_auc
    })

# Create a DataFrame from the results
df_results_2 = pd.DataFrame(results)

# Print the DataFrame as a table
print(df_results_2)

                 Model   ROC-AUC    PR-AUC
0        Random Forest  0.736037  0.224972
1  Logistic Regression  0.858266  0.497735
2              XGBoost  0.855851  0.483367
3             LightGBM  0.861808  0.499206
4             CatBoost  0.891671  0.577059


### Export classification metrics

In [40]:
# Concatenate df_results_1 and df_results_2 horisontally
df_combined_results = pd.concat([df_results_2, df_results_1], axis=1)

# Export results to CSV files
df_combined_results.to_csv('5_evaluation_results_outliers.csv', index=False)
df_confusion_matricies.to_csv('5_confusion_matricies_outliers.csv', index=False)