## Contents
- [1. Goal](#1-goal)
- [2. Theory](#2-theory)
    - [Summary Table of Averages and Metrics in Machine Learning](#summary-table-of-averages-and-metrics-in-machine-learning)
    - [Summary Table of Mean-Related Metrics in Machine Learning](#summary-table-of-mean-related-metrics-in-machine-learning)
- [3. Exploratory Analysis Summary](#3-exploratory-analysis-summary)
- [4. Notebook 1 - Santander Step 1: Initial Evaluation](#4-notebook-1---santander-step-1-initial-evaluation)
- [5. Notebook 2 - Santander Step 2: Stratified Sampling](#5-notebook-2---santander-step-2-stratified-sampling)
- [6. Notebook 3 - Santander Step 3: Scaling](#6-notebook-3---santander-step-3-scaling)
- [7. Notebook 4 - Santander Step 4: Undersampling Majority Class](#7-notebook-4---santander-step-4-undersampling-majority-class)
- [8. Notebook 5 - Santander Step 5: Outliers Engineering](#8-notebook-5---santander-step-5-outliers-engineering)
- [9. Notebook 6 - Santander Step 6: Oversampling](#9-notebook-6---santander-step-6-oversampling)
- [10. Notebook 7 - Santander Step 7: Oversampling and Cross-Validation](#10-notebook-7---santander-step-7-oversampling-and-cross-validation)
- [11. Notebook 8 - Santander Step 8: Over and Undersampling and Cross-Validation](#11-notebook-8---santander-step-8-over-and-undersampling-and-cross-validation)
- [12. Notebook 9 - Santander Step 9: Ensemble Methods](#12-notebook-9---santander-step-9-ensemble-methods)
- [13. Notebook 9.1 - Santander Step 9.1: Cost-Sensitive Approach](#13-notebook-91---santander-step-91-cost-sensitive-approach)
- [14. Summary Table of Highest ROC-AUC Values](#14-summary-table-of-highest-roc-auc-values)
- [15. Conclusion](#15-conclusion)


## 1. Goal
The goal of working on the Santander dataset from Kaggle was
- Explore and compare a variety of classification metrics, including both popular and lesser-known ones.
- Evaluate different machine learning models using these diverse metrics.
- Apply techniques to address the imbalanced nature of the dataset.
- Utilize both predict and predict_proba methods for comprehensive model analysis.
- Apply undersampling and oversampling strategies to enhance model robustness and accuracy.

## 2. Theory

### Summary Table of Averages and Metrics in Machine Learning

| Average Type                         | Description                                                         | Metric Examples                                 |
|--------------------------------------|---------------------------------------------------------------------|-------------------------------------------------|
| Arithmetic Mean                      | Sum of all values divided by the number of values.                  | N/A                                             |
| Geometric Mean                       | Nth root of the product of N values.                                | G-mean                                          |
| Harmonic Mean                        | N divided by the sum of the reciprocals of the values.              | F1-Score                                        |
| Mode                                 | The value that appears most frequently in a data set.               | N/A                                             |
| Median                               | The middle value in a data set when the values are arranged in ascending order. | N/A                                             |
| Micro Average                        | Computes metrics globally by aggregating across all instances.      | Micro Precision, Micro Recall, Micro F1-Score   |
| Macro Average                        | Computes metrics for each class separately and then averages them.  | Macro Precision, Macro Recall, Macro F1-Score   |
| Weighted Average                     | Computes metrics for each class separately and then averages them, weighted by class support. | Weighted Precision, Weighted Recall, Weighted F1-Score |
| Binary Average                       | Computes metrics based on the positive class in binary classification. | Binary Precision, Binary Recall, Binary F1-Score |
| Micro Average (Geometric Mean)       | Computes the geometric mean globally across all instances.          | Micro G-mean                                    |
| Macro Average (Geometric Mean)       | Computes the geometric mean for each class separately and then averages them. | Macro G-mean                                    |
| Weighted Average (Geometric Mean)    | Computes the geometric mean for each class separately and then averages them, weighted by class support. | Weighted G-mean                                 |

### Summary Table of Mean-Related Metrics in Machine Learning

| Metric                        | Description                                                  | 
|-------------------------------|--------------------------------------------------------------|
| Mean Absolute Error (MAE)     | Average of the absolute differences between predicted and actual values. | 
| Mean Squared Error (MSE)      | Average of the squared differences between predicted and actual values.    | 
| Root Mean Squared Error (RMSE)| Square root of the MSE, providing the error magnitude in the same units as the original values. | 
| Mean Absolute Percentage Error (MAPE)| Average of the absolute percentage differences between predicted and actual values. | 
| Mean Bias Deviation (MBD)     | Measures the mean bias error of a model.                    | 

## 3. Exploratory Analysis Summary

###### Shape
- bank_ds.shape (200000, 202)
- bank_test_ds.shape (200000, 201)
###### Data types
- There is 1 binary feature (predicted column 'target')
- There are 0 discrete features 
- There are 200 continuous features ('var_1' ... 'var_199')
- There is 1 categorical feature ('ID_code')
- Cardinality (min cardinality = 451 (var_68), max cardinality = 169968 (var_45))
###### Duplicated rows
- No duplicated rows
###### Missing values
No missing values
###### Distributions
- Histograms of [var_1, ..., var_199] remind normal distribution, sometimes there are a few peaks though. However, the Shapiro test says that no normally distributed histograms
###### Outliers
- about 10 features are checked via box plots, and outliers exist
###### Correlation
No correlation between them is noticed between [var_1, ..., var_199]
###### Balance ratio 
- target can be 1 or 0
- 1 in 10.049% of cases
- 0 in 89.951% of cases
###### Conclusion
- From distribution plots we can see that features seem to be already transformed, they seem to be normally distributed, which is not confirmed by Shapiro test.
- From box plots we can see that some outliers exist.
- From correlation analysis we can see that features are not correlated with each other, the only target has a slight correlation with other fields.
- The data set is imbalanced (10/90)
###### Actions:
- Delete categorical column
- Apply techniques for imbalance data
- Work with outliers

## 4. Notebook 1 - Santander Step 1: Initial Evaluation - 0.8940
| Model                | Accuracy | Recall Majority (TNR) | Recall Minority (TPR) | Balanced Accuracy | FPR      | FNR      | Precision Majority | Precision Minority | F1-Score Majority | F1-Score Minority | G-mean-binary | G-mean-weighted | G-mean-micro | G-mean-macro | Corrected G-mean | Corrected G-mean-binary | Corrected G-mean-weighted | Corrected G-mean-micro | Corrected G-mean-macro | Dominance |
|----------------------|----------|-----------------------|-----------------------|-------------------|----------|----------|--------------------|--------------------|--------------------|--------------------|----------------|------------------|--------------|--------------|------------------|-------------------------|--------------------------|-------------------------|-------------------------|-----------|
| Baseline             | 0.897600 | 1.000000              | 0.000000              | 0.500000          | 0.000000 | 1.000000 | 0.897600           | 0.000000           | 0.946037           | 0.000000           | 0.000000       | 0.303174         | 0.897600     | 0.500000     | 0.000000         | 0.000000                | 0.128459                 | 0.805686                | 0.250000                | -1.000000  |
| Random Forest        | 0.897600 | 1.000000              | 0.000000              | 0.500000          | 0.000000 | 1.000000 | 0.897600           | 0.000000           | 0.946037           | 0.000000           | 0.000000       | 0.303174         | 0.897600     | 0.500000     | 0.000000         | 0.000000                | 0.128459                 | 0.805686                | 0.250000                | -1.000000  |
| Logistic Regression  | 0.912817 | 0.986612              | 0.265951              | 0.626281          | 0.013388 | 0.734049 | 0.921763           | 0.693843           | 0.953086           | 0.383561           | 0.512240       | 0.556890         | 0.912817     | 0.626281     | 0.262390         | 0.167843                | 0.398988                 | 0.833234                | 0.392228                | -0.720662  |
| XGBoost              | 0.911717 | 0.986650              | 0.254883              | 0.620766          | 0.013350 | 0.745117 | 0.920679           | 0.685339           | 0.952524           | 0.371429           | 0.501478       | 0.548360         | 0.911717     | 0.620766     | 0.251480         | 0.159468                | 0.388187                 | 0.831227                | 0.385351                | -0.731767  |
| LightGBM             | 0.906350 | 0.997995              | 0.103027              | 0.550511          | 0.002005 | 0.896973 | 0.907001           | 0.854251           | 0.950325           | 0.183098           | 0.320657       | 0.420049         | 0.906350     | 0.550511     | 0.102821         | 0.056810                | 0.239226                 | 0.821470                | 0.303062                | -0.894967  |
| AdaBoost             | 0.905017 | 0.989268              | 0.166504              | 0.577886          | 0.010732 | 0.833496 | 0.912310           | 0.638976           | 0.949232           | 0.256757           | 0.405853       | 0.476379         | 0.905017     | 0.577886     | 0.164717         | 0.096955                | 0.301176                 | 0.819055                | 0.333952                | -0.822764  |
| CatBoost             | 0.922533 | 0.990159              | 0.329753              | 0.659956          | 0.009841 | 0.670247 | 0.928313           | 0.792645           | 0.958239           | 0.464437           | 0.571408       | 0.605471         | 0.922533     | 0.659956     | 0.326507         | 0.218694                | 0.462854                 | 0.851068                | 0.435542                | -0.660406  |

| Model                | Confusion matrix                 |
|----------------------|----------------------------------|
| Baseline             | [[53856, 0], [6144, 0]]          |
| Random Forest        | [[53856, 0], [6144, 0]]          |
| Logistic Regression  | [[53135, 721], [4510, 1634]]     |
| XGBoost              | [[53137, 719], [4578, 1566]]     |
| LightGBM             | [[53748, 108], [5511, 633]]      |
| AdaBoost             | [[53278, 578], [5121, 1023]]     |
| CatBoost             | [[53326, 530], [4118, 2026]]     |

| Model                | ROC-AUC  | PR-AUC  |
|----------------------|----------|---------|
| Baseline             | 0.500000 | 0.102400|
| Random Forest        | 0.747842 | 0.289410|
| Logistic Regression  | 0.860041 | 0.510377|
| XGBoost              | 0.857861 | 0.497970|
| LightGBM             | 0.862282 | 0.517279|
| AdaBoost             | 0.803598 | 0.401087|
| CatBoost             | 0.893987 | 0.606566|

## 5. Notebook 2 - Santander Step 2: Stratified Sampling - 0.8939

| Model                | Accuracy | Recall Majority (TNR) | Recall Minority (TPR) | Balanced Accuracy | FPR      | FNR      | Precision Majority | Precision Minority | F1-Score Majority | F1-Score Minority | G-mean-binary | G-mean-weighted | G-mean-micro | G-mean-macro | Corrected G-mean | Corrected G-mean-binary | Corrected G-mean-weighted | Corrected G-mean-micro | Corrected G-mean-macro | Dominance |
|----------------------|----------|-----------------------|-----------------------|-------------------|----------|----------|--------------------|--------------------|-------------------|-------------------|---------------|-----------------|--------------|--------------|------------------|--------------------------|--------------------------|--------------------------|--------------------------|-----------|
| Baseline             | 0.899517 | 1.000000              | 0.000000              | 0.500000          | 0.000000 | 1.000000 | 0.899517           | 0.000000           | 0.947101           | 0.000000          | 0.000000      | 0.300643        | 0.899517     | 0.500000     | 0.000000         | 0.000000                 | 0.126497                 | 0.809130                 | 0.250000                | -1.000000 |
| Random Forest        | 0.899517 | 1.000000              | 0.000000              | 0.500000          | 0.000000 | 1.000000 | 0.899517           | 0.000000           | 0.947101           | 0.000000          | 0.000000      | 0.300643        | 0.899517     | 0.500000     | 0.000000         | 0.000000                 | 0.126497                 | 0.809130                 | 0.250000                | -1.000000 |
| Logistic Regression  | 0.913250 | 0.985807              | 0.263725              | 0.624766          | 0.014193 | 0.736275 | 0.922993           | 0.674873           | 0.953366           | 0.674873          | 0.509885      | 0.554175        | 0.913250     | 0.624766     | 0.259982         | 0.166118                 | 0.395706                 | 0.834026                 | 0.390333                | -0.722082 |
| XGBoost              | 0.913183 | 0.987660              | 0.246475              | 0.617068          | 0.012340 | 0.753525 | 0.921467           | 0.690520           | 0.953416           | 0.690520          | 0.493390      | 0.541376        | 0.913183     | 0.617068     | 0.243434         | 0.153219                 | 0.379876                 | 0.833904                 | 0.380773                | -0.741185 |
| LightGBM             | 0.907650 | 0.998240              | 0.096699              | 0.547470          | 0.001760 | 0.903301 | 0.908196           | 0.859882           | 0.951091           | 0.859882          | 0.310691      | 0.412302        | 0.907650     | 0.547470     | 0.096529         | 0.053017                 | 0.231221                 | 0.823829                 | 0.299723                | -0.901541 |
| CatBoost             | 0.920667 | 0.989420              | 0.305192              | 0.647306          | 0.010580 | 0.694808 | 0.927260           | 0.763169           | 0.957332           | 0.763169          | 0.549511      | 0.586753        | 0.920667     | 0.647306     | 0.301963         | 0.198657                 | 0.438391                 | 0.847627                 | 0.419005                | -0.684229 |

| Model                | Confusion matrix                 |
|----------------------|----------------------------------|
| Baseline             | [[53971, 0], [6029, 0]]          |
| Random Forest        | [[53971, 0], [6029, 0]]          |
| Logistic Regression  | [[53205, 766], [4439, 1590]]     |
| XGBoost              | [[53305, 666], [4543, 1486]]     |
| LightGBM             | [[53876, 95], [5446, 583]]       |
| CatBoost             | [[53400, 571], [4189, 1840]]     |

| Model               | ROC-AUC  | PR-AUC  |
|---------------------|----------|---------|
| Baseline            | 0.500000 | 0.100483|
| Random Forest       | 0.748211 | 0.277832|
| Logistic Regression | 0.858369 | 0.497739|
| XGBoost             | 0.854883 | 0.488493|
| LightGBM            | 0.863721 | 0.507548|
| CatBoost            | 0.893907 | 0.592283|

## 6. Notebook 3 - Santander Step 3: Scaling - 0.8939

| Model                | Accuracy | Recall Majority (TNR) | Recall Minority (TPR) | Balanced Accuracy | FPR      | FNR      | Precision Majority | Precision Minority | F1-Score Majority | F1-Score Minority | G-mean-binary | G-mean-weighted | G-mean-micro | G-mean-macro | Corrected G-mean | Corrected G-mean-binary | Corrected G-mean-weighted | Corrected G-mean-micro | Corrected G-mean-macro | Dominance |
|----------------------|----------|-----------------------|-----------------------|-------------------|----------|----------|--------------------|--------------------|-------------------|-------------------|---------------|-----------------|--------------|--------------|------------------|--------------------------|--------------------------|--------------------------|--------------------------|-----------|
| Baseline             | 0.899517 | 1.000000              | 0.000000              | 0.500000          | 0.000000 | 1.000000 | 0.899517           | 0.000000           | 0.947101           | 0.000000          | 0.000000      | 0.300643        | 0.899517     | 0.500000     | 0.000000         | 0.000000                 | 0.126497                 | 0.809130                 | 0.250000                | -1.000000 |
| Random Forest        | 0.899517 | 1.000000              | 0.000000              | 0.500000          | 0.000000 | 1.000000 | 0.899517           | 0.000000           | 0.947101           | 0.000000          | 0.000000      | 0.300643        | 0.899517     | 0.500000     | 0.000000         | 0.000000                 | 0.126497                 | 0.809130                 | 0.250000                | -1.000000 |
| Logistic Regression  | 0.913517 | 0.986085              | 0.263891              | 0.624988          | 0.013915 | 0.736109 | 0.923029           | 0.679334           | 0.953516           | 0.678529          | 0.510117      | 0.554402        | 0.913517     | 0.624988     | 0.260219         | 0.166255                 | 0.396044                 | 0.834513                 | 0.390610                | -0.722194 |
| XGBoost              | 0.913183 | 0.987660              | 0.246475              | 0.617068          | 0.012340 | 0.753525 | 0.921467           | 0.690520           | 0.953416           | 0.685228          | 0.493390      | 0.541376        | 0.913183     | 0.617068     | 0.243434         | 0.153219                 | 0.379876                 | 0.833904                 | 0.380773                | -0.741185 |
| LightGBM             | 0.907283 | 0.998295              | 0.092553              | 0.545424          | 0.001705 | 0.907447 | 0.907818           | 0.858462           | 0.950909           | 0.165231          | 0.303965      | 0.408099        | 0.907283     | 0.545424     | 0.092395         | 0.050552                 | 0.226811                 | 0.823163                 | 0.297487                | -0.905743 |
| CatBoost             | 0.920667 | 0.989420              | 0.305192              | 0.647306          | 0.010580 | 0.694808 | 0.927260           | 0.763169           | 0.957332           | 0.607332          | 0.549511      | 0.586753        | 0.920667     | 0.647306     | 0.301963         | 0.198657                 | 0.438391                 | 0.847627                 | 0.419005                | -0.684229 |

| Model                | Confusion matrix                 |
|----------------------|----------------------------------|
| Baseline             | [[53971, 0], [6029, 0]]          |
| Random Forest        | [[53971, 0], [6029, 0]]          |
| Logistic Regression  | [[53220, 751], [4438, 1591]]     |
| XGBoost              | [[53305, 666], [4543, 1486]]     |
| LightGBM             | [[53879, 92], [5471, 558]]       |

| Model               | ROC-AUC  | PR-AUC  |
|---------------------|----------|---------|
| Baseline            | 0.500000 | 0.100483|
| Random Forest       | 0.748215 | 0.277840|
| Logistic Regression | 0.858844 | 0.498939|
| XGBoost             | 0.854883 | 0.488493|
| LightGBM            | 0.863880 | 0.511203|
| CatBoost            | 0.893906 | 0.592294|

## 7. Notebook 4 - Santander Step 4: Undersampling Majority Class - 0.8941

sampling_strategy='auto' undersamples only the majority class

| Technique                        | Description                                                                                                           | Key Parameters                                                                                   |
|----------------------------------|-----------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|
| **Random UnderSampler (random)** | Randomly removes samples from the majority class.                                                                     | `sampling_strategy='auto'`, `random_state=0`, `replacement=False`                                |
| **Neighbourhood Cleaning Rule (ncr)** | Removes samples from the majority class based on a neighborhood rule to clean the class distribution.                | `sampling_strategy='auto'`, `n_neighbors=3`, `kind_sel='all'`, `n_jobs=4`, `threshold_cleaning=0.5` |
| **Tomek Links (tomek)**          | Removes samples from the majority class that form Tomek links with the minority class.                                | `sampling_strategy='auto'`, `n_jobs=4`                                                           |
| **One-Sided Selection (oss)**    | Combines Tomek links and condensed nearest neighbors to clean the dataset.                                            | `sampling_strategy='auto'`, `random_state=0`, `n_neighbors=1`, `n_jobs=4`                        |
| **Edited Nearest Neighbours (enn)** | Removes samples from the majority class if they are misclassified by their nearest neighbors.                       | `sampling_strategy='auto'`, `n_neighbors=3`, `kind_sel='all'`, `n_jobs=4`                        |
| **Repeated Edited Nearest Neighbours (renn)** | Applies Edited Nearest Neighbours multiple times iteratively.                                                     | `sampling_strategy='auto'`, `n_neighbors=3`, `kind_sel='all'`, `n_jobs=4`, `max_iter=100`        |
| **NearMiss-1 (nm1)**             | Selects samples from the majority class that are closest to the minority class.                                        | `sampling_strategy='auto'`, `version=1`, `n_neighbors=3`, `n_jobs=4`                             |
| **AllKNN (allknn)**              | Applies Edited Nearest Neighbours, using all k nearest neighbors for editing.                                          | `sampling_strategy='auto'`, `n_neighbors=3`, `kind_sel='all'`, `n_jobs=4`                        |
| **Condensed Nearest Neighbour (cnn)** | Removes samples from the majority class to make the nearest neighbor rule consistent.                              | `sampling_strategy='auto'`, `random_state=0`, `n_neighbors=1`, `n_jobs=4`                        |
| **NearMiss-2 (nm2)**             | Selects samples from the majority class that are farthest from the minority class.                                      | `sampling_strategy='auto'`, `version=2`, `n_neighbors=3`, `n_jobs=4`                             |

| Undersampling Technique | Model               | Accuracy | Recall Majority (TNR) | Recall Minority (TPR) | Balanced Accuracy |       FPR       |       FNR       | Precision Majority | Precision Minority | F1-Score Majority | F1-Score Minority | G-mean-binary | G-mean-weighted | G-mean-micro | G-mean-macro | Corrected G-mean | Corrected G-mean-binary | Corrected G-mean-weighted | Corrected G-mean-micro | Corrected G-mean-macro | Dominance |
|-------------------------|---------------------|----------|-----------------------|-----------------------|-------------------|-----------------|-----------------|--------------------|--------------------|-------------------|-------------------|---------------|-----------------|---------------|---------------|------------------|--------------------------|--------------------------|--------------------------|--------------------------|-----------|
| random                  | Logistic Regression | 0.779217 | 0.780104              | 0.771272              | 0.775688          | 0.219896        | 0.228728        | 0.968286           | 0.281511           | 0.857540          | 0.494507          | 0.775676      | 0.775680        | 0.779217      | 0.775688      | 0.601673         | 0.599016                 | 0.603803                 | 0.607179                 | 0.601692                 | -0.008832 |
| random                  | LightGBM            | 0.772617 | 0.770414              | 0.792337              | 0.781375          | 0.229586        | 0.207663        | 0.970770           | 0.278250           | 0.863440          | 0.499011          | 0.781298      | 0.781326        | 0.772617      | 0.781375      | 0.610427         | 0.617119                 | 0.605124                 | 0.596937                 | 0.610547                 | 0.021923  |
| ncr                     | Logistic Regression | 0.912683 | 0.980212              | 0.308177              | 0.644194          | 0.019788        | 0.691823        | 0.926919           | 0.634997           | 0.912683          | 0.498557          | 0.549617      | 0.585577        | 0.912683      | 0.644194      | 0.302079         | 0.200575                 | 0.434965                 | 0.832991                 | 0.414986                 | -0.672034 |
| ncr                     | LightGBM            | 0.908267 | 0.997684              | 0.107812              | 0.552748          | 0.002316        | 0.892188        | 0.909177           | 0.838710           | 0.908267          | 0.508598          | 0.327967      | 0.423246        | 0.908267      | 0.552748      | 0.107563         | 0.059704                 | 0.242824                 | 0.824948                 | 0.305530                 | -0.889872 |
| enn                     | Logistic Regression | 0.913283 | 0.982750              | 0.291425              | 0.637087          | 0.017250        | 0.708575        | 0.925461           | 0.653646           | 0.913283          | 0.498822          | 0.535161      | 0.574105        | 0.913283      | 0.637087      | 0.286398         | 0.187401                 | 0.420629                 | 0.834086                 | 0.405880                 | -0.691325 |
| enn                     | LightGBM            | 0.908583 | 0.997628              | 0.111461              | 0.554545          | 0.002372        | 0.888539        | 0.909510           | 0.840000           | 0.908583          | 0.512163          | 0.333462      | 0.426822        | 0.908583      | 0.554545      | 0.111197         | 0.061927                 | 0.246674                 | 0.825524                 | 0.307520                 | -0.886167 |
| renn                    | Logistic Regression | 0.913300 | 0.982546              | 0.293415              | 0.637981          | 0.017454        | 0.706585        | 0.925640           | 0.652527           | 0.913300          | 0.498744          | 0.536930      | 0.575516        | 0.913300      | 0.637981      | 0.288294         | 0.188958                 | 0.422409                 | 0.834117                 | 0.407019                 | -0.689131 |
| renn                    | LightGBM            | 0.909067 | 0.997573              | 0.116769              | 0.557171          | 0.002427        | 0.883231        | 0.909997           | 0.843114           | 0.909067          | 0.509786          | 0.341300      | 0.431982        | 0.909067      | 0.557171      | 0.116486         | 0.065185                 | 0.252276                 | 0.826402                 | 0.310439                 | -0.880804 |
| nm1                     | Logistic Regression | 0.797967 | 0.804191              | 0.742246              | 0.773218          | 0.195809        | 0.257754        | 0.965434           | 0.297481           | 0.797967          | 0.493583          | 0.772598      | 0.772822        | 0.797967      | 0.773218      | 0.596908         | 0.578420                 | 0.612035                 | 0.636751                 | 0.597867                 | -0.061945 |
| nm1                     | LightGBM            | 0.675917 | 0.652795              | 0.882899              | 0.767847          | 0.347205        | 0.117101        | 0.980355           | 0.221220           | 0.675917          | 0.505903          | 0.759179      | 0.762324        | 0.675917      | 0.767847      | 0.576352         | 0.642663                 | 0.527714                 | 0.456863                 | 0.589589                 | 0.230104  |
| tomek                   | Logistic Regression | 0.913533 | 0.985900              | 0.265716              | 0.625808          | 0.014100        | 0.734284        | 0.923192           | 0.677952           | 0.913533          | 0.498919          | 0.511829      | 0.555742        | 0.913533      | 0.625808      | 0.261969         | 0.167636                 | 0.397713                 | 0.834543                 | 0.391635                 | -0.720184 |
| tomek                   | LightGBM            | 0.907450 | 0.998351              | 0.093714              | 0.546032          | 0.001649        | 0.906286        | 0.907930           | 0.863914           | 0.907450          | 0.510729          | 0.305874      | 0.409303        | 0.907450      | 0.546032      | 0.093559         | 0.051241                 | 0.228076                 | 0.823466                 | 0.298151                 | -0.904637 |
| oss                     | Logistic Regression | 0.913517 | 0.985881              | 0.265716              | 0.625799          | 0.014119        | 0.734284        | 0.923190           | 0.677665           | 0.913517          | 0.498916          | 0.511824      | 0.555736        | 0.913517      | 0.625799      | 0.261964         | 0.167635                 | 0.397701                 | 0.834513                 | 0.391624                 | -0.720166 |
| oss                     | LightGBM            | 0.907667 | 0.998073              | 0.098358              | 0.548215          | 0.001927        | 0.901642        | 0.908335           | 0.850789           | 0.907667          | 0.507399          | 0.313318      | 0.413926        | 0.907667      | 0.548215      | 0.098168         | 0.054007                 | 0.232922                 | 0.823859                 | 0.300540                 | -0.899715 |

| Undersampling Technique | Model               | ROC-AUC  | PR-AUC  |
|-------------------------|---------------------|----------|---------|
| random                  | Logistic Regression | 0.857540 | 0.494507|
| random                  | LightGBM            | 0.863440 | 0.499011|
| random                  | CatBoost            | 0.888497 | 0.576395|
| ncr                     | Logistic Regression | 0.858739 | 0.498557|
| ncr                     | LightGBM            | 0.862781 | 0.508598|
| ncr                     | CatBoost            | 0.893958 | 0.592104|
| enn                     | Logistic Regression | 0.858747 | 0.498822|
| enn                     | LightGBM            | 0.863523 | 0.512163|
| enn                     | CatBoost            | 0.893906 | 0.592826|
| renn                    | Logistic Regression | 0.858724 | 0.498744|
| renn                    | LightGBM            | 0.863945 | 0.509786|
| renn                    | CatBoost            | 0.894112 | 0.591983|
| nm1                     | Logistic Regression | 0.856994 | 0.493583|
| nm1                     | LightGBM            | 0.864912 | 0.505903|
| nm1                     | CatBoost            | 0.877002 | 0.551184|
| tomek                   | Logistic Regression | 0.858826 | 0.498919|
| tomek                   | LightGBM            | 0.864771 | 0.510729|
| tomek                   | CatBoost            | 0.893994 | 0.591643|
| oss                     | Logistic Regression | 0.858822 | 0.498916|
| oss                     | LightGBM            | 0.865372 | 0.507399|
| oss                     | CatBoost            | 0.893818 | 0.592434|

## 8. Notebook 5 - Santander Step 5: Outliers Engineering - 0.8917

The focus of this notebook was on detecting and handling outliers in the dataset to improve model training. Key steps included:

- **Outlier Detection**: Used the IQR method to identify outliers in each feature.
- **Outlier Analysis**: Counted outliers in each target class to understand their distribution.
- **Data Cleaning**: Removed outliers from the majority class (target == 0) while retaining all samples from the minority class (target == 1).
- **Dataset Comparison**: Compared the shapes and target distributions of the original and cleaned datasets to ensure data quality and class balance.
- **Preparation for Training**: Prepared cleaned training data and retained the test data for further model training.
This process aimed to enhance the model's performance by reducing noise and potential bias introduced by outliers.

| Model               | ROC-AUC  | PR-AUC  |
|---------------------|----------|---------|
| Random Forest       | 0.736037 | 0.224972|
| Logistic Regression | 0.858266 | 0.497735|
| XGBoost             | 0.855851 | 0.483367|
| LightGBM            | 0.861808 | 0.499206|
| CatBoost            | 0.891671 | 0.577059|

## 9. Notebook 6 - Santander Step 6: Oversampling - 0.8876

| Technique                        | Description                                                                                                           | Key Parameters                                                                                   |
|----------------------------------|-----------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|
| **Random OverSampler (random)**  | Randomly duplicates samples from the minority class.                                                                  | `sampling_strategy='auto'`, `random_state=0`                                                     |
| **SMOTE (smote)**                | Synthetic Minority Over-sampling Technique. Generates synthetic samples by interpolating between minority class samples. | `sampling_strategy='auto'`, `random_state=0`, `k_neighbors=5`, `n_jobs=4`                         |
| **ADASYN (adasyn)**              | Adaptive Synthetic Sampling. Generates synthetic samples by considering the density distribution of the minority class. | `sampling_strategy='auto'`, `random_state=0`, `n_neighbors=5`, `n_jobs=4`                         |
| **Borderline-SMOTE1 (border1)**  | Borderline Synthetic Minority Over-sampling Technique. Generates synthetic samples by focusing on the borderline cases. | `sampling_strategy='auto'`, `random_state=0`, `k_neighbors=5`, `m_neighbors=10`, `kind='borderline-1'`, `n_jobs=4` |
| **Borderline-SMOTE2 (border2)**  | Borderline Synthetic Minority Over-sampling Technique. Similar to Borderline-SMOTE1 but uses a different approach for synthetic generation. | `sampling_strategy='auto'`, `random_state=0`, `k_neighbors=5`, `m_neighbors=10`, `kind='borderline-2'`, `n_jobs=4` |

| Undersampling Technique | Model               | ROC-AUC  | PR-AUC  |
|-------------------------|---------------------|----------|---------|
| random                  | CatBoost            | 0.887633 | 0.572914|
| smote                   | CatBoost            | 0.780618 | 0.339093|
| adasyn                  | CatBoost            | 0.780315 | 0.340419|
| border1                 | CatBoost            | 0.793785 | 0.364244|
| border2                 | CatBoost            | 0.814447 | 0.408506|

## 10. Notebook 7 - Santander Step 7: Oversampling and Cross-Validation - 0.8847

| Oversampling Technique | Model               | ROC-AUC  | ROC-AUC-std | PR-AUC  | PR-AUC-std |
|------------------------|---------------------|----------|-------------|---------|------------|
| random                 | CatBoost            | 0.884703 | 0.001642    | 0.573771| 0.003754   |
| smote                  | CatBoost            | 0.773963 | 0.001513    | 0.329668| 0.006116   |
| adasyn                 | CatBoost            | 0.774085 | 0.000926    | 0.331239| 0.005575   |
| border1                | CatBoost            | 0.787326 | 0.001666    | 0.354895| 0.004578   |
| border2                | CatBoost            | 0.807864 | 0.001257    | 0.398956| 0.003009   |

## 11. Notebook 8 - Santander Step 8: Over and Undersampling, Cross-Validation - 0.857

| Technique                        | Description                                                                                                           | Key Parameters                                                                                   |
|----------------------------------|-----------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|
| **SMOTEENN (smenn)**             | Combines SMOTE for oversampling and Edited Nearest Neighbours for undersampling to balance the dataset.                | `sampling_strategy='auto'`, `random_state=0`, `smote=SMOTE(sampling_strategy='auto', random_state=0, k_neighbors=5)`, `enn=EditedNearestNeighbours(sampling_strategy='auto', n_neighbors=3, kind_sel='all')`, `n_jobs=4` |
| **SMOTETomek (smtomek)**         | Combines SMOTE for oversampling and Tomek Links for undersampling to balance the dataset.                              | `sampling_strategy='auto'`, `random_state=0`, `smote=SMOTE(sampling_strategy='auto', random_state=0, k_neighbors=5)`, `tomek=TomekLinks(sampling_strategy='all')`, `n_jobs=4` |

| Oversampling Technique | Model               | ROC-AUC  | ROC-AUC-std | PR-AUC  | PR-AUC-std |
|------------------------|---------------------|----------|-------------|---------|------------|
| smenn                  | CatBoost            | 0.773708 | 0.001129    | 0.331826| 0.003130   |
| smtomek                | CatBoost            | 0.773963 | 0.001513    | 0.329668| 0.006116   |

## 12. Notebook 9 - Santander Step 9: Ensemble Methods - 0.858

Ensemble methods (with or without resampling)

| Technique                        | Description                                                                                                           | Key Parameters                                                                                   |
|----------------------------------|-----------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|
| **Balanced Random Forest (balancedRF)** | Random Forest with balanced bootstrapping to handle class imbalance.                                                 | `n_estimators=20`, `criterion='gini'`, `max_depth=3`, `sampling_strategy='auto'`, `n_jobs=4`, `random_state=42` |
| **Bagging Classifier (bagging)** | Bagging ensemble of Logistic Regression without resampling.                                                           | `base_estimator=LogisticRegression(random_state=42)`, `n_estimators=20`, `n_jobs=4`, `random_state=42`         |
| **Balanced Bagging Classifier (balancedbagging)** | Bagging ensemble of Logistic Regression with resampling.                                                            | `estimator=LogisticRegression(random_state=42)`, `n_estimators=20`, `max_samples=1.0`, `max_features=1.0`, `bootstrap=True`, `bootstrap_features=False`, `sampling_strategy='auto'`, `n_jobs=4`, `random_state=42` |
| **RUSBoost Classifier (rusboost)** | Boosting with Random Under Sampling to handle class imbalance.                                                        | `estimator=None`, `n_estimators=20`, `learning_rate=1.0`, `sampling_strategy='auto'`, `random_state=42`          |
| **Easy Ensemble Classifier (easyEnsemble)** | Combines bagging, boosting, and under-sampling to handle class imbalance.                                            | `n_estimators=20`, `sampling_strategy='auto'`, `n_jobs=4`, `random_state=42`                                  |

| Model               | ROC-AUC  | PR-AUC  |
|---------------------|----------|---------|
| balancedRF          | 0.732446 | 0.256243|
| bagging             | 0.858818 | 0.498701|
| balancedbagging     | 0.858477 | 0.497667|
| rusboost            | 0.731738 | 0.272281|
| easyEnsemble        | 0.768610 | 0.319986|

## 13. Notebook 9.1 - Santander Step 9.1: Cost-Sensitive Approach - 0.8916

Cost-sensitive learning is a method used to handle imbalanced datasets by assigning different costs to misclassifications of positive and negative samples. This helps the model to pay more attention to the minority class (usually the positive class in imbalanced datasets).

| Model                | Cost-Sensitive Technique                     | Explanation                                                                                               |
|----------------------|----------------------------------------------|-----------------------------------------------------------------------------------------------------------|
| Random Forest        | `class_weight='balanced'`                    | Automatically adjusts weights inversely proportional to class frequencies, giving more importance to the minority class. |
| Logistic Regression  | `class_weight='balanced', solver='liblinear'`| Adjusts weights inversely proportional to class frequencies, ensuring the minority class is given more importance during training. The `solver='liblinear'` is used for small datasets. |
| XGBoost              | `scale_pos_weight=ratio`                     | Adjusts the balance of positive and negative weights by scaling the positive class weight according to the specified ratio of negative to positive samples. |
| LightGBM             | `class_weight='balanced'`                    | Adjusts weights inversely proportional to class frequencies, similar to the Random Forest and Logistic Regression. |
| CatBoost             | `auto_class_weights='Balanced', verbose=False`| Automatically calculates balanced class weights to handle imbalanced datasets, giving more focus to the minority class during training. The `verbose=False` parameter is used to suppress output during training. |

## 14. Summary Table of Highest ROC-AUC Values

| Notebook    | Technique                           | Model              | ROC-AUC  |
|-------------|-------------------------------------|--------------------|----------|
| Notebook1   | Initial Evaluation                  | CatBoost           | 0.893987 |
| Notebook2   | Stratified Sampling                 | CatBoost           | 0.893907 |
| Notebook3   | Scaling                             | CatBoost           | 0.893906 |
| Notebook4   | Undersampling Majority Class        | CatBoost (ncr)     | 0.893958 |
| Notebook5   | Outliers Engineering                | CatBoost           | 0.891671 |
| Notebook6   | Oversampling                        | CatBoost (random)  | 0.887633 |
| Notebook7   | Oversampling and Cross-Validation   | CatBoost (random)  | 0.884703 |
| Notebook8   | Over and Undersampling, Cross-Validation | CatBoost (smtomek) | 0.773963 |
| Notebook9   | Ensemble Methods                    | Bagging            | 0.858818 |
| Notebook9.1 | Cost-Sensitive Approach             | CatBoost           | 0.8916   |

## 15. Conclusion

CatBoost achieved the highest ROC-AUC value initially, demonstrating strong performance even before applying advanced sampling techniques. However, undersampling techniques, particularly NCR combined with CatBoost, also showed high results.
