## Experiment Overview

This series of experiments aims to:

1. **Understand the Data:** Analyze the dataset to identify key features and patterns related to fraud.
2. **Find Insights:** Discover data patterns and anomalies that indicate fraudulent transactions.
3. **Model Effectiveness:** Evaluate various machine learning models (Logistic Regression, Random Forest, CatBoost, Decision Tree) to assess the impact of feature selection and class balancing on fraud detection performance.

The goal is to combine data understanding with effective modeling to improve fraud detection capabilities.

The notebook `Load_&_Preprocess.ipynb` handles the following:
- **Data Loading:** Reads and imports the data.
- **Data Transformation:** Applies appropriate encoding and feature selection to prepare the data for model training.
- **Flagging Transactions:** Implements logic to flag reversal and multi-swipe transactions, which are crucial for identifying patterns associated with fraud.

------

### Experimental Setup

Each model was evaluated under four different conditions, with corresponding Jupyter notebooks documenting the experiments:

1. **Feature Selection and Class Balancing Enabled** (`Feature Selection = TRUE`, `Class Balancing = TRUE`):
   - **File**: `Modelling_Selected_Features_Balanced.ipynb`
   - **Description**: This notebook contains the results for all four models using a set of features selected through feature engineering. The class imbalance was addressed using a `RandomUnderSampler`. This setup represents an optimized approach where irrelevant features are removed, and the dataset is balanced to improve model performance.

2. **Feature Selection Enabled, Class Balancing Disabled** (`Feature Selection = TRUE`, `Class Balancing = FALSE`):
   - **File**: `Modelling_Selected_Features_Unbalanced.ipynb`
   - **Description**: This notebook includes results for all four models with selected features but without any class balancing. The goal here was to observe the effect of feature selection on model performance when the class imbalance is not addressed.

3. **Feature Selection Disabled, Class Balancing Enabled** (`Feature Selection = FALSE`, `Class Balancing = TRUE`):
   - **File**: `Modelling_All_Features_Balanced.ipynb`
   - **Description**: This notebook presents results for all four models using all available features (i.e., no feature selection was performed). Class balancing was applied using a `RandomUnderSampler` to see how the models perform when no feature reduction is done but the data is balanced.

4. **Feature Selection and Class Balancing Disabled** (`Feature Selection = FALSE`, `Class Balancing = FALSE`):
   - **File**: `Modelling_All_Features_Unbalanced.ipynb`
   - **Description**: This notebook contains results for all four models using the full set of features without any class balancing. It serves as a baseline to understand how the models perform with all features and without addressing class imbalance.


*Note: All the results from each of above experiments are stored at `data/result_df.csv'*

------

## Execution Instructions

 To install all libraries use `pip install -r requirements.txt`

1. **Run the Data Preprocessing Notebook:**
   - Execute the `Load_&_Preprocess.ipynb` notebook.
   - After running this notebook, two CSV files will be saved in the `data` folder:
     - `transformed_data_all_features.csv`
     - `transformed_data_selected_features.csv`
   - These CSV files can be used in the modeling process wherever required.

2. **Run the Modeling Notebooks:**
   - Start with `3.1 Modelling_Selected_Features_Balanced.ipynb`, which creates the `result_df` DataFrame to store all results and updates it accordingly.
   - Next, run the following notebooks in sequence:
     - `3.2 Modelling_Selected_Features_Unbalanced.ipynb`
     - `3.3 Modelling_All_Features_Balanced.ipynb`
     - `3.4 Modelling_All_Features_Unbalanced.ipynb`
   - Running all these notebooks will generate results for all combinations of feature selection and class balancing.

3. **Results Storage:**
   - At the end of the process, all the updated results are stored in the `data/result_df.csv` file.



In [1]:
import pandas as pd
result_df = pd.read_csv("data/result_df.csv")
result_df

Unnamed: 0,Model,Feature Selection,Class Balancing,AUC,Train_Precision,Train_Recall,Test_Precision,Test_Recall
0,Logistic Regression,True,True,0.7,0.65,0.66,0.64,0.66
1,Random Forest,True,True,0.73,0.7,0.64,0.69,0.61
2,CatBoost,True,True,0.74,0.71,0.7,0.67,0.67
3,Decision Tree,True,True,0.7,0.72,0.73,0.64,0.66
4,Logistic Regression,True,False,0.7,0.0,0.0,0.0,0.0
5,Random Forest,True,False,0.73,0.0,0.0,0.0,0.0
6,CatBoost,True,False,0.74,0.8,0.0,0.0,0.0
7,Decision Tree,True,False,0.72,0.96,0.02,0.06,0.0
8,Logistic Regression,False,True,0.69,0.65,0.65,0.63,0.64
9,Random Forest,False,True,0.74,0.68,0.7,0.66,0.69


## Top 3 Performing Models:

### 1st Place: CatBoost without Feature Selection but with Class Balancing (Row 10)
- **AUC:** ~0.81
- **Precision:** ~0.72
- **Recall:** ~0.73
- **Summary**: This model achieves the highest AUC, precision, and recall among all experiments, making it the best overall performer. It effectively balances precision and recall, which is crucial for minimizing false positives and false negatives in fraud detection.

### 2nd Place: CatBoost with Feature Selection and Class Balancing (Row 2)
- **AUC:** ~0.74
- **Precision:** ~0.67
- **Recall:** ~0.67
- **Summary**: This model ranks second due to its strong balance between AUC, precision, and recall. The feature selection combined with class balancing contributes to its effectiveness in distinguishing fraudulent transactions.

### 3rd Place: Random Forest without Feature Selection but with Class Balancing (Row 9)
- **AUC:** ~0.74
- **Precision:** ~0.66
- **Recall:** ~0.69
- **Summary**: This model also performs well, closely following the CatBoost models. It demonstrates that even without feature selection, class balancing can significantly enhance the model's ability to detect fraud.

> The top-performing models show consistent results across both training and test sets, indicating robustness and generalization. This consistency suggests that these models are not only performing well on the training data but are also likely to perform reliably on unseen data, which is crucial for real-world applications like fraud detection.

## Important Metrics for Predicting Fraud:
- **AUC (Area Under the Curve):**
  - **Importance**: Measures the model's ability to differentiate between fraudulent and non-fraudulent transactions. A higher AUC is essential for effective fraud detection.
  - **Top Models:**
    - **1st:** CatBoost without feature selection but with class balancing (~0.80)
    - **2nd:** CatBoost with feature selection and class balancing (~0.74)
    - **3rd:** Random Forest without feature selection but with class balancing (~0.74)

- **Precision:**
  - **Importance**: Indicates the accuracy of the model in predicting fraud cases, minimizing false positives.
  - **Top Models:**
    - **1st:** CatBoost without feature selection but with class balancing (~0.72)
    - **2nd:** CatBoost with feature selection and class balancing (~0.68)
    - **3rd:** Random Forest without feature selection but with class balancing (~0.67)

- **Recall:**
  - **Importance**: Measures how well the model identifies actual fraud cases, which is critical for reducing false negatives.
  - **Top Models:**
    - **1st:** CatBoost without feature selection but with class balancing (~0.73)
    - **2nd:** CatBoost with feature selection and class balancing (~0.67)
    - **3rd:** Random Forest without feature selection but with class balancing (~0.66)

> **Note:** 
> In fraud detection, recall is particularly important. It is similar to cancer prediction, where failing to detect cancer (or fraud in this case) can have severe consequences. If a non-fraudulent transaction is flagged as fraud, it only incurs the cost of additional verification or inquiry. However, if a fraudulent transaction is not detected, it can result in significant financial losses, potentially amounting to millions of dollars.