# Final Assignment
## CM3015 Machine Learning and Neural Networks

### Credit Card Fraud Detection with a Feedforward MLP

- Student: cy150
- Workflow: Chollet's ML workflow (problem → data → evaluation → prep → baseline → model → tuning → final eval)

---

## Step 1 — Define the problem

Chollet’s workflow keeps the project aligned with real-world goals: define the problem and success metrics, understand the data, choose an evaluation protocol, prepare the data, establish a baseline, develop a first model, tune and improve it, then report final test results and deployment considerations.

### Overview: Credit Card Fraud

Credit card fraud is the unauthorized use of a credit (or debit) card to make purchases, withdraw funds, or create transactions that the legitimate cardholder did not approve.

### Problem Statement

Credit card fraud causes direct and significant financial losses: issuers, merchants, and consumers may absorb losses from unauthorized purchases and chargebacks. Fraud also creates investigative overhead (reviews, disputes) and temporary loss of funds and reputation.

As a result, stricter verification and KYC are implemented by merchants and banks. These organizations can also lose customers if there are excessive false declines or chargebacks. Additionally, failure to protect customers according to regulatory standards can incur penalties.

---

## Step 2 — Identify and understand the data

### Dataset Overview
This dataset consists of credit card transactions by European cardholders. It covers two days of transactions with 492 frauds out of 284,807 transactions.
The dataset is highly imbalanced, with frauds accounting for about 0.172% of all transactions.


### Nature of the Dataset

It contains only numerical input variables resulting from a PCA transformation. Due to confidentiality, the original features and more background information are not available.

### Dataset Features
Features V1, V2, … V28 are principal components from PCA. The only features not transformed with PCA are `Time` and `Amount`.

- `Time` contains the seconds elapsed between each transaction and the first transaction in the dataset.

*Example*

| Time |      V1 |      V2 |     V3 |     V4 |      V5 |
| ---: | ------: | ------: | -----: | -----: | ------: |
|    0 | -1.3598 | -0.0728 | 2.5363 | 1.3782 | -0.3383 |
|    1 |  1.1919 |  0.2662 | 0.1665 | 0.4482 |  0.0600 |
|    1 | -1.3584 | -1.3402 | 1.7732 | 0.3798 | -0.5032 |


### Dataset Licensing

The dataset is licensed under the Database Contents License (DbCL).

According to the license (Open Data Commons), the Licensor grants a worldwide, royalty-free, non-exclusive, perpetual, irrevocable copyright license to do any act that is restricted by copyright over anything within the Contents, whether in the original medium or any other. These rights explicitly include commercial use and do not exclude any field of endeavor.

### Permission

The DbCL license explicitly allows use of this dataset for this final assignment.

### Dataset Author

- Machine Learning Group - ULB 

### Dataset Source

After browsing Kaggle, I selected a dataset that is complex and challenging while providing rich features for the model to learn from.

Link to dataset: `https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud/data`

### Justification for this dataset

This dataset supports a full end-to-end deep learning workflow: clear labels, numeric features, and a real-world class imbalance that demands careful evaluation.

#### Rationale behind why this dataset was chosen

- Real-world fraud detection is a technically challenging binary classification task that fits a feedforward MLP well.
- The data are numeric and tabular with fixed-length features, aligning cleanly with Dense and Dropout layers.
- Severe class imbalance pushes robust evaluation (precision, recall, PR AUC, threshold selection).

#### Limitations of this dataset and mitigation strategy

- Features V1 to V28 are anonymized PCA components while `Time` and `Amount` are not PCA transformed, limiting feature-level interpretation. I will treat them as informative numeric inputs and focus on predictive performance rather than interpretability.

---

## Step 1 Cont'd — Define success metrics

### Objective

The primary objective is to detect fraudulent transactions while minimizing false positives. Because fraud is rare, the evaluation focuses on minority-class performance and selecting an operating point that reflects the cost of errors.

## Step 3 — Choose an evaluation protocol

### Holdout Protocol

1. The data will be split into:

    - Training set
    - Validation set
    - Test set

2. Preprocessing decisions are fitted using the training set exclusively.
3. The validation set is only used for model and threshold selection.
4. Final performance is evaluated on the untouched test set.


### Evaluation metrics

This section describes how performance will be measured based on the pre-defined success criteria.

### Primary Metrics

| Metric               | What it measures                                                       | Why it matters for fraud under heavy class imbalance                                                    |
| -------------------- | ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------- |
| Precision–Recall AUC | Area under the precision–recall curve across decision thresholds       | Strong overall summary metric when fraud is rare, more informative than accuracy                        |
| Recall               | True positive rate, how many actual fraud cases are correctly detected | Directly captures missed fraud risk since low recall means more fraud slips through                     |
| F1 Score             | Harmonic mean of precision and recall                                  | Useful single number when you want a balanced tradeoff between catching fraud and limiting false alarms |


### Secondary Metric

| Secondary metric                              | Description                                                                                                       | Purpose                                                                                        |
| --------------------------------------------- | ----------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------- |
| Confusion matrix at a predetermined threshold | Uses TP, FP, TN, FN to make tradeoffs explicit                                                                    | Shows performance at the chosen operating point and clarifies the cost of each type of mistake |
| Error rates                                   | False negative rate and false positive rate to quantify misses and false alarms                                   | Measures miss risk versus false alarm burden in a comparable way                               |
| Calibration check                             | Compare predicted probabilities with observed outcomes using a simple binning table to verify probability quality | Checks whether predicted risk scores align with real observed fraud rates                      |


#### Justification for evaluation metrics

Fraud is a rare event, so a single metric can be misleading. A model may look strong on one metric while failing in practice. The primary metric provides a consistent rule for model comparison, while secondary metrics provide the context needed to interpret false-positive and false-negative tradeoffs.

### Implications of the evaluation metrics

1. Precision reflects workload and customer friction.
2. Recall reflects loss prevention, often translating into direct financial losses.
3. PR AUC reflects ranking quality under rare fraud across thresholds.
4. The confusion matrix is reported at an operating threshold aligned with transaction behavior.


## Step 4 — Prepare the data

### Data preparation plan

To prepare the data and avoid leakage:

1. Use a time-aware split: earlier transactions for training, later transactions for validation/test.
2. Fit all preprocessing steps (scaling, imputation if needed) on the training set only.
3. Apply the same fitted transforms to validation and test sets.
4. Preserve the class imbalance during splitting to reflect real deployment.
5. Track feature distributions and label rate over time to identify drift.
6. Use a small threshold sweep on the validation set for later operating-point selection.
7. Calibrate probabilities with simple binning to sanity-check outputs.
8. Re-train periodically as base rates drift in production.

## Step 4 (continued) — Data quality checks and class imbalance

- Check for missing values, extreme outliers, and distribution shifts between splits.
- Standardize numeric features using training-set statistics only.
- Consider class-weighting or resampling strategies as part of the training plan.
- Keep an audit trail of preprocessing decisions to support reproducibility.

---

## Step 5 — Establish a baseline and pick a starting model

A trivial baseline is to always predict non-fraud; this sets the minimum acceptable recall and PR AUC. After establishing the baseline, I start with a small-to-medium feedforward MLP: fully connected Dense layers with Dropout and a single sigmoid output for binary classification.

#### Model Size
- Small to medium feedforward multilayer perceptron

## Step 6 — Develop the model

### Overview of feedforward neural networks

A feedforward network passes inputs through a stack of Dense layers without cycles. For tabular data, this is an effective first-choice architecture.

### Overview of multilayer perceptrons

A multilayer perceptron (MLP) stacks Dense layers with nonlinear activations (e.g., ReLU) to learn complex decision boundaries from tabular numeric features.

### Justification: Why a feedforward MLP

- Works well for fixed-length tabular numeric inputs.
- Efficient to train and tune compared to more complex architectures.
- Provides a strong baseline before exploring more specialized models.

## Step 7 — Model improvement and threshold tuning

- Instead of using a standard 0.5 cutoff, choose a decision threshold on the validation set (maximize recall subject to a minimum precision).
- Use PR AUC as the primary model selection metric, which is more appropriate than accuracy for rare-event detection.
- Run each configuration multiple times and report mean and standard deviation for stability.
- Use a time-aware split to mirror real deployment conditions.
- Apply binning-based calibration checks to verify probability quality.


## Step 8 — Final evaluation and deployment considerations

### Current real-life applications

Fraud detection systems are used by card issuers and payment networks to rank transactions by risk, trigger manual review, or apply step-up verification. In production, models are monitored for drift, recalibrated, and re-trained as fraud patterns evolve.


## Glossary of terms


| Term                  | Definition                                                                                                             |
| --------------------- | ---------------------------------------------------------------------------------------------------------------------- |
| Binary classification | A prediction task with two classes, here fraud versus non fraud                                                        |
| Class imbalance       | A dataset property where one class is much rarer than the other                                                        |
| Positive class        | The class of interest, here fraud transactions labeled 1                                                               |
| Negative class        | The other class, here legitimate transactions labeled 0                                                                |
| Feature               | An input variable used for prediction, such as V1 or Amount                                                            |
| Label                 | The target variable the model learns to predict, here Class                                                            |
| PCA                   | Principal Component Analysis, a transformation that creates new variables as linear combinations of original variables |
| Principal component   | One PCA derived feature, here V1 to V28                                                                                |
| Train set             | Data used to fit model parameters                                                                                      |
| Validation set        | Data used to select hyperparameters and decision threshold                                                             |
| Test set              | Held out data used once for final performance reporting                                                                |
| Data leakage          | When information from validation or test data influences training or preprocessing decisions                           |
| Standardization       | Scaling features to have zero mean and unit variance using training statistics                                         |
| Normalization         | Rescaling features to a fixed range, often 0 to 1, depending on the method                                             |
| Model                 | A function that maps input features to a predicted output                                                              |
| Neural network        | A model composed of layers of learned transformations, here Dense and Dropout layers                                   |
| Dense layer           | A fully connected layer that applies a linear transformation followed by an activation function                        |
| Dropout               | A regularization method that randomly disables a fraction of units during training to reduce overfitting               |
| Activation function   | A non linear function applied within a layer, such as ReLU or sigmoid                                                  |
| Sigmoid               | An activation that maps a real number to a value between 0 and 1, used for binary outputs                              |
| Logits                | The raw model output before applying sigmoid                                                                           |
| Probability score     | The model output after sigmoid, interpreted as probability like score                                                  |
| Decision threshold    | The cutoff used to convert probability scores into class predictions                                                   |
| Confusion matrix      | A table counting true positives false positives true negatives and false negatives                                     |
| True positive TP      | Fraud correctly predicted as fraud                                                                                     |
| False positive FP     | Legitimate predicted as fraud                                                                                          |
| True negative TN      | Legitimate correctly predicted as legitimate                                                                           |
| False negative FN     | Fraud predicted as legitimate                                                                                          |
| Precision             | TP divided by TP plus FP, the fraction of predicted fraud that is truly fraud                                          |
| Recall                | TP divided by TP plus FN, the fraction of actual fraud that is detected                                                |
| F1 score              | Harmonic mean of precision and recall                                                                                  |
| ROC curve             | Curve of true positive rate versus false positive rate over thresholds                                                 |
| AUC                   | Area under a curve, a threshold independent performance summary                                                        |
| PR curve              | Precision versus recall over thresholds                                                                                |
| PR AUC                | Area under the precision recall curve, often preferred for rare event detection                                        |
| Overfitting           | When a model performs well on training but poorly on new data                                                          |
| Regularization        | Methods that reduce overfitting, such as dropout or weight penalties                                                   |
| Hyperparameter        | A setting chosen outside training, such as number of layers, dropout rate, learning rate                               |
| Learning rate         | Step size used by the optimizer when updating model weights                                                            |
| Optimizer             | Algorithm that updates model weights to minimize the loss, such as Adam                                                |
| Loss function         | The quantity the model minimizes during training, such as binary cross entropy                                         |
| Early stopping        | Stopping training when validation performance stops improving                                                          |
| Calibration           | How well predicted probabilities match observed event rates                                                            |
| Concept drift         | When the data generating process changes over time, causing performance degradation                                    |
| Baseline model        | A simple reference model used for comparison, such as always predicting non fraud                                      |


---
### Bibliography & Citations


Carcillo, F., Le Borgne, Y. A., Caelen, O. and Bontempi, G. (2018) ‘Streaming active learning strategies for real life credit card fraud detection: assessment and visualization’, International Journal of Data Science and Analytics, 5(4), pp. 285 300.

Carcillo, F., Dal Pozzolo, A., Le Borgne, Y. A., Caelen, O., Mazzer, Y. and Bontempi, G. (2018) ‘Scarff: a scalable framework for streaming credit card fraud detection with Spark’, Information Fusion, 41, pp. 182 194.

Carcillo, F., Le Borgne, Y. A., Caelen, O., Oblé, F. and Bontempi, G. (2019) ‘Combining unsupervised and supervised learning in credit card fraud detection’, Information Sciences.

Dal Pozzolo, A. (n.d.) Adaptive machine learning for credit card fraud detection. PhD thesis. Université libre de Bruxelles.

Dal Pozzolo, A., Caelen, O., Johnson, R. A. and Bontempi, G. (2015) ‘Calibrating probability with undersampling for unbalanced classification’, in Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining. IEEE.

Dal Pozzolo, A., Caelen, O., Le Borgne, Y. A., Waterschoot, S. and Bontempi, G. (2014) ‘Learned lessons in credit card fraud detection from a practitioner perspective’, Expert Systems with Applications, 41(10), pp. 4915 4928.

Dal Pozzolo, A., Boracchi, G., Caelen, O., Alippi, C. and Bontempi, G. (2018) ‘Credit card fraud detection: a realistic modeling and a novel learning strategy’, IEEE Transactions on Neural Networks and Learning Systems, 29(8), pp. 3784 3797.

Lebichot, B., Le Borgne, Y. A., He, L., Oblé, F. and Bontempi, G. (2019) ‘Deep learning domain adaptation techniques for credit cards fraud detection’, in INNSBDDL 2019 Recent Advances in Big Data and Deep Learning, pp. 78 88.

Lebichot, B., Paldino, G., Siblini, W., He, L., Oblé, F. and Bontempi, G. (n.d.) ‘Incremental learning strategies for credit cards fraud detection’, International Journal of Data Science and Analytics.

Le Borgne, Y. A. and Bontempi, G. (n.d.) Reproducible machine learning for credit card fraud detection: practical handbook.

If you want, paste your target year for the PhD thesis and the handbook, plus any missing page ranges, and I will update the entries so everything is fully complete and consistent.
