# Probabilistic Machine Learning - Project Report

## Fraud detection

- **Course:** Probabilistic Machine Learning (SoSe 2025)
- **Lecturer:** Alvaro Diaz-Ruelas
- **Students Names:**  khalid Sabih, abdellah charki
- **GitHub Usernames:**  @khalidsabih / @abdellahcharki
- **Date:**  05/07/2025
- **PROJECT-ID:** 26-1CASKXX  

---

# 1. Introduction

## 1.1 Motivation
Fraud detection has become an increasingly critical task in financial systems and digital transactions, where even a small number of fraudulent activities can result in significant financial losses and erode trust in institutions. The complexity of detecting fraud arises from its rarity and the constantly evolving tactics used by fraudsters to conceal illicit activities within massive volumes of legitimate transactions. As organizations handle millions of financial operations daily, distinguishing fraudulent patterns from normal behavior is both technically challenging and essential for operational security and customer trust.

## 1.2 Dataset
The dataset used in this project, named Fraud.csv, consists of approximately 6 million synthetic financial transactions, created to reflect realistic banking operations while protecting privacy. Each transaction record contains several attributes that describe its details, including both numerical and categorical features. The primary challenge posed by this dataset is the severe class imbalance, as fraudulent transactions account for fewer than 0.2% of all records, making fraud detection a complex and highly imbalanced classification problem.

The dataset includes the following columns:

- step: The hour of the simulation.
- type: The type of transaction, such as PAYMENT, TRANSFER, CASH_OUT, DEBIT, or CASH_IN.
- amount: The amount of money involved in the transaction.
- nameOrig: An anonymized identifier for the originator’s account.
- oldbalanceOrg: The account balance of the originator before the transaction.
- newbalanceOrig: The account balance of the originator after the transaction.
- nameDest: An anonymized identifier for the recipient’s account.
- oldbalanceDest: The account balance of the recipient before the transaction.
- newbalanceDest: The account balance of the recipient after the transaction.
- isFraud: A binary label indicating whether the transaction was fraudulent (1) or not (0).
- isFlaggedFraud: A binary flag indicating whether the transaction was flagged as suspicious by internal business rules.

## 1.3 Hypothesis
- Fraudulent transactions are more likely to occur in specific transaction types, particularly TRANSFER and CASH_OUT, compared to other types such as PAYMENT or CASH_IN.
- Fraudulent transactions tend to involve higher transaction amounts than legitimate transactions.
- Fraudulent transactions often result in the destination account balance dropping to zero, suggesting immediate withdrawal or transfer of illicit funds.

# 2. Data Loading and Exploration
## 2.1. Data Loading
We use a synthetic fraud detection dataset for training and evaluating fraud detection models.


[Fraud Detection Dataset – Kaggle](https://www.kaggle.com/datasets/ashishkumarjayswal/froud-detection-dataset)


Our analysis begins with loading the dataset Fraud.csv, which is stored in the  `data/` folder of our project repository. We use the pandas library in Python to handle the file, as it efficiently manages large datasets and provides useful tools for data exploration.

**Snapshot of Original Dataset (Before Preprocessing)**


| step | type      | amount   | nameOrig       | oldbalanceOrg | newbalanceOrig | nameDest     | oldbalanceDest | newbalanceDest | isFraud | isFlaggedFraud |
|------|-----------|----------|----------------|---------------|----------------|--------------|----------------|----------------|---------|----------------|
| 1    | PAYMENT   | 9839.64  | C1231006815    | 170136.0      | 160296.36      | M1979787155  | 0.0            | 0.0            | 0       | 0              |
| 1    | PAYMENT   | 1864.28  | C1666544295    | 21249.0       | 19384.72       | M2044282225  | 0.0            | 0.0            | 0       | 0              |
| 1    | TRANSFER  | 181.00   | C1305486145    | 181.0         | 0.00           | C553264065   | 0.0            | 0.0            | 1       | 0              |
| 1    | CASH_OUT  | 181.00   | C840083671     | 181.0         | 0.00           | C38997010    | 21182.0        | 0.0            | 1       | 0              |
| 1    | PAYMENT   | 11668.14 | C2048537720    | 41554.0       | 29885.86       | M1230701703  | 0.0            | 0.0            | 0       | 0              |


## 2.2. Data Exploration
After successfully loading the dataset, we performed a detailed exploratory data analysis to better understand its structure and the nature of fraudulent transactions.


### 2.2.1 Class Distribution
A critical first step was to examine the distribution of our target variable, isFraud. As shown in Figure  (Class Distribution), fraudulent transactions are extremely rare, accounting for only about 0.13% of all transactions. In absolute terms, there are 8,213 fraudulent transactions out of a total of 6,354,620 transactions, which is consistent with the class distribution reported as follows:
- Non-fraudulent (0): 6,354,407 transactions (99.87%)
- Fraudulent (1): 8,213 transactions (0.13%)


![Class Distribution](/results/class_distribution.png)

This significant class imbalance underscores the challenges associated with fraud detection, where traditional metrics like overall accuracy would be misleading.

### 2.2.2 Fraud Rate by Transaction Type
We then analyzed how fraud is distributed across different transaction types. The dataset includes various transaction categories such as PAYMENT, TRANSFER, CASH_OUT, DEBIT, and CASH_IN. Our analysis revealed that fraud is concentrated almost exclusively in the TRANSFER and CASH_OUT transaction types.
Figure (Fraud Rate by Transaction Type) illustrates that:

- TRANSFER transactions account for approximately 80.69% of fraudulent activity.

- CASH_OUT transactions account for about 19.31% of fraud.

- Other transaction types show virtually no fraud.

![Fraud Rate by Transaction Type](./results/fraud_rate_by_transaction.png)

These findings highlight the importance of transaction type as a strong predictor of fraud.

### 2.2.3 Transaction Type vs. Fraud Count
To visualize how fraud and non-fraud transactions are distributed across different transaction types, we plotted a count graph, shown in Figure 3 (Transaction Type vs. Fraud). The chart confirms that while PAYMENT, CASH_IN, and DEBIT transactions are numerous, they rarely involve fraud. By contrast, TRANSFER and CASH_OUT transactions, although less frequent overall, carry a much higher proportion of fraudulent cases relative to their volume.

This information is critical for model development, suggesting that the transaction type should be included as a categorical feature in any predictive modeling approach.


![Transaction Type vs. Fraud Count](./results/Transaction_Type_vs_Fraud.png)


### 2.2.4 Correlation Analysis
To further investigate relationships between variables, we computed and visualized a correlation heatmap, presented in Figure (Correlation Heatmap). 
The heatmap provides insights into how features relate to each other and to the target variable `isFraud`.

The strongest positive correlations with isFraud were observed for:

- `amount` (correlation coefficient ≈ 0.0767)
- `type_TRANSFER` (≈ 0.0539)
- `isFlaggedFraud` (≈ 0.0441)

Meanwhile, features such as type_PAYMENT show a slight negative correlation with fraud. Although these correlation values are generally low, they point to certain trends that may help distinguish fraudulent transactions.

![Correlation Analysis](results/heatmap.png)

### 2.2.5 Insights from Data Exploration
From this exploratory phase, we can conclude several important patterns:

- The dataset is highly imbalanced, with fraud representing less than 0.2% of transactions.
- Fraud occurs almost exclusively in TRANSFER and CASH_OUT transactions.
- Fraudulent transactions often involve larger amounts, supporting the hypothesis that transaction value is a key indicator of potential fraud.
- Correlation analysis, while showing modest relationships, suggests that transaction type and amount are among the most informative features for predicting fraud.

These findings inform the direction of our feature engineering and modeling strategies. In particular, they emphasize the need to account for class imbalance and to focus on transaction types and amounts when building fraud detection models.

## 3. Data Preprocessing
Before developing any models, we performed several preprocessing steps to prepare the dataset for analysis. These steps ensured that the data was clean, consistent, and suitable for machine learning algorithms.

- **Verified missing values:**  Checked all columns and confirmed there was no missing data.
- **Data type checks:** Verified that numeric columns remained as floats or integers, and that new one-hot encoded columns were stored as boolean values.
- **remove unneeded column:**  Removed the columns  `nameOrig` and `nameDest`.
- **Column transfer:**  Transformed the type column (categorical) into multiple binary columns such as `type_PAYMENT`, `type_TRANSFER`, etc. Each new column indicates whether the transaction belongs to that type (True/False).

 **Snapshot of Transformed Data**


| step | amount   | oldbalanceOrg | newbalanceOrig | oldbalanceDest | newbalanceDest | isFraud | isFlaggedFraud | type_CASH_OUT | type_DEBIT | type_PAYMENT | type_TRANSFER |
|------|----------|---------------|----------------|----------------|----------------|---------|----------------|---------------|------------|--------------|---------------|
| 1    | 9839.64  | 170136.00     | 160296.36      | 0.00           | 0.00           | 0       | 0              | False         | False      | True         | False         |
| 1    | 1864.28  | 21249.00      | 19384.72       | 0.00           | 0.00           | 0       | 0              | False         | False      | True         | False         |
| 1    | 181.00   | 181.00        | 0.00           | 0.00           | 0.00           | 1       | 0              | False         | False      | False        | True          |
| 1    | 181.00   | 181.00        | 0.00           | 21182.00       | 0.00           | 1       | 0              | True          | False      | False        | False         |
| 1    | 11668.14 | 41554.00      | 29885.86       | 0.00           | 0.00           | 0       | 0              | False         | False      | True         | False         |

# 4. Modeling Approach

In this project, we applied several probabilistic and statistical classification models to detect fraudulent transactions. The selection was guided by the need to handle a highly imbalanced dataset, maintain interpretability, and explore both discriminative 

### 4.1 Logistic Regression (Frequentist)


A baseline discriminative model that estimates the probability of fraud using the logistic function:
![ddd](results/eq1.png)

The parameters β are estimated by maximizing the likelihood (equivalently minimizing the log-loss). Logistic regression is interpretable and provides calibrated probability outputs, making it suitable for fraud detection.

### 4.2 Bayesian Logistic Regression
Extends logistic regression by placing a prior over the parameters β (e.g., Gaussian prior) and inferring a posterior distribution:
![ddd](results/eq2.png)

This allows quantifying parameter uncertainty and producing posterior predictive distributions, which can be valuable in decision-making under uncertainty.

### 4.3 Naive Bayes
A generative model assuming conditional independence between features given the class label. Using Bayes’ theorem:
![ddd](results/eq3.png)

While the independence assumption is often violated, Naive Bayes is computationally efficient and can perform well when features are weakly correlated.


### 4.4 Weighted Logistic Regression
A modification of the logistic regression model to address class imbalance by setting `class_weight='balanced'`. The weighted loss is:
![ddd](results/eq4.png)

where W_y_i is inversely proportional to the class frequency. This ensures the minority fraud class has greater influence on the decision boundary.

### 4.5 Linear Discriminant Analysis (LDA)
A Gaussian Discriminant Analysis method that assumes each class follows a Gaussian distribution with the same covariance matrix Σ, leading to linear decision boundaries. The discriminant function is:

![ddd](results/eq5.png)

We used balanced priors  ( 𝜋0 = 𝜋1 = 0.5 ) to counter the strong class imbalance.

### 4.6 Quadratic Discriminant Analysis (QDA)

Similar to LDA, but each class has its own covariance matrix Σ𝑘, allowing quadratic decision boundaries:

![ddd](results/eq6.png)

We also used balanced priors to avoid bias toward the majority class. A small regularization term was applied to improve numerical stability in the presence of multicollinearity.

## 5. Model Training and Evaluation
- Model evaluation (metrics, plots, performance)
- Cross-validation or uncertainty quantification

### 5.1 Data Preparation for Modeling
All models used the same preprocessed dataset to ensure fair comparison. Preprocessing steps included:
- **One-Hot Encoding** of the categorical variable `type`.
- **Removal** of identifier columns `nameOrig` and `nameDest`.
- **Standardization** of numerical variables for models sensitive to feature scale (LDA, QDA, Logistic Regression).
- **Train/Test Split**: 80% training, 20% testing, stratified by the target variable to maintain the fraud/non-fraud proportion.


### 5.2 Training Procedures
#### 5.2.1 Frequentist Logistic Regression

The frequentist logistic regression model was trained using scikit-learn’s `LogisticRegression` with `max_iter=1000`.  
This model estimates the probability of a transaction being fraudulent using a linear decision boundary in the feature space, optimized via maximum likelihood estimation.

- **ROC AUC**: 0.9874  
- **Accuracy**: 0.96  
- **Classification Report (Fraud class)**:  
  - Precision: 0.03  
  - Recall: 0.92  
  - F1-score: 0.06  

**Interpretation:**  
The model achieves a very high recall for fraudulent transactions, detecting the majority of fraud cases.  
However, the low precision indicates a high number of false positives, which may lead to unnecessary manual reviews.  
This trade-off can be acceptable in high-stakes fraud detection, where missing fraud is costlier than investigating false alerts.

**Performance Curves:**  
![ROC Curve - Logistic Regression](results/output1.png)   ![Precision-Recall Curve - Logistic Regression](results/output2.png)  

#### 5.2.2 Naive Bayes

The Gaussian Naive Bayes model was trained using scikit-learn’s `GaussianNB`.  
It assumes that the features are conditionally independent given the class label and that each feature follows a Gaussian distribution within each class.

- **ROC AUC**: 0.8075  
- **Accuracy**: 0.99  

- **Classification Report (Fraud class)**:  
  - Precision: 0.03  
  - Recall: 0.16  
  - F1-score: 0.06  
  
**Interpretation:**  
The model achieves high accuracy due to the overwhelming number of non-fraud cases but performs poorly in recall and precision for fraud detection.  
Its strong independence assumption does not hold for this dataset, which limits its ability to detect fraudulent transactions effectively.

**Performance Curves:**  
![ROC Curve - Naive Bayes](results/output3.png)   ![Precision-Recall Curve - Naive Bayes](results/output4.png)  


#### 5.2.3 Bayesian Logistic Regression

The Bayesian logistic regression model was implemented using a probabilistic programming framework (e.g., PyMC) with Gaussian priors placed over the model coefficients.  
Inference was performed via Markov Chain Monte Carlo (MCMC) sampling, allowing estimation of full posterior distributions for the parameters and predictive probabilities.

- **ROC AUC**: 0.9668  
- **Accuracy**: 0.99  

- **Classification Report (Fraud class)**:  
  - Precision: 1.00  
  - Recall: 0.07  
  - F1-score: 0.14  
  
**Interpretation:**  
The model achieves perfect precision for the fraud class, meaning that every predicted fraud case is truly fraudulent.  
However, this comes at the cost of very low recall, missing the majority of fraud cases.  
The Bayesian approach provides uncertainty estimates for predictions, which can be leveraged in operational settings to set more informed thresholds.

**Performance Plot (Posterior Distributions):**  
![Posterior Distributions - Bayesian Logistic Regression](results/output5.png)  

#### 5.2.4 Weighted Logistic Regression

The weighted logistic regression model was trained using scikit-learn’s `LogisticRegression` with the parameter `class_weight='balanced'`.  
This approach adjusts the loss function to give more weight to the minority fraud class, compensating for the severe class imbalance.

- **ROC AUC**: 0.9874  
- **Accuracy**: 0.96  

- **Classification Report (Fraud class)**:  
  - Precision: 0.03  
  - Recall: 0.92  
  - F1-score: 0.06  

  **Interpretation:**  
By rebalancing class weights, the model maintains very high recall for fraud detection, capturing the vast majority of fraudulent transactions.  
However, precision remains low, resulting in a large number of false positives.  
This trade-off is often acceptable in fraud detection, where missing a fraud case is costlier than investigating a false alert.

**Performance Curves:**  
![ROC Curve - Weighted Logistic Regression](results/output1.png)   ![Precision-Recall Curve - Weighted Logistic Regression](results/output2.png)  

#### 5.2.5 Linear Discriminant Analysis (LDA)

The LDA model was trained using scikit-learn’s `LinearDiscriminantAnalysis` with balanced priors (`priors=[0.5, 0.5]`).  
LDA assumes that each class follows a Gaussian distribution with the same covariance matrix, resulting in linear decision boundaries.

- **ROC AUC**: 0.9455  
- **Accuracy**: 0.9991  

- **Classification Report (Fraud class)**:  
  - Precision: 0.7698  
  - Recall: 0.3786  
  - F1-score: 0.5075  

**Interpretation:**  
LDA achieves very high precision for fraud detection, meaning most flagged cases are truly fraudulent.  
However, recall is moderate, meaning some fraud cases are missed.  
This makes LDA more suited for scenarios where minimizing false positives is a priority.

**Performance Curves:**  
![ROC — LDA (Balanced Priors)](results/photo_2025-08-11_22-04-15_3.jpg)   ![Precision–Recall — LDA (Balanced Priors)](results/photo_2025-08-11_22-04-15_4.jpg)  



#### 5.2.6 Quadratic Discriminant Analysis (QDA)

The QDA model was trained using scikit-learn’s `QuadraticDiscriminantAnalysis` with balanced priors (`priors=[0.5, 0.5]`) and a small regularization parameter (`reg_param=1e-3`) to handle covariance singularities.  
QDA allows each class to have its own covariance matrix, resulting in quadratic decision boundaries.

- **ROC AUC**: 0.9790  
- **Accuracy**: 0.6344  

- **Classification Report (Fraud class)**:  
  - Precision: 0.0035  
  - Recall: 1.0000  
  - F1-score: 0.0070  

**Interpretation:**  
QDA achieves perfect recall for fraud detection, capturing all fraudulent transactions.  
However, the precision is extremely low, meaning nearly all flagged transactions are false positives.  
This makes QDA impractical for deployment in production systems without additional filtering mechanisms.

**Performance Curves:**  
![ROC — QDA (Balanced Priors)](results/photo_2025-08-11_22-04-15.jpg)  
![Precision–Recall — QDA (Balanced Priors)](results/photo_2.jpg)  

## 6. Results

### 6.1 Model Performance Summary

| Model | ROC AUC | Accuracy | Precision (Fraud) | Recall (Fraud) | F1 (Fraud) |
|-------|---------|----------|-------------------|----------------|------------|
| Frequentist Logistic Regression | 0.9874 | 0.9600 | 0.03 | 0.92 | 0.06 |
| Naive Bayes | 0.8075 | 0.9900 | 0.03 | 0.16 | 0.06 |
| Bayesian Logistic Regression | 0.9668 | 0.9900 | 1.00 | 0.07 | 0.14 |
| Weighted Logistic Regression | 0.9874 | 0.9600 | 0.03 | 0.92 | 0.06 |
| LDA (Balanced Priors) | 0.9455 | 0.9991 | 0.7698 | 0.3786 | 0.5075 |
| QDA (Balanced Priors) | 0.9790 | 0.6344 | 0.0035 | 1.0000 | 0.0070 |


### 6.2 Observations from Performance Metrics
- **Best Overall ROC AUC**: Frequentist Logistic Regression and Weighted Logistic Regression (0.9874), closely followed by QDA (0.9790).
- **Best Precision**: Bayesian Logistic Regression (1.00) — but at the cost of very low recall.
- **Best Recall**: QDA (1.0) — but with extremely low precision, making it impractical without additional filtering.
- **Balanced Performance**: LDA offers a better trade-off between precision and recall compared to QDA, but with lower recall than Logistic Regression.
- **Effect of Class Balancing**: Weighted Logistic Regression greatly improves recall over Bayesian Logistic Regression but keeps precision low.

### 6.3 ROC and Precision–Recall Curve Analysis
**ROC Curves**:
- Logistic Regression models (Frequentist & Weighted) and QDA show curves close to the top-left corner, indicating strong classification performance.
- LDA’s ROC curve is slightly lower, reflecting its more conservative detection strategy.

**Precision–Recall Curves**:
- Bayesian Logistic Regression maintains high precision at low recall.
- Weighted Logistic Regression and Frequentist Logistic Regression achieve much higher recall but see a drop in precision.
- QDA reaches full recall but with a near-zero precision plateau.


## 7. Discussion

The experiments reveal the challenges of detecting fraudulent transactions in a **highly imbalanced dataset**, where the majority of cases are legitimate.  
In such settings, overall accuracy is not a reliable indicator of model performance, as seen with Naive Bayes and LDA, which achieve very high accuracy but differ greatly in their ability to detect fraud.

### 7.1 Trade-off Between Precision and Recall
Fraud detection systems must balance **recall** (catching as many fraudulent cases as possible) and **precision** (minimizing false alarms).  
- **High Recall Models**: Frequentist Logistic Regression, Weighted Logistic Regression, and QDA excel at detecting nearly all fraud cases, but QDA’s extreme false-positive rate makes it unsuitable without post-processing filters.  
- **High Precision Models**: Bayesian Logistic Regression and LDA minimize false positives, which is beneficial for reducing investigation costs, but they miss many fraudulent cases due to lower recall.

### 7.2 Impact of Class Imbalance Handling
Introducing **class weighting** in Logistic Regression (Weighted LR) significantly improved fraud recall while keeping ROC AUC unchanged.  
This suggests that weighting is a simple yet effective technique to counteract the bias toward the majority class in imbalanced datasets.

### 7.3 Discriminant Analysis Models
LDA and QDA provided insight into the distributional assumptions of the data:
- **LDA** assumes shared covariance across classes, yielding stable decision boundaries and a balanced precision-recall trade-off.
- **QDA** allows for separate covariances, leading to perfect recall but collapsing precision due to overfitting on rare fraud examples.

### 7.4 Naive Bayes Limitations
Naive Bayes underperformed in both precision and recall for the fraud class.  
Its assumption of feature independence is likely violated in financial transaction data, leading to suboptimal classification boundaries.

### 7.5 Model Selection for Deployment
In a real-world fraud detection system:
- **Weighted Logistic Regression** would be a strong candidate for deployment because it provides high recall (catching most fraud cases) while keeping the model interpretable and computationally efficient.
- Bayesian Logistic Regression could be integrated as a **second-stage verification model** to filter out high-confidence fraud predictions, thus reducing false positives.
- LDA could be used in conjunction with Logistic Regression in an **ensemble approach** to balance recall and precision.

### 7.6 Practical Considerations
- Models with high recall but low precision (like QDA) can cause operational inefficiencies due to the large number of false positives.
- Conversely, models with high precision but low recall risk missing costly fraudulent transactions.
- In practice, the decision threshold can be tuned based on business requirements and resource constraints for fraud investigation.


## 8. Conclusion

This project compared several probabilistic and discriminative models for fraud detection on a highly imbalanced dataset.  
Weighted Logistic Regression provided the best trade-off, maintaining very high recall while remaining interpretable.  
Bayesian Logistic Regression achieved perfect precision but low recall, making it better suited as a second-stage filter.  
LDA offered high precision with moderate recall, while QDA achieved perfect recall but impractically low precision.  
Naive Bayes underperformed due to its strong independence assumptions.

Class weighting proved effective in improving fraud detection recall, confirming its value in imbalanced classification problems.  
Future work could explore ensemble methods, cost-sensitive learning, and threshold tuning to further optimize performance.


## 9. References

1. Dataset - [https://www.kaggle.com/datasets/ashishkumarjayswal/froud-detection-dataset](https://www.kaggle.com/datasets/ashishkumarjayswal/froud-detection-dataset)
2. Probabilistic Machine Learning – Lecture Notes. Dr. Álvaro Díaz Ruelas, Leipzig University, 2025.  
3. scikit-learn Documentation — [https://scikit-learn.org/stable/](https://scikit-learn.org/stable/)  
4. PyMC Documentation — [https://www.pymc.io/](https://www.pymc.io/)  