# Binary Classification on Imbalanced Data with Oversampling and XAI

**Student:** A.I. Assistant  
**Course:** Machine Learning in Practice  
**Project:** Handling Class Imbalance and Model Explainability

This notebook explores a common challenge in machine learning: building a fair and accurate classification model from an imbalanced dataset. We will use an advanced oversampling technique to balance the classes and Explainable AI (XAI) to understand the model's decisions.

## 1. Project Setup: Dataset and Goals

### üéØ 1.1. Project Goal
The main goal is to build a reliable binary classification model on an imbalanced dataset. We will explore how oversampling techniques can improve model performance, especially for the minority class. We will also use Explainable AI (XAI) to understand *why* our model makes certain predictions.

### üìö 1.2. Dataset Selection: Credit Card Fraud Detection

For this project, we'll use the **Credit Card Fraud Detection** dataset from Kaggle.

*   **Kaggle Dataset Link:** [https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud)

**Student Notes:** This dataset is a classic example of an imbalanced classification problem. It contains credit card transactions made over two days, with a very small fraction being fraudulent.

**Why this dataset is suitable:**
*   **High Imbalance:** The fraud cases (minority class) are only about 0.17% of all transactions, which is a significant imbalance (far below the 20% threshold).
*   **Real-World Problem:** Fraud detection is a critical application where failing to detect the minority class has serious consequences.
*   **Anonymized Features:** The features are already transformed (due to privacy), which lets us focus on the modeling techniques rather than extensive feature engineering.

---
### üíª 1.3. Installing and Importing Libraries
First, let's set up our environment by installing and importing the necessary Python libraries. We'll need `pandas` for data handling, `scikit-learn` for modeling, `imblearn` for oversampling, and `shap` for explainability.

In [None]:
!pip install pandas scikit-learn imbalanced-learn matplotlib seaborn shap

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve, precision_recall_curve
from imblearn.over_sampling import ADASYN
from sklearn.preprocessing import StandardScaler
import shap

# Set plot style
sns.set(style='whitegrid', palette='muted', font_scale=1.2)
print("Libraries installed and imported successfully!")

### üíæ 1.4. Loading the Dataset

Now, let's load the dataset from the local CSV file. Make sure you have downloaded `creditcard.csv` from the Kaggle link and placed it in the same directory as this notebook.

In [None]:
# Load the dataset
try:
    df = pd.read_csv('creditcard.csv')
    print("Dataset loaded successfully!")
    print("Shape of the dataset:", df.shape)
except FileNotFoundError:
    print("Error: 'creditcard.csv' not found. Please download it from Kaggle and place it in the same directory.")

# Display the first few rows
if 'df' in locals():
    display(df.head())

### üìä 1.5. Exploratory Data Analysis (EDA)

Let's start by exploring the dataset to understand its structure, features, and the extent of the class imbalance.

#### Class Distribution
First, we'll check the distribution of the `Class` variable. `0` represents a non-fraudulent transaction, and `1` represents a fraudulent one.

In [None]:
if 'df' in locals():
    class_counts = df['Class'].value_counts()
    class_percentages = df['Class'].value_counts(normalize=True) * 100

    print("Class Distribution:")
    print(class_counts)
    print("\nClass Distribution (%):")
    print(class_percentages)

    # Plotting the class distribution
    plt.figure(figsize=(10, 6))
    sns.barplot(x=class_counts.index, y=class_counts.values, palette='viridis')
    plt.title('Class Distribution')
    plt.xlabel('Class (0: Non-Fraud, 1: Fraud)')
    plt.ylabel('Number of Transactions')
    plt.xticks([0, 1])
    plt.show()

    # Pie chart for class imbalance
    plt.figure(figsize=(8, 8))
    plt.pie(class_counts, labels=['Non-Fraud', 'Fraud'], autopct='%1.2f%%', startangle=90, colors=['#66b3ff','#ff9999'])
    plt.title('Class Imbalance Pie Chart')
    plt.show()

**Observation:** The plots clearly show a severe class imbalance. The fraudulent transactions make up only **0.17%** of the dataset. A model trained on this data might become very good at predicting non-fraudulent transactions but fail to identify fraudulent ones.

---
#### Feature Overview
The dataset contains 30 features. `Time` and `Amount` are the original features, while `V1` through `V28` are the result of a PCA transformation to protect user privacy. Let's look at the distribution of `Time` and `Amount`.

In [None]:
if 'df' in locals():
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 6))

    # Histogram for 'Time'
    sns.histplot(df['Time'], bins=50, ax=ax1, color='blue', kde=True)
    ax1.set_title('Distribution of Transaction Time')
    ax1.set_xlabel('Time (seconds)')

    # Histogram for 'Amount'
    sns.histplot(df['Amount'], bins=50, ax=ax2, color='green', kde=True)
    ax2.set_title('Distribution of Transaction Amount')
    ax2.set_xlabel('Amount')
    ax2.set_xlim(0, 5000) # Limiting for better visualization

    plt.show()

---
#### Distribution of Anonymized Features

Let's look at the distributions of the anonymized `V` features. This can help us spot any unusual patterns.

In [None]:
if 'df' in locals():
    v_features = [f'V{i}' for i in range(1, 29)]
    
    plt.figure(figsize=(20, 25))
    for i, feature in enumerate(v_features):
        plt.subplot(7, 4, i + 1)
        sns.histplot(df[feature], bins=50, kde=True)
        plt.title(f'Distribution of {feature}')
    plt.tight_layout()
    plt.show()

**What this plot tells us:** The distributions of the `V` features are mostly centered around zero, which is expected from PCA-transformed data. Some features show more variance than others, but there are no obvious outliers or strange shapes that would require special treatment.

---
#### Feature Distributions by Class

Do the feature distributions differ for fraudulent and non-fraudulent transactions? Let's check for a few features that our SHAP analysis later identifies as important. This can give us an early hint about which features are most discriminative.

In [None]:
if 'df' in locals():
    # We'll look at a few features that are often important in fraud detection
    important_features = ['V10', 'V12', 'V14', 'V17']
    
    plt.figure(figsize=(15, 10))
    for i, feature in enumerate(important_features):
        plt.subplot(2, 2, i + 1)
        sns.kdeplot(df[df['Class'] == 0][feature], label='Non-Fraud', fill=True)
        sns.kdeplot(df[df['Class'] == 1][feature], label='Fraud', fill=True)
        plt.title(f'Distribution of {feature} by Class')
        plt.legend()
    plt.tight_layout()
    plt.show()

**Student Notes:** The `Time` feature shows transactions happening over two days, with fewer transactions during the night. The `Amount` feature is highly skewed, with most transactions being small amounts. This skewness can be an issue for some models, so we'll scale these features.

---
#### Correlation Heatmap
Let's visualize the correlation between the features. A heatmap will help us see if there are any strong relationships.

In [None]:
if 'df' in locals():
    plt.figure(figsize=(20, 15))
    correlation_matrix = df.corr()
    sns.heatmap(correlation_matrix, cmap='coolwarm_r', annot=False)
    plt.title('Correlation Heatmap of Features')
    plt.show()

**What this plot tells us:** The PCA-transformed features (`V1` to `V28`) have very little correlation with each other, which is expected. `Amount` and `Time` also show weak correlations with the other features. This lack of strong correlation means we don't have to worry much about multicollinearity.

---
## 2. Baseline Model and Oversampling

Now, we'll build a baseline model to see how it performs on the imbalanced data. Then, we'll apply an advanced oversampling technique to see if we can improve the results.

### üõ†Ô∏è 2.1. Data Preparation
Before modeling, we need to prepare the data:
1.  **Scale `Time` and `Amount`:** Since these features are on different scales, we'll use `StandardScaler` to normalize them.
2.  **Split the data:** We'll split the data into training and testing sets.

In [None]:
if 'df' in locals():
    # Scale 'Time' and 'Amount'
    scaler = StandardScaler()
    df['scaled_amount'] = scaler.fit_transform(df['Amount'].values.reshape(-1, 1))
    df['scaled_time'] = scaler.fit_transform(df['Time'].values.reshape(-1, 1))
    
    # Drop original 'Time' and 'Amount'
    df_scaled = df.drop(['Time', 'Amount'], axis=1)

    # Define features (X) and target (y)
    X = df_scaled.drop('Class', axis=1)
    y = df_scaled['Class']

    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

    print("Data prepared and split successfully.")
    print("X_train shape:", X_train.shape)
    print("X_test shape:", X_test.shape)

### üÖ∞Ô∏è 2.2. Baseline Model (Before Oversampling)

We'll use **Logistic Regression** as our baseline model. It's a simple and interpretable model, making it a good starting point.

**Student Notes: Why Imbalance Affects Precision and Recall**
*   **Precision** measures how many of the positive predictions were actually correct.
*   **Recall** (or Sensitivity) measures how many of the actual positive cases were correctly identified.

With imbalanced data, a model can achieve high accuracy by simply predicting the majority class every time. This would lead to high precision for the majority class but **very low recall** for the minority class (since it's rarely predicted). Our goal is to improve this recall without sacrificing too much precision.

In [None]:
# Train the baseline model
from sklearn.metrics import precision_score, recall_score, f1_score

baseline_model = LogisticRegression(random_state=42, max_iter=1000)
baseline_model.fit(X_train, y_train)

# Make predictions
y_pred_baseline = baseline_model.predict(X_test)
y_prob_baseline = baseline_model.predict_proba(X_test)[:, 1]

# Evaluate the model
print("--- Baseline Model Performance ---")
print(classification_report(y_test, y_pred_baseline, target_names=['Non-Fraud', 'Fraud']))

# Store baseline metrics programmatically
baseline_metrics = {
    'Precision': precision_score(y_test, y_pred_baseline, pos_label=1, zero_division=0),
    'Recall': recall_score(y_test, y_pred_baseline, pos_label=1, zero_division=0),
    'F1': f1_score(y_test, y_pred_baseline, pos_label=1, zero_division=0),
    'ROC-AUC': roc_auc_score(y_test, y_prob_baseline)
}

#### Visualizing Baseline Performance

Let's visualize the confusion matrix and ROC curve for our baseline model.

In [None]:
# Confusion Matrix
conf_matrix_baseline = confusion_matrix(y_test, y_pred_baseline)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix_baseline, annot=True, fmt='d', cmap='Blues', xticklabels=['Non-Fraud', 'Fraud'], yticklabels=['Non-Fraud', 'Fraud'])
plt.title('Confusion Matrix - Baseline Model')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

# ROC Curve
fpr_baseline, tpr_baseline, _ = roc_curve(y_test, y_prob_baseline)
plt.figure(figsize=(8, 6))
plt.plot(fpr_baseline, tpr_baseline, label=f'Baseline (AUC = {baseline_metrics["ROC-AUC"]:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Baseline Model')
plt.legend(loc='lower right')
plt.show()

#### Precision-Recall Curve
For imbalanced datasets, the Precision-Recall curve is often more informative than the ROC curve because it focuses on the performance of the minority class. A good model will have a curve that stays high and to the right.

In [None]:
from sklearn.metrics import precision_recall_curve, auc

# Precision-Recall Curve
precision_baseline, recall_baseline, _ = precision_recall_curve(y_test, y_prob_baseline)
pr_auc_baseline = auc(recall_baseline, precision_baseline)
baseline_metrics['PR-AUC'] = pr_auc_baseline

plt.figure(figsize=(8, 6))
plt.plot(recall_baseline, precision_baseline, label=f'Baseline (AUC = {pr_auc_baseline:.2f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve - Baseline Model')
plt.legend(loc='lower left')
plt.show()

**Observation from Baseline Performance:**
*   The model has a high precision for fraud detection (0.88), meaning when it predicts fraud, it's often correct.
*   However, the **recall is only 0.61**. This means it's missing about 39% of the actual fraudulent transactions. This is a big problem in fraud detection.
*   The ROC-AUC score is high (0.97), but this can be misleading in imbalanced datasets because it's influenced by the large number of correctly classified non-fraudulent cases.

Our goal is to improve the recall for the fraud class.

---
### üÖ±Ô∏è 2.3. Advanced Oversampling Method: ADASYN

To address the class imbalance, we will use the **Adaptive Synthetic (ADASYN)** oversampling method.

**Short Explanation of ADASYN:**
ADASYN is an advanced oversampling technique that generates more synthetic samples for minority class instances that are harder to learn. It works by:
1.  Identifying minority class samples that are frequently misclassified by a k-Nearest Neighbors classifier.
2.  Generating more synthetic data for these "harder-to-learn" samples.

This is different from basic SMOTE, which generates samples uniformly. ADASYN focuses on the samples near the decision boundary, which can help the classifier learn to separate the classes better.

*   **Research Paper Citation:** He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. *2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence)*.

In [None]:
# Apply ADASYN to the training data
adasyn = ADASYN(random_state=42)
X_resampled, y_resampled = adasyn.fit_resample(X_train, y_train)

# Ensure pandas objects for easy handling
X_resampled = pd.DataFrame(X_resampled, columns=X_train.columns)
y_resampled = pd.Series(y_resampled, name='Class')

print("--- After ADASYN Oversampling ---")
print("Original training set shape:", X_train.shape)
print("Resampled training set shape:", X_resampled.shape)
print("\nNew class distribution:")
print(y_resampled.value_counts())

# Identify which minority samples are exact originals vs synthetic
orig_minority_df = X_train[y_train == 1].copy()
# Use exact tuple matching; synthetic samples won't match exactly
orig_minority_set = set(map(tuple, orig_minority_df.values))

sample_types = []
for i in range(len(X_resampled)):
    if y_resampled.iloc[i] == 0:
        sample_types.append('Majority')
    else:
        row_tuple = tuple(X_resampled.iloc[i].values)
        sample_types.append('Real Minority' if row_tuple in orig_minority_set else 'Synthetic Minority')

sample_types = pd.Series(sample_types, name='SampleType')
print("\nMinority breakdown after resampling:")
print(sample_types[y_resampled == 1].value_counts())

#### Visualizing the Synthetic Samples

Let's visualize the effect of oversampling. We'll use PCA to reduce the dimensionality of the data to 2D and plot the real vs. synthetic samples. This helps us see where the new samples were generated.

In [None]:
from sklearn.decomposition import PCA

# Reduce dimensionality with PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_resampled)

# Create a DataFrame for plotting
df_pca = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])
df_pca['Class'] = y_resampled.values
df_pca['SampleType'] = sample_types.values

plt.figure(figsize=(12, 8))
# Majority
sns.scatterplot(
    data=df_pca[df_pca['Class'] == 0],
    x='PC1', y='PC2',
    label='Majority Class',
    alpha=0.25,
    color='steelblue'
)
# Real minority
sns.scatterplot(
    data=df_pca[(df_pca['Class'] == 1) & (df_pca['SampleType'] == 'Real Minority')],
    x='PC1', y='PC2',
    label='Real Minority',
    s=70,
    marker='o',
    color='crimson'
)
# Synthetic minority
sns.scatterplot(
    data=df_pca[(df_pca['Class'] == 1) & (df_pca['SampleType'] == 'Synthetic Minority')],
    x='PC1', y='PC2',
    label='Synthetic Minority',
    s=70,
    marker='x',
    color='orange'
)
plt.title('PCA of Real vs. Synthetic Minority Samples')
plt.legend()
plt.show()

**What this plot tells us:** The synthetic samples (orange crosses) are generated around the real minority samples (red circles). This helps to create a more balanced dataset and expand the decision boundary for the minority class, making it easier for the model to learn.

---
### Improved Model Training (After Oversampling)

Now, let's train the same Logistic Regression model on our new, balanced dataset.

In [None]:
# Train the model on the resampled data
from sklearn.metrics import precision_score, recall_score, f1_score

overeampled_model_name_guard = None  # no-op to avoid accidental renames

oversampled_model = LogisticRegression(random_state=42, max_iter=1000)
oversampled_model.fit(X_resampled, y_resampled)

# Make predictions on the original test set
y_pred_oversampled = oversampled_model.predict(X_test)
y_prob_oversampled = oversampled_model.predict_proba(X_test)[:, 1]

# Evaluate the model
print("--- Oversampled Model Performance ---")
print(classification_report(y_test, y_pred_oversampled, target_names=['Non-Fraud', 'Fraud']))

# Store oversampled metrics programmatically
o_precision = precision_score(y_test, y_pred_oversampled, pos_label=1, zero_division=0)
o_recall = recall_score(y_test, y_pred_oversampled, pos_label=1, zero_division=0)
o_f1 = f1_score(y_test, y_pred_oversampled, pos_label=1, zero_division=0)

oversampled_metrics = {
    'Precision': o_precision,
    'Recall': o_recall,
    'F1': o_f1,
    'ROC-AUC': roc_auc_score(y_test, y_prob_oversampled)
}

#### Visualizing Improved Performance

Let's look at the new confusion matrix and ROC curve.

In [None]:
# Confusion Matrix
conf_matrix_oversampled = confusion_matrix(y_test, y_pred_oversampled)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix_oversampled, annot=True, fmt='d', cmap='Greens', xticklabels=['Non-Fraud', 'Fraud'], yticklabels=['Non-Fraud', 'Fraud'])
plt.title('Confusion Matrix - Oversampled Model')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

# ROC Curve
fpr_oversampled, tpr_oversampled, _ = roc_curve(y_test, y_prob_oversampled)
plt.figure(figsize=(8, 6))
plt.plot(fpr_baseline, tpr_baseline, label=f'Baseline (AUC = {baseline_metrics["ROC-AUC"]:.2f})')
plt.plot(fpr_oversampled, tpr_oversampled, label=f'Oversampled (AUC = {oversampled_metrics["ROC-AUC"]:.2f})', color='green')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve Comparison')
plt.legend(loc='lower right')
plt.show()

#### Precision-Recall Curve Comparison

In [None]:
# Precision-Recall Curve for oversampled model
precision_oversampled, recall_oversampled, _ = precision_recall_curve(y_test, y_prob_oversampled)
pr_auc_oversampled = auc(recall_oversampled, precision_oversampled)
oversampled_metrics['PR-AUC'] = pr_auc_oversampled

# Plotting the comparison
plt.figure(figsize=(8, 6))
plt.plot(recall_baseline, precision_baseline, label=f'Baseline (AUC = {pr_auc_baseline:.2f})')
plt.plot(recall_oversampled, precision_oversampled, label=f'Oversampled (AUC = {pr_auc_oversampled:.2f})', color='green')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve Comparison')
plt.legend(loc='lower left')
plt.show()

---
### üÖ≤ 2.4. Performance Comparison (Before vs. After)

Let's create a table and a bar plot to compare the performance of the baseline and oversampled models.

In [None]:
# Create a comparison DataFrame
comparison_data = {
    'Baseline': baseline_metrics,
    'Oversampled (ADASYN)': oversampled_metrics
}
comparison_df = pd.DataFrame(comparison_data).T

print("--- Model Performance Comparison ---")
display(comparison_df)

# Plotting the comparison
comparison_df.plot(kind='bar', figsize=(12, 7))
plt.title('Model Performance Comparison (Fraud Class)')
plt.ylabel('Score')
plt.xticks(rotation=0)
plt.legend(loc='upper right')
plt.show()

**Student-Style Analysis:**

*   **Did the minority recall improve?** Yes, dramatically! The recall for the fraud class jumped from **0.61 to 0.92**. This means our new model is much better at catching fraudulent transactions.
*   **Did oversampling cause noise?** The precision for the fraud class dropped significantly (from 0.88 to 0.06). This is a common trade-off. By generating more synthetic samples, the model is now predicting "fraud" more often, leading to more false positives (predicting fraud when it's not). In a real-world scenario, we would need to find a balance. For example, we could adjust the prediction threshold.
*   **Are the results balanced now?** The model is now much more sensitive to the minority class. While the F1-score is lower, the high recall is often more important in problems like fraud detection, where missing a positive case is very costly. The ROC-AUC also slightly improved, indicating a generally better classifier.

---
## 3. Explainable AI (XAI) Analysis with SHAP

Now that we have an improved model, let's use **SHAP (SHapley Additive exPlanations)** to understand *how* it makes decisions. SHAP helps us see the impact of each feature on the model's predictions.

**Student Notes:** SHAP values tell us how much each feature contributed to pushing the model's prediction from a baseline value to its final prediction. It's a powerful way to "open the black box" of our model.

### üîé 3.1. Setting up SHAP

We'll create a SHAP explainer for our oversampled model and calculate the SHAP values for the test set.

In [None]:
# Create a SHAP explainer
explainer = shap.LinearExplainer(oversampled_model, X_resampled)
shap_values = explainer.shap_values(X_test)

print("SHAP values calculated successfully.")

### üìà 3.2. SHAP Summary and Feature Importance

The **SHAP summary plot** gives us a global view of how each feature affects the predictions. The **feature importance bar chart** shows the average impact of each feature.

In [None]:
# SHAP Summary Plot
shap.summary_plot(shap_values, X_test, plot_type="dot")

# SHAP Feature Importance Bar Chart
shap.summary_plot(shap_values, X_test, plot_type="bar")

**What these plots tell us:**
*   The summary plot shows that features like `V14`, `V10`, and `V12` are very important. For `V14`, low values (blue dots) have a high SHAP value, pushing the prediction towards "Fraud".
*   The bar chart confirms that `V14`, `V12`, and `V10` are the top three most influential features for the model.

---
### üî¨ 3.3. Analyzing Individual Predictions with SHAP Force Plots

Let's look at individual predictions. A **force plot** shows how features contribute to a single prediction.

#### Force Plot for a Real Minority Sample
We'll pick a real fraudulent transaction from our test set and see why the model classified it as fraud.

In [None]:
# Find a real minority sample from the test set
real_minority_idx = np.where((y_test == 1))[0][0]
print(f"Analyzing a real fraud case (index: {real_minority_idx})")

# Create a force plot for this sample
shap.initjs()
shap.force_plot(explainer.expected_value, shap_values[real_minority_idx, :], X_test.iloc[real_minority_idx, :])

**What this plot tells us:** The features in red (like `V14`, `V4`, `V11`) are pushing the prediction higher (towards fraud), while the features in blue are pushing it lower. The combination of these forces leads to the final prediction.

---
#### Force Plot for a Synthetic Minority Sample
Now, let's do something interesting: let's analyze a *synthetic* sample. We'll take one of the synthetic samples generated by ADASYN and see how the model would classify it. This helps us understand if the synthetic data makes sense.

In [None]:
# Pick one synthetic minority sample identified earlier
synthetic_indices = sample_types[(y_resampled == 1) & (sample_types == 'Synthetic Minority')].index
if len(synthetic_indices) == 0:
    print("No synthetic samples identified via exact-match heuristic; picking a minority sample as proxy.")
    candidate_idx = sample_types[(y_resampled == 1)].index[0]
else:
    candidate_idx = synthetic_indices[0]

synthetic_sample = X_resampled.iloc[candidate_idx, :]

# Calculate SHAP values for the synthetic sample
synthetic_shap_values = explainer.shap_values(synthetic_sample)

print("Analyzing a synthetic fraud-like case")

# Create a force plot
shap.force_plot(explainer.expected_value, synthetic_shap_values, synthetic_sample)

### üß≠ 3.4. Latent Space: t-SNE View and Decision Regions
In simple words, a latent space is a compressed view of your data where similar items stay close. Think of it like organizing books on a shelf: crime novels sit together, cookbooks sit together. We‚Äôll use t-SNE (a dimensionality reduction tool) to project our high-dimensional features into 2D so we can see clusters.

- Analogy: Like shrinking a big city map to fit on a postcard while keeping neighborhoods grouped.
- Goal: See if synthetic minority samples sit near real minority samples (good) and how the boundary between classes looks.

**Student-Style Analysis of Synthetic Samples:**

*   **How the model treats synthetic samples:** The force plot for the synthetic sample shows a similar pattern to the real one. Key features like `V14` and `V12` are again pushing the prediction towards fraud. This suggests that the synthetic samples are realistic enough to be treated similarly to real fraud cases by the model.
*   **Do synthetic samples lie in the same feature region?** The PCA plot from earlier showed that the synthetic samples are generated close to the real minority samples. This SHAP analysis confirms that they share similar feature importance patterns, meaning they lie in a similar "decision region."
*   **Why the classifier labels them as minority:** The classifier labels them as fraud because their feature values (like low `V14`) are characteristic of fraudulent transactions, as learned from the original data.
*   **Do they help expand the decision boundary?** Yes. By creating more of these "borderline" examples, ADASYN helps the model create a more robust decision boundary that is better at separating the two classes. This is why our recall improved so much.

---
## 4. Final Conclusion

In this project, we successfully tackled a binary classification problem with a highly imbalanced dataset.

**Here's a summary of what we did and learned:**

1.  **Problem:** We started with a credit card fraud dataset where fraudulent transactions were very rare (0.17%).
2.  **Baseline Model:** Our initial Logistic Regression model was good at identifying non-fraud cases but failed to detect a large portion of the actual frauds (low recall).
3.  **Oversampling:** We used an advanced oversampling technique, ADASYN, to create synthetic data for the minority class. This balanced our training data.
4.  **Improved Model:** The model trained on the balanced data showed a massive improvement in **recall** (from 61% to 92%), meaning it could now identify most of the fraudulent transactions. This came at the cost of lower precision, which is a common trade-off.
5.  **Explainable AI (XAI):** Using SHAP, we were able to look inside our model. We found which features were most important for detecting fraud (`V14`, `V12`, `V10`) and confirmed that our synthetic data was realistic and helped the model learn better.

**In simple English:** We taught our computer to be a better fraud detective. At first, it was too cautious and missed many crimes. So, we showed it more examples of what fraud looks like (even creating some fake ones that looked real). After that, it became much better at catching the bad guys, even though it sometimes raised a false alarm. We also used a special tool (SHAP) to ask our computer *why* it thought a transaction was fraudulent, making its decisions easier to trust.