# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [3]:
#Libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE

In [4]:
fraud_data = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud_data.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [5]:
# 1. Analyze the target variable distribution
print("Target Variable Distribution:")
print(fraud_data['fraud'].value_counts(normalize=True))

Target Variable Distribution:
fraud
0.0    0.912597
1.0    0.087403
Name: proportion, dtype: float64


In [6]:
# Split the data into features and target
X = fraud_data.drop(columns=['fraud'])
y = fraud_data['fraud']

In [7]:
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

In [9]:
def train_and_evaluate_model(X_train, y_train, X_test, y_test):
    # Train the model
    model = LogisticRegression(random_state=42, max_iter=1000)
    model.fit(X_train, y_train)

    # Make predictions
    y_pred = model.predict(X_test)

    # Evaluate the model
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred))
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))

In [10]:
# 2. Train with the original data
print("Original Data:")
train_and_evaluate_model(X_train, y_train, X_test, y_test)

Original Data:
Confusion Matrix:
[[271936   1843]
 [ 10432  15789]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.96      0.99      0.98    273779
         1.0       0.90      0.60      0.72     26221

    accuracy                           0.96    300000
   macro avg       0.93      0.80      0.85    300000
weighted avg       0.96      0.96      0.96    300000



#3 Evaluate the model

The model has high accuracy (96%), but this is mainly due to the imbalance in the dataset.
The recall for the minority class (fraudulent transactions) is low (0.60), which means the model misses a significant portion of fraud cases.

In [11]:
#4 Oversample the minority class

print("\nOversampling:")
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X_train, y_train)
train_and_evaluate_model(X_resampled, y_resampled, X_test, y_test)


Oversampling:
Confusion Matrix:
[[255518  18261]
 [  1357  24864]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    273779
         1.0       0.58      0.95      0.72     26221

    accuracy                           0.93    300000
   macro avg       0.79      0.94      0.84    300000
weighted avg       0.96      0.93      0.94    300000



Oversampling effectively improves recall (from 0.60 to 0.95) for fraudulent transactions, meaning the model identifies far more fraud cases.
The trade-off is a drop in precision (from 0.90 to 0.58), which indicates more false positives.


In [12]:
#5 Undersample the majority class

print("\nUndersampling:")
rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)
train_and_evaluate_model(X_resampled, y_resampled, X_test, y_test)


Undersampling:
Confusion Matrix:
[[255526  18253]
 [  1327  24894]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    273779
         1.0       0.58      0.95      0.72     26221

    accuracy                           0.93    300000
   macro avg       0.79      0.94      0.84    300000
weighted avg       0.96      0.93      0.94    300000



Undersampling shows nearly identical results to oversampling, likely because it achieves a similar class balance.
Recall remains high (0.95), ensuring that most fraudulent transactions are caught, but precision remains low, leading to more false positives.


In [13]:
# 6. Apply SMOTE for synthetic oversampling
print("\nSMOTE:")
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
train_and_evaluate_model(X_resampled, y_resampled, X_test, y_test)


SMOTE:
Confusion Matrix:
[[255666  18113]
 [  1401  24820]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    273779
         1.0       0.58      0.95      0.72     26221

    accuracy                           0.93    300000
   macro avg       0.79      0.94      0.84    300000
weighted avg       0.96      0.93      0.94    300000



High recall (0.95) ensures most fraudulent transactions are detected.
