# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [7]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, accuracy_score
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

In [2]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


In [3]:
# Display the first few rows of the dataset
print(fraud.head())

   distance_from_home  distance_from_last_transaction  \
0           57.877857                        0.311140   
1           10.829943                        0.175592   
2            5.091079                        0.805153   
3            2.247564                        5.600044   
4           44.190936                        0.566486   

   ratio_to_median_purchase_price  repeat_retailer  used_chip  \
0                        1.945940              1.0        1.0   
1                        1.294219              1.0        0.0   
2                        0.427715              1.0        0.0   
3                        0.362663              1.0        1.0   
4                        2.222767              1.0        1.0   

   used_pin_number  online_order  fraud  
0              0.0           0.0    0.0  
1              0.0           0.0    0.0  
2              0.0           1.0    0.0  
3              0.0           1.0    0.0  
4              0.0           1.0    0.0  


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [4]:
# Step 1: Check the distribution of the target variable

fraud_distribution = fraud['fraud'].value_counts(normalize=True)
print(f"Fraud distribution:\n{fraud_distribution}")

Fraud distribution:
fraud
0.0    0.912597
1.0    0.087403
Name: proportion, dtype: float64


In [8]:
# Step 2: Train a Logistic Regression model

X = fraud.drop('fraud', axis=1)
y = fraud['fraud']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Train the Logistic Regression model
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)

In [9]:
# Step 3: Evaluate the model

y_pred = logreg.predict(X_test)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("ROC AUC Score:", roc_auc_score(y_test, y_pred))


Confusion Matrix:
 [[271937   1842]
 [ 10434  15787]]
Classification Report:
               precision    recall  f1-score   support

         0.0       0.96      0.99      0.98    273779
         1.0       0.90      0.60      0.72     26221

    accuracy                           0.96    300000
   macro avg       0.93      0.80      0.85    300000
weighted avg       0.96      0.96      0.96    300000

ROC AUC Score: 0.7976733092962089


In [10]:
# Step 4: Oversampling using SMOTE
smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X_train, y_train)

logreg_smote = LogisticRegression(max_iter=1000)
logreg_smote.fit(X_smote, y_smote)

# Evaluate the model with SMOTE
y_pred_smote = logreg_smote.predict(X_test)
print("Confusion Matrix (SMOTE):\n", confusion_matrix(y_test, y_pred_smote))
print("Classification Report (SMOTE):\n", classification_report(y_test, y_pred_smote))
print("ROC AUC Score (SMOTE):", roc_auc_score(y_test, y_pred_smote))

Confusion Matrix (SMOTE):
 [[255667  18112]
 [  1401  24820]]
Classification Report (SMOTE):
               precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    273779
         1.0       0.58      0.95      0.72     26221

    accuracy                           0.93    300000
   macro avg       0.79      0.94      0.84    300000
weighted avg       0.96      0.93      0.94    300000

ROC AUC Score (SMOTE): 0.9402069973385493


In [11]:
# Step 5: Undersampling
undersample = RandomUnderSampler(random_state=42)
X_under, y_under = undersample.fit_resample(X_train, y_train)

logreg_under = LogisticRegression(max_iter=1000)
logreg_under.fit(X_under, y_under)

# Evaluate the model with undersampling
y_pred_under = logreg_under.predict(X_test)
print("Confusion Matrix (Undersampling):\n", confusion_matrix(y_test, y_pred_under))
print("Classification Report (Undersampling):\n", classification_report(y_test, y_pred_under))
print("ROC AUC Score (Undersampling):", roc_auc_score(y_test, y_pred_under))

Confusion Matrix (Undersampling):
 [[255527  18252]
 [  1327  24894]]
Classification Report (Undersampling):
               precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    273779
         1.0       0.58      0.95      0.72     26221

    accuracy                           0.93    300000
   macro avg       0.79      0.94      0.84    300000
weighted avg       0.96      0.93      0.94    300000

ROC AUC Score (Undersampling): 0.9413623993817565


In [14]:
# Step 6: Balancing with SMOTE
logreg_balanced_smote = LogisticRegression(max_iter=1000)
logreg_balanced_smote.fit(X_smote, y_smote)

# Evaluate the model with balanced data using SMOTE
y_pred_balanced_smote = logreg_balanced_smote.predict(X_test)
print("Confusion Matrix (Balanced SMOTE):\n", confusion_matrix(y_test, y_pred_balanced_smote))
print("Classification Report (Balanced SMOTE):\n", classification_report(y_test, y_pred_balanced_smote))
print("ROC AUC Score (Balanced SMOTE):", roc_auc_score(y_test, y_pred_balanced_smote))


Confusion Matrix (Balanced SMOTE):
 [[255667  18112]
 [  1401  24820]]
Classification Report (Balanced SMOTE):
               precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    273779
         1.0       0.58      0.95      0.72     26221

    accuracy                           0.93    300000
   macro avg       0.79      0.94      0.84    300000
weighted avg       0.96      0.93      0.94    300000

ROC AUC Score (Balanced SMOTE): 0.9402069973385493
