# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [103]:
#Libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, classification_report, confusion_matrix, f1_score

from sklearn.utils import resample
from imblearn.over_sampling import SMOTE

In [84]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
print(fraud.shape)
display(fraud.head())

(1000000, 8)


Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [85]:
# 1. What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?

fraud['fraud'].value_counts(normalize=True)


fraud
0.0    0.912597
1.0    0.087403
Name: proportion, dtype: float64

In [58]:
# Yes we are dealing with an imbalanced dataset. The target variable 'fraud' has a significant imbalance, 
# with 0 (non-fraudulent transactions) making up approximately 83.791.3% of the data
# and 1 (fraudulent transactions) making up about 8.7%. 
# This indicates that fraudulent transactions are much less common than non-fraudulent ones in this dataset.

In [59]:
# 1.1 EDA

In [86]:
print(fraud.dtypes)
print(fraud.isnull().sum())

distance_from_home                float64
distance_from_last_transaction    float64
ratio_to_median_purchase_price    float64
repeat_retailer                   float64
used_chip                         float64
used_pin_number                   float64
online_order                      float64
fraud                             float64
dtype: object
distance_from_home                0
distance_from_last_transaction    0
ratio_to_median_purchase_price    0
repeat_retailer                   0
used_chip                         0
used_pin_number                   0
online_order                      0
fraud                             0
dtype: int64


In [87]:
# 2. Train a LogisticRegression.

### Splitting the data into features and target
features = fraud.drop(columns = ["fraud"])
target = fraud["fraud"]

X_train, X_test, y_train, y_test = train_test_split(features, target)

In [88]:
### Normalizing and transforming the data

from sklearn.preprocessing import MinMaxScaler

normalizer = MinMaxScaler()
normalizer.fit(X_train)

X_train_norm = normalizer.transform(X_train)
X_test_norm = normalizer.transform(X_test)

In [89]:
# LogisticRegression

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()

In [90]:
log_reg.fit(X_train_norm, y_train)

In [91]:
# 3. Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.

pred = log_reg.predict(X_test_norm)
print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

         0.0       0.95      1.00      0.97    227999
         1.0       0.92      0.41      0.57     22001

    accuracy                           0.94    250000
   macro avg       0.93      0.70      0.77    250000
weighted avg       0.94      0.94      0.94    250000



In [None]:
### Recall is the most important metric here, as we want to minimize false negatives 
### (fraudulent transactions classified as non-fraudulent).
### Model is not performing well, recall is very low so it is not dedecting the fradulent transactions.

In [92]:
# 4. Run Oversample in order to balance our target variable and repeat the steps above, now with balanced data. 
# Does it improve the performance of our model? 

train = pd.DataFrame(X_train_norm, columns = X_train.columns)
train['fraud'] = y_train.values

fraud_1 = train[train['fraud'] == 1]
fraud_0 = train[train['fraud'] == 0]

In [93]:
fraud_1_over = resample(fraud_1, 
                          replace=True,             # sample with replacement
                          n_samples=len(fraud_0),   # to match majority class
                          random_state=42)          # reproducible results

In [94]:
train_over = pd.concat([fraud_1_over, fraud_0])
train_over

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
647097,0.009415,0.002825,0.005904,1.0,0.0,0.0,1.0,1.0
181120,0.013988,0.001160,0.005040,1.0,0.0,0.0,1.0,1.0
9799,0.009065,0.015921,0.004960,1.0,0.0,0.0,1.0,1.0
438488,0.003263,0.000090,0.016202,1.0,0.0,0.0,1.0,1.0
717419,0.014137,0.000271,0.000678,1.0,0.0,0.0,1.0,1.0
...,...,...,...,...,...,...,...,...
749995,0.000707,0.000026,0.007384,1.0,0.0,0.0,1.0,0.0
749996,0.001590,0.000734,0.004837,1.0,0.0,0.0,0.0,0.0
749997,0.002535,0.001789,0.005618,1.0,1.0,0.0,1.0,0.0
749998,0.007837,0.000023,0.002461,1.0,1.0,0.0,1.0,0.0


In [95]:
# We will create a new instance of Logistic Regression with balanced data

X_train_over = train_over.drop(columns = ["fraud"])
y_train_over = train_over["fraud"]
log_reg.fit(X_train_over, y_train_over)

In [96]:
pred = log_reg.predict(X_test_norm)
print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    227999
         1.0       0.56      0.94      0.70     22001

    accuracy                           0.93    250000
   macro avg       0.78      0.93      0.83    250000
weighted avg       0.96      0.93      0.94    250000





In [97]:
### Recall is much better now, model is able to dedect the fraudulent transactions.
### Precision has dropped.

In [98]:
# 5. Now, run Undersample in order to balance our target variable and repeat the steps above (1-3), now with balanced data. 
# Does it improve the performance of our model?

fraud_0_under = resample(fraud_0,
                         replace=False,            # sample without replacement
                         n_samples=len(fraud_1),  # to match minority class
                         random_state=42)         # reproducible results


In [99]:
train_under = pd.concat([fraud_1, fraud_0_under])
X_train_under = train_under.drop(columns = ["fraud"])
y_train_under = train_under["fraud"]
log_reg.fit(X_train_under, y_train_under)

In [101]:
pred = log_reg.predict(X_test_norm)
print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

         0.0       0.99      0.90      0.94    227999
         1.0       0.47      0.92      0.62     22001

    accuracy                           0.90    250000
   macro avg       0.73      0.91      0.78    250000
weighted avg       0.95      0.90      0.91    250000





In [102]:
### Performance is worse than with oversampling, but better than with the original imbalanced data.

In [104]:
# 6. Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. 
# Does it improve the performance of our model? 

smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train_norm, y_train)
log_reg.fit(X_train_smote, y_train_smote)

In [105]:
pred = log_reg.predict(X_test_norm)
print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    227999
         1.0       0.56      0.94      0.70     22001

    accuracy                           0.93    250000
   macro avg       0.78      0.93      0.83    250000
weighted avg       0.96      0.93      0.94    250000



In [106]:
### Silimilar performance to oversampling.