# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [122]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import precision_score, recall_score, classification_report, confusion_matrix, f1_score
from sklearn.preprocessing import StandardScaler

from sklearn.utils import resample
from imblearn.over_sampling import SMOTE

In [67]:
fraud_df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud_df.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [9]:
# Check value counts for [fraud] column
fraud['fraud'].value_counts()

fraud
0.0    912597
1.0     87403
Name: count, dtype: int64

By the value counts, we are dealing with an imbalanced dataset.

In [None]:
# Train logistic regression

# Test train split


In [69]:
# Test train split
features = fraud_df.drop(columns = ["fraud"])
target = fraud_df["fraud"]

X_train, X_test, y_train, y_test = train_test_split(features, target)

In [71]:
scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [73]:
log_reg = LogisticRegression()

In [75]:
log_reg.fit(X_train_scaled, y_train)

In [76]:
pred = log_reg.predict(X_test_scaled)
print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

         0.0       0.96      0.99      0.98    228338
         1.0       0.89      0.61      0.72     21662

    accuracy                           0.96    250000
   macro avg       0.93      0.80      0.85    250000
weighted avg       0.96      0.96      0.96    250000



## Oversample

In [79]:
len(fraud_df[fraud_df["fraud"]==0])

912597

In [81]:
train = pd.DataFrame(X_train_scaled, columns = X_train.columns)
train["fraud"] = y_train.values
yes_fraud = train[train["fraud"] == 1]
no_fraud = train[train["fraud"] == 0]

In [83]:
fraud_oversampled = resample(yes_fraud, 
                    replace=True, 
                    n_samples = len(no_fraud),
                    random_state=15)

In [86]:
train_over = pd.concat([fraud_oversampled, no_fraud])
train_over

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
87736,-0.373920,-0.171637,0.877035,0.366654,-0.735215,-0.334727,0.732866,1.0
605066,0.090310,-0.162682,1.653314,0.366654,1.360147,-0.334727,0.732866,1.0
92881,-0.283740,-0.064858,1.939474,0.366654,-0.735215,-0.334727,0.732866,1.0
559206,-0.119057,0.011857,1.121787,0.366654,-0.735215,-0.334727,0.732866,1.0
404578,-0.342575,0.349893,1.054828,0.366654,-0.735215,-0.334727,0.732866,1.0
...,...,...,...,...,...,...,...,...
749995,-0.208138,-0.160593,-0.531866,0.366654,-0.735215,-0.334727,0.732866,0.0
749996,-0.128501,-0.174351,-0.352374,0.366654,1.360147,-0.334727,0.732866,0.0
749997,-0.332108,-0.210681,-0.500756,0.366654,1.360147,-0.334727,0.732866,0.0
749998,-0.281791,0.318262,1.265886,0.366654,1.360147,-0.334727,-1.364505,0.0


In [90]:
train_over['fraud'].value_counts()

fraud
1.0    684259
0.0    684259
Name: count, dtype: int64

In [94]:
X_train_over = train_over.drop(columns = ["fraud"])
y_train_over = train_over["fraud"]

In [96]:
log_reg = LogisticRegression()
log_reg.fit(X_train_over, y_train_over)

In [98]:
pred = log_reg.predict(X_test_scaled)
print(classification_report(y_pred = pred, y_true = y_test))



              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    228338
         1.0       0.57      0.95      0.71     21662

    accuracy                           0.93    250000
   macro avg       0.78      0.94      0.84    250000
weighted avg       0.96      0.93      0.94    250000



This approach improves precision of no_fraud checks at the cost of recall. It radically reduces precision in the yes_fraud case but increases recall there dramatically as well.

## Undersample

In [103]:
no_fraud_undersampled = resample(no_fraud, 
                                    replace=False, 
                                    n_samples = len(fraud),
                                    random_state=0)

In [105]:
train_under = pd.concat([no_fraud_undersampled, fraud])
train_under

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
620859,-0.356859,-0.164751,-0.128666,0.366654,1.360147,-0.334727,0.732866,0.0
632125,0.411229,-0.213758,-0.110930,0.366654,1.360147,-0.334727,-1.364505,0.0
30058,0.025132,-0.174042,-0.510860,0.366654,-0.735215,-0.334727,0.732866,0.0
623995,-0.368454,0.152974,0.225094,0.366654,-0.735215,-0.334727,-1.364505,0.0
232710,0.252244,-0.166075,-0.408163,0.366654,-0.735215,-0.334727,0.732866,0.0
...,...,...,...,...,...,...,...,...
749930,-0.360327,-0.087676,2.519036,0.366666,-0.734623,-0.334364,0.733008,1.0
749939,-0.265880,-0.189941,1.125390,0.366666,-0.734623,-0.334364,0.733008,1.0
749963,-0.317752,-0.173856,1.933839,0.366666,-0.734623,-0.334364,0.733008,1.0
749983,1.960567,-0.188431,-0.555123,0.366666,-0.734623,-0.334364,0.733008,1.0


In [109]:
train_under['fraud'].value_counts()

fraud
0.0    65580
1.0    65580
Name: count, dtype: int64

In [112]:
X_train_under = train_under.drop(columns = ["fraud"])
y_train_under = train_under["fraud"]

In [115]:
log_reg = LogisticRegression()
log_reg.fit(X_train_under, y_train_under)

In [118]:
pred = log_reg.predict(X_test_scaled)
print(classification_report(y_pred = pred, y_true = y_test))



              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    228338
         1.0       0.57      0.95      0.71     21662

    accuracy                           0.93    250000
   macro avg       0.78      0.94      0.84    250000
weighted avg       0.96      0.93      0.94    250000



As with the oversampling, the improvement in no_fraud precision comes at a cost of yes_fraud precision while improving yes_fraud recall.

## SMOTE

In [128]:
sm = SMOTE(random_state = 1,sampling_strategy=1.0)

In [131]:
X_train_sm,y_train_sm = sm.fit_resample(X_train_scaled,y_train)

In [133]:
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_sm, y_train_sm)

In [136]:
pred = log_reg.predict(X_test_scaled)
print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    228338
         1.0       0.57      0.95      0.71     21662

    accuracy                           0.93    250000
   macro avg       0.78      0.94      0.84    250000
weighted avg       0.96      0.93      0.94    250000



This model appears to mirror the accuracy of the oversampling model. Linear regression remains the best overall performer.