# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [24]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler

In [7]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
df.head()


Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model?

- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [16]:
#1. Yes, it's imbalanced
df['fraud'].value_counts()

fraud
0.0    912597
1.0     87403
Name: count, dtype: int64

In [20]:
features = df.drop(columns = "fraud")
target = df.fraud
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.20, random_state=0)

In [26]:
normalizer = MinMaxScaler()
normalizer.fit(X_train)

In [28]:
X_train_norm = normalizer.transform(X_train)

X_test_norm = normalizer.transform(X_test)

In [84]:
X_train_norm = pd.DataFrame(X_train_norm, columns = X_train.columns)
X_test_norm = pd.DataFrame(X_test_norm, columns = X_test.columns)

In [86]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

log_reg = LogisticRegression(random_state=42) 


log_reg.fit(X_train_norm, y_train)


y_pred_train = log_reg.predict(X_train_norm)  # Predictions on training set
y_pred_test = log_reg.predict(X_test_norm)    # Predictions on test set


print("\nTraining Set Performance:")
print(f"Accuracy: {accuracy_score(y_train, y_pred_train): .4f}")
print("\nTest Set Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_test): .4f}")
print("\nClassification Report (Test Set):")
print(classification_report(y_test, y_pred_test))


Training Set Performance:
Accuracy:  0.9448

Test Set Performance:
Accuracy:  0.9455

Classification Report (Test Set):
              precision    recall  f1-score   support

         0.0       0.95      1.00      0.97    182615
         1.0       0.92      0.41      0.57     17385

    accuracy                           0.95    200000
   macro avg       0.93      0.70      0.77    200000
weighted avg       0.94      0.95      0.94    200000



In [34]:
X_train_norm["fraud"] = y_train_norm.values

In [36]:
fraud = X_train_norm[X_train_norm["fraud"] == 1]
no_fraud = X_train_norm[X_train_norm["fraud"] == 0]

In [42]:
from sklearn.utils import resample
fraud_oversampled = resample(fraud, 
                                    replace=True, 
                                    n_samples = len(no_fraud),
                                    random_state=0)

In [44]:
train_over = pd.concat([fraud, no_fraud])
train_over

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
32,0.037535,0.000021,0.010371,1.0,0.0,0.0,1.0,1.0
49,0.000112,0.000066,0.020619,0.0,0.0,0.0,0.0,1.0
74,0.001591,0.000129,0.016457,1.0,0.0,0.0,1.0,1.0
111,0.000017,0.007251,0.014927,0.0,1.0,0.0,1.0,1.0
114,0.011131,0.000088,0.000781,1.0,0.0,0.0,1.0,1.0
...,...,...,...,...,...,...,...,...
799995,0.006690,0.000262,0.000437,1.0,0.0,0.0,1.0,0.0
799996,0.000315,0.001097,0.007914,1.0,0.0,0.0,1.0,0.0
799997,0.002521,0.000372,0.001792,1.0,0.0,1.0,0.0,0.0
799998,0.001142,0.000003,0.006678,1.0,0.0,0.0,1.0,0.0


In [46]:
X_train_over = train_over.drop(columns = ["fraud"])
y_train_over = train_over["fraud"]

In [48]:
log_reg = LogisticRegression()
log_reg.fit(X_train_over, y_train_over)

In [52]:
pred = log_reg.predict(X_test_norm)
print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

         0.0       0.95      1.00      0.97    182615
         1.0       0.92      0.41      0.57     17385

    accuracy                           0.95    200000
   macro avg       0.93      0.70      0.77    200000
weighted avg       0.94      0.95      0.94    200000



In [56]:
no_undersampled = resample(fraud, 
                                    replace=False, 
                                    n_samples = len(fraud),
                                    random_state=0)


In [58]:
train_under = pd.concat([no_fraud, fraud])
train_under

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,0.000331,0.000083,0.002352,1.0,1.0,0.0,1.0,0.0
1,0.003688,0.000002,0.002246,1.0,1.0,0.0,1.0,0.0
2,0.005209,0.000082,0.001718,1.0,0.0,0.0,1.0,0.0
3,0.000056,0.000045,0.001963,0.0,1.0,0.0,0.0,0.0
4,0.002078,0.000002,0.002453,1.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...
799941,0.000419,0.001097,0.066801,1.0,1.0,0.0,1.0,1.0
799947,0.011019,0.000104,0.000503,1.0,0.0,0.0,1.0,1.0
799959,0.000671,0.000013,0.014943,1.0,0.0,0.0,1.0,1.0
799960,0.013652,0.000002,0.002138,1.0,0.0,0.0,1.0,1.0


In [60]:
X_train_under = train_under.drop(columns = ["fraud"])
y_train_under = train_under["fraud"]

In [70]:
log_reg = LogisticRegression()
log_reg.fit(X_train_under, y_train_under)

In [72]:
pred = log_reg.predict(X_test_norm)
print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

         0.0       0.95      1.00      0.97    182615
         1.0       0.92      0.41      0.57     17385

    accuracy                           0.95    200000
   macro avg       0.93      0.70      0.77    200000
weighted avg       0.94      0.95      0.94    200000



In [88]:
from imblearn.over_sampling import SMOTE

In [92]:
sm = SMOTE(random_state = 1,sampling_strategy=1.0)
X_train_sm,y_train_sm = sm.fit_resample(X_train_norm,y_train)

In [94]:
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_sm, y_train_sm)

In [98]:
pred = log_reg.predict(X_test_norm)
print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

         0.0       0.99      0.92      0.96    182615
         1.0       0.54      0.93      0.68     17385

    accuracy                           0.92    200000
   macro avg       0.77      0.93      0.82    200000
weighted avg       0.95      0.92      0.93    200000

