# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [3]:
#Libraries
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import precision_score, recall_score, classification_report, confusion_matrix, f1_score
from sklearn.preprocessing import StandardScaler

from sklearn.utils import resample

In [2]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [4]:
fraud["fraud"].value_counts()

#The dataset is pretty inbalance in the fraud result

fraud
0.0    912597
1.0     87403
Name: count, dtype: int64

In [5]:
features = fraud.drop(columns = ["fraud"])
target = fraud["fraud"]

X_train, X_test, y_train, y_test = train_test_split(features, target)

In [6]:
scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [7]:
log_reg = LogisticRegression()

log_reg.fit(X_train_scaled, y_train)

In [8]:
log_reg.score(X_test_scaled, y_test)

0.959232

In [9]:
pred = log_reg.predict(X_test_scaled)
print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

         0.0       0.96      0.99      0.98    228357
         1.0       0.89      0.60      0.72     21643

    accuracy                           0.96    250000
   macro avg       0.93      0.80      0.85    250000
weighted avg       0.96      0.96      0.96    250000



In [10]:
train = pd.DataFrame(X_train_scaled, columns = X_train.columns)

In [11]:
train["fraud"] = y_train.values

In [12]:
yes_fraud = train[train["fraud"] == 1]
no_fraud = train[train["fraud"] == 0]

In [13]:
yes_fraud_oversampled = resample(yes_fraud, 
                                    replace=True, 
                                    n_samples = len(no_fraud),
                                    random_state=0)

In [15]:
train_over = pd.concat([yes_fraud_oversampled, no_fraud])
train_over

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
496808,-0.346762,2.648336,-0.532013,0.365966,-0.734126,-0.334754,0.732909,1.0
486110,-0.086340,1.840318,0.421988,0.365966,-0.734126,-0.334754,0.732909,1.0
523592,-0.226048,-0.100050,1.391003,0.365966,-0.734126,-0.334754,0.732909,1.0
243288,-0.353951,-0.058213,1.149028,0.365966,1.362163,-0.334754,0.732909,1.0
479041,-0.072961,3.453584,-0.012446,0.365966,-0.734126,-0.334754,0.732909,1.0
...,...,...,...,...,...,...,...,...
749995,-0.343658,-0.145783,0.132175,0.365966,-0.734126,-0.334754,0.732909,0.0
749996,-0.351073,-0.130469,-0.545778,0.365966,-0.734126,-0.334754,0.732909,0.0
749997,-0.302696,0.254503,-0.469580,0.365966,1.362163,-0.334754,-1.364425,0.0
749998,-0.379135,-0.172589,-0.240421,0.365966,-0.734126,-0.334754,0.732909,0.0


In [16]:
X_train_over = train_over.drop(columns = ["fraud"])
y_train_over = train_over["fraud"]

In [17]:
log_reg = LogisticRegression()
log_reg.fit(X_train_over, y_train_over)

In [18]:
pred = log_reg.predict(X_test_scaled)
print(classification_report(y_pred = pred, y_true = y_test))



              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    228357
         1.0       0.57      0.95      0.71     21643

    accuracy                           0.93    250000
   macro avg       0.78      0.94      0.84    250000
weighted avg       0.96      0.93      0.94    250000



In [19]:
no_fraud_undersampled = resample(no_fraud, 
                                    replace=False, 
                                    n_samples = len(yes_fraud),
                                    random_state=0)
no_fraud_undersampled

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
594100,-0.176137,-0.164726,-0.551061,0.365966,-0.734126,-0.334754,0.732909,0.0
594774,-0.309086,-0.151983,0.139642,0.365966,1.362163,-0.334754,0.732909,0.0
743542,-0.372660,-0.166621,-0.473658,0.365966,1.362163,-0.334754,-1.364425,0.0
241370,0.166760,-0.134001,-0.223304,0.365966,-0.734126,-0.334754,0.732909,0.0
74377,-0.362717,-0.070143,-0.318969,0.365966,-0.734126,-0.334754,0.732909,0.0
...,...,...,...,...,...,...,...,...
188563,0.741779,-0.183761,-0.452655,0.365966,1.362163,-0.334754,0.732909,0.0
273684,-0.272665,-0.172353,0.637322,0.365966,1.362163,-0.334754,0.732909,0.0
53010,-0.376836,-0.168265,-0.376394,0.365966,1.362163,-0.334754,0.732909,0.0
26409,-0.189865,-0.181876,-0.349690,0.365966,-0.734126,-0.334754,0.732909,0.0


In [20]:
train_under = pd.concat([no_fraud_undersampled, yes_fraud])
train_under

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
594100,-0.176137,-0.164726,-0.551061,0.365966,-0.734126,-0.334754,0.732909,0.0
594774,-0.309086,-0.151983,0.139642,0.365966,1.362163,-0.334754,0.732909,0.0
743542,-0.372660,-0.166621,-0.473658,0.365966,1.362163,-0.334754,-1.364425,0.0
241370,0.166760,-0.134001,-0.223304,0.365966,-0.734126,-0.334754,0.732909,0.0
74377,-0.362717,-0.070143,-0.318969,0.365966,-0.734126,-0.334754,0.732909,0.0
...,...,...,...,...,...,...,...,...
749942,-0.176960,0.357443,1.774296,0.365966,1.362163,-0.334754,0.732909,1.0
749955,-0.380140,-0.178197,1.075593,0.365966,1.362163,-0.334754,0.732909,1.0
749959,-0.380150,-0.167983,1.049823,0.365966,-0.734126,-0.334754,0.732909,1.0
749961,-0.330416,-0.184752,2.253368,0.365966,1.362163,-0.334754,0.732909,1.0


In [21]:
X_train_under = train_under.drop(columns = ["fraud"])
y_train_under = train_under["fraud"]

In [22]:
log_reg = LogisticRegression()
log_reg.fit(X_train_under, y_train_under)

In [23]:
pred = log_reg.predict(X_test_scaled)
print(classification_report(y_pred = pred, y_true = y_test))



              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    228357
         1.0       0.57      0.95      0.71     21643

    accuracy                           0.93    250000
   macro avg       0.78      0.94      0.84    250000
weighted avg       0.96      0.93      0.94    250000



In [24]:
from imblearn.over_sampling import SMOTE

In [25]:
sm = SMOTE(random_state = 1,sampling_strategy=1.0)

In [26]:
X_train_sm,y_train_sm = sm.fit_resample(X_train_scaled,y_train)

In [27]:
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_sm, y_train_sm)

In [28]:
pred = log_reg.predict(X_test_scaled)
print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    228357
         1.0       0.57      0.95      0.71     21643

    accuracy                           0.93    250000
   macro avg       0.78      0.94      0.84    250000
weighted avg       0.96      0.93      0.94    250000

