# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [22]:
#Libraries
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import recall_score
from sklearn.utils import resample


In [2]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [4]:
fraud["fraud"].value_counts()

fraud
0.0    912597
1.0     87403
Name: count, dtype: int64

In [10]:
lr = LogisticRegression(max_iter=1500)

In [11]:
features = fraud.drop(columns=["fraud"])
target = fraud["fraud"]

In [12]:
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=42)

In [19]:
lr.fit(X_train, y_train)

In [21]:
lr_pred = lr.predict(X_test)
recall_score(lr_pred, y_test)

0.8906816161892391

## Since we are looking to avoid False Negative, aka considering a client legit, when in fact he isnt, our priority will be recall_score ##

In [23]:
train = pd.DataFrame(X_train, columns=X_train.columns)
train["fraud"] = y_train.values
fradulent = train[train["fraud"] == 1]
legit = train[train["fraud"] == 0]

oversampled = resample(fradulent,
                      replace=True,
                      n_samples=len(legit),
                      random_state=42)

train_over = pd.concat([oversampled, legit])

In [25]:
train_over["fraud"].value_counts()

fraud
1.0    684339
0.0    684339
Name: count, dtype: int64

In [26]:
lr_2 = LogisticRegression(max_iter=1500)
features_over = train_over.drop(columns=["fraud"])
target_over = train_over["fraud"]

X_train_over, X_test_over, y_train_over, y_test_over = train_test_split(features_over, target_over, random_state=42)

In [27]:
lr_2.fit(X_train_over, y_train_over)

In [28]:
lr_pred_over = lr_2.predict(X_test_over)
recall_score(lr_pred_over, y_test_over)

0.9337445765384217

In [29]:
undersampled = resample(legit,
                      replace=True,
                      n_samples=len(fradulent),
                      random_state=42)

train_under = pd.concat([undersampled, fradulent])

In [30]:
train_under["fraud"].value_counts()

fraud
0.0    65661
1.0    65661
Name: count, dtype: int64

In [31]:
lr_3 = LogisticRegression(max_iter=1500)
features_under = train_under.drop(columns=["fraud"])
target_under = train_under["fraud"]

X_train_under, X_test_under, y_train_under, y_test_under = train_test_split(features_under, target_under, random_state=42)

In [32]:
lr_3.fit(X_train_under, y_train_under)

In [33]:
lr_pred_under = lr_3.predict(X_test_under)
recall_score(lr_pred_under, y_test_under)

0.9327508285628201

In [34]:
from imblearn.over_sampling import SMOTE

In [35]:
sm = SMOTE(random_state = 42,sampling_strategy=1.0)
X_train_sm,y_train_sm = sm.fit_resample(X_train,y_train)

In [36]:
log_reg = LogisticRegression(max_iter=1500)
log_reg.fit(X_train_sm, y_train_sm)

In [38]:
lr_pred_sm = log_reg.predict(X_test)
recall_score(lr_pred_sm, y_test)

0.5765504146538967