# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [187]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor,AdaBoostRegressor, GradientBoostingRegressor, RandomForestClassifier, GradientBoostingClassifier,AdaBoostClassifier
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, root_mean_squared_error
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.utils import resample
from imblearn.over_sampling import SMOTE

In [None]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

In [None]:
fraud.info()

**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

**1** DISTRIBUTION

In [65]:
# Calculate the percentage of each unique value in the "fraud" column
fraud_percentage = fraud["fraud"].value_counts(normalize=True) * 100
print(fraud_percentage)

fraud
0.0    91.2597
1.0     8.7403
Name: proportion, dtype: float64


We are facing imbalanced dataset with 91% of "legit" transaction and only 9% of "fraud" transactions

**2** LOGISTICREGRESSION

In [67]:
features = fraud.drop(columns = ["fraud"])
target = fraud["fraud"]
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=0)

In [69]:
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled_np = scaler.transform(X_train)
X_test_scaled_np = scaler.transform(X_test)

In [76]:
log_reg = LogisticRegression()

In [78]:
X_train_scaled_df = pd.DataFrame(X_train_scaled_np, columns=X_train.columns, index=X_train.index)
X_test_scaled_df  = pd.DataFrame(X_test_scaled_np, columns=X_test.columns, index=X_test.index)

In [81]:
log_reg.fit(X_train_scaled_df, y_train)

**3** MODEL EVALUATION

In [83]:
log_reg.score(X_test_scaled_df, y_test)

0.959012

In [90]:
y_pred_test_log = log_reg.predict(X_test_scaled_df)
print(classification_report(y_pred = y_pred_test_log, y_true = y_test))

              precision    recall  f1-score   support

         0.0       0.96      0.99      0.98    228273
         1.0       0.89      0.60      0.72     21727

    accuracy                           0.96    250000
   macro avg       0.93      0.80      0.85    250000
weighted avg       0.96      0.96      0.96    250000



Taking class importance into consideration ensures your model is not biased toward the majority class and truly captures the nuances of the dataset's meaningful outcomes. This alignment provides better actionable insights and decision-making capabilities in real-world applications.


**4** OVERSAMPLE

In [107]:
train = pd.DataFrame(X_train_scaled_np, columns=X_train.columns, index=X_train.index)

In [109]:
train["fraud"] = y_train.values

In [112]:
fraud = train[train["fraud"] == 1]
no_fraud = train[train["fraud"] == 0]

In [130]:
len(fraud),len(no_fraud)

(65676, 684324)

In [123]:
yes_oversampled = resample(fraud, replace=True, n_samples = len(no_fraud), random_state=0)

In [None]:
train_over = pd.concat([yes_oversampled, no_fraud])
train_over

In [None]:
fraud_plt = train_over["fraud"].value_counts()
fraud_plt.plot(kind="bar")
plt.show()

In [142]:
X_train_over = train_over.drop(columns = ["fraud"])
y_train_over = train_over["fraud"]

In [144]:
log_reg = LogisticRegression()
log_reg.fit(X_train_over, y_train_over)

In [146]:
y_pred_test_log = log_reg.predict(X_test_scaled_df)
print(classification_report(y_pred = y_pred_test_log, y_true = y_test))

              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    228273
         1.0       0.57      0.95      0.71     21727

    accuracy                           0.93    250000
   macro avg       0.78      0.94      0.84    250000
weighted avg       0.96      0.93      0.94    250000



This model does not improve my first results

**5** UNDERSAMPLE

In [None]:
no_undersampled = resample(no_fraud,
                                    replace=False,
                                    n_samples = len(fraud),
                                    random_state=0)
no_undersampled

In [None]:
train_under = pd.concat([no_undersampled, fraud])
train_under

In [None]:
fraud_plt = train_under["fraud"].value_counts()
fraud_plt.plot(kind="bar")
plt.show()

In [170]:
X_train_under = train_under.drop(columns = ["fraud"])
y_train_under = train_under["fraud"]

In [173]:
log_reg = LogisticRegression()
log_reg.fit(X_train_under, y_train_under)

In [177]:
y_pred_test_log = log_reg.predict(X_test_scaled_df)
print(classification_report(y_pred = y_pred_test_log, y_true = y_test))

              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    228273
         1.0       0.57      0.95      0.71     21727

    accuracy                           0.93    250000
   macro avg       0.78      0.94      0.84    250000
weighted avg       0.96      0.93      0.94    250000



This model does not improve my first results

**6** SMOTE

In [196]:
sm = SMOTE(random_state = 1,sampling_strategy=1.0)

In [198]:
X_train_sm, y_train_sm =  sm.fit_resample(X_train_scaled_df,y_train) # .fit() doesn't exists!!!

In [199]:
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_sm, y_train_sm)

In [201]:
y_pred_test_log = log_reg.predict(X_test_scaled_df)
print(classification_report(y_pred = y_pred_test_log, y_true = y_test))

              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    228273
         1.0       0.57      0.95      0.71     21727

    accuracy                           0.93    250000
   macro avg       0.78      0.94      0.84    250000
weighted avg       0.96      0.93      0.94    250000



This model does not improve my first results