# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [3]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [None]:
df_fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
df_fraud.head()

**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [None]:
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import precision_score, recall_score, classification_report, confusion_matrix, f1_score
from sklearn.preprocessing import StandardScaler

from sklearn.utils import resample

In [None]:
df_fraud.isnull().sum()

In [None]:
df_fraud.shape

In [None]:
df_fraud["fraud"].value_counts()

In [None]:
# 1. What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
# Yes, because 91.26 % of the values for the column "fraud" are False.

# 2. Train a LogisticRegression.
# For baseline, we will train a Logistic Regression in imbalanced data.

In [None]:
cl_fraud = df_fraud["fraud"].value_counts()
cl_fraud.plot(kind="bar")
plt.show()

In [None]:
features = df_fraud.drop(columns = ["fraud"])
target = df_fraud["fraud"]

X_train, X_test, y_train, y_test = train_test_split(features, target)

In [None]:
scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
log_reg = LogisticRegression()

In [None]:
log_reg.fit(X_train_scaled, y_train)

In [None]:
log_reg.score(X_test_scaled, y_test)

In [None]:
pred = log_reg.predict(X_test_scaled)
print(classification_report(y_pred = pred, y_true = y_test))

In [None]:
# 3. Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
# High Accuracy (95.91%) → The model correctly classifies most cases.
# Precision for Class 1 (Fraud): 89%. When the model predicts fraud, it is correct 89% of the time.
# Recall for Class 1 (Fraud): 60% The model misses 40% of actual fraud cases. Recall of fraud matters more than overall accuracy.

In [None]:
# 4. Run Oversample in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the model's performance?

In [None]:
train = pd.DataFrame(X_train_scaled, columns = X_train.columns)

In [None]:
train["fraud"] = y_train.values

In [None]:
# Add the target column "fraud" manually before resampling.

In [None]:
print(train.columns)  # Verify if "fraud" is present

In [None]:
train = pd.DataFrame(X_train_scaled, columns=X_train.columns)
train["fraud"] = y_train.values

In [None]:
print(train.columns)

In [None]:
true = train[train["fraud"] == 1]
false = train[train["fraud"] == 0]

In [None]:
true_oversampled = resample(true, 
                                    replace=True, 
                                    n_samples = len(false),
                                    random_state=0)

In [None]:
train_over = pd.concat([true_oversampled, false])
train_over

In [None]:
true_plt = train_over["fraud"].value_counts()
true_plt.plot(kind="bar")
plt.show()

In [None]:
features = df_fraud.drop(columns = ["fraud"])
target = df_fraud["fraud"]

X_train, X_test, y_train, y_test = train_test_split(features, target)

In [None]:
scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
log_reg = LogisticRegression()

In [None]:
log_reg.fit(X_train_scaled, y_train)

In [None]:
log_reg.score(X_test_scaled, y_test)

In [None]:
pred = log_reg.predict(X_test_scaled)
print(classification_report(y_pred = pred, y_true = y_test))

In [None]:
# 4. Overall accuracy remains high (96%), meaning the model is still performing well.
# Precision for Fraud (Class 1) improved slightly (89%), indicating better fraud identification.
# Recall for Fraud (Class 1) stayed the same (60%), meaning the model still misses 40% of actual fraud cases.
# Macro Avg Recall (80%) improved compared to the imbalanced version (previously lower).

In [None]:
# 5. Now, run Undersample in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?

In [None]:
train

In [None]:
false_undersampled = resample(false, 
                                    replace=False, 
                                    n_samples = len(true),
                                    random_state=0)
false_undersampled

In [None]:
train_under = pd.concat([false_undersampled, true])
train_under

In [None]:
true_plt = train_under["fraud"].value_counts()
true_plt.plot(kind="bar")
plt.show()

In [None]:
X_train_under = train_under.drop(columns = ["fraud"])
y_train_under = train_under["fraud"]

In [None]:
log_reg = LogisticRegression()
log_reg.fit(X_train_under, y_train_under)

In [None]:
pred = log_reg.predict(X_test_scaled)
print(classification_report(y_pred = pred, y_true = y_test))

In [None]:
# Accuracy dropped slightly (93%) compared to oversampling (96%).
# Precision for Fraud (Class 1) decreased (57%), meaning more false positives (incorrect fraud predictions).
# Recall for Fraud (Class 1) improved significantly (95%), meaning the model detects most actual fraud cases.
# Macro Avg Recall (94%) increased compared to oversampling (80%).

# If recall is more important, this model is the most useful.

In [None]:
# 6. Finally, run SMOTE in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?

In [None]:
# If recall is more important, this model is the most useful.

In [None]:
from imblearn.over_sampling import SMOTE

In [None]:
sm = SMOTE(random_state = 1,sampling_strategy=1.0)

In [None]:
X_train_sm,y_train_sm = sm.fit_resample(X_train_scaled,y_train)

In [None]:
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_sm, y_train_sm)

In [None]:
pred = log_reg.predict(X_test_scaled)
print(classification_report(y_pred = pred, y_true = y_test))

In [None]:
# Create a df after SMOTE
train_smote = pd.DataFrame(X_train_sm, columns=X_train.columns)
train_smote["fraud"] = y_train_sm.values

In [None]:
# Count fraud after SMOTE
smote_plt = train_smote["fraud"].value_counts()

In [None]:
smote_plt.plot(kind="bar")
plt.title("Class Distribution SMOTE")
plt.xlabel("Fraud (0 = No, 1 = Yes)")
plt.ylabel("Count")
plt.show()

In [None]:
# Overall Accuracy (93%). Slightly lower than oversampling but still strong.
# Recall for Fraud (Class 1) increased significantly (95%).
# Precision for Fraud (Class 1) remains low (57%, more false positives.
# Macro Avg Recall (94%) is higher than oversampling.

# If recall is more important, this model is the most useful, together with undersample (recall 95 % in both).

In [None]:
# Continue in the lab-imbalance notebook and add grid search.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [None]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.20, random_state=0)

In [None]:
# Normalizer

In [None]:
normalizer = MinMaxScaler()

normalizer.fit(X_train)

In [None]:
X_train_norm = normalizer.transform(X_train)

X_test_norm = normalizer.transform(X_test)

In [None]:
X_train_norm = pd.DataFrame(X_train_norm, columns = X_train.columns)
X_test_norm = pd.DataFrame(X_test_norm, columns = X_test.columns)

In [None]:
# Grid Search

In [None]:
grid = {"n_neighbors": [3, 6, 9, 12, 15, 18]}

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn = KNeighborsClassifier()

In [None]:
model = GridSearchCV(estimator = knn, param_grid = grid, cv=5)

In [None]:
model.fit(X_train_norm, y_train)

In [None]:
model = GridSearchCV(estimator=knn, param_grid=grid, cv=5, verbose=2)

In [None]:
X_train_norm.info()

In [None]:
y_train.dtype

In [None]:
features = df_fraud.drop(columns=['cl_fraud'])
target = df_fraud['cl_fraud']

In [None]:
knn = KNeighborsClassifier(n_neighbors = 9)

In [None]:
knn.fit(X_train_norm, y_train)
knn.score(X_test_norm, y_test)

In [None]:
grid = {"max_depth": [3, 6, 9, 12, 15, 18]}

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
model = GridSearchCV(estimator = dt, param_grid = grid, cv=5)

In [None]:
model.fit(X_train, y_train)

In [None]:
model.best_params_

In [None]:
dt = DecisionTreeRegressor(max_depth=10)
dt.fit(X_train,y_train)

In [None]:
dt.score(X_test, y_test)

In [None]:
# After training, we check what are the best values for the hyperparameters that we have tested.

In [None]:
model.best_params_

In [None]:
# You can retrieve the best model with the best parameters when accessing best_estimator_ attribute.

In [None]:
best_model = model.best_estimator_

In [None]:
# Evaluate our model

In [None]:
pred = ada_reg.predict(X_test_norm)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", ada_reg.score(X_test_norm, y_test))