# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [3]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE

In [2]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [4]:
# **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?

target_distribution = fraud['fraud'].value_counts(normalize=True)
print(target_distribution)

fraud
0.0    0.912597
1.0    0.087403
Name: proportion, dtype: float64


In [7]:
# **2.** Train a LogisticRegression.

X = fraud.drop(columns=['fraud'])
y = fraud['fraud']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Train a Logistic Regression model
log_reg = LogisticRegression(max_iter=10000)
log_reg.fit(X_train, y_train)

# Make predictions
y_pred = log_reg.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.95928


In [9]:
# **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


Confusion Matrix:
 [[181291   1228]
 [  6916  10565]]


In [10]:
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=['Not Fraud', 'Fraud']))

Classification Report:
               precision    recall  f1-score   support

   Not Fraud       0.96      0.99      0.98    182519
       Fraud       0.90      0.60      0.72     17481

    accuracy                           0.96    200000
   macro avg       0.93      0.80      0.85    200000
weighted avg       0.96      0.96      0.96    200000



el modelo original tiene una alta accuracy/precisión , pero puede tener un bajo recall para la clase de fraude

In [20]:
# **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model?
ros = RandomOverSampler(random_state=42)
X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train)

# Train the Logistic Regression model on the oversampled data
log_reg_ros = LogisticRegression(max_iter=10000)
log_reg_ros.fit(X_train_ros, y_train_ros)

# Make predictions
y_pred_ros = log_reg_ros.predict(X_test)

# Evaluate the oversampled model
print("Oversampled Accuracy:", accuracy_score(y_test, y_pred_ros))


Oversampled Accuracy: 0.93482


In [18]:
print("Oversampled Confusion Matrix:\n", confusion_matrix(y_test, y_pred_ros))


Oversampled Confusion Matrix:
 [[170393  12126]
 [   910  16571]]


In [19]:
print("Oversampled Classification Report:\n", classification_report(y_test, y_pred_ros, target_names=['Not Fraud', 'Fraud']))

Oversampled Classification Report:
               precision    recall  f1-score   support

   Not Fraud       0.99      0.93      0.96    182519
       Fraud       0.58      0.95      0.72     17481

    accuracy                           0.93    200000
   macro avg       0.79      0.94      0.84    200000
weighted avg       0.96      0.93      0.94    200000



mejore el recall de la clase de fraude, pero podría tener un mayor número de falsos positivos

In [22]:
# **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)

# Train the Logistic Regression model on the undersampled data
log_reg_rus = LogisticRegression(max_iter=10000)
log_reg_rus.fit(X_train_rus, y_train_rus)

# Make predictions
y_pred_rus = log_reg_rus.predict(X_test)

# Evaluate the undersampled model
print("Undersampled Accuracy:", accuracy_score(y_test, y_pred_rus))


Undersampled Accuracy: 0.934735


In [23]:
print("Undersampled Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rus))


Undersampled Confusion Matrix:
 [[170384  12135]
 [   918  16563]]


In [24]:
print("Undersampled Classification Report:\n", classification_report(y_test, y_pred_rus, target_names=['Not Fraud', 'Fraud']))

Undersampled Classification Report:
               precision    recall  f1-score   support

   Not Fraud       0.99      0.93      0.96    182519
       Fraud       0.58      0.95      0.72     17481

    accuracy                           0.93    200000
   macro avg       0.79      0.94      0.84    200000
weighted avg       0.96      0.93      0.94    200000



En este caso la clasificacion es igual, ya que la diferencia en los datos es mínima como podemos ver al comparar la matriz de confusion

In [25]:
# **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Train the Logistic Regression model on the SMOTE data
log_reg_smote = LogisticRegression(max_iter=10000)
log_reg_smote.fit(X_train_smote, y_train_smote)

# Make predictions
y_pred_smote = log_reg_smote.predict(X_test)

# Evaluate the SMOTE model
print("SMOTE Accuracy:", accuracy_score(y_test, y_pred_smote))


SMOTE Accuracy: 0.93519


In [26]:
print("SMOTE Confusion Matrix:\n", confusion_matrix(y_test, y_pred_smote))


SMOTE Confusion Matrix:
 [[170498  12021]
 [   941  16540]]


In [27]:
print("SMOTE Classification Report:\n", classification_report(y_test, y_pred_smote, target_names=['Not Fraud', 'Fraud']))

SMOTE Classification Report:
               precision    recall  f1-score   support

   Not Fraud       0.99      0.93      0.96    182519
       Fraud       0.58      0.95      0.72     17481

    accuracy                           0.94    200000
   macro avg       0.79      0.94      0.84    200000
weighted avg       0.96      0.94      0.94    200000



In [None]:
Hay entre precisión y recall para ambas clases por lo que este sería el modelo a seguir