# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [14]:
#Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, classification_report, confusion_matrix, f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.utils import resample
from imblearn.over_sampling import SMOTE


In [3]:
# Load the dataset
url = "https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv"
fraud = pd.read_csv(url)

# Display the first few rows of the dataset
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [9]:
# Calculate the distribution of the target variable
fraud_counts = fraud['fraud'].value_counts(normalize=True)

fraud_counts

fraud
1.0    1.0
Name: proportion, dtype: float64

In [4]:
# Separate the features and the target variable
X = fraud.drop('fraud', axis=1)
y = fraud['fraud']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [5]:
# Combine the training data back together
train_data = pd.concat([X_train, y_train], axis=1)

# Separate the minority and majority classes
not_fraud = train_data[train_data.fraud == 0]
fraud = train_data[train_data.fraud == 1]

# Oversample the minority class
fraud_upsampled = resample(fraud, replace=True, n_samples=len(not_fraud), random_state=42)

# Combine the majority class with the upsampled minority class
upsampled = pd.concat([not_fraud, fraud_upsampled])

# Separate the features and the target variable again
X_train = upsampled.drop('fraud', axis=1)
y_train = upsampled['fraud']


In [6]:
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the logistic regression model
model = LogisticRegression(random_state=42)
model.fit(X_train_scaled, y_train)


In [7]:
# Predict on the test set
y_pred = model.predict(X_test_scaled)

# Print evaluation metrics
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Precision: 0.5739726976647495
Recall: 0.9510122852003521
F1 Score: 0.7158825732476736
Confusion Matrix:
 [[255427  18444]
 [  1280  24849]]

Classification Report:
               precision    recall  f1-score   support

         0.0       1.00      0.93      0.96    273871
         1.0       0.57      0.95      0.72     26129

    accuracy                           0.93    300000
   macro avg       0.78      0.94      0.84    300000
weighted avg       0.96      0.93      0.94    300000



In [15]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Combine the training data back together
train_data = pd.concat([X_train, y_train], axis=1)

# Separate the minority and majority classes
not_fraud = train_data[train_data.fraud == 0]
fraud = train_data[train_data.fraud == 1]

# Undersample the majority class
not_fraud_downsampled = resample(not_fraud, replace=False, n_samples=len(fraud), random_state=42)

# Combine the minority class with the downsampled majority class
downsampled = pd.concat([not_fraud_downsampled, fraud])

# Separate the features and the target variable again
X_train = downsampled.drop('fraud', axis=1)
y_train = downsampled['fraud']

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the logistic regression model
model = LogisticRegression(random_state=42)
model.fit(X_train_scaled, y_train)

# Predict on the test set
y_pred = model.predict(X_test_scaled)

# Print evaluation metrics
print("Undersampling the Majority Class")
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Undersampling the Majority Class
Precision: 0.5718950490951319
Recall: 0.9518159898962838
F1 Score: 0.7144909216272122
Confusion Matrix:
 [[255254  18617]
 [  1259  24870]]

Classification Report:
               precision    recall  f1-score   support

         0.0       1.00      0.93      0.96    273871
         1.0       0.57      0.95      0.71     26129

    accuracy                           0.93    300000
   macro avg       0.78      0.94      0.84    300000
weighted avg       0.96      0.93      0.94    300000



In [18]:
url = "https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv"
fraud_data = pd.read_csv(url)
# Separate the features and the target variable
X = fraud_data.drop('fraud', axis=1)
y = fraud_data['fraud']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Apply SMOTE to the training data
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Ensure both classes are present after SMOTE
print("After SMOTE:")
print("Not fraud:", sum(y_train_smote == 0))
print("Fraud:", sum(y_train_smote == 1))

# Standardize the features
scaler = StandardScaler()
X_train_smote_scaled = scaler.fit_transform(X_train_smote)
X_test_scaled = scaler.transform(X_test)

# Train the logistic regression model
model = LogisticRegression(random_state=42)
model.fit(X_train_smote_scaled, y_train_smote)

# Predict on the test set
y_pred = model.predict(X_test_scaled)

# Print evaluation metrics
print("Applying SMOTE")
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

After SMOTE:
Not fraud: 638726
Fraud: 638726
Applying SMOTE
Precision: 0.5757617985878856
Recall: 0.9487542577213058
F1 Score: 0.7166293271662934
Confusion Matrix:
 [[255605  18266]
 [  1339  24790]]

Classification Report:
               precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    273871
         1.0       0.58      0.95      0.72     26129

    accuracy                           0.93    300000
   macro avg       0.79      0.94      0.84    300000
weighted avg       0.96      0.93      0.94    300000

