# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [2]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [4]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [None]:
# Check the distribution of the target variable
print("Distribution of the target variable ('fraud'):")
print(fraud['fraud'].value_counts())

# Calculate the imbalance ratio
imbalance_ratio = fraud['fraud'].value_counts()[0] / fraud['fraud'].value_counts()[1]
print(f"\nImbalance ratio (legit to fraud): {imbalance_ratio:.2f}")

# Split the data into features (X) and target variable (y)
X = fraud.drop('fraud', axis=1)
y = fraud['fraud']

# Splitting the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize Logistic Regression model
logreg = LogisticRegression(max_iter=1000, random_state=42)

# Fit the model to the training data
logreg.fit(X_train, y_train)

# Predict on the test set
y_pred = logreg.predict(X_test)

# Evaluate using different metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print evaluation metrics
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")

# Confusion Matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Oversampling using RandomOverSampler
oversampler = RandomOverSampler(random_state=42)
X_over, y_over = oversampler.fit_resample(X_train, y_train)

# Train Logistic Regression on oversampled data
logreg_over = LogisticRegression(max_iter=1000, random_state=42)
logreg_over.fit(X_over, y_over)

# Evaluate the model on original test set
y_pred_over = logreg_over.predict(X_test)

# Evaluate using different metrics
accuracy_over = accuracy_score(y_test, y_pred_over)
precision_over = precision_score(y_test, y_pred_over)
recall_over = recall_score(y_test, y_pred_over)
f1_over = f1_score(y_test, y_pred_over)

# Print evaluation metrics after oversampling
print("\nMetrics after oversampling:")
print(f"Accuracy: {accuracy_over:.2f}")
print(f"Precision: {precision_over:.2f}")
print(f"Recall: {recall_over:.2f}")
print(f"F1 Score: {f1_over:.2f}")

# Confusion Matrix after oversampling
print("\nConfusion Matrix after oversampling:")
print(confusion_matrix(y_test, y_pred_over))

# Undersampling using RandomUnderSampler
undersampler = RandomUnderSampler(random_state=42)
X_under, y_under = undersampler.fit_resample(X_train, y_train)

# Train Logistic Regression on undersampled data
logreg_under = LogisticRegression(max_iter=1000, random_state=42)
logreg_under.fit(X_under, y_under)

# Evaluate the model on original test set
y_pred_under = logreg_under.predict(X_test)

# Evaluate using different metrics
accuracy_under = accuracy_score(y_test, y_pred_under)
precision_under = precision_score(y_test, y_pred_under)
recall_under = recall_score(y_test, y_pred_under)
f1_under = f1_score(y_test, y_pred_under)

# Print evaluation metrics after undersampling
print("\nMetrics after undersampling:")
print(f"Accuracy: {accuracy_under:.2f}")
print(f"Precision: {precision_under:.2f}")
print(f"Recall: {recall_under:.2f}")
print(f"F1 Score: {f1_under:.2f}")

# Confusion Matrix after undersampling
print("\nConfusion Matrix after undersampling:")
print(confusion_matrix(y_test, y_pred_under))

# Balancing using SMOTE (Synthetic Minority Over-sampling Technique)
smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X_train, y_train)

# Train Logistic Regression on SMOTE-balanced data
logreg_smote = LogisticRegression(max_iter=1000, random_state=42)
logreg_smote.fit(X_smote, y_smote)

# Evaluate the model on original test set
y_pred_smote = logreg_smote.predict(X_test)

# Evaluate using different metrics
accuracy_smote = accuracy_score(y_test, y_pred_smote)
precision_smote = precision_score(y_test, y_pred_smote)
recall_smote = recall_score(y_test, y_pred_smote)
f1_smote = f1_score(y_test, y_pred_smote)

# Print evaluation metrics after SMOTE
print("\nMetrics after SMOTE:")
print(f"Accuracy: {accuracy_smote:.2f}")
print(f"Precision: {precision_smote:.2f}")
print(f"Recall: {recall_smote:.2f}")
print(f"F1 Score: {f1_smote:.2f}")

# Confusion Matrix after SMOTE
print("\nConfusion Matrix after SMOTE:")
print(confusion_matrix(y_test, y_pred_smote))