# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [2]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE

In [3]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

### 1. Check the distribution of the target variable

In [4]:
target_distribution = fraud['fraud'].value_counts(normalize=True)
print("Target variable distribution:")
print(target_distribution)
print("\nThe dataset is imbalanced." if target_distribution[0] > 0.6 else "\nThe dataset is balanced.")

Target variable distribution:
fraud
0.0    0.912597
1.0    0.087403
Name: proportion, dtype: float64

The dataset is imbalanced.


Explanation  
With ~91% of transactions being legitimate and only ~9% being fraudulent, this dataset is highly imbalanced. Typically, datasets are considered imbalanced when one class significantly outweighs the other (e.g., less than 30% for the minority class).  

This imbalance can lead to challenges in training a model, as it may favor predicting the majority class (0.0) while overlooking the minority class (1.0), which is usually the class of interest in fraud detection scenarios.

In [5]:
# Split data into features and target
X = fraud.drop('fraud', axis=1)
y = fraud['fraud']

In [6]:
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

In [7]:
# Function to evaluate the model
def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))
    print("ROC AUC Score:", roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]))

### 2. Train a Logistic Regression model

In [8]:
print("\nTraining Logistic Regression on imbalanced data...")
logreg = LogisticRegression(max_iter=1000, random_state=42)
logreg.fit(X_train, y_train)


Training Logistic Regression on imbalanced data...


### 3. Evaluate the model

In [9]:
evaluate_model(logreg, X_test, y_test)

Confusion Matrix:
 [[271936   1843]
 [ 10431  15790]]
Classification Report:
               precision    recall  f1-score   support

         0.0       0.96      0.99      0.98    273779
         1.0       0.90      0.60      0.72     26221

    accuracy                           0.96    300000
   macro avg       0.93      0.80      0.85    300000
weighted avg       0.96      0.96      0.96    300000

ROC AUC Score: 0.9671862712228372


EXPLANATION

- Confusion Matrix:  

True Negatives (TN): 271,936 legitimate transactions correctly classified.  
False Positives (FP): 1,843 legitimate transactions incorrectly classified as fraud.  
False Negatives (FN): 10,431 fraudulent transactions incorrectly classified as legitimate.  
True Positives (TP): 15,790 fraudulent transactions correctly classified.  

- Classification Report:  

Precision for Class 1.0 (Fraud): 90%  
Out of all transactions predicted as fraudulent, 90% are truly fraudulent.  
Recall for Class 1.0 (Fraud): 60%  
Out of all actual fraudulent transactions, only 60% are correctly identified.  
F1-Score for Class 1.0 (Fraud): 72%  
The harmonic mean of precision and recall, showing the balance between the two for detecting fraud.  

- Accuracy:  

Overall accuracy is 96%, but this metric is not reliable in imbalanced datasets, as it is heavily influenced by the majority class (legitimate transactions).  

- Macro Average:  

Averages precision, recall, and F1-score across both classes equally, highlighting imbalance issues. The macro recall (0.80) indicates that fraud detection is far less effective than legitimate transaction detection.  

- Weighted Average:  

Similar to the macro average but weights each class by its support (number of samples). This provides a slightly more optimistic view but still highlights the class imbalance.  

- ROC AUC Score:  

A high ROC AUC score (0.967) indicates that the model is capable of distinguishing between fraud and legitimate transactions. However, the low recall for fraudulent transactions suggests it could still miss many fraud cases in real-world applications.  

### 4. Oversampling

In [10]:
print("\nApplying Oversampling...")
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X_train, y_train)
logreg_over = LogisticRegression(max_iter=1000, random_state=42)
logreg_over.fit(X_resampled, y_resampled)
print("\nEvaluating model with oversampled data...")
evaluate_model(logreg_over, X_test, y_test)


Applying Oversampling...

Evaluating model with oversampled data...
Confusion Matrix:
 [[255521  18258]
 [  1357  24864]]
Classification Report:
               precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    273779
         1.0       0.58      0.95      0.72     26221

    accuracy                           0.93    300000
   macro avg       0.79      0.94      0.84    300000
weighted avg       0.96      0.93      0.94    300000

ROC AUC Score: 0.9795659265409268


Explanation after oversampling:  
- Confusion Matrix:  

True Negatives (TN): 255,521 legitimate transactions correctly classified.  
False Positives (FP): 18,258 legitimate transactions incorrectly classified as fraud.  
False Negatives (FN): 1,357 fraudulent transactions incorrectly classified as legitimate.  
True Positives (TP): 24,864 fraudulent transactions correctly classified.  

- Classification Report:  

Precision for Class 1.0 (Fraud): 58%  
Out of all transactions predicted as fraudulent, 58% are truly fraudulent. Precision slightly decreased compared to the imbalanced data (90%) because oversampling introduces more minority class samples, increasing the chances of false positives.  

Recall for Class 1.0 (Fraud): 95%  
A significant improvement from the 60% recall with imbalanced data. The model is now much better at identifying fraudulent transactions.  

F1-Score for Class 1.0 (Fraud): 72%  
Matches the previous F1-score, indicating a better balance between recall and precision.  

Weighted Metrics: The weighted average remains high (94%), showing overall strong performance.  

- ROC AUC Score:  
0.9796, higher than the imbalanced dataset's 0.9672, indicating an improved ability to distinguish between fraudulent and legitimate transactions.

### 5. Undersampling

In [11]:
print("\nApplying Undersampling...")
rus = RandomUnderSampler(random_state=42)
X_resampled_under, y_resampled_under = rus.fit_resample(X_train, y_train)
logreg_under = LogisticRegression(max_iter=1000, random_state=42)
logreg_under.fit(X_resampled_under, y_resampled_under)
print("\nEvaluating model with undersampled data...")
evaluate_model(logreg_under, X_test, y_test)


Applying Undersampling...

Evaluating model with undersampled data...
Confusion Matrix:
 [[255528  18251]
 [  1327  24894]]
Classification Report:
               precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    273779
         1.0       0.58      0.95      0.72     26221

    accuracy                           0.93    300000
   macro avg       0.79      0.94      0.84    300000
weighted avg       0.96      0.93      0.94    300000

ROC AUC Score: 0.979587744378318


EXPLANATION

- Confusion Matrix:  

True Negatives (TN): 255,528 legitimate transactions correctly classified.  
False Positives (FP): 18,251 legitimate transactions incorrectly classified as fraud.  
False Negatives (FN): 1,327 fraudulent transactions incorrectly classified as legitimate.  
True Positives (TP): 24,894 fraudulent transactions correctly classified.  

- Classification Report:

Precision for Class 1.0 (Fraud): 58%  
Precision remains consistent with the oversampling result (58%).  
Recall for Class 1.0 (Fraud): 95%  
Recall is also identical to the oversampling result, showing the model’s strong ability to detect fraudulent transactions.  
F1-Score for Class 1.0 (Fraud): 72%  
This balance between precision and recall is consistent with oversampling.  
Weighted Metrics: Overall weighted averages remain high, showing a well-performing model.  

- ROC AUC Score:  

0.9796, comparable to the score achieved with oversampling. This indicates that the model maintains its ability to distinguish between fraudulent and legitimate transactions.

### 6. SMOTE

In [12]:
smote = SMOTE(random_state=42)
X_resampled_smote, y_resampled_smote = smote.fit_resample(X_train, y_train)
logreg_smote = LogisticRegression(max_iter=1000, random_state=42)
logreg_smote.fit(X_resampled_smote, y_resampled_smote)
print("\nEvaluating model with SMOTE data...")
evaluate_model(logreg_smote, X_test, y_test)


Evaluating model with SMOTE data...
Confusion Matrix:
 [[255664  18115]
 [  1401  24820]]
Classification Report:
               precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    273779
         1.0       0.58      0.95      0.72     26221

    accuracy                           0.93    300000
   macro avg       0.79      0.94      0.84    300000
weighted avg       0.96      0.93      0.94    300000

ROC AUC Score: 0.9792851947382067


CONCLUSION

SMOTE COMPARISON WITH OVERSAMPLING AND UNDERSAMPLING  
- Recall: SMOTE achieves the same recall improvement (95%) as oversampling and undersampling, addressing the imbalance effectively.  
- Precision: Precision remains constant (58%) across all techniques, showing similar trade-offs between true positives and false positives.  
- F1-Score: Identical to other methods (72%), indicating a consistent balance between precision and recall.  
- ROC AUC Score: Slightly lower (0.9793 vs. 0.9796) but within a minor range of variation  

SMOTE performs similarly to oversampling and undersampling in improving the model's ability to detect fraud. It achieves the same recall boost (95%), which is crucial for minimizing false negatives. However, SMOTE has an advantage over undersampling as it retains all majority class data and generates synthetic samples for the minority class, preserving the dataset's overall size and diversity. This makes SMOTE a preferred choice in practical scenarios, especially for maintaining data richness while addressing class imbalance.