# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [3]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


In [5]:
fraud.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
 #   Column                          Non-Null Count    Dtype  
---  ------                          --------------    -----  
 0   distance_from_home              1000000 non-null  float64
 1   distance_from_last_transaction  1000000 non-null  float64
 2   ratio_to_median_purchase_price  1000000 non-null  float64
 3   repeat_retailer                 1000000 non-null  float64
 4   used_chip                       1000000 non-null  float64
 5   used_pin_number                 1000000 non-null  float64
 6   online_order                    1000000 non-null  float64
 7   fraud                           1000000 non-null  float64
dtypes: float64(8)
memory usage: 61.0 MB


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [12]:
#. What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?

# Check the distribution of the target variable
fraud_distribution = fraud['fraud'].value_counts(normalize=True)
print(fraud_distribution)

#Fraudulent transactions are only 8.74% of the dataset, it is definatelly an imbalanced dataset.
#Models trained on imbalanced datasets may have a bias towards the majority class, which in this case is non-fraudulent transactions.




fraud
0.0    0.912597
1.0    0.087403
Name: proportion, dtype: float64


In [14]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

#2. Train a LogisticRegression.
#3. Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.



# train and test sets
X = fraud.drop(columns=['fraud'])
y = fraud['fraud']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Train reg. mod
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)

# test set
y_pred = log_reg.predict(X_test)
y_pred_proba = log_reg.predict_proba(X_test)[:, 1]

# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print("\nROC AUC Score:")
print(roc_auc_score(y_test, y_pred_proba))


Confusion Matrix:
[[181291   1228]
 [  6916  10565]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.96      0.99      0.98    182519
         1.0       0.90      0.60      0.72     17481

    accuracy                           0.96    200000
   macro avg       0.93      0.80      0.85    200000
weighted avg       0.96      0.96      0.96    200000


ROC AUC Score:
0.9670409810026577


In [None]:

#4. Run Oversample in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model?

from imblearn.over_sampling import RandomOverSampler

# Apply RandomOverSampler
ros = RandomOverSampler(random_state=42)
X_res, y_res = ros.fit_resample(X_train, y_train)

# Train a Logistic Regression model on the balanced dataset
log_reg.fit(X_res, y_res)

# Predict on the test set
y_pred_res = log_reg.predict(X_test)
y_pred_res_proba = log_reg.predict_proba(X_test)[:, 1]

# Evaluate the model
print("Confusion Matrix (Oversampled):")
print(confusion_matrix(y_test, y_pred_res))
print("\nClassification Report (Oversampled):")
print(classification_report(y_test, y_pred_res))
print("\nROC AUC Score (Oversampled):")
print(roc_auc_score(y_test, y_pred_res_proba))



In [17]:
#5. Now, run Undersample in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?

from imblearn.under_sampling import RandomUnderSampler

# RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X_train, y_train)

# Train model on the balanced dataset
log_reg.fit(X_res, y_res)

# test set
y_pred_res = log_reg.predict(X_test)
y_pred_res_proba = log_reg.predict_proba(X_test)[:, 1]

# Evaluate
print("Confusion Matrix (Undersampled):")
print(confusion_matrix(y_test, y_pred_res))
print("\nClassification Report (Undersampled):")
print(classification_report(y_test, y_pred_res))
print("\nROC AUC Score (Undersampled):")
print(roc_auc_score(y_test, y_pred_res_proba))

#Accuracy: The accuracy of the undersampled model is slightly lower (0.93) compared to the original model (0.96). This is expected as undersampling reduces the number of non-fraudulent cases, making the model more balanced but also potentially less accurate overall.

#The undersampled model has significantly improved recall for the minority class (fraudulent transactions), which is crucial for identifying fraud cases. However, it comes at the cost of lower precision and overall accuracy. The improved ROC AUC score indicates better performance in distinguishing between fraudulent and non-fraudulent transactions.

#In scenarios where identifying fraudulent transactions is critical (even at the cost of more false positives), the undersampled model performs better. However, if the cost of false positives is high, the original model might be preferred.



Confusion Matrix (Undersampled):
[[170386  12133]
 [   917  16564]]

Classification Report (Undersampled):
              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    182519
         1.0       0.58      0.95      0.72     17481

    accuracy                           0.93    200000
   macro avg       0.79      0.94      0.84    200000
weighted avg       0.96      0.93      0.94    200000


ROC AUC Score (Undersampled):
0.9795773854342927


In [19]:

#6. Finally, run SMOTE in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?

from imblearn.over_sampling import SMOTE

# SMOTE
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X_train, y_train)

# Train a Logistic model on the balanced dataset
log_reg.fit(X_res, y_res)

# test set
y_pred_res = log_reg.predict(X_test)
y_pred_res_proba = log_reg.predict_proba(X_test)[:, 1]

# Evaluate the model
print("Confusion Matrix (SMOTE):")
print(confusion_matrix(y_test, y_pred_res))
print("\nClassification Report (SMOTE):")
print(classification_report(y_test, y_pred_res))
print("\nROC AUC Score (SMOTE):")
print(roc_auc_score(y_test, y_pred_res_proba))


#Using SMOTE has resulted in a model that balances precision and recall for the fraudulent class better than the original model. While accuracy is slightly lower than the original model, the recall for the fraudulent class is significantly improved. The ROC AUC score indicates that the model with SMOTE performs well in distinguishing between fraudulent and non-fraudulent transactions.

#balancing the dataset using SMOTE has led to better performance in terms of recall for the minority class (fraudulent transactions) while maintaining a good overall performance. This makes the SMOTE model a strong candidate when the identification of fraudulent transactions is critical.



Confusion Matrix (SMOTE):
[[170500  12019]
 [   942  16539]]

Classification Report (SMOTE):
              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    182519
         1.0       0.58      0.95      0.72     17481

    accuracy                           0.94    200000
   macro avg       0.79      0.94      0.84    200000
weighted avg       0.96      0.94      0.94    200000


ROC AUC Score (SMOTE):
0.9791832099721023
