# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


In [9]:
fraud.to_csv('fraud.csv', index=False)

**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [None]:
#1 What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?

In [21]:
# Display basic information about the dataset
fraud.info()

# Display the distribution of the target variable
distribution = fraud['fraud'].value_counts()
distribution

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
 #   Column                          Non-Null Count    Dtype  
---  ------                          --------------    -----  
 0   distance_from_home              1000000 non-null  float64
 1   distance_from_last_transaction  1000000 non-null  float64
 2   ratio_to_median_purchase_price  1000000 non-null  float64
 3   repeat_retailer                 1000000 non-null  float64
 4   used_chip                       1000000 non-null  float64
 5   used_pin_number                 1000000 non-null  float64
 6   online_order                    1000000 non-null  float64
 7   fraud                           1000000 non-null  float64
dtypes: float64(8)
memory usage: 61.0 MB


fraud
0.0    912597
1.0     87403
Name: count, dtype: int64

In [None]:
#2 Train a LogisticRegression

In [25]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Split the data into features and target
X = fraud.drop(columns=['fraud'])
y = fraud['fraud']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Train the logistic regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
classification_report_initial = classification_report(y_test, y_pred)
classification_report_initial


'              precision    recall  f1-score   support\n\n         0.0       0.96      0.99      0.98    182519\n         1.0       0.90      0.60      0.72     17481\n\n    accuracy                           0.96    200000\n   macro avg       0.93      0.80      0.85    200000\nweighted avg       0.96      0.96      0.96    200000\n'

In [None]:
#3 Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric

In [27]:
from imblearn.over_sampling import RandomOverSampler

# Apply oversampling
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)

# Split the resampled data
X_train_resampled, X_test_resampled, y_train_resampled, y_test_resampled = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42, stratify=y_resampled)

# Train the model on the resampled data
model.fit(X_train_resampled, y_train_resampled)

# Make predictions
y_pred_resampled = model.predict(X_test_resampled)

# Evaluate the model
classification_report_oversampled = classification_report(y_test_resampled, y_pred_resampled)
classification_report_oversampled


'              precision    recall  f1-score   support\n\n         0.0       0.95      0.93      0.94    182520\n         1.0       0.93      0.95      0.94    182519\n\n    accuracy                           0.94    365039\n   macro avg       0.94      0.94      0.94    365039\nweighted avg       0.94      0.94      0.94    365039\n'

In [None]:
#4 Run Oversample in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model?

In [29]:
from imblearn.under_sampling import RandomUnderSampler

# Apply undersampling
rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X, y)

# Split the resampled data
X_train_resampled, X_test_resampled, y_train_resampled, y_test_resampled = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42, stratify=y_resampled)

# Train the model on the resampled data
model.fit(X_train_resampled, y_train_resampled)

# Make predictions
y_pred_resampled = model.predict(X_test_resampled)

# Evaluate the model
classification_report_undersampled = classification_report(y_test_resampled, y_pred_resampled)
classification_report_undersampled


'              precision    recall  f1-score   support\n\n         0.0       0.95      0.93      0.94     17481\n         1.0       0.94      0.95      0.94     17481\n\n    accuracy                           0.94     34962\n   macro avg       0.94      0.94      0.94     34962\nweighted avg       0.94      0.94      0.94     34962\n'

In [None]:
#5 Now, run Undersample in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?

In [31]:
from imblearn.over_sampling import SMOTE

# Apply SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

# Split the resampled data
X_train_resampled, X_test_resampled, y_train_resampled, y_test_resampled = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42, stratify=y_resampled)

# Train the model on the resampled data
model.fit(X_train_resampled, y_train_resampled)

# Make predictions
y_pred_resampled = model.predict(X_test_resampled)

# Evaluate the model
classification_report_smote = classification_report(y_test_resampled, y_pred_resampled)
classification_report_smote

'              precision    recall  f1-score   support\n\n         0.0       0.95      0.93      0.94    182520\n         1.0       0.94      0.95      0.94    182519\n\n    accuracy                           0.94    365039\n   macro avg       0.94      0.94      0.94    365039\nweighted avg       0.94      0.94      0.94    365039\n'