# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [9]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTEENN
import seaborn as sns
import matplotlib as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [2]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [3]:
fraud["fraud"].value_counts()

fraud
0.0    912597
1.0     87403
Name: count, dtype: int64

In [6]:
features = fraud.drop(columns = "fraud")
target = fraud["fraud"]

In [7]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

In [10]:
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [11]:
log_reg = LogisticRegression(random_state=42)

In [12]:
log_reg.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [13]:
# Make predictions on the testing set
y_pred = log_reg.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Generate classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Generate confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.94515
Classification Report:
              precision    recall  f1-score   support

         0.0       0.95      1.00      0.97    182557
         1.0       0.91      0.41      0.57     17443

    accuracy                           0.95    200000
   macro avg       0.93      0.70      0.77    200000
weighted avg       0.94      0.95      0.94    200000

Confusion Matrix:
[[181835    722]
 [ 10248   7195]]


In [14]:
oversampler = RandomOverSampler(sampling_strategy='minority')

In [15]:
features_resampled, target_resampled = oversampler.fit_resample(features, target)

In [16]:
print(pd.Series(target_resampled).value_counts())

fraud
0.0    912597
1.0    912597
Name: count, dtype: int64


In [17]:
X_train, X_test, y_train, y_test = train_test_split(features_resampled, target_resampled, test_size=0.2, random_state=42)

In [18]:
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [19]:
log_reg = LogisticRegression(random_state=42)

In [20]:
log_reg.fit(X_train, y_train)

In [21]:
# Make predictions on the testing set
y_pred = log_reg.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Generate classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Generate confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.9349055854305979
Classification Report:
              precision    recall  f1-score   support

         0.0       0.94      0.93      0.93    182421
         1.0       0.93      0.94      0.94    182618

    accuracy                           0.93    365039
   macro avg       0.93      0.93      0.93    365039
weighted avg       0.93      0.93      0.93    365039

Confusion Matrix:
[[169576  12845]
 [ 10917 171701]]


In [22]:
undersampler = RandomUnderSampler(sampling_strategy='majority')

In [23]:
features_resampled, target_resampled = undersampler.fit_resample(features, target)

In [24]:
print(pd.Series(target_resampled).value_counts())

fraud
0.0    87403
1.0    87403
Name: count, dtype: int64


In [25]:
X_train, X_test, y_train, y_test = train_test_split(features_resampled, target_resampled, test_size=0.2, random_state=42)

In [26]:
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [27]:
log_reg.fit(X_train, y_train)

In [28]:
# Make predictions on the testing set
y_pred = log_reg.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Generate classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Generate confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.9174818374234883
Classification Report:
              precision    recall  f1-score   support

         0.0       0.93      0.91      0.92     17474
         1.0       0.91      0.93      0.92     17488

    accuracy                           0.92     34962
   macro avg       0.92      0.92      0.92     34962
weighted avg       0.92      0.92      0.92     34962

Confusion Matrix:
[[15825  1649]
 [ 1236 16252]]


In [29]:
smote = SMOTEENN(sampling_strategy='auto')

In [30]:
features_resampled, target_resampled = smote.fit_resample(features, target)

In [32]:
print(pd.Series(target_resampled).value_counts())

fraud
1.0    910832
0.0    890089
Name: count, dtype: int64


In [31]:
X_train, X_test, y_train, y_test = train_test_split(features_resampled, target_resampled, test_size=0.2, random_state=42)

In [33]:
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [34]:
log_reg.fit(X_train, y_train)

In [35]:
# Make predictions on the testing set
y_pred = log_reg.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Generate classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Generate confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.9389369351860849
Classification Report:
              precision    recall  f1-score   support

         0.0       0.94      0.93      0.94    178115
         1.0       0.94      0.94      0.94    182070

    accuracy                           0.94    360185
   macro avg       0.94      0.94      0.94    360185
weighted avg       0.94      0.94      0.94    360185

Confusion Matrix:
[[166391  11724]
 [ 10270 171800]]
