# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [31]:
#Libraries
import pandas as pd
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE

In [2]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [3]:
fraud

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.311140,1.945940,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...
999995,2.207101,0.112651,1.626798,1.0,1.0,0.0,0.0,0.0
999996,19.872726,2.683904,2.778303,1.0,1.0,0.0,0.0,0.0
999997,2.914857,1.472687,0.218075,1.0,1.0,0.0,1.0,0.0
999998,4.258729,0.242023,0.475822,1.0,0.0,0.0,1.0,0.0


- What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?

In [6]:
fraud_count = fraud['fraud'].value_counts()
fraud_count

fraud
0.0    912597
1.0     87403
Name: count, dtype: int64

- Train a LogisticRegression.

In [7]:
X = fraud.drop(columns = ['fraud'])
y = fraud['fraud']

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [11]:
scaler = StandardScaler()

In [12]:
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [13]:
model = LogisticRegression(random_state=42)

In [14]:
model.fit(X_train_scaled, y_train)

In [15]:
y_pred = model.predict(X_test_scaled)

- Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.

In [16]:
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")

conf_matrix = confusion_matrix(y_test, y_pred)
print("Matriz de confusión:")
print(conf_matrix)

Accuracy: 0.95875
Precision: 0.8914913550804872
Recall: 0.6000687955053603
F1 Score: 0.7173108552631579
Matriz de confusión:
[[181283   1274]
 [  6976  10467]]


- Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 

In [21]:
oversample = RandomOverSampler(random_state=42)
X_train_oversample, y_train_oversample = oversample.fit_resample(X_train_scaled, y_train)

In [22]:
# Entrenar nuevamente el modelo de Regresión Logística con los datos balanceados
model.fit(X_train_oversample, y_train_oversample)

In [23]:
y_pred_oversample = model.predict(X_test_scaled)

In [25]:
accuracy = accuracy_score(y_test, y_pred_oversample)
precision = precision_score(y_test, y_pred_oversample)
recall = recall_score(y_test, y_pred_oversample)
f1 = f1_score(y_test, y_pred_oversample)

# Mostrar las métricas con oversampling
print("Evaluación del modelo con Oversampling:")
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")

conf_matrix = confusion_matrix(y_test, y_pred_oversample)
print("Matriz de confusión:")
print(conf_matrix)

Evaluación del modelo con Oversampling:
Accuracy: 0.93469
Precision: 0.5760563869310094
Recall: 0.9511551911941754
F1 Score: 0.7175417351440186
Matriz de confusión:
[[170347  12210]
 [   852  16591]]


- Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?

In [27]:
undersample = RandomUnderSampler(random_state=42)
X_train_undersample, y_train_undersample = undersample.fit_resample(X_train_scaled, y_train)

In [28]:
model.fit(X_train_undersample, y_train_undersample)

In [29]:
y_pred_undersample = model.predict(X_test_scaled)

In [30]:
accuracy = accuracy_score(y_test, y_pred_undersample)
precision = precision_score(y_test, y_pred_undersample)
recall = recall_score(y_test, y_pred_undersample)
f1 = f1_score(y_test, y_pred_undersample)

# Mostrar las métricas con oversampling
print("Evaluación del modelo con Undersampling:")
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")

conf_matrix = confusion_matrix(y_test, y_pred_undersample)
print("Matriz de confusión:")
print(conf_matrix)

Evaluación del modelo con Undersampling:
Accuracy: 0.93448
Precision: 0.5751654367182899
Recall: 0.9517284870721779
F1 Score: 0.7170129140932061
Matriz de confusión:
[[170295  12262]
 [   842  16601]]


- Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?

In [33]:
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train_scaled, y_train)

In [34]:
model.fit(X_train_smote, y_train_smote)

In [35]:
y_pred_smote = model.predict(X_test_scaled)

In [37]:
accuracy = accuracy_score(y_test, y_pred_smote)
precision = precision_score(y_test, y_pred_smote)
recall = recall_score(y_test, y_pred_smote)
f1 = f1_score(y_test, y_pred_smote)

# Mostrar las métricas con SMOTE
print("Evaluación del modelo con SMOTE:")
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")

conf_matrix = confusion_matrix(y_test, y_pred_smote)
print("Matriz de confusión:")
print(conf_matrix)

Evaluación del modelo con SMOTE:
Accuracy: 0.934645
Precision: 0.5758553681726699
Recall: 0.9513845095453763
F1 Score: 0.7174509846306825
Matriz de confusión:
[[170334  12223]
 [   848  16595]]
