# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split




In [2]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

import seaborn 

In [35]:
total = fraud["fraud"].count()
total

proportion = fraud.groupby(["fraud"])["fraud"].count().rename('counting')
proportion = proportion.reset_index()
proportion["porpotion_result"] = (proportion["counting"] / total) * 100
proportion

Unnamed: 0,fraud,counting,porpotion_result
0,0.0,912597,91.2597
1,1.0,87403,8.7403


In [37]:
fraud

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.311140,1.945940,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...
999995,2.207101,0.112651,1.626798,1.0,1.0,0.0,0.0,0.0
999996,19.872726,2.683904,2.778303,1.0,1.0,0.0,0.0,0.0
999997,2.914857,1.472687,0.218075,1.0,1.0,0.0,1.0,0.0
999998,4.258729,0.242023,0.475822,1.0,0.0,0.0,1.0,0.0


In [62]:
fraud["fraud"].value_counts()

fraud
0.0    912597
1.0     87403
Name: count, dtype: int64

In [39]:
X = fraud.drop(columns=["fraud"])
y = fraud["fraud"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [46]:
from sklearn.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor(n_neighbors=5) 
knn.fit(X_train, y_train) 
y_pred = knn.predict(X_test)
y_pred

array([1. , 0. , 0. , ..., 0. , 0. , 0.6])

In [56]:
# If y_pred contains probabilities, apply a threshold to get binary values
y_pred_binary = (y_pred >= 0.5).astype(int)

# Now calculate accuracy with the binary predictions
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred_binary)
print("Précision :", accuracy)

from sklearn.metrics import recall_score
recall = recall_score(y_test, y_pred_binary, average='binary')  # 'binary' pour classification binaire
print("Rappel :", recall)



Précision : 0.9831166666666666
Rappel : 0.9241942283371019


In [78]:
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

In [82]:
print(len(X_train), len(y_train))

700000 700000


In [95]:
# try it now with oversample 

ros = RandomOverSampler(random_state=0)
X_train_oversampled, y_train_oversampled = ros.fit_resample(X_train, y_train)
print(len(X_train_oversampled), len(y_train_oversampled))

knn = KNeighborsRegressor(n_neighbors=5) 
knn.fit(X_train_oversampled, y_train_oversampled) 
y_pred = knn.predict(X_test)

y_predd_binary = (y_pred >= 0.5).astype(int)

accuracy = accuracy_score(y_test, y_predd_binary)
print("Précision :", accuracy)

recall = recall_score(y_test, y_predd_binary, average='binary')  # 'binary' pour classification binaire
print("Rappel :", recall)

1277380 1277380
Précision : 0.9757166666666667
Rappel : 0.9849001647951557


In [101]:
rus = RandomUnderSampler(random_state=0)
X_train_undersampled, y_train_undersampled = rus.fit_resample(X_train, y_train)
print(len(X_train_undersampled), len(y_train_undersampled))

knn = KNeighborsRegressor(n_neighbors=5) 
knn.fit(X_train_undersampled, y_train_undersampled) 
y_pred = knn.predict(X_test)

y_preddd_binary = (y_pred >= 0.5).astype(int)

accuracy = accuracy_score(y_test, y_preddd_binary)
print("Précision :", accuracy)

recall = recall_score(y_test, y_preddd_binary, average='binary')  # 'binary' pour classification binaire
print("Rappel :", recall)

122620 122620
Précision : 0.9399466666666667
Rappel : 0.9977005327099222


In [102]:
smote = SMOTE(random_state=0)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
print(len(X_train_smote), len(y_train_smote))

knn = KNeighborsRegressor(n_neighbors=5) 
knn.fit(X_train_smote, y_train_smote) 
y_pred = knn.predict(X_test)

y_predddd_binary = (y_pred >= 0.5).astype(int)

accuracy = accuracy_score(y_test, y_predddd_binary)
print("Précision :", accuracy)

recall = recall_score(y_test, y_predddd_binary, average='binary')  # 'binary' pour classification binaire
print("Rappel :", recall)

1277380 1277380
Précision : 0.97502
Rappel : 0.9875828766335799
