# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [3]:
# 1) Distribución del target y comprobación de desbalanceo
import pandas as pd
import numpy as np

assert 'fraud' in fraud.columns, "No se encuentra la columna 'fraud' en el dataset."
target_counts = fraud['fraud'].value_counts(dropna=False)
target_ratio = fraud['fraud'].value_counts(normalize=True).rename('ratio')
display(pd.concat([target_counts, target_ratio], axis=1).rename(columns={'fraud':'count'}))

print("\n¿Dataset desbalanceado? Tip: si la clase minoritaria < 10-15% → desbalance importante.")

Unnamed: 0_level_0,count,ratio
fraud,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,912597,0.912597
1.0,87403,0.087403



¿Dataset desbalanceado? Tip: si la clase minoritaria < 10-15% → desbalance importante.


In [4]:
# 2) Split + Pipeline base con Logistic Regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# Features (todas menos el target)
X = fraud.drop(columns=['fraud']).copy()
y = fraud['fraud'].astype(int)

# Columnas numéricas (este dataset es numérico/bool; por si acaso seleccionamos numéricas)
num_cols = X.select_dtypes(include=['int64','float64','bool']).columns.tolist()
# Pasar bool a int
for c in X.select_dtypes(include=['bool']).columns:
    X[c] = X[c].astype(int)

preprocess = ColumnTransformer(
    transformers=[('num', StandardScaler(), num_cols)],
    remainder='drop'
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

base_clf = LogisticRegression(max_iter=200, n_jobs=None, solver='lbfgs')  # sin class_weight de momento
pipe_base = Pipeline(steps=[('prep', preprocess), ('clf', base_clf)])
pipe_base.fit(X_train, y_train)


0,1,2
,steps,"[('prep', ...), ('clf', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,200


In [5]:
# 3) Evaluación con métricas adecuadas para desbalanceo
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, average_precision_score, precision_recall_fscore_support

def evaluate_model(model, X_te, y_te, name="model"):
    y_pred = model.predict(X_te)
    if hasattr(model, "predict_proba"):
        y_proba = model.predict_proba(X_te)[:,1]
    else:
        # fallback para modelos sin predict_proba
        try:
            y_proba = model.decision_function(X_te)
        except:
            y_proba = None

    print(f"=== {name} ===")
    print("Confusion matrix:")
    print(confusion_matrix(y_te, y_pred))
    print("\nClassification report:")
    print(classification_report(y_te, y_pred, digits=3))

    # Métricas centradas en clase positiva (fraude=1)
    prec, rec, f1, _ = precision_recall_fscore_support(y_te, y_pred, average='binary', pos_label=1, zero_division=0)
    print(f"Precision: {prec:.3f} | Recall: {rec:.3f} | F1: {f1:.3f}")
    if y_proba is not None:
        try:
            roc = roc_auc_score(y_te, y_proba)
            pr  = average_precision_score(y_te, y_proba)
            print(f"ROC-AUC: {roc:.3f} | PR-AUC: {pr:.3f}")
        except Exception as e:
            print("No se pudieron calcular ROC/PR AUC:", e)
    print()
    return {'precision': prec, 'recall': rec, 'f1': f1, 'model': name}

metrics_summary = []
metrics_summary.append(evaluate_model(pipe_base, X_test, y_test, name="Baseline LogReg"))


=== Baseline LogReg ===
Confusion matrix:
[[181296   1223]
 [  6895  10586]]

Classification report:
              precision    recall  f1-score   support

           0      0.963     0.993     0.978    182519
           1      0.896     0.606     0.723     17481

    accuracy                          0.959    200000
   macro avg      0.930     0.799     0.850    200000
weighted avg      0.958     0.959     0.956    200000

Precision: 0.896 | Recall: 0.606 | F1: 0.723
ROC-AUC: 0.967 | PR-AUC: 0.807



In [6]:
# 4) Oversampling (RandomOverSampler) con imblearn Pipeline
from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.linear_model import LogisticRegression

oversample = RandomOverSampler(random_state=42)
logreg_os = LogisticRegression(max_iter=200, solver='lbfgs')

pipe_over = ImbPipeline(steps=[('prep', preprocess),
                               ('over', oversample),
                               ('clf', logreg_os)])

pipe_over.fit(X_train, y_train)
metrics_summary.append(evaluate_model(pipe_over, X_test, y_test, name="LogReg + Oversampling"))


=== LogReg + Oversampling ===
Confusion matrix:
[[170390  12129]
 [   911  16570]]

Classification report:
              precision    recall  f1-score   support

           0      0.995     0.934     0.963    182519
           1      0.577     0.948     0.718     17481

    accuracy                          0.935    200000
   macro avg      0.786     0.941     0.840    200000
weighted avg      0.958     0.935     0.942    200000

Precision: 0.577 | Recall: 0.948 | F1: 0.718
ROC-AUC: 0.980 | PR-AUC: 0.757



In [7]:
# 5) Undersampling (RandomUnderSampler)
from imblearn.under_sampling import RandomUnderSampler

undersample = RandomUnderSampler(random_state=42)

pipe_under = ImbPipeline(steps=[('prep', preprocess),
                                ('under', undersample),
                                ('clf', LogisticRegression(max_iter=200, solver='lbfgs'))])

pipe_under.fit(X_train, y_train)
metrics_summary.append(evaluate_model(pipe_under, X_test, y_test, name="LogReg + Undersampling"))


=== LogReg + Undersampling ===
Confusion matrix:
[[170394  12125]
 [   918  16563]]

Classification report:
              precision    recall  f1-score   support

           0      0.995     0.934     0.963    182519
           1      0.577     0.947     0.717     17481

    accuracy                          0.935    200000
   macro avg      0.786     0.941     0.840    200000
weighted avg      0.958     0.935     0.942    200000

Precision: 0.577 | Recall: 0.947 | F1: 0.717
ROC-AUC: 0.980 | PR-AUC: 0.757



In [9]:
# 6) SMOTE (Synthetic Minority Over-sampling Technique)
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
pipe_smote = ImbPipeline(steps=[('prep', preprocess),
                                ('smote', smote),
                                ('clf', LogisticRegression(max_iter=200, solver='lbfgs'))])

pipe_smote.fit(X_train, y_train)
metrics_summary.append(evaluate_model(pipe_smote, X_test, y_test, name="LogReg + SMOTE"))


=== LogReg + SMOTE ===
Confusion matrix:
[[170386  12133]
 [   907  16574]]

Classification report:
              precision    recall  f1-score   support

           0      0.995     0.934     0.963    182519
           1      0.577     0.948     0.718     17481

    accuracy                          0.935    200000
   macro avg      0.786     0.941     0.840    200000
weighted avg      0.958     0.935     0.942    200000

Precision: 0.577 | Recall: 0.948 | F1: 0.718
ROC-AUC: 0.980 | PR-AUC: 0.757



In [10]:
# Comparativa rápida centrada en F1/Recall (fraude=1)
import pandas as pd
summary_df = pd.DataFrame(metrics_summary).set_index('model').sort_values('f1', ascending=False)
display(summary_df)
print("\nRecomendación: en fraudes suele priorizarse RECALL (detectar la mayoría de fraudes) incluso a costa de más falsos positivos. Ajusta el umbral si es necesario.")


Unnamed: 0_level_0,precision,recall,f1
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Baseline LogReg,0.896435,0.605572,0.722841
LogReg + SMOTE,0.57735,0.948115,0.717676
LogReg + Oversampling,0.577372,0.947886,0.717627
LogReg + Undersampling,0.577349,0.947486,0.717494



Recomendación: en fraudes suele priorizarse RECALL (detectar la mayoría de fraudes) incluso a costa de más falsos positivos. Ajusta el umbral si es necesario.
