---

<center>

<h1><b>Fraud Detection Project</b></h1>
<h3>Universidad Carlos III de Madrid · Bluetab</h3>

<p><em>Development of a predictive system using Machine Learning to identify fraudulent transactions and strengthen financial security.</em></p>

</center>

---

Identify types of variables (numerical or categorical) because SMOTE cannot deal with categorical ones and SMOTENC needs to know which are them to use a special metric.

### Notebook Overview

This notebook is the second part of the *Bluetab–UC3M Fraud Detection Project*.  
Its main goal is to perform adequate feature engineering and address the class imbalance within the merged dataset to provide an insightful and robust dataset for the machine learning models to be applied.  

Throughout this notebook, we:
- Address the class imbalance through various approaches (SMOTE, no SMOTE)
- Perform feature engineering

---

# **4. Data Preprocessing & Class Balancing**

In this section, we start the preprocessing stage of the project. After importing the merged dataset, we will focus on addressing the strong class imbalance, identified during the EDA, using SMOTE (Synthetic Minority Oversampling Technique). Then we apply feature engineering to extract more information from out available variables.

In [7]:
import sys
import time
import warnings
import chardet
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import polars as pl
import os
from IPython.display import display

from datetime import datetime, timedelta
from itertools import combinations
from pytz import timezone

from tqdm import tqdm
from tqdm.auto import tqdm

from pvlib.location import Location
from scipy.stats import pearsonr


from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures, StandardScaler, RobustScaler
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import xgboost as xgb
import lightgbm as lgbm
from catboost import CatBoostRegressor, CatBoostClassifier

from collections import Counter
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import OrdinalEncoder
from imblearn.over_sampling import SMOTENC

np.random.seed(123)

## Prepare training and test sets 

In [8]:
# Upload merged dataset from EDA.ipynb
df = pd.read_csv('content/merged.csv')

# Separate Features (X) and Target (y)
X = df.drop('Class', axis=1)
y = df['Class']

print("Shape of X:", X.shape)
print("Shape of y:", y.shape)

Shape of X: (286068, 45)
Shape of y: (286068,)


Identify types of variables (numerical or categorical) because SMOTE cannot deal with categorical ones and SMOTENC needs to know which are them to use a special metric.

In [None]:
# Separate between numerical and categorical features
cat_cols = list(X.select_dtypes(include=['object']).columns)
num_cols = list(X.select_dtypes(include=['float64', 'int64']).columns)

print(cat_cols)

We removed 'transaction_id', 'customer_id', 'device_id', 'ip_address', 'email', 'phone' and 'name' as they are unique identifiers and are noise for the model

In [None]:
# Removed id columns
id_cols = [
    'transaction_id', 'customer_id', 'device_id',
    'ip_address', 'email', 'phone', 'name'
]
X = X.drop(columns=id_cols, errors='ignore')

Transform join_date in an useful attribute compatible with the models, days since the client registered.

In [None]:
X['join_date'] = pd.to_datetime(X['join_date'], errors='coerce')                    # Convert to datetime
X['customer_days_since_join'] = (X['join_date'].max() - X['join_date']).dt.days     # Calculate days since join
X = X.drop(columns=['join_date'])

Reduce cardinality in the categories by collapsing the most rare ones into "other". This threshold is established with 'top'. If we do not deal with this, SMOTENC will have memory problems as it cannot deal with so many levels. 

In [None]:
top= 20
def rare_values(series, top=top):
    top_vals = series.value_counts().nlargest(top).index        # Get the top N most frequent values
    return series.where(series.isin(top_vals), 'Other')         # Replace rare values with 'Other'

In [None]:
for col in ['merchant_country', 'customer_country', 'city', 'merchant', 'zip_code']: # TODO hacerlo automatico en vez de pasando la lista de variables
    if X[col].nunique() > top:                                          # Only apply if there are more than 'top' unique values
        X[col] = rare_values(X[col], top=top)                           # Replace rare values
        print(f"Reduced categories in {col} to top {top} + 'Other'")

In [None]:
# Categorical cardinality after bucketing (BEFORE encoding)

def plot_categorical_cardinality(dfX, title="Categorical Cardinality (after bucketing)", top_used=20):
    # detect categories
    cat_cols = list(dfX.select_dtypes(include='object').columns)
    if not cat_cols:
        print("No remaining object (categorical) columns. Did you encode already?")
        return
    
    # cardinality and order
    card = (dfX[cat_cols]
            .nunique(dropna=False)                # if there is Unknown/NaN, count it as a category
            .sort_values(ascending=True)
            .to_frame('n_categories'))
    
    # plot
    sns.set_theme(style="whitegrid")
    h = max(3, 0.40 * len(card))                  
    fig, ax = plt.subplots(figsize=(9, h))
    
    sns.barplot(
        data=card.reset_index(),
        x='n_categories', y='index',
        orient='h', ax=ax, palette="viridis"
    )
    
    # titles and labels
    ax.set_title(f"{title}\n(Top-{top_used} rare bucketing applied)", fontsize=13, weight="bold")
    ax.set_xlabel("Number of distinct categories")
    ax.set_ylabel("Feature")
    
    # number of categories on the bars
    for i, (feat, ncat) in enumerate(card['n_categories'].items()):
        ax.text(ncat + max(card['n_categories']) * 0.01, i, f"{int(ncat)}",
                va="center", ha="left", fontsize=10, weight="bold", color="#2f2f2f")
    
    sns.despine(ax=ax, top=True, right=True)
    plt.tight_layout()
    plt.show()
    
    # also display the cardinality table
    display(card.sort_values('n_categories', ascending=False))

plot_categorical_cardinality(X, title="Categorical Cardinality (after Top-N + 'Other')", top_used=top)


## SMOTE

SMOTENC balances datasers with both numerical and categorical variables. For **numerical features** it interpolates values between a sample and its neighbors (knn) and for **categorical features** it assigns the most frequent category among neighbors. This way, new synthetic minority samples are generated following the variables' distribution, without losing the mixed data structure.

In [None]:
# PARAMETERS
knn = 3     # Number of nearest neighbors for SMOTE

In [None]:
# Recalculate the categorical columns and their indices after dropping columns
cat_cols = list(X.select_dtypes(include='object').columns)
cat_idx = [X.columns.get_loc(c) for c in cat_cols]

Codify the categories to integers for SMOTENC without real order, allowing -1 if a new category in the test appears to avoid errors. Apply SMOTENC 

In [None]:
# Codify categories to integers                                      # TODO meterlo en un transformer para uqe lo haga directo/hacer una funcion, a lo mejor usar one hot para alguna de las columans
encode = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)   # Handle unknown categories in test set
X[cat_cols] = encode.fit_transform(X[cat_cols])                                 # Fit and transform the categorical columns

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Apply SMOTENC to the training data
print("Before SMOTENC:", Counter(y_train))

smotenc = SMOTENC(categorical_features=cat_idx, random_state=42, k_neighbors=knn) 
X_train_res, y_train_res = smotenc.fit_resample(X_train, y_train)

print("After SMOTENC:", Counter(y_train_res))

In [None]:
# Class balance before/after (train) 
import numpy as np, pandas as pd, seaborn as sns, matplotlib.pyplot as plt
sns.set_theme(style="whitegrid")

def _plot_counts_with_pct(ax, s, title):
    counts = s.value_counts().sort_index()
    total = counts.sum()
    bars = sns.barplot(x=counts.index.astype(str), y=counts.values, ax=ax, palette="viridis")
    ax.set_title(title, fontsize=12, weight="bold")
    ax.set_xlabel("Class"); ax.set_ylabel("Count")
    for i, v in enumerate(counts.values):
        pct = 100.0 * v / total
        ax.text(i, v, f"{v:,}\n({pct:.1f}%)", ha="center", va="bottom", fontsize=10, weight="bold")
    sns.despine(ax=ax)

fig, ax = plt.subplots(1, 2, figsize=(11, 4.5))
_plot_counts_with_pct(ax[0], y_train, "Before SMOTENC (Train)")
_plot_counts_with_pct(ax[1], y_train_res, "After SMOTENC (Train)")
plt.tight_layout()
plt.show()


## Feature Engineering

The purpose of feature engineering is to clean and transform our current variables into more usable and predictable parameters of our machine learning models.

From Section 1 there are several proposed features that can be implemented into the merged dataset by simply grouping differently.

- country_transaction_count and city_transaction_count

- avg_city_transactions_per_country

- merchant_transaction_count and merchant_avg_amount

- customer_transaction_count and customer_avg_amount

- ip_per_customer and ip_per_device

- customer_age_group

- The variable join_date can be transformed into join_year and customer_tenure


In [None]:
# Testing with a subset (first 1000 rows)
df = df.iloc[:1000, :].copy()
print(df.columns)

Applying the feature engineering with pandas groupby and transforms.

In [None]:
# Calculate the transaction count for each customer country
df['country_transaction_count'] = (
    df.groupby('customer_country')['transaction_id'].transform('count')
)

# Calculate the average number of transactions per city within each customer country
# First, calculate transaction count per city
df['city_transaction_count'] = (
    df.groupby(['customer_country', 'city'])['transaction_id'].transform('count')
)

# Then, calculate the average of these counts within each country
df['avg_city_transactions_per_country'] = (
    df.groupby('customer_country')['city_transaction_count'].transform('mean')
)

# Calculate the transaction count for each merchant
df['merchant_transaction_count'] = (
    df.groupby('merchant')['transaction_id'].transform('count')
)

# Calculate the average transaction amount for each merchant
avg_amount_per_merchant = (
    df.groupby('merchant')['Amount'].mean().reset_index(name='avg_amount')
)

# Calculate the transaction count for each customer
txn_count_per_customer = (
    df.groupby('customer_id')['transaction_id'].count().reset_index(name='transaction_count')
)

# Calculate the average transaction amount for each customer
avg_amount_per_customer = (
    df.groupby('customer_id')['Amount'].mean().reset_index(name='avg_amount')
)

# Calculate the IP addresses for each customer
ip_per_customer = (
    df.groupby('customer_id')['ip_address'].unique().reset_index()
)

# Calculate the IP addresses for each device
ip_per_device = (
    df.groupby('device_id')['ip_address'].unique().reset_index()
)

# Separate the customer age into different brackets
bins = [0, 25, 35, 45, 60, 100]
labels = ['<25', '25-35', '35-45', '45-60', '60+']
df['age_bracket'] = pd.cut(df['age'], bins=bins, labels=labels, right=False)

# Transform customer join_date into join_year and customer_tenure
df['join_date'] = pd.to_datetime(df['join_date'], errors='coerce')
df['join_year'] = df['join_date'].dt.year
df['customer_tenure'] = (pd.Timestamp('now') - df['join_date']).dt.days // 365