### **Model Development**

### **Project Setup**

In [1]:
# Import libraries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import missingno as msno
import kagglehub # type: ignore

# Data Preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer

# Training and Evaluations
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

# Hyper-parameter Tuning and Cross Validation
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score, cross_validate

# Handling Imbalanced Data
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline, make_pipeline 

# Import machine learning models
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier

In [4]:
# Read the dataset
df = pd.read_csv("credit_card_fraud.csv")
df.head()

Unnamed: 0,category,transaction_amount,gender,state,job,is_fraud,age,transaction_location,transaction_hour
0,misc_net,4.97,F,NC,"Psychologist, counselling",0,31,"-81.1781, 36.0788",0
1,grocery_pos,107.23,F,WA,Special educational needs teacher,0,41,"-118.2105, 48.8878",0
2,entertainment,220.11,M,ID,Nature conservation officer,0,57,"-112.262, 42.1808",0
3,gas_transport,45.0,M,MT,Patent attorney,0,52,"-112.1138, 46.2306",0
4,misc_pos,41.96,M,VA,Dance movement psychotherapist,0,33,"-79.4629, 38.4207",0


In [5]:
# Dataset information 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1852394 entries, 0 to 1852393
Data columns (total 9 columns):
 #   Column                Dtype  
---  ------                -----  
 0   category              object 
 1   transaction_amount    float64
 2   gender                object 
 3   state                 object 
 4   job                   object 
 5   is_fraud              int64  
 6   age                   int64  
 7   transaction_location  object 
 8   transaction_hour      int64  
dtypes: float64(1), int64(3), object(5)
memory usage: 127.2+ MB


In [None]:
# Descriptive statistics on numerical features
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
transaction_amount,1852394.0,70.063567,159.253975,1.0,9.64,47.45,83.1,28948.9
is_fraud,1852394.0,0.00521,0.071992,0.0,0.0,0.0,0.0,1.0
age,1852394.0,46.21138,17.395446,14.0,33.0,44.0,57.0,96.0
transaction_hour,1852394.0,12.806119,6.815753,0.0,7.0,14.0,19.0,23.0


In [7]:
# Descriptive statistics on categorical features
df.describe(include='object').T

Unnamed: 0,count,unique,top,freq
category,1852394,14,gas_transport,188029
gender,1852394,2,F,1014749
state,1852394,51,TX,135269
job,1852394,497,Film/video editor,13898
transaction_location,1852394,985,"-108.8964, 43.0048",5116


### **Feature Engineering**

**1. Time-based Features**

- Transaction frequency features:

    - transactions_last_1h (number of transactions by the same user in the last hour)

    - transactions_last_24h (daily transaction rate)

    - time_since_last_transaction (seconds/minutes since the user’s previous transaction)

- Time-of-day indicators:

    - is_nighttime (e.g., 12 AM – 5 AM local time)

    - is_weekend (higher fraud risk on weekends?)

**2. Behavioral Features (User-level)**

- Spending behavior:

    - user_avg_amount (mean transaction amount per user)

    - amount_deviation (current_amount / user_avg_amount)

    - user_median_amount (robust against outliers)

- Transaction patterns:

    - user_transaction_freq (transactions per day/week)

    - user_unique_merchants (number of distinct merchants a user transacts with)

**3. Merchant-based Features**

- Merchant fraud risk: 

    - merchant_fraud_rate (historical fraud rate per merchant, using target encoding)

    - merchant_transaction_volume (total transactions per merchant)

- Merchant category:

    - is_high_risk_merchant_category 

### **Data Preprocessing**

### **Model Training & Evaluation**