# Credit Card Fraud Detection - Exploratory Data Analysis

## 📄 About the Dataset

This project uses a **simulated credit card transaction dataset** provided on Kaggle. The dataset contains over 550,000 transactions from January 1st, 2019 to December 31st, 2020. These transactions were generated using a simulation tool called **Sparkov** which creates realistic behavioral patterns based on profiles such as age, gender, and location.

The goal is to analyze the dataset and identify key patterns that distinguish **fraudulent transactions (`is_fraud = 1`)** from **legitimate ones (`is_fraud = 0`)**.

You can find the dataset [here](https://www.kaggle.com/datasets/kartik2112/fraud-detection?select=fraudTrain.csv).

### 💡 Dataset Details:

The dataset includes:

- **Demographic information** (e.g., name, gender, job, DOB)
- **Transaction details** (amount, category, merchant)
- **Geolocation** (latitude, longitude)
- **Temporal features** (transaction timestamp)
- **Label**: `is_fraud` (target class)

---

## ⚠️ Note:

We are using only the `fraudTrain.csv` file as our base dataset. This file contains both fraud and legitimate transactions. We'll later perform our own **train-test split** to ensure consistent preprocessing and evaluation.


### 🧠 Credit Card Fraud Detection – Exploratory Data Analysis (EDA)
This notebook focuses on exploring a simulated credit card transactions dataset to detect patterns and insights related to fraudulent behavior. We aim to understand the data through statistical summaries, visualizations, and correlations before building any models.

Source: Kaggle Dataset - Credit Card Fraud Detection

Dataset Used: fraudTrain.csv

Objective: Understand the structure and patterns in data to prepare it for modeling fraudulent vs. legitimate transactions (is_fraud target class).


In [14]:
# Basic libraries
import numpy as np
import pandas as pd

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

### 🗂️ Load and View the Dataset
We begin by loading the fraudTrain.csv file to examine its structure and check for basic information like number of rows, columns, and data types.

In [15]:
# Load the dataset
df = pd.read_csv(r"C:\Users\aswin\Downloads\fraudTrain.csv")

# Display the shape and first few rows
print("Shape of dataset:", df.shape)
df.head()


Shape of dataset: (1296675, 23)


Unnamed: 0.1,Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,0,2019-01-01 00:00:18,2703186189652095,"fraud_Rippin, Kub and Mann",misc_net,4.97,Jennifer,Banks,F,561 Perry Cove,...,36.0788,-81.1781,3495,"Psychologist, counselling",1988-03-09,0b242abb623afc578575680df30655b9,1325376018,36.011293,-82.048315,0
1,1,2019-01-01 00:00:44,630423337322,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,Stephanie,Gill,F,43039 Riley Greens Suite 393,...,48.8878,-118.2105,149,Special educational needs teacher,1978-06-21,1f76529f8574734946361c461b024d99,1325376044,49.159047,-118.186462,0
2,2,2019-01-01 00:00:51,38859492057661,fraud_Lind-Buckridge,entertainment,220.11,Edward,Sanchez,M,594 White Dale Suite 530,...,42.1808,-112.262,4154,Nature conservation officer,1962-01-19,a1a22d70485983eac12b5b88dad1cf95,1325376051,43.150704,-112.154481,0
3,3,2019-01-01 00:01:16,3534093764340240,"fraud_Kutch, Hermiston and Farrell",gas_transport,45.0,Jeremy,White,M,9443 Cynthia Court Apt. 038,...,46.2306,-112.1138,1939,Patent attorney,1967-01-12,6b849c168bdad6f867558c3793159a81,1325376076,47.034331,-112.561071,0
4,4,2019-01-01 00:03:06,375534208663984,fraud_Keeling-Crist,misc_pos,41.96,Tyler,Garcia,M,408 Bradley Rest,...,38.4207,-79.4629,99,Dance movement psychotherapist,1986-03-28,a41d7549acf90789359a9aa5346dcb46,1325376186,38.674999,-78.632459,0


### 🔍 Initial Overview
Let's use .info() and .describe() to get a sense of the dataset’s structure and summary statistics.

In [16]:
# Data info
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1296675 entries, 0 to 1296674
Data columns (total 23 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   Unnamed: 0             1296675 non-null  int64  
 1   trans_date_trans_time  1296675 non-null  object 
 2   cc_num                 1296675 non-null  int64  
 3   merchant               1296675 non-null  object 
 4   category               1296675 non-null  object 
 5   amt                    1296675 non-null  float64
 6   first                  1296675 non-null  object 
 7   last                   1296675 non-null  object 
 8   gender                 1296675 non-null  object 
 9   street                 1296675 non-null  object 
 10  city                   1296675 non-null  object 
 11  state                  1296675 non-null  object 
 12  zip                    1296675 non-null  int64  
 13  lat                    1296675 non-null  float64
 14  long              

### 🧹 Drop Unnecessary Columns
The Unnamed: 0 column is simply a duplicate index and doesn’t add analytical value. We will remove it to keep the dataset clean.

In [17]:
# Drop the 'Unnamed: 0' column
df.drop(columns=['Unnamed: 0'], inplace=True)

# Confirm the new shape
print("New shape of dataset:", df.shape)


New shape of dataset: (1296675, 22)


### 🔍 Check for Duplicate Rows
It’s important to identify and remove any duplicate transactions that might skew the analysis or model training.

In [18]:
# Check for duplicate rows
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

# Optionally, drop duplicates if any found
if duplicates > 0:
    df.drop_duplicates(inplace=True)
    print(f"Duplicates removed. New shape: {df.shape}")
else:
    print("No duplicates found.")


Number of duplicate rows: 0
No duplicates found.


In [19]:
# Convert 'trans_date_trans_time' and 'dob' to datetime
df['trans_date_trans_time'] = pd.to_datetime(df['trans_date_trans_time'])
df['dob'] = pd.to_datetime(df['dob'])

# Confirm the changes
print(df[['trans_date_trans_time', 'dob']].dtypes)


trans_date_trans_time    datetime64[ns]
dob                      datetime64[ns]
dtype: object


In [20]:
df.head()

Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,city,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,2019-01-01 00:00:18,2703186189652095,"fraud_Rippin, Kub and Mann",misc_net,4.97,Jennifer,Banks,F,561 Perry Cove,Moravian Falls,...,36.0788,-81.1781,3495,"Psychologist, counselling",1988-03-09,0b242abb623afc578575680df30655b9,1325376018,36.011293,-82.048315,0
1,2019-01-01 00:00:44,630423337322,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,Stephanie,Gill,F,43039 Riley Greens Suite 393,Orient,...,48.8878,-118.2105,149,Special educational needs teacher,1978-06-21,1f76529f8574734946361c461b024d99,1325376044,49.159047,-118.186462,0
2,2019-01-01 00:00:51,38859492057661,fraud_Lind-Buckridge,entertainment,220.11,Edward,Sanchez,M,594 White Dale Suite 530,Malad City,...,42.1808,-112.262,4154,Nature conservation officer,1962-01-19,a1a22d70485983eac12b5b88dad1cf95,1325376051,43.150704,-112.154481,0
3,2019-01-01 00:01:16,3534093764340240,"fraud_Kutch, Hermiston and Farrell",gas_transport,45.0,Jeremy,White,M,9443 Cynthia Court Apt. 038,Boulder,...,46.2306,-112.1138,1939,Patent attorney,1967-01-12,6b849c168bdad6f867558c3793159a81,1325376076,47.034331,-112.561071,0
4,2019-01-01 00:03:06,375534208663984,fraud_Keeling-Crist,misc_pos,41.96,Tyler,Garcia,M,408 Bradley Rest,Doe Hill,...,38.4207,-79.4629,99,Dance movement psychotherapist,1986-03-28,a41d7549acf90789359a9aa5346dcb46,1325376186,38.674999,-78.632459,0


In [21]:
# Extracting features from transaction datetime
df['trans_year'] = df['trans_date_trans_time'].dt.year
df['trans_month'] = df['trans_date_trans_time'].dt.month
df['trans_day'] = df['trans_date_trans_time'].dt.day
df['trans_hour'] = df['trans_date_trans_time'].dt.hour
df['trans_day_of_week'] = df['trans_date_trans_time'].dt.dayofweek  # Monday=0, Sunday=6


In [22]:
# Calculate age at the time of transaction
df['age'] = (df['trans_date_trans_time'] - df['dob']).dt.days // 365


### Check for Missing Values
We need to inspect whether any columns have missing (NaN) values. This helps us decide if imputation or column dropping is needed.

In [23]:
# Check for missing values
missing_values = df.isnull().sum()
missing_values = missing_values[missing_values > 0]

if missing_values.empty:
    print("✅ No missing values found.")
else:
    print("❌ Columns with missing values:\n")
    print(missing_values)


✅ No missing values found.


### Basic Statistical Summary & Class Balance

Before diving into visualizations, let’s:

Get a statistical summary of numerical columns.

Check the balance of our target class (is_fraud) to understand if the dataset is imbalanced.

In [24]:
# Statistical summary of numerical features
df.describe()

# Distribution of target variable
fraud_counts = df['is_fraud'].value_counts()
fraud_percent = df['is_fraud'].value_counts(normalize=True) * 100

print("Fraud Class Distribution:")
print(fraud_counts)
print("\nPercentage Distribution:")
print(fraud_percent)


Fraud Class Distribution:
0    1289169
1       7506
Name: is_fraud, dtype: int64

Percentage Distribution:
0    99.421135
1     0.578865
Name: is_fraud, dtype: float64


Class 0 (Not Fraud): ~99.42%

Class 1 (Fraud): ~0.58%

This imbalance will strongly influence model performance. If we don't address it, the model might just predict “Not Fraud” every time and still get high accuracy — but it will fail to detect actual frauds.

### Check Cardinality of Categorical Columns
Next, let's look at how many unique values exist in the categorical columns (like merchant, job, category, etc.). This helps us understand which ones are useful or too sparse.

In [25]:
# Select object (categorical) columns
cat_cols = df.select_dtypes(include='object').columns

# Unique values in each categorical column
for col in cat_cols:
    print(f"{col}: {df[col].nunique()} unique values")


merchant: 693 unique values
category: 14 unique values
first: 352 unique values
last: 481 unique values
gender: 2 unique values
street: 983 unique values
city: 894 unique values
state: 51 unique values
job: 494 unique values
trans_num: 1296675 unique values


In [26]:
df.describe()


Unnamed: 0,cc_num,amt,zip,lat,long,city_pop,unix_time,merch_lat,merch_long,is_fraud,trans_year,trans_month,trans_day,trans_hour,trans_day_of_week,age
count,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0
mean,4.17192e+17,70.35104,48800.67,38.53762,-90.22634,88824.44,1349244000.0,38.53734,-90.22646,0.005788652,2019.287,6.14215,15.58798,12.80486,3.070604,45.52822
std,1.308806e+18,160.316,26893.22,5.075808,13.75908,301956.4,12841280.0,5.109788,13.77109,0.07586269,0.4522452,3.417703,8.829121,6.817824,2.198153,17.40895
min,60416210000.0,1.0,1257.0,20.0271,-165.6723,23.0,1325376000.0,19.02779,-166.6712,0.0,2019.0,1.0,1.0,0.0,0.0,13.0
25%,180042900000000.0,9.65,26237.0,34.6205,-96.798,743.0,1338751000.0,34.73357,-96.89728,0.0,2019.0,3.0,8.0,7.0,1.0,32.0
50%,3521417000000000.0,47.52,48174.0,39.3543,-87.4769,2456.0,1349250000.0,39.36568,-87.43839,0.0,2019.0,6.0,15.0,14.0,3.0,44.0
75%,4642255000000000.0,83.14,72042.0,41.9404,-80.158,20328.0,1359385000.0,41.95716,-80.2368,0.0,2020.0,9.0,23.0,19.0,5.0,57.0
max,4.992346e+18,28948.9,99783.0,66.6933,-67.9503,2906700.0,1371817000.0,67.51027,-66.9509,1.0,2020.0,12.0,31.0,23.0,6.0,95.0


In [27]:
df.drop(['unix_time', 'trans_num', 'first', 'last', 'street'], axis=1, inplace=True)


### Calculate Distance Between Customer and Merchant
We can create a new feature distance using the Haversine formula to calculate the geographical distance between the transaction location and the merchant's location.

In [28]:
import numpy as np

def haversine(lat1, lon1, lat2, lon2):
    R = 6371  # Earth radius in kilometers
    phi1 = np.radians(lat1)
    phi2 = np.radians(lat2)
    delta_phi = np.radians(lat2 - lat1)
    delta_lambda = np.radians(lon2 - lon1)

    a = np.sin(delta_phi / 2.0)**2 + \
        np.cos(phi1) * np.cos(phi2) * np.sin(delta_lambda / 2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    return R * c

# Apply the function
df['distance'] = haversine(df['lat'], df['long'], df['merch_lat'], df['merch_long'])


###  Drop Latitude/Longitude Columns After Distance
After calculating distance, the raw coordinates become redundant:

In [29]:
df.drop(['lat', 'long', 'merch_lat', 'merch_long'], axis=1, inplace=True)


In [30]:
df.head()

Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,gender,city,state,zip,city_pop,job,dob,is_fraud,trans_year,trans_month,trans_day,trans_hour,trans_day_of_week,age,distance
0,2019-01-01 00:00:18,2703186189652095,"fraud_Rippin, Kub and Mann",misc_net,4.97,F,Moravian Falls,NC,28654,3495,"Psychologist, counselling",1988-03-09,0,2019,1,1,0,1,30,78.597568
1,2019-01-01 00:00:44,630423337322,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,F,Orient,WA,99160,149,Special educational needs teacher,1978-06-21,0,2019,1,1,0,1,40,30.212176
2,2019-01-01 00:00:51,38859492057661,fraud_Lind-Buckridge,entertainment,220.11,M,Malad City,ID,83252,4154,Nature conservation officer,1962-01-19,0,2019,1,1,0,1,56,108.206083
3,2019-01-01 00:01:16,3534093764340240,"fraud_Kutch, Hermiston and Farrell",gas_transport,45.0,M,Boulder,MT,59632,1939,Patent attorney,1967-01-12,0,2019,1,1,0,1,52,95.673231
4,2019-01-01 00:03:06,375534208663984,fraud_Keeling-Crist,misc_pos,41.96,M,Doe Hill,VA,24433,99,Dance movement psychotherapist,1986-03-28,0,2019,1,1,0,1,32,77.556744


### Encoding Categorical Variables

 Identify Categorical Columns
From your df.head(), these are the remaining categorical columns:

merchant – many unique values (you might drop or encode later)

category – 14 categories → good for one-hot encoding

gender – binary → can use label encoding

city, state – high cardinality → can drop for now or encode with care

job – high cardinality → can drop or target encode later

In [31]:
# Label Encode gender
df['gender'] = df['gender'].map({'F': 0, 'M': 1})

# One-hot encode category
df = pd.get_dummies(df, columns=['category'], drop_first=True)

df.drop(['merchant', 'city', 'state', 'job'], axis=1, inplace=True)

df.head()

Unnamed: 0,trans_date_trans_time,cc_num,amt,gender,zip,city_pop,dob,is_fraud,trans_year,trans_month,...,category_grocery_pos,category_health_fitness,category_home,category_kids_pets,category_misc_net,category_misc_pos,category_personal_care,category_shopping_net,category_shopping_pos,category_travel
0,2019-01-01 00:00:18,2703186189652095,4.97,0,28654,3495,1988-03-09,0,2019,1,...,0,0,0,0,1,0,0,0,0,0
1,2019-01-01 00:00:44,630423337322,107.23,0,99160,149,1978-06-21,0,2019,1,...,1,0,0,0,0,0,0,0,0,0
2,2019-01-01 00:00:51,38859492057661,220.11,1,83252,4154,1962-01-19,0,2019,1,...,0,0,0,0,0,0,0,0,0,0
3,2019-01-01 00:01:16,3534093764340240,45.0,1,59632,1939,1967-01-12,0,2019,1,...,0,0,0,0,0,0,0,0,0,0
4,2019-01-01 00:03:06,375534208663984,41.96,1,24433,99,1986-03-28,0,2019,1,...,0,0,0,0,0,1,0,0,0,0


### Feature Scaling
Before training your machine learning model, you should scale the numerical features. This ensures that models like logistic regression or KNN don’t get biased by feature magnitude.

In [32]:
from sklearn.preprocessing import StandardScaler

# Select numerical columns to scale
num_cols = ['amt', 'city_pop', 'age', 'distance', 'trans_hour']  # Add other numerical features if present

scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])


In [33]:
print(df.columns.tolist())


['trans_date_trans_time', 'cc_num', 'amt', 'gender', 'zip', 'city_pop', 'dob', 'is_fraud', 'trans_year', 'trans_month', 'trans_day', 'trans_hour', 'trans_day_of_week', 'age', 'distance', 'category_food_dining', 'category_gas_transport', 'category_grocery_net', 'category_grocery_pos', 'category_health_fitness', 'category_home', 'category_kids_pets', 'category_misc_net', 'category_misc_pos', 'category_personal_care', 'category_shopping_net', 'category_shopping_pos', 'category_travel']


In [34]:
corr_with_target = df.corr()['is_fraud'].sort_values(ascending=False)
print(corr_with_target)


is_fraud                   1.000000
amt                        0.219404
category_shopping_net      0.044261
category_grocery_pos       0.035558
category_misc_net          0.025886
trans_hour                 0.013799
age                        0.012244
gender                     0.007642
category_shopping_pos      0.005955
trans_day                  0.003848
trans_year                 0.003004
city_pop                   0.002136
trans_day_of_week          0.001739
distance                   0.000403
cc_num                    -0.000981
zip                       -0.002162
category_gas_transport    -0.004851
category_travel           -0.006924
category_grocery_net      -0.007136
category_misc_pos         -0.008937
category_personal_care    -0.012167
trans_month               -0.012409
category_health_fitness   -0.014885
category_kids_pets        -0.014967
category_food_dining      -0.015025
category_home             -0.017848
Name: is_fraud, dtype: float64


In [35]:
features_to_keep = [
    'amt',
    'age',
    'gender',
    'trans_hour',
    'trans_day_of_week',
    'category_grocery_pos',
    'category_shopping_net',
    'category_misc_net'
]


In [36]:
df_reduced = df[features_to_keep + ['is_fraud']]  # Keep the target column as well


In [37]:
print(df_reduced.head())
print(df_reduced.info())


        amt       age  gender  trans_hour  trans_day_of_week  \
0 -0.407826 -0.891968       0   -1.878145                  1   
1  0.230039 -0.317551       0   -1.878145                  1   
2  0.934149  0.601517       1   -1.878145                  1   
3 -0.158132  0.371750       1   -1.878145                  1   
4 -0.177094 -0.777085       1   -1.878145                  1   

   category_grocery_pos  category_shopping_net  category_misc_net  is_fraud  
0                     0                      0                  1         0  
1                     1                      0                  0         0  
2                     0                      0                  0         0  
3                     0                      0                  0         0  
4                     0                      0                  0         0  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1296675 entries, 0 to 1296674
Data columns (total 9 columns):
 #   Column                 Non-Nul

In [38]:
from sklearn.model_selection import train_test_split

X = df_reduced.drop('is_fraud', axis=1)
y = df_reduced['is_fraud']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)


In [39]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score

# Initialize model
model = LogisticRegression(max_iter=1000)

# Train model
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

# Evaluate
print(classification_report(y_test, y_pred))
print("ROC AUC Score:", roc_auc_score(y_test, y_prob))


              precision    recall  f1-score   support

           0       0.99      1.00      1.00    257834
           1       0.00      0.00      0.00      1501

    accuracy                           0.99    259335
   macro avg       0.50      0.50      0.50    259335
weighted avg       0.99      0.99      0.99    259335

ROC AUC Score: 0.8061825767005618


In [40]:
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score

# Assume X_train, y_train, X_test, y_test are already defined

# Initialize SMOTE
smote = SMOTE(random_state=42)

# Apply SMOTE to training data
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

print("Before SMOTE:", y_train.value_counts())
print("After SMOTE:", pd.Series(y_train_res).value_counts())

# Train your model (e.g. Logistic Regression)
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train_res, y_train_res)

# Predict on test data
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

# Evaluate model performance
print(classification_report(y_test, y_pred))
print("ROC AUC Score:", roc_auc_score(y_test, y_prob))


Before SMOTE: 0    1031335
1       6005
Name: is_fraud, dtype: int64
After SMOTE: 0    1031335
1    1031335
Name: is_fraud, dtype: int64
              precision    recall  f1-score   support

           0       1.00      0.94      0.97    257834
           1       0.07      0.76      0.12      1501

    accuracy                           0.94    259335
   macro avg       0.53      0.85      0.55    259335
weighted avg       0.99      0.94      0.96    259335

ROC AUC Score: 0.8489611505870691


In [43]:
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, roc_auc_score

# Logistic Regression (already done)
print("Logistic Regression Results:")
model_lr = LogisticRegression(max_iter=1000, random_state=42)
model_lr.fit(X_train_res, y_train_res)
y_pred_lr = model_lr.predict(X_test)
y_prob_lr = model_lr.predict_proba(X_test)[:, 1]
print(classification_report(y_test, y_pred_lr))
print("ROC AUC Score:", roc_auc_score(y_test, y_prob_lr))

# Random Forest
print("\nRandom Forest Results:")
model_rf = RandomForestClassifier(random_state=42, n_jobs=-1)
model_rf.fit(X_train_res, y_train_res)
y_pred_rf = model_rf.predict(X_test)
y_prob_rf = model_rf.predict_proba(X_test)[:, 1]
print(classification_report(y_test, y_pred_rf))
print("ROC AUC Score:", roc_auc_score(y_test, y_prob_rf))

# XGBoost
print("\nXGBoost Results:")
model_xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
model_xgb.fit(X_train_res, y_train_res)
y_pred_xgb = model_xgb.predict(X_test)
y_prob_xgb = model_xgb.predict_proba(X_test)[:, 1]
print(classification_report(y_test, y_pred_xgb))
print("ROC AUC Score:", roc_auc_score(y_test, y_prob_xgb))


Logistic Regression Results:
              precision    recall  f1-score   support

           0       1.00      0.94      0.97    257834
           1       0.07      0.76      0.12      1501

    accuracy                           0.94    259335
   macro avg       0.53      0.85      0.55    259335
weighted avg       0.99      0.94      0.96    259335

ROC AUC Score: 0.8489611505870691

Random Forest Results:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    257834
           1       0.48      0.77      0.59      1501

    accuracy                           0.99    259335
   macro avg       0.74      0.88      0.80    259335
weighted avg       1.00      0.99      0.99    259335

ROC AUC Score: 0.9708688536034813

XGBoost Results:
              precision    recall  f1-score   support

           0       1.00      0.99      1.00    257834
           1       0.39      0.82      0.53      1501

    accuracy                           0.99

In [45]:
from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import classification_report, roc_auc_score

# Reduce parameter grid or keep as is, but n_iter=10
param_dist = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.7, 0.8, 1],
    'colsample_bytree': [0.7, 0.8, 1]
}

xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)

random_search = RandomizedSearchCV(
    estimator=xgb,
    param_distributions=param_dist,
    n_iter=10,   # fewer iterations
    cv=2,        # fewer folds
    scoring='roc_auc',
    n_jobs=-1,
    verbose=1,
    random_state=42
)

# Fit with early stopping on validation set inside cv might be tricky, but RandomizedSearchCV handles it internally
random_search.fit(X_train_res, y_train_res)

print("Best params:", random_search.best_params_)

y_pred = random_search.predict(X_test)
y_prob = random_search.predict_proba(X_test)[:, 1]

print(classification_report(y_test, y_pred))
print("ROC AUC Score:", roc_auc_score(y_test, y_prob))


Fitting 2 folds for each of 10 candidates, totalling 20 fits
Best params: {'subsample': 0.8, 'n_estimators': 200, 'max_depth': 5, 'learning_rate': 0.2, 'colsample_bytree': 1}
              precision    recall  f1-score   support

           0       1.00      0.99      1.00    257834
           1       0.40      0.83      0.54      1501

    accuracy                           0.99    259335
   macro avg       0.70      0.91      0.77    259335
weighted avg       1.00      0.99      0.99    259335

ROC AUC Score: 0.9881731859898578


✅ ROC AUC Score: 0.988 — excellent! It shows your model distinguishes well between fraud and non-fraud.

✅ Recall for class 1 (fraud): 0.83 — this is strong and crucial for fraud detection (catching actual frauds).

✅ F1-score for fraud: 0.54 — not perfect, but acceptable considering class imbalance. You can still improve this later if needed.

✅ Accuracy: 99% — high, but expected due to the dataset being mostly non-fraud.

In [49]:
import joblib
import os

# Define the path to the models directory
model_dir = r"F:\Aswin\01 epita\Projects\Data science Portfolio Projects\Card_Guard\Card_Guard\models"
os.makedirs(model_dir, exist_ok=True)  # Ensure the folder exists

# Full path to save the model
model_path = os.path.join(model_dir, "xgboost_fraud_model.pkl")

# Save the model
joblib.dump(random_search.best_estimator_, model_path)

print(f"Model saved to: {model_path}")


Model saved to: F:\Aswin\01 epita\Projects\Data science Portfolio Projects\Card_Guard\Card_Guard\models\xgboost_fraud_model.pkl


Load the saved model and use it for prediction:



In [50]:
import joblib

# Path to the saved model
model_path = r"F:\Aswin\01 epita\Projects\Data science Portfolio Projects\Card_Guard\Card_Guard\models\xgboost_fraud_model.pkl"

# Load the model
loaded_model = joblib.load(model_path)

# Make predictions (example)
y_loaded_pred = loaded_model.predict(X_test)
y_loaded_prob = loaded_model.predict_proba(X_test)[:, 1]

# Evaluate to confirm it's working
from sklearn.metrics import classification_report, roc_auc_score

print("Loaded Model Evaluation:")
print(classification_report(y_test, y_loaded_pred))
print("ROC AUC Score:", roc_auc_score(y_test, y_loaded_prob))


Loaded Model Evaluation:
              precision    recall  f1-score   support

           0       1.00      0.99      1.00    257834
           1       0.40      0.83      0.54      1501

    accuracy                           0.99    259335
   macro avg       0.70      0.91      0.77    259335
weighted avg       1.00      0.99      0.99    259335

ROC AUC Score: 0.9881731859898578
