# Introduction

In today’s digital era, credit card transactions have become a cornerstone of financial activities, enabling quick and convenient payments. However, this convenience comes with the growing risk of credit card fraud, which poses significant financial losses to individuals and institutions and erodes trust in digital payment systems.

Manual detection of fraudulent transactions is impractical due to the sheer volume of transactions and the complexity of fraudulent patterns. Machine Learning (ML) offers a solution by automatically identifying anomalous or suspicious transactions from historical data. ML models can learn subtle patterns that distinguish genuine transactions from fraudulent ones, enabling real-time fraud detection.

This project focuses on building a robust machine learning system capable of classifying credit card transactions as genuine or fraudulent. The study involves data preprocessing, normalization, handling class imbalance, model training, and performance evaluation using metrics suited for highly imbalanced datasets, such as Precision, Recall, F1-score, and the Area Under the Precision-Recall Curve (AUPRC).

# Problem Statement

Credit card fraud is a rare but critical issue in financial transactions. In the dataset considered, fraudulent transactions account for only 0.172% of total transactions, making the dataset highly imbalanced. This poses challenges for traditional classification models, as accuracy alone is misleading in detecting fraud.

The problem is to develop a machine learning model that can reliably identify fraudulent transactions while minimizing missed fraud cases (false negatives) and avoiding unnecessary false alarms (false positives). The main challenges include:

- Class imbalance – the number of fraud transactions is extremely low compared to genuine ones.

- Feature scaling – variables such as Amount and Time require normalization for effective modeling.

- Evaluation of performance – models must be evaluated with metrics suitable for imbalanced datasets, including Precision, Recall, F1-score, and AUPRC, rather than simple accuracy.

The goal is to create an efficient and accurate fraud detection system that can be integrated into real-world financial monitoring systems to reduce losses and enhance trust.

# Objectives

- Preprocess and normalize transaction data, including scaling of non-PCA features (Time and Amount).

- Handle class imbalance using techniques such as oversampling (SMOTE), undersampling, or class weighting.

- Train and compare machine learning models (e.g., Logistic Regression, Random Forest, XGBoost) to classify transactions.

- Evaluate model performance using metrics suitable for imbalanced datasets: Precision, Recall, F1-score, and AUPRC.

- Visualize results and provide insights into model effectiveness for real-world fraud detection.

- Recommend strategies to improve detection and reduce false positives and false negatives.

# Load Dataset

In [None]:
import pandas as pd

# Load the dataset
data = pd.read_csv('/kaggle/input/creditcardfraud/creditcard.csv')

# Exploring Data

In [None]:
print(data.info())

In [None]:
print(data.describe())

In [None]:
print(data['Class'].value_counts())

In [None]:
# Count of each class
class_counts = data['Class'].value_counts()
fraud_percentage = (class_counts[1] / class_counts.sum()) * 100

print(f"Total transactions: {class_counts.sum()}")
print(f"Fraudulent transactions: {class_counts[1]}")
print(f"Percentage of fraud transactions: {fraud_percentage:.3f}%")

In [None]:
import matplotlib.pyplot as plt
# Plot class distribution
plt.figure(figsize=(6,4))
class_counts.plot(kind='bar', color=['skyblue', 'salmon'])
plt.xticks([0,1], ['Genuine (0)', 'Fraud (1)'], rotation=0)
plt.title('Class Distribution of Transactions')
plt.ylabel('Number of Transactions')

# Annotate percentages
for i, count in enumerate(class_counts):
    plt.text(i, count + 3000, f"{(count/class_counts.sum()*100):.3f}%", ha='center')

plt.show()

# Data Preprocessing

In [None]:
from sklearn.preprocessing import StandardScaler

data['scaled_amount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1,1))
data['scaled_time'] = StandardScaler().fit_transform(data['Time'].values.reshape(-1,1))

# Drop original 'Amount' and 'Time'
data.drop(['Amount', 'Time'], axis=1, inplace=True)

# Data Preparation

In [None]:
X = data.drop('Class', axis=1)
y = data['Class']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Handling Class Imbalance

In [None]:
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=42)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

print(f"Resampled dataset shape: {pd.Series(y_train_res).value_counts()}")

# Model Training

## Model1 - Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(max_iter=1000, class_weight='balanced')
lr.fit(X_train_res, y_train_res)

## Model 2- Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=50, random_state=42, class_weight='balanced')
rf.fit(X_train_res, y_train_res)

In [None]:
from sklearn.metrics import classification_report, precision_recall_curve, auc
import matplotlib.pyplot as plt

y_pred = rf.predict(X_test)
y_prob = rf.predict_proba(X_test)[:,1]

# Classification report
print(classification_report(y_test, y_pred))

# Precision-Recall Curve
precision, recall, _ = precision_recall_curve(y_test, y_prob)
pr_auc = auc(recall, precision)
print(f"AUPRC: {pr_auc:.4f}")

plt.plot(recall, precision, marker='.', label='Random Forest')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.show()