Introduction

The objective of this project is to develop a predictive model for detecting fraudulent transactions in a financial dataset. We utilized the RandomForest classifier, known for its efficiency and high performance on tabular data, to build the model. This report outlines the data cleaning process, model training, evaluation, and key insights derived from the model.

Data Overview

The dataset consists of 6,362,620 rows and 11 columns, with the following attributes:

step: Represents an hour in the real world.

type: Type of transaction (CASH-IN, CASH-OUT, DEBIT, PAYMENT, TRANSFER).

amount: Amount of the transaction in local currency.

nameOrig: Customer who started the transaction.

oldbalanceOrg: Initial balance before the transaction.

newbalanceOrig: New balance after the transaction.

nameDest: Customer who is the recipient of the transaction.

oldbalanceDest: Initial balance of the recipient before the transaction.

newbalanceDest: New balance of the recipient after the transaction.

isFraud: Indicates if the transaction is fraudulent.

isFlaggedFraud: Indicates if the transaction is flagged as fraudulent based on business rules.

In [1]:
pip install statsmodels

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.3.1 -> 24.1.1
[notice] To update, run: C:\Users\aryan\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [2]:
pip install xgboost

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.3.1 -> 24.1.1
[notice] To update, run: C:\Users\aryan\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from scipy import stats
import matplotlib.pyplot as plt
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.ensemble import RandomForestClassifier

In [4]:

data = pd.read_csv('Fraud.csv')

In [5]:
print(data.head())

   step      type    amount     nameOrig  oldbalanceOrg  newbalanceOrig  \
0     1   PAYMENT   9839.64  C1231006815       170136.0       160296.36   
1     1   PAYMENT   1864.28  C1666544295        21249.0        19384.72   
2     1  TRANSFER    181.00  C1305486145          181.0            0.00   
3     1  CASH_OUT    181.00   C840083671          181.0            0.00   
4     1   PAYMENT  11668.14  C2048537720        41554.0        29885.86   

      nameDest  oldbalanceDest  newbalanceDest  isFraud  isFlaggedFraud  
0  M1979787155             0.0             0.0        0               0  
1  M2044282225             0.0             0.0        0               0  
2   C553264065             0.0             0.0        1               0  
3    C38997010         21182.0             0.0        1               0  
4  M1230701703             0.0             0.0        0               0  


Handling Missing Values

No missing values were detected in the dataset.

In [6]:
data.isnull().sum()

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

In [7]:
data 

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.00,160296.36,M1979787155,0.00,0.00,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.00,19384.72,M2044282225,0.00,0.00,0,0
2,1,TRANSFER,181.00,C1305486145,181.00,0.00,C553264065,0.00,0.00,1,0
3,1,CASH_OUT,181.00,C840083671,181.00,0.00,C38997010,21182.00,0.00,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.00,29885.86,M1230701703,0.00,0.00,0,0
...,...,...,...,...,...,...,...,...,...,...,...
6362615,743,CASH_OUT,339682.13,C786484425,339682.13,0.00,C776919290,0.00,339682.13,1,0
6362616,743,TRANSFER,6311409.28,C1529008245,6311409.28,0.00,C1881841831,0.00,0.00,1,0
6362617,743,CASH_OUT,6311409.28,C1162922333,6311409.28,0.00,C1365125890,68488.84,6379898.11,1,0
6362618,743,TRANSFER,850002.52,C1685995037,850002.52,0.00,C2080388513,0.00,0.00,1,0


Categorical features were encoded into numerical values.

In [8]:
data['type'] = data['type'].astype('category').cat.codes

In [9]:
dataset = data.copy()


In [10]:
dataset

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,3,9839.64,C1231006815,170136.00,160296.36,M1979787155,0.00,0.00,0,0
1,1,3,1864.28,C1666544295,21249.00,19384.72,M2044282225,0.00,0.00,0,0
2,1,4,181.00,C1305486145,181.00,0.00,C553264065,0.00,0.00,1,0
3,1,1,181.00,C840083671,181.00,0.00,C38997010,21182.00,0.00,1,0
4,1,3,11668.14,C2048537720,41554.00,29885.86,M1230701703,0.00,0.00,0,0
...,...,...,...,...,...,...,...,...,...,...,...
6362615,743,1,339682.13,C786484425,339682.13,0.00,C776919290,0.00,339682.13,1,0
6362616,743,4,6311409.28,C1529008245,6311409.28,0.00,C1881841831,0.00,0.00,1,0
6362617,743,1,6311409.28,C1162922333,6311409.28,0.00,C1365125890,68488.84,6379898.11,1,0
6362618,743,4,850002.52,C1685995037,850002.52,0.00,C2080388513,0.00,0.00,1,0


HERE WE NEED TO DROP THE 'isFraud' COLUMN BECOME THE Z_TEST SCORE WILL ALSO GET APPLIED TO THE 'isFraud' COLUMN

In [11]:
dataset.drop(['isFraud'],axis= 1 , inplace= True)

In [12]:
dataset

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFlaggedFraud
0,1,3,9839.64,C1231006815,170136.00,160296.36,M1979787155,0.00,0.00,0
1,1,3,1864.28,C1666544295,21249.00,19384.72,M2044282225,0.00,0.00,0
2,1,4,181.00,C1305486145,181.00,0.00,C553264065,0.00,0.00,0
3,1,1,181.00,C840083671,181.00,0.00,C38997010,21182.00,0.00,0
4,1,3,11668.14,C2048537720,41554.00,29885.86,M1230701703,0.00,0.00,0
...,...,...,...,...,...,...,...,...,...,...
6362615,743,1,339682.13,C786484425,339682.13,0.00,C776919290,0.00,339682.13,0
6362616,743,4,6311409.28,C1529008245,6311409.28,0.00,C1881841831,0.00,0.00,0
6362617,743,1,6311409.28,C1162922333,6311409.28,0.00,C1365125890,68488.84,6379898.11,0
6362618,743,4,850002.52,C1685995037,850002.52,0.00,C2080388513,0.00,0.00,0


FIND THE OULIERS USING Z-TEST WITH THE THRESHOLD OF 3

In [13]:
z_scores = np.abs(stats.zscore(dataset.select_dtypes(include=[np.number])))
outliers = np.where(z_scores > 3, True, False)
data_no_outliers = data[(outliers == False).all(axis=1)]

In [14]:
data_no_outliers

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,3,9839.64,C1231006815,170136.00,160296.36,M1979787155,0.0,0.0,0,0
1,1,3,1864.28,C1666544295,21249.00,19384.72,M2044282225,0.0,0.0,0,0
2,1,4,181.00,C1305486145,181.00,0.00,C553264065,0.0,0.0,1,0
3,1,1,181.00,C840083671,181.00,0.00,C38997010,21182.0,0.0,1,0
4,1,3,11668.14,C2048537720,41554.00,29885.86,M1230701703,0.0,0.0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
6295995,670,3,8868.80,C1638464390,11556.00,2687.20,M30922949,0.0,0.0,0,0
6295996,670,3,7343.33,C893404945,152.00,0.00,M2090441769,0.0,0.0,0,0
6295997,670,3,3282.37,C1514880247,5954.00,2671.63,M1709561483,0.0,0.0,0,0
6295998,670,3,4527.98,C1386905065,2671.63,0.00,M1659030337,0.0,0.0,0,0


Variance Inflation Factor (VIF) was calculated to check for multicollinearity. Features with high VIF (>5) were removed to reduce redundancy and improve model performance.

In [15]:
X = data_no_outliers.drop(['isFraud', 'isFlaggedFraud', 'nameOrig', 'nameDest'], axis=1)
vif_data = pd.DataFrame()
vif_data['Feature'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

In [16]:
vif_data

Unnamed: 0,Feature,VIF
0,step,2.498687
1,type,2.297849
2,amount,3.647315
3,oldbalanceOrg,380.304028
4,newbalanceOrig,393.757958
5,oldbalanceDest,124.806801
6,newbalanceDest,136.537997


In [17]:
features_to_keep = vif_data[vif_data['VIF'] < 5]['Feature']
data_cleaned = data_no_outliers.loc[:, features_to_keep]
data_cleaned['isFraud'] = data_no_outliers.loc[:, 'isFraud']

In [18]:
X = data_cleaned.drop(['isFraud'], axis=1)
y = data_cleaned['isFraud']

In [19]:
y.value_counts()

isFraud
0    6020098
1       5851
Name: count, dtype: int64

Divide the Data Into 70 - 30 Split.

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Using RandomForestClassifier Because Random Forest can handle imbalanced datasets by adjusting class weights or using techniques like SMOTE to balance the classes.

In [21]:
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

In [22]:
y_pred = rf.predict(X_test)
y_pred_prob = rf.predict_proba(X_test)

Classification Metrics:

Precision: Measures the accuracy of the positive predictions.

Recall: Measures the ability of the model to find all positive instances.

F1-Score: Harmonic mean of precision and recall.

Confusion Matrix: Summarizes the performance of the classification model.



In [23]:
classification_report(y_test, y_pred)

'              precision    recall  f1-score   support\n\n           0       1.00      1.00      1.00   1805949\n           1       0.72      0.25      0.37      1836\n\n    accuracy                           1.00   1807785\n   macro avg       0.86      0.62      0.68   1807785\nweighted avg       1.00      1.00      1.00   1807785\n'

In [24]:
confusion_matrix(y_test, y_pred)

array([[1805772,     177],
       [   1379,     457]], dtype=int64)