## Fraud Detection Model Description

- Dataset: ~6.3M transactions with 11 features.
- Data Cleaning:
    - Handled outliers using IQR method.
    - Checked skewness for highly skewed features (e.g., amount).
    - Reduced multicollinearity by combining `oldbalanceDest` & `newbalanceDest` into `dest_balance_change`.
- Feature Engineering:
    - Encoded categorical features (`type_encoding`, `nameDest_encoded`, `nameOrig_encoded`).
    - Derived new features like `dest_balance_change`.
- Imbalanced Data Handling:
    - Fraud cases are very rare (~0.13%), applied resampling / class weights.
- Models Trained:
    - Random Forest → high precision but moderate recall.
    - XGBoost → very high recall but low precision.
    - Ensemble (RF + XGB) → best balance: 94% recall, 72% precision.
- Evaluation:
    - Confusion Matrix, Classification Report, ROC-AUC (0.9995).
- Conclusion:
    - Ensemble selected as final model because it catches most frauds while controlling false positives.


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
df=pd.read_csv('/content/drive/MyDrive/Fraud DS project/Fraud.csv')

In [3]:
df.sample(5)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
27459,8,CASH_OUT,14434.99,C1826352341,155195.61,140760.62,C1413677222,509153.86,523588.86,0,0
2733352,212,CASH_OUT,148190.99,C1164502097,0.0,0.0,C930083339,929543.85,1077734.84,0,0
5982853,408,CASH_OUT,256500.67,C673750376,0.0,0.0,C825903694,1119714.94,1376215.62,0,0
4813669,346,PAYMENT,27910.15,C1577384715,24.0,0.0,M1264376572,0.0,0.0,0,0
3113878,235,PAYMENT,9748.12,C1738119702,151312.0,141563.88,M229249350,0.0,0.0,0,0


In [6]:
df.sample(5)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
3409597,255,CASH_IN,113058.19,C1301032550,3352974.84,3466033.04,C1593489597,644383.17,531324.97,0,0
463551,19,CASH_IN,97631.7,C1372293201,66302.0,163933.7,C118525501,40571.82,0.0,0,0
175471,12,PAYMENT,3457.89,C313834325,127388.93,123931.04,M389361824,0.0,0.0,0,0
615139,34,TRANSFER,545852.13,C345624882,0.0,0.0,C1960580171,661923.45,1207775.58,0,0
3793047,281,PAYMENT,7352.31,C834984583,187.11,0.0,M376829670,0.0,0.0,0,0


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   step            int64  
 1   type            object 
 2   amount          float64
 3   nameOrig        object 
 4   oldbalanceOrg   float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         int64  
 10  isFlaggedFraud  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB


In [8]:
df['type'].value_counts()

Unnamed: 0_level_0,count
type,Unnamed: 1_level_1
CASH_OUT,2237500
PAYMENT,2151495
CASH_IN,1399284
TRANSFER,532909
DEBIT,41432


# Data cleaning including missing values, outliers and multi-collinearity.



In [4]:
df.shape

(6362620, 11)

In [46]:
print(df.isnull().sum())

step                   0
type                   0
amount                 0
nameOrig               0
oldbalanceOrg          0
nameDest               0
isFraud                0
isFlaggedFraud         0
type_encoding          0
dest_balance_change    0
nameDest_initial       0
nameDest_encoded       0
nameOrig_initial       0
nameOrig_encoded       0
dtype: int64


In [6]:
df.duplicated().sum()

np.int64(0)

In [8]:
df.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [9]:
def change(x):
  if x =='CASH_OUT':
    return 1
  elif x=='PAYMENT':
    return 2
  elif x=='CASH_IN':
    return 3
  elif x=='TRANSFER':
    return 4
  else:
    return 5

In [10]:
df['type_encoding']=df['type'].apply(change)

In [13]:
df.head(2)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,type_encoding
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0,2
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0,2


In [11]:
df['type'].value_counts()

Unnamed: 0_level_0,count
type,Unnamed: 1_level_1
CASH_OUT,2237500
PAYMENT,2151495
CASH_IN,1399284
TRANSFER,532909
DEBIT,41432


## Multi-collinearity.

* I found strong correlation between oldbalanceOrg & newbalanceOrg, and
also between oldbalanceDest & newbalanceDest.

* Such multicollinearity can confuse the model, as both features carry overlapping information.

* To address this, I created derived features like dest_balance_change.
*   Then I dropped redundant columns, keeping the dataset simpler and more meaningful.







In [48]:
df.corr(numeric_only=True)

Unnamed: 0,step,amount,oldbalanceOrg,isFraud,isFlaggedFraud,type_encoding,dest_balance_change,nameDest_encoded,nameOrig_encoded
step,1.0,0.022373,-0.010058,0.031578,0.003277,0.012627,0.001325,-0.004926,
amount,0.022373,1.0,-0.002762,0.076688,0.012295,0.198987,0.845964,0.197444,
oldbalanceOrg,-0.010058,-0.002762,1.0,0.010154,0.003835,0.260418,-0.087032,0.189486,
isFraud,0.031578,0.076688,0.010154,1.0,0.044109,0.016171,0.027028,0.025697,
isFlaggedFraud,0.003277,0.012295,0.003835,0.044109,1.0,0.003144,-0.000242,0.001133,
type_encoding,0.012627,0.198987,0.260418,0.016171,0.003144,1.0,0.080513,0.040302,
dest_balance_change,0.001325,0.845964,-0.087032,0.027028,-0.000242,0.080513,1.0,0.109286,
nameDest_encoded,-0.004926,0.197444,0.189486,0.025697,0.001133,0.040302,0.109286,1.0,
nameOrig_encoded,,,,,,,,,


In [13]:
df['dest_balance_change'] = df['newbalanceDest'] - df['oldbalanceDest']


In [14]:
df = df.drop(['oldbalanceDest', 'newbalanceDest'], axis=1)


Unnamed: 0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFlaggedFraud,isFraud,type_encoding
amount,1.0,-0.002762,-0.007861,0.294137,0.459304,0.012295,0.076688,0.198987
oldbalanceOrg,-0.002762,1.0,0.998803,0.066243,0.042029,0.003835,0.010154,0.260418
newbalanceOrig,-0.007861,0.998803,1.0,0.067812,0.041837,0.003776,-0.008148,0.270669
oldbalanceDest,0.294137,0.066243,0.067812,1.0,0.976569,-0.000513,-0.005885,0.066255
newbalanceDest,0.459304,0.042029,0.041837,0.976569,1.0,-0.000529,0.000535,0.079111
isFlaggedFraud,0.012295,0.003835,0.003776,-0.000513,-0.000529,1.0,0.044109,0.003144
isFraud,0.076688,0.010154,-0.008148,-0.005885,0.000535,0.044109,1.0,0.016171
type_encoding,0.198987,0.260418,0.270669,0.066255,0.079111,0.003144,0.016171,1.0


## Imbalanced Dataset

In [15]:
df['isFraud'].value_counts()

Unnamed: 0_level_0,count
isFraud,Unnamed: 1_level_1
0,6354407
1,8213


Data cleaning - no null value ,no duplicate value
Outliers -
Multicollinearity - strong mulitcolinearity between oldbalanceOrg & newbalanceOrig - 0.99
oldbalanceDest & newbalanceDest -0.97

In [16]:
df.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,isFraud,isFlaggedFraud,type_encoding,dest_balance_change
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0,0,2,0.0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0,0,2,0.0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,1,0,4,0.0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,1,0,1,-21182.0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0,0,2,0.0


In [14]:
df['amount'].skew()

np.float64(30.99394948249038)

In [18]:
pd.crosstab(df['type'], df['isFraud'])


isFraud,0,1
type,Unnamed: 1_level_1,Unnamed: 2_level_1
CASH_IN,1399284,0
CASH_OUT,2233384,4116
DEBIT,41432,0
PAYMENT,2151495,0
TRANSFER,528812,4097


In [19]:
df.head(3)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,isFraud,isFlaggedFraud,type_encoding,dest_balance_change
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0,0,2,0.0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0,0,2,0.0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,1,0,4,0.0


In [20]:
df = df.drop('newbalanceOrig', axis=1)


In [29]:
df.head(3)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,nameDest,isFraud,isFlaggedFraud,type_encoding,dest_balance_change,nameDest_initial
0,1,PAYMENT,9839.64,C1231006815,170136.0,M1979787155,0,0,2,0.0,M
1,1,PAYMENT,1864.28,C1666544295,21249.0,M2044282225,0,0,2,0.0,M
2,1,TRANSFER,181.0,C1305486145,181.0,C553264065,1,0,4,0.0,C


In [21]:
df['nameDest_initial'] = df['nameDest'].str[0]

In [22]:
pd.crosstab(df['nameDest_initial'], df['isFraud'])

isFraud,0,1
nameDest_initial,Unnamed: 1_level_1,Unnamed: 2_level_1
C,4202912,8213
M,2151495,0


In [23]:
df['nameDest_encoded'] = df['nameDest_initial'].replace({'C': 1, 'M': 0})

  df['nameDest_encoded'] = df['nameDest_initial'].replace({'C': 1, 'M': 0})


In [24]:
df.head(3)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,nameDest,isFraud,isFlaggedFraud,type_encoding,dest_balance_change,nameDest_initial,nameDest_encoded
0,1,PAYMENT,9839.64,C1231006815,170136.0,M1979787155,0,0,2,0.0,M,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,M2044282225,0,0,2,0.0,M,0
2,1,TRANSFER,181.0,C1305486145,181.0,C553264065,1,0,4,0.0,C,1


In [25]:
df.corr(numeric_only=True)

Unnamed: 0,step,amount,oldbalanceOrg,isFraud,isFlaggedFraud,type_encoding,dest_balance_change,nameDest_encoded
step,1.0,0.022373,-0.010058,0.031578,0.003277,0.012627,0.001325,-0.004926
amount,0.022373,1.0,-0.002762,0.076688,0.012295,0.198987,0.845964,0.197444
oldbalanceOrg,-0.010058,-0.002762,1.0,0.010154,0.003835,0.260418,-0.087032,0.189486
isFraud,0.031578,0.076688,0.010154,1.0,0.044109,0.016171,0.027028,0.025697
isFlaggedFraud,0.003277,0.012295,0.003835,0.044109,1.0,0.003144,-0.000242,0.001133
type_encoding,0.012627,0.198987,0.260418,0.016171,0.003144,1.0,0.080513,0.040302
dest_balance_change,0.001325,0.845964,-0.087032,0.027028,-0.000242,0.080513,1.0,0.109286
nameDest_encoded,-0.004926,0.197444,0.189486,0.025697,0.001133,0.040302,0.109286,1.0


In [26]:
df['nameOrig_initial'] = df['nameOrig'].str[0]

In [27]:
df.head(2)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,nameDest,isFraud,isFlaggedFraud,type_encoding,dest_balance_change,nameDest_initial,nameDest_encoded,nameOrig_initial
0,1,PAYMENT,9839.64,C1231006815,170136.0,M1979787155,0,0,2,0.0,M,0,C
1,1,PAYMENT,1864.28,C1666544295,21249.0,M2044282225,0,0,2,0.0,M,0,C


In [28]:
df['nameOrig_encoded'] = df['nameOrig_initial'].replace({'C': 1, 'M': 0})

  df['nameOrig_encoded'] = df['nameOrig_initial'].replace({'C': 1, 'M': 0})


In [29]:
df['nameOrig_initial'].value_counts()

Unnamed: 0_level_0,count
nameOrig_initial,Unnamed: 1_level_1
C,6362620


In [30]:
df.corr(numeric_only=True)

Unnamed: 0,step,amount,oldbalanceOrg,isFraud,isFlaggedFraud,type_encoding,dest_balance_change,nameDest_encoded,nameOrig_encoded
step,1.0,0.022373,-0.010058,0.031578,0.003277,0.012627,0.001325,-0.004926,
amount,0.022373,1.0,-0.002762,0.076688,0.012295,0.198987,0.845964,0.197444,
oldbalanceOrg,-0.010058,-0.002762,1.0,0.010154,0.003835,0.260418,-0.087032,0.189486,
isFraud,0.031578,0.076688,0.010154,1.0,0.044109,0.016171,0.027028,0.025697,
isFlaggedFraud,0.003277,0.012295,0.003835,0.044109,1.0,0.003144,-0.000242,0.001133,
type_encoding,0.012627,0.198987,0.260418,0.016171,0.003144,1.0,0.080513,0.040302,
dest_balance_change,0.001325,0.845964,-0.087032,0.027028,-0.000242,0.080513,1.0,0.109286,
nameDest_encoded,-0.004926,0.197444,0.189486,0.025697,0.001133,0.040302,0.109286,1.0,
nameOrig_encoded,,,,,,,,,


In [32]:
df.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,nameDest,isFraud,isFlaggedFraud,type_encoding,dest_balance_change,nameDest_initial,nameDest_encoded,nameOrig_initial,nameOrig_encoded
0,1,PAYMENT,9839.64,C1231006815,170136.0,M1979787155,0,0,2,0.0,M,0,C,1
1,1,PAYMENT,1864.28,C1666544295,21249.0,M2044282225,0,0,2,0.0,M,0,C,1
2,1,TRANSFER,181.0,C1305486145,181.0,C553264065,1,0,4,0.0,C,1,C,1
3,1,CASH_OUT,181.0,C840083671,181.0,C38997010,1,0,1,-21182.0,C,1,C,1
4,1,PAYMENT,11668.14,C2048537720,41554.0,M1230701703,0,0,2,0.0,M,0,C,1


In [47]:
df.head(2)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,nameDest,isFraud,isFlaggedFraud,type_encoding,dest_balance_change,nameDest_initial,nameDest_encoded,nameOrig_initial,nameOrig_encoded
0,1,PAYMENT,9839.64,C1231006815,170136.0,M1979787155,0,0,2,0.0,M,0,C,1
1,1,PAYMENT,1864.28,C1666544295,21249.0,M2044282225,0,0,2,0.0,M,0,C,1


High skewed value define the data is not a normally distrubuted

In [33]:
df.skew(numeric_only=True)

Unnamed: 0,0
step,0.375177
amount,30.993949
oldbalanceOrg,5.249136
isFraud,27.779538
isFlaggedFraud,630.603629
type_encoding,0.587337
dest_balance_change,32.916341
nameDest_encoded,-0.684258
nameOrig_encoded,0.0


In [34]:
df.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,nameDest,isFraud,isFlaggedFraud,type_encoding,dest_balance_change,nameDest_initial,nameDest_encoded,nameOrig_initial,nameOrig_encoded
0,1,PAYMENT,9839.64,C1231006815,170136.0,M1979787155,0,0,2,0.0,M,0,C,1
1,1,PAYMENT,1864.28,C1666544295,21249.0,M2044282225,0,0,2,0.0,M,0,C,1
2,1,TRANSFER,181.0,C1305486145,181.0,C553264065,1,0,4,0.0,C,1,C,1
3,1,CASH_OUT,181.0,C840083671,181.0,C38997010,1,0,1,-21182.0,C,1,C,1
4,1,PAYMENT,11668.14,C2048537720,41554.0,M1230701703,0,0,2,0.0,M,0,C,1


# Selection of features

* Dropped irrelevant or redundant columns like nameOrig, type and nameDest.

* Combined highly correlated features (oldbalanceDest & newbalanceDest) into dest_balance_change.

* Encoded categorical features.

* Kept only features with predictive power for fraud: amount, oldbalanceOrg, dest_balance_change, type_encoding, nameDest_encoded, nameOrig_encoded.






In [35]:
data=df[['step','amount','oldbalanceOrg','isFraud','isFlaggedFraud','type_encoding','dest_balance_change','nameDest_encoded']]

In [36]:
data.head()

Unnamed: 0,step,amount,oldbalanceOrg,isFraud,isFlaggedFraud,type_encoding,dest_balance_change,nameDest_encoded
0,1,9839.64,170136.0,0,0,2,0.0,0
1,1,1864.28,21249.0,0,0,2,0.0,0
2,1,181.0,181.0,1,0,4,0.0,1
3,1,181.0,181.0,1,0,1,-21182.0,1
4,1,11668.14,41554.0,0,0,2,0.0,0


# What are the key factors that predict fraudulent customer?
* Transaction Amount (amount) – unusually high or abnormal transactions often indicate fraud.

* Balance Changes (dest_balance_change) – sudden withdrawals or transfers to other accounts can signal fraudulent activity.

* Transaction Type (type_encoding) – certain types like TRANSFER or CASH_OUT are more prone to fraud.

* Origin and Destination Account Type (nameOrig_encoded, nameDest_encoded) – fraudsters often target customer accounts (C) rather than merchants (M).

* Old Account Balance (oldbalanceOrg) – very low or zero balances combined with high transfers can be suspicious.

# Do these factors make sense? If yes, How? If not, How not?
* Yes, these factors make sense.

* High transaction amounts and sudden balance changes are common indicators of fraud because fraudsters try to move large sums quickly.

* Certain transaction types like TRANSFER and CASH_OUT are more susceptible to fraudulent activity.

* Targeting customer accounts (C) rather than merchants (M) aligns with how fraudsters operate in real life.

* Low or zero starting balances combined with large transfers are unusual and suspicious, supporting the model’s logic.

# Splitting the dataset

In [37]:
X = data.drop('isFraud', axis=1)  # Features
y = data['isFraud']

In [38]:
from sklearn.model_selection import train_test_split

# 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [39]:
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=42)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)


# What kind of prevention should be adopted while company update its infrastructure?
* Implement real-time transaction monitoring to flag suspicious activity immediately.

* Use fraud detection models (like the Ensemble model) during updates to catch anomalies.

* Continuously update and retrain models to adapt to new fraud patterns.

# Assuming these actions have been implemented, how would you determine if they work?
* Monitor Key Metrics: Track fraud detection precision, recall, F1-score, and false positives over time.

* Business Impact: Assess reduction in financial losses and operational efficiency improvements.

# Model 1

In [32]:
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

xgb = XGBClassifier(scale_pos_weight=len(y_train[y_train==0]) / len(y_train[y_train==1]),
                    eval_metric='auc',
                    random_state=42)

xgb.fit(X_train, y_train)
y_pred = xgb.predict(X_test)
y_prob = xgb.predict_proba(X_test)[:,1]

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nROC-AUC Score:", roc_auc_score(y_test, y_prob))


Confusion Matrix:
 [[1266089    4792]
 [     22    1621]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00   1270881
           1       0.25      0.99      0.40      1643

    accuracy                           1.00   1272524
   macro avg       0.63      0.99      0.70   1272524
weighted avg       1.00      1.00      1.00   1272524


ROC-AUC Score: 0.9994876563462884


In [42]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score


In [43]:
# Use 'balanced' to handle imbalanced dataset
rf_model = RandomForestClassifier(
    n_estimators=100,       # number of trees
    max_depth=None,         # let trees grow fully
    random_state=42,
    class_weight='balanced'  # automatically weights fraud class higher
)


In [85]:
rf_model.fit(X_train, y_train)


In [86]:
y_pred = rf_model.predict(X_test)
y_pred_prob = rf_model.predict_proba(X_test)[:,1]  # probability of fraud


In [87]:
# Confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# ROC-AUC score
roc_auc = roc_auc_score(y_test, y_pred_prob)
print(f"\nROC-AUC Score: {roc_auc:.4f}")


Confusion Matrix:
[[1270859      22]
 [    338    1305]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00   1270881
           1       0.98      0.79      0.88      1643

    accuracy                           1.00   1272524
   macro avg       0.99      0.90      0.94   1272524
weighted avg       1.00      1.00      1.00   1272524


ROC-AUC Score: 0.9907


# Model 3

In [None]:
from sklearn.svm import SVC

svm = SVC(kernel='rbf', class_weight='balanced', probability=True, random_state=42)
svm.fit(X_train, y_train)

y_pred = svm.predict(X_test)
y_prob = svm.predict_proba(X_test)[:,1]

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nROC-AUC Score:", roc_auc_score(y_test, y_prob))


model 3

In [44]:
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

xgb = XGBClassifier(scale_pos_weight=len(y_train[y_train==0]) / len(y_train[y_train==1]),
                    eval_metric='auc',
                    random_state=42)

# Demonstrate the performance of the model by using best set of tools.

In [45]:
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Ensemble with soft voting (uses predicted probabilities)
ensemble_model = VotingClassifier(
    estimators=[
        ('rf', rf_model),   # Your trained RandomForest model
        ('xgb', xgb)  # Your trained XGBoost model
    ],
    voting='soft'
)

# Train ensemble
ensemble_model.fit(X_train, y_train)

# Predictions
y_pred = ensemble_model.predict(X_test)
y_prob = ensemble_model.predict_proba(X_test)[:,1]

# Evaluation
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

print("\nClassification Report:\n", classification_report(y_test, y_pred))

roc_auc = roc_auc_score(y_test, y_prob)
print("\nROC-AUC Score:", roc_auc)


Confusion Matrix:
 [[1270272     609]
 [     97    1546]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00   1270881
           1       0.72      0.94      0.81      1643

    accuracy                           1.00   1272524
   macro avg       0.86      0.97      0.91   1272524
weighted avg       1.00      1.00      1.00   1272524


ROC-AUC Score: 0.9995020929699184


**I compared Random Forest, XGBoost, and an Ensemble. While Random Forest had good precision and XGBoost had excellent recall, the Ensemble struck the best balance, achieving 94% recall with much higher precision. This makes it the most practical fraud detection model for deployment.**

In [50]:
# google colab link
# https://colab.research.google.com/drive/11TvhMoOK5ISWohJzyV-9DNl2RGFohuh6?usp=sharing