 <h1 align="center"> Online Payment Fraud Detection System:</h1>

<h2 align="center"> A Data-Driven Machine Learning Project </h2>

### Introduction:
This project is a machine learning classification model to detect fraudulent transactions in online payments. The dataset used is highly imbalanced, with far more non-fraudulent transactions than fraudulent ones. The project explores multiple algorithms and techniques to address the imbalance and improve the detection of fraud.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier


from sklearn.metrics import  confusion_matrix, classification_report, roc_auc_score

pd.set_option('display.float_format', lambda x: '{:.2f}'.format(x))
np.set_printoptions(suppress=True)

import warnings
warnings.filterwarnings('ignore')  # To suppress warnings


# Data Loading and Analyzing:

In [None]:
df = pd.read_csv('data.csv')
df.head()

In [None]:
print("Rows and Columns")
print(df.shape)

##### Columns




In [None]:
print(df.columns)
df.rename(columns={'type':'transaction_type'},inplace=True)
print(df.columns)

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.head(2)

#### Class imbalance 

In [None]:
df['isFraud'].value_counts()

# We have class imbalance in the Fraudulent(1) transactions.

In [None]:
numeric_columns = df.select_dtypes(['int64','float64']).columns

df[numeric_columns] = df[numeric_columns].astype('int')
df[numeric_columns].head(3)

In [None]:
cat_column = ['transaction_type','isFraud','isFlaggedFraud']

for col in cat_column:
    print(f"column => {col} = {df[col].unique()}")

#### Features:

## Data Cleaning

In [None]:
print("Checking for NA Values")
print(df.isna().sum())

In [None]:
print("Checking for Duplicates")
print(df.duplicated().sum())

In [None]:
df[numeric_columns].describe()

### Analyze categorical features

In [None]:
df.select_dtypes(exclude=['float64','int64']).columns

categorical_columns = df.select_dtypes(exclude=['float64','int']).columns[0] # We will only use transaction type column as categorical because the other columns are unique identifiers.
categorical_columns

In [None]:
# Unique Identifers:
df[['nameOrig','nameDest']].describe()

# Exploratory Data Analysis

In [None]:
numeric_columns

In [None]:
categorical_columns

In [None]:
transaction_count = df.groupby('transaction_type')['isFraud'].count().reset_index(name='no of transactions') \
   .sort_values(by='no of transactions',ascending=False)

ax = sns.barplot(data=transaction_count,x='transaction_type',y='no of transactions',)
ax.bar_label(ax.containers[0])

plt.xlabel(" No of transactions ")
plt.title(" Count of Transactions Per Transaction Type ")
plt.ylabel(" Transaction Type")
plt.show()

### Insights

1. Majority of the Transactions were done By Payment and Cash_out Mode.
2. Very less transactions were done by Debit and transfer type.

In [None]:
fraud_count_df = df.groupby('transaction_type')['isFraud'].sum().reset_index() \
     .sort_values(by='isFraud',ascending=False) 

ax = sns.barplot(data=fraud_count_df,x='transaction_type',y='isFraud',color='red')
ax.bar_label(ax.containers[0])

plt.title("Count of Fraudulent Transactions per Transaction Type")
plt.xlabel("No of Fraud Transactions")
plt.ylabel("Transaction_type")
plt.show()

### Insights

1. Majortiy of the Fraudulent transaction's were either by (CASH OUT) withdrawal of money from the account or by Transfer of funds between two accounts (TRANSFER).
2. No Fraudulent Transactions were done by any other type.

In [None]:
numeric_columns

In [None]:
plt.figure(figsize=(8, 4))
sns.kdeplot(np.log1p(df['amount'][df['isFraud'] == 0]), fill=True, label='default = 0')
sns.kdeplot(np.log1p(df['amount'][df['isFraud'] == 1]), fill=True, label='default = 1')
plt.title("Amount KDE Plot with isFraud as Hue (Log Scale)")
plt.xlabel("Log(Amount)")
plt.legend()
plt.show()

#### Insights:

1. We can see that Transactions with higher amount have chances of beign FRAUDULENT.
2. Transactions with smaller amount tend to being NON-FRAUDULENT

### Co-relation matrix:

In [None]:
cm = df[numeric_columns].corr()

plt.figure(figsize=(10,4))
sns.heatmap(cm,annot=True,fmt=".2f",cmap='coolwarm')
plt.title("Co-relation matrix")
plt.show()

### Insights from EDA.

1. Most of the transactions took place by CASH OUT , PAYMENT and CASH IN
2. Majority of the FRAUDULENT Transactions took place by CASH OUT or TRANSFER Transaction Type
3. Higher the Amount of Transaction, Higher the Risk of it being Fraud.


# Feature Engineering , Feature Selection

In [None]:
# Usage of step feature:

#### Removing columns based on Domain knowledge

In [None]:
df.drop(columns=['step','nameOrig','nameDest','isFlaggedFraud','oldbalanceDest','newbalanceDest'],inplace=True)
df.head(2)

# Feature Encoding:
Encoding transaction type column using label encoding:

In [None]:
df['transaction_type'] = df['transaction_type'].map({"CASH_OUT": 1, "PAYMENT": 2, 
                                 "CASH_IN": 3, "TRANSFER": 4,
                                 "DEBIT": 5})

print(df['transaction_type'].unique())

In [None]:
df.head(2)

# Model Training

#### Train Test Split

In [None]:
X = df[['transaction_type','amount','oldbalanceOrg','newbalanceOrig']]
y = df['isFraud']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)

### Attempt 1:


##### Without handling Class Imbalance

In [None]:
# Created a function that takes model, train and test data and returns a classification report.

def get_report(model,X_train,y_train,X_test,y_test):
    model.fit(X_train,y_train)
    y_pred = model.predict(X_test)
    report = classification_report(y_test,y_pred) 
    
    print(report)

In [None]:
get_report(LogisticRegression(),X_train,y_train,X_test,y_test)

In [None]:
get_report(DecisionTreeClassifier(),X_train,y_train,X_test,y_test)

In [None]:
get_report(RandomForestClassifier(),X_train,y_train,X_test,y_test)

### Insights:

LogiticRegression Model : recall = 0.53 , precision = 0.98

DecsionTreeClassifier Model : recall = 0.96 , precision =0.98

RandomForestClassfier Model : recall = 0.96, precision = 0.99

## Attempt 2:

#### Handling Class Imbalance using Smote, RandomUnderSampler Techniques:

#### Using SMOTE

In [None]:
from imblearn.over_sampling import SMOTE

In [None]:
smt = SMOTE(random_state=42)

X_resampled,y_resampled = smt.fit_resample(X,y)

In [None]:
get_report(LogisticRegression(),X_resampled,y_resampled,X_test,y_test)

In [None]:
get_report(DecisionTreeClassifier(),X_resampled,y_resampled,X_test,y_test)

In [None]:
get_report(RandomForestClassifier(),X_resampled,y_resampled,X_test,y_test)

#### Using RandomUnderSampler:

In [None]:
from imblearn.under_sampling import RandomUnderSampler

In [None]:
under_sampler= RandomUnderSampler(random_state=42)

X_under,y_under = under_sampler.fit_resample(X,y)

In [None]:
get_report(LogisticRegression(),X_under,y_under,X_test,y_test)

In [None]:
get_report(DecisionTreeClassifier(),X_under,y_under,X_test,y_test)

In [None]:
get_report(RandomForestClassifier(),X_under,y_under,X_test,y_test)

## Model Fine Tuning:

1. Finalized Random Forest Classifier Model without hyperparameters:

In [None]:
get_report( RandomForestClassifier(n_estimators=300,
    max_depth=10,
    min_samples_leaf=5,
    min_samples_split=10,
    max_features='sqrt',
    class_weight={0: 1, 1: 5},  
    random_state=42), X_train,y_train,X_test,y_test ) 


In [None]:
best_model = RandomForestClassifier(random_state=42)
best_model.fit(X_train,y_train)

In [None]:
y_pred = best_model.predict(X_test)
print(classification_report(y_test,y_pred))

### Our Best model is RandomForestClassifier with recall 99% and precision 96%

# Model Evaluation Roc,Auc Curve:

In [None]:
fpr, tpr, thresholds = roc_curve(y_test,y_pred)
fpr[:5], tpr [:5],thresholds[:5]

In [None]:
auc_score = auc(fpr,tpr)
print(f"Area Under the Curve = {round(auc_score,2)}")

In [None]:
# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f"AUC = {auc_score:.2f}", color='darkorange', lw=2)
plt.plot([0, 1], [0, 1], color='navy', linestyle='--', label='Random Guess')
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (TPR)")
plt.title("ROC Curve")
plt.legend(loc="lower right")
plt.grid()
plt.show()

#### Manual Inputs for checking the working of the model:

In [None]:
X.head(2)

In [None]:
value = best_model.predict([[1,10240,200000,0]])
print(value)
if value == 0:
    print("Not a Fraud")
else:
    print("Fraud")    

#### Dumping the Model and its Artifcats:

In [None]:
# import joblib

# model_data = {
#     'model':best_model,
#     'features':X_train.columns,
#     'label_mapping': {0: 'Not Fraud', 1: 'Fraud'} 
# }

# joblib.dump(model_data,'fraud_detection_model.pkl')

In [None]:
df.head(2)

In [None]:
df.query('isFraud ==1 and newbalanceOrig != 0').head(20)

# Transaction Mapping:
# cash_ot = 1, payment = 2, cash_in = 3, transfer = 4 , debit = 5