<div style="
    background: linear-gradient(45deg, 
        #d6eaf8, 
        #f2f4f4 20%, 
        #ffe5b4 40%, 
        #ffffcc 60%, 
        #d1f2eb 80%, 
        #f3e5f5 100%
    );
    padding: 20px; 
    margin: 20px 0; 
    border-radius: 10px; 
    box-shadow: 5px 5px 15px rgba(0, 0, 0, 0.3);
    font-family: Arial, sans-serif; 
    color: #333;
    text-align: center;
">
<h1><strong>Bank Transaction Fraud Detection</strong></h1>
<h2><strong>Problem Statement</strong></h2>
<p>With the rapid growth of digital banking, fraudulent transactions have become a significant concern for financial institutions. The challenge is to build a robust system to detect and prevent fraudulent transactions in real-time while maintaining customer convenience and privacy.</p>
<p>The dataset provided contains detailed information about bank transactions, including customer demographics, transaction metadata, merchant categories, device types, transaction locations, and other relevant attributes. Key fields like transaction descriptions, device usage, and merchant categories provide vital insights for identifying anomalous activities. The "Is_Fraud" label offers a foundation for supervised learning techniques to differentiate between genuine and fraudulent transactions.</p>
<p>The objective of this problem is to analyze transaction patterns and develop predictive models that can accurately classify transactions as fraudulent or legitimate. This task involves exploring feature correlations, detecting unusual transaction behavior, and leveraging machine learning algorithms to create a scalable and efficient fraud detection system.</p>
<p>A successful solution will not only detect fraudulent activities but also minimize false positives, ensuring genuine transactions are not unnecessarily flagged. Insights derived from this analysis can help strengthen security measures, optimize fraud prevention strategies, and enhance the overall banking experience for customers.</p>
</div>

<div style="
    background: linear-gradient(45deg, 
        #d6eaf8, 
        #f2f4f4 20%, 
        #ffe5b4 40%, 
        #ffffcc 60%, 
        #d1f2eb 80%, 
        #f3e5f5 100%
    );
    padding: 20px; 
    margin: 20px 0; 
    border-radius: 10px; 
    box-shadow: 5px 5px 15px rgba(0, 0, 0, 0.3);
    font-family: Arial, sans-serif; 
    color: #333;
    text-align: left;
">
<h2><strong>Objectives for Bank Transaction Fraud Detection</strong></h2>
<ol>
<li>
<p><strong>Fraud Detection:</strong></p>
<ul>
<li>Develop a predictive model to classify bank transactions as fraudulent or legitimate using historical transaction data.</li>
</ul>
</li>
<li>
<p><strong>Anomaly Detection:</strong></p>
<ul>
<li>Identify unusual patterns or behaviors in customer transactions that may indicate potential fraud.</li>
</ul>
</li>
<li>
<p><strong>Feature Analysis:</strong></p>
<ul>
<li>Explore key features such as merchant categories, transaction devices, transaction locations, and account types to understand their impact on fraud detection.</li>
</ul>
</li>
<li>
<p><strong>Model Performance Optimization:</strong></p>
<ul>
<li>Ensure the fraud detection system achieves high accuracy, precision, and recall while minimizing false positives and false negatives.</li>
</ul>
</li>
<li>
<p><strong>Real-Time Fraud Prevention:</strong></p>
<ul>
<li>Create a scalable solution that can potentially be adapted for real-time fraud detection in production environments.</li>
</ul>
</li>
<li>
<p><strong>Customer Behavior Insights:</strong></p>
<ul>
<li>Analyze legitimate transaction behaviors to gain insights into customer banking patterns and preferences.</li>
</ul>
</li>
<li>
<p><strong>Device and Location Security:</strong></p>
<ul>
<li>Understand the correlation between transaction device types, locations, and fraudulent activities.</li>
</ul>
</li>
<li>
<p><strong>Security Enhancements:</strong></p>
<ul>
<li>Provide actionable recommendations to the bank for improving fraud prevention strategies and enhancing digital transaction security.</li>
</ul>
</li>
</ol>
</div>

<div style="
    background: linear-gradient(45deg, 
        #d6eaf8 0%, 
        #d6eaf8 10%, 
        #f2f4f4 10%, 
        #f2f4f4 20%, 
        #ffe5b4 20%, 
        #ffe5b4 30%, 
        #ffffcc 30%, 
        #ffffcc 40%, 
        #d1f2eb 40%, 
        #d1f2eb 50%, 
        #f3e5f5 50%, 
        #f3e5f5 60%, 
        #ffe4e1 60%, 
        #ffe4e1 70%
    );
    color: #333;
    padding: 10px;
    box-shadow: 5px 5px 15px rgba(0, 0, 0, 0.3);
    border-radius: 10px;
">
    <h2><strong>Stage 1 - Importing Libraries</strong></h2>
</div>

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from imblearn.over_sampling import SMOTE
from sklearn.decomposition import PCA
from sklearn.utils.class_weight import compute_class_weight
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.linear_model import LogisticRegression
# from sklearn.linear_model import LinearDiscriminantAnalysis as LDA, RidgeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, BaggingClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn import metrics
import xgboost as xgb
import lightgbm as lgb
import catboost as cb
import warnings

# Suppress all warnings
warnings.filterwarnings("ignore")

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        file_path = os.path.join(dirname, filename)
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Load the dataset
df = pd.read_csv(file_path)
df.head()

In [None]:
df.info()

In [None]:
df.shape

In [None]:
df.duplicated().sum()

<div style="
    background: linear-gradient(45deg, 
        #d6eaf8 0%, 
        #d6eaf8 10%, 
        #f2f4f4 10%, 
        #f2f4f4 20%, 
        #ffe5b4 20%, 
        #ffe5b4 30%, 
        #ffffcc 30%, 
        #ffffcc 40%, 
        #d1f2eb 40%, 
        #d1f2eb 50%, 
        #f3e5f5 50%, 
        #f3e5f5 60%, 
        #ffe4e1 60%, 
        #ffe4e1 70%
    );
    color: #333;
    padding: 10px;
    box-shadow: 5px 5px 15px rgba(0, 0, 0, 0.3);
    border-radius: 10px;
">
    <h2><strong>Stage 2 - Data Preprocessing</strong></h2>
</div>

In [None]:
# Checking for missing values
print("Missing NULL values in the dataset:")
print(df.isnull().sum())
print("-"*80)
print("Missing N/A values in the dataset:")
print(df.isna().sum())

In [None]:
desc = pd.DataFrame(index = list(df))
desc['type'] = df.dtypes
desc['count'] = df.count()
desc['nunique'] = df.nunique()
desc['%unique'] = desc['nunique'] /len(df) * 100
desc['null'] = df.isnull().sum()
desc['%null'] = desc['null'] / len(df) * 100
desc = pd.concat([desc,df.describe().T.drop('count',axis=1)],axis=1)
desc.sort_values(by=['type','null']).style.background_gradient(cmap='YlOrBr')\
    .bar(subset=['mean'],color='green')\
    .bar(subset=['max'],color='red')\
    .bar(subset=['min'], color='pink')

In [None]:
# Get a list of categorical columns in the dataframe
categorical_columns = df.select_dtypes(include=['object']).columns

# Check the unique values and their counts for each categorical column
for col in categorical_columns:
    print(f"Column: {col}")
    print("-" * 25)
    print(f"Unique values: {df[col].nunique()}")
    print(f"Unique values sample: {df[col].unique()[:10]}")  # Display a sample of unique values
    print("-" * 50)

In [None]:
# If a column has only one unique value, it won't be useful for prediction.
single_value_columns = [col for col in df.columns if df[col].nunique() == 1]
print("Columns with only one unique value:", single_value_columns)

# Dropping columns with one unique value
df = df.drop(columns=single_value_columns)

In [None]:
# Checking columns after dropping one unique columns
df.columns

In [None]:
# Drop the columns which are not useful for the model evaluation
df = df.drop(columns=['Customer_Contact', 'Customer_Email', 'Customer_Name', 'Customer_ID', 'Transaction_ID', 'Merchant_ID'])
print(df.shape)

In [None]:
# Checking columns after dropping not useful columns
df.columns

<div style="
    background: linear-gradient(45deg, 
        #d6eaf8 0%, 
        #d6eaf8 10%, 
        #f2f4f4 10%, 
        #f2f4f4 20%, 
        #ffe5b4 20%, 
        #ffe5b4 30%, 
        #ffffcc 30%, 
        #ffffcc 40%, 
        #d1f2eb 40%, 
        #d1f2eb 50%, 
        #f3e5f5 50%, 
        #f3e5f5 60%, 
        #ffe4e1 60%, 
        #ffe4e1 70%
    );
    color: #333;
    padding: 10px;
    box-shadow: 5px 5px 15px rgba(0, 0, 0, 0.3);
    border-radius: 10px;
">
    <h2><strong>Stage 3 - Exploratory Data Analysis (EDA)</strong></h2>
</div>

<div style="
    background: linear-gradient(45deg, 
        #d6eaf8, 
        #f2f4f4 20%, 
        #ffe5b4 40%, 
        #ffffcc 60%, 
        #d1f2eb 80%, 
        #f3e5f5 100%
    );
    padding: 20px; 
    margin: 20px 0; 
    border-radius: 10px; 
    box-shadow: 5px 5px 15px rgba(0, 0, 0, 0.3);
    font-family: Arial, sans-serif; 
    color: #333;
    text-align: center;
">
    <h3><strong>EDA for Numerical Columns</strong></h3>
</div>


In [None]:
# For numerical columns, we'll fill missing values with the median of each column
numerical_columns = df.select_dtypes(include=['float64', 'int64']).columns
for col in numerical_columns:
    df[col] = df[col].fillna(df[col].median())

print(numerical_columns)

In [None]:
# Create a figure with 2 subplots in a horizontal row
fig, axes = plt.subplots(1, 2, figsize=(15, 6))  # 1 row, 2 columns

# KDE plot for the 'Is_Fraud' column (on the first subplot)
sns.kdeplot(df["Is_Fraud"], fill=True, ax=axes[0])
axes[0].set_title('Target Variable Distribution')

# Count plot for the 'Is_Fraud' column (on the second subplot)
sns.countplot(x='Is_Fraud', data=df, ax=axes[1])
axes[1].set_title('Fraudulent Transactions Count')

# Adjust layout to prevent overlap
plt.tight_layout()

# Show the plot
plt.show()

In [None]:
# Loop through each numerical column in your DataFrame
for col in numerical_columns:
    plt.style.use("fivethirtyeight")
    plt.figure(figsize=(10, 6))
    
    # Create the boxplot
    sns.boxplot(x=df[col])
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    
    # Show the plot
    plt.show()


<div style="
    background: linear-gradient(45deg, 
        #d6eaf8, 
        #f2f4f4 20%, 
        #ffe5b4 40%, 
        #ffffcc 60%, 
        #d1f2eb 80%, 
        #f3e5f5 100%
    );
    padding: 20px; 
    margin: 20px 0; 
    border-radius: 10px; 
    box-shadow: 5px 5px 15px rgba(0, 0, 0, 0.3);
    font-family: Arial, sans-serif; 
    color: #333;
    text-align: center;
">
    <h3><strong>EDA for Categorical Columns</strong></h3>
</div>


In [None]:
# For categorical columns, we'll fill missing values with the mode (most frequent category)
categorical_columns = df.select_dtypes(include=['object']).columns
for col in categorical_columns:
    df[col] = df[col].fillna(df[col].mode()[0])
    
print(categorical_columns)

In [None]:
# Create a figure with 4 subplots in a horizontal row
fig, axes = plt.subplots(1, 2, figsize=(20, 6))  # 1 row, 4 columns

# Histogram for the 'Age' column (on the third subplot)
sns.histplot(df['Age'], kde=True, ax=axes[0], color='skyblue')
axes[0].set_title('Age Distribution')

# Count plot for the 'Gender' column (on the fourth subplot)
sns.countplot(x='Gender', data=df, ax=axes[1])
axes[1].set_title('Gender Distribution')

# Adjust layout to prevent overlap
plt.tight_layout()

# Show the plot
plt.show()

In [None]:
# Calculate the number of rows needed based on the number of charts
num_cols = 3  # Number of charts per row
# num_rows = (len(categorical_columns) + num_cols - 1) // num_cols  # Calculate rows required for all charts
num_rows = 2 # Number of rows
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, num_rows * 6))  # Adjust figure size for more rows

# Flatten the axes array for easier iteration
axes = axes.flatten()

ax_index = 0
for col in categorical_columns:
    unique_values = df[col].nunique()
    if unique_values < 10:  # Only plot if unique values are less than 10
        # Plot on the respective subplot
        ax = axes[ax_index]
        ax.pie(df[col].value_counts(), labels=df[col].unique(), autopct='%1.1f%%')
        ax.set_title(f'{col} Distribution')
        
        # Move to the next subplot
        ax_index += 1

# Hide any unused subplots (in case there are fewer than `num_rows * num_cols` charts)
for i in range(ax_index, len(axes)):
    axes[i].axis('off')

# Adjust layout
plt.tight_layout()
plt.show()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Filter categorical columns with less than 20 unique values
categorical_cols = df.select_dtypes(include=['object']).columns
categorical_cols = [col for col in categorical_cols if df[col].nunique() < 20]

# Set the number of charts per row and rows
num_cols = 3  # Number of charts per row
num_rows = 2  # Number of rows

# Calculate the total number of subplots needed
total_plots = len(categorical_cols)

# Create a figure with the appropriate number of rows and columns
plt.figure(figsize=(15, 5 * num_rows))

# Plot the count plots for the filtered categorical columns
for i, col in enumerate(categorical_cols):
    plt.subplot(num_rows, num_cols, i + 1)
    sns.countplot(data=df, x=col, hue='Is_Fraud') 
    plt.title(f'Fraud by {col}')
    plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

# Calculate churn rate by categories
print("\Fraud Rate by Categories:")
for col in categorical_cols:
    print(f"\n{col} Analysis:")
    print(df.groupby(col)['Is_Fraud'].mean().round(3) * 100)


<div style="
    background: linear-gradient(45deg, 
        #d6eaf8 0%, 
        #d6eaf8 10%, 
        #f2f4f4 10%, 
        #f2f4f4 20%, 
        #ffe5b4 20%, 
        #ffe5b4 30%, 
        #ffffcc 30%, 
        #ffffcc 40%, 
        #d1f2eb 40%, 
        #d1f2eb 50%, 
        #f3e5f5 50%, 
        #f3e5f5 60%, 
        #ffe4e1 60%, 
        #ffe4e1 70%
    );
    color: #333;
    padding: 10px;
    box-shadow: 5px 5px 15px rgba(0, 0, 0, 0.3);
    border-radius: 10px;
">
    <h2><strong>Stage 4 - Convert Date Time Columns to Numerical Columns</strong></h2>
</div>

Convert 'Transaction_Date' and 'Transaction_Time' to datetime

In [None]:
df['Transaction_Date'] = pd.to_datetime(df['Transaction_Date'], format='%d-%m-%Y')
df['Transaction_Time'] = pd.to_datetime(df['Transaction_Time'], format='%H:%M:%S')

In [None]:
# Extract new features from 'Transaction_Date' and 'Transaction_Time'
df['Transaction_Day'] = df['Transaction_Date'].dt.day
df['Transaction_Month'] = df['Transaction_Date'].dt.month
df['Transaction_Year'] = df['Transaction_Date'].dt.year
df['Transaction_Hour'] = df['Transaction_Time'].dt.hour
df['Transaction_Minute'] = df['Transaction_Time'].dt.minute
df['Transaction_Second'] = df['Transaction_Time'].dt.second

In [None]:
# Drop 'Transaction_Date' and 'Transaction_Time' columns after feature extraction
df = df.drop(columns=['Transaction_Date', 'Transaction_Time'])

In [None]:
# If a column has only one unique value, it won't be useful for prediction.
single_value_cols = [col for col in df.columns if df[col].nunique() == 1]
print("Columns with only one unique value:", single_value_columns)

# Dropping columns with one unique value
df = df.drop(columns=single_value_cols)

In [None]:
# For numerical columns, updating after conversion
numerical_columns = df.select_dtypes(include=['float64', 'int64']).columns
print("Numerical Columns ::", numerical_columns)
print("-"*50)
# For categorical columns, updating after conversion
categorical_columns = df.select_dtypes(include=['object']).columns
print("Categorical Columns ::", categorical_columns)

In [None]:
df.head()

<div style="
    background: linear-gradient(45deg, 
        #d6eaf8 0%, 
        #d6eaf8 10%, 
        #f2f4f4 10%, 
        #f2f4f4 20%, 
        #ffe5b4 20%, 
        #ffe5b4 30%, 
        #ffffcc 30%, 
        #ffffcc 40%, 
        #d1f2eb 40%, 
        #d1f2eb 50%, 
        #f3e5f5 50%, 
        #f3e5f5 60%, 
        #ffe4e1 60%, 
        #ffe4e1 70%
    );
    color: #333;
    padding: 10px;
    box-shadow: 5px 5px 15px rgba(0, 0, 0, 0.3);
    border-radius: 10px;
">
    <h2><strong>Stage 5 - Encode Categorical Features</strong></h2>
</div>

In [None]:
# Initializing the LabelEncoder
label_encoder = LabelEncoder()

In [None]:
for col in categorical_columns:
    df[col] = label_encoder.fit_transform(df[col])

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.nunique()

<div style="
    background: linear-gradient(45deg, 
        #d6eaf8 0%, 
        #d6eaf8 10%, 
        #f2f4f4 10%, 
        #f2f4f4 20%, 
        #ffe5b4 20%, 
        #ffe5b4 30%, 
        #ffffcc 30%, 
        #ffffcc 40%, 
        #d1f2eb 40%, 
        #d1f2eb 50%, 
        #f3e5f5 50%, 
        #f3e5f5 60%, 
        #ffe4e1 60%, 
        #ffe4e1 70%
    );
    color: #333;
    padding: 10px;
    box-shadow: 5px 5px 15px rgba(0, 0, 0, 0.3);
    border-radius: 10px;
">
    <h2><strong>Stage 6 - EDA after Label Encoder</strong></h2>
</div>

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Filter numerical columns with less than 20 unique values
numerical_features = df.select_dtypes(include=['float64', 'int64']).columns
numerical_features = [col for col in numerical_features if df[col].nunique() < 200]

# Set the number of charts per row
num_cols = 2  # Number of charts per row

# Calculate the number of rows needed based on the number of features
num_rows = (len(numerical_features) + num_cols - 1) // num_cols  # This ensures enough rows are created

# Create a figure with the appropriate number of rows and columns
plt.figure(figsize=(15, 5 * num_rows))

# Plot the histograms for the filtered numerical columns
for i, feature in enumerate(numerical_features):
    plt.subplot(num_rows, num_cols, i + 1)
    sns.histplot(data=df, x=feature, hue='Is_Fraud', bins=30)
    plt.title(f'{feature} Distribution by Churn Status')

plt.tight_layout()
plt.show()

<div style="
    background: linear-gradient(45deg, 
        #d6eaf8 0%, 
        #d6eaf8 10%, 
        #f2f4f4 10%, 
        #f2f4f4 20%, 
        #ffe5b4 20%, 
        #ffe5b4 30%, 
        #ffffcc 30%, 
        #ffffcc 40%, 
        #d1f2eb 40%, 
        #d1f2eb 50%, 
        #f3e5f5 50%, 
        #f3e5f5 60%, 
        #ffe4e1 60%, 
        #ffe4e1 70%
    );
    color: #333;
    padding: 10px;
    box-shadow: 5px 5px 15px rgba(0, 0, 0, 0.3);
    border-radius: 10px;
">
    <h2><strong>Stage 7 - Visualize Fraud Patterns and Distribution of Features</strong></h2>
</div>

In [None]:
# Create a figure with 2 subplots in a horizontal row
fig, axes = plt.subplots(1, 2, figsize=(15, 6))  # 1 row, 2 columns

# KDE plot for the 'Is_Fraud' column (on the first subplot)
sns.kdeplot(df["Is_Fraud"], fill=True, ax=axes[0])
axes[0].set_title('Target Variable Distribution')

# Count plot for the 'Is_Fraud' column (on the second subplot)
sns.countplot(x='Is_Fraud', data=df, ax=axes[1])
axes[1].set_title('Fraudulent Transactions Count')

# Adjust layout to prevent overlap
plt.tight_layout()

# Show the plot
plt.show()

In [None]:
# Visualize fraud transactions based on 'Transaction_Amount'
plt.figure(figsize=(12, 6))
sns.boxplot(x='Is_Fraud', y='Transaction_Amount', data=df)
plt.title("Transaction Amount vs Fraud/Non-Fraud")
plt.show()

<div style="
    background: linear-gradient(45deg, 
        #d6eaf8 0%, 
        #d6eaf8 10%, 
        #f2f4f4 10%, 
        #f2f4f4 20%, 
        #ffe5b4 20%, 
        #ffe5b4 30%, 
        #ffffcc 30%, 
        #ffffcc 40%, 
        #d1f2eb 40%, 
        #d1f2eb 50%, 
        #f3e5f5 50%, 
        #f3e5f5 60%, 
        #ffe4e1 60%, 
        #ffe4e1 70%
    );
    color: #333;
    padding: 10px;
    box-shadow: 5px 5px 15px rgba(0, 0, 0, 0.3);
    border-radius: 10px;
">
    <h2><strong>Stage 8 - Plot Correlation Matrix to Understand Feature Relationships</strong></h2>
</div>

In [None]:
plt.figure(figsize=(14, 10))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Matrix")
plt.show()

In [None]:
# Calculate correlation matrix for numerical columns
correlation_matrix = df.corr()

# Extract correlation with 'Exited' and drop 'Exited' itself
correlation_price = correlation_matrix['Is_Fraud'].sort_values(ascending=False).drop('Is_Fraud')

# Plot the heatmap for the correlation with 'Exited'
plt.figure(figsize=(8, 5))
sns.heatmap(correlation_price.to_frame(), annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation with Exited')
plt.show()

<div style="
    background: linear-gradient(45deg, 
        #d6eaf8 0%, 
        #d6eaf8 10%, 
        #f2f4f4 10%, 
        #f2f4f4 20%, 
        #ffe5b4 20%, 
        #ffe5b4 30%, 
        #ffffcc 30%, 
        #ffffcc 40%, 
        #d1f2eb 40%, 
        #d1f2eb 50%, 
        #f3e5f5 50%, 
        #f3e5f5 60%, 
        #ffe4e1 60%, 
        #ffe4e1 70%
    );
    color: #333;
    padding: 10px;
    box-shadow: 5px 5px 15px rgba(0, 0, 0, 0.3);
    border-radius: 10px;
">
    <h2><strong>Stage 9 - Feature Importance using Random Forest</strong></h2>
</div>

In [None]:
rf = RandomForestClassifier(n_estimators=100, random_state=42)
X = df.drop(columns=['Is_Fraud'])
y = df['Is_Fraud']

In [None]:
print("Shape for X Dataframe: ", X.shape)
print("Columns for X Dataframe: ", X.columns)
print("-"*50)
print("Shape for y Dataframe: ", y.shape)

In [None]:
# Train the model
rf.fit(X, y)

In [None]:
# Get feature importances
feature_importances = pd.DataFrame(rf.feature_importances_, index=X.columns, columns=['importance'])
feature_importances = feature_importances.sort_values('importance', ascending=False)

In [None]:
# Plot feature importances
plt.figure(figsize=(12, 8))
feature_importances.head(20).plot(kind='bar', figsize=(10, 6))
plt.title("Top 20 Feature Importances")
plt.show()

<div style="
    background: linear-gradient(45deg, 
        #d6eaf8 0%, 
        #d6eaf8 10%, 
        #f2f4f4 10%, 
        #f2f4f4 20%, 
        #ffe5b4 20%, 
        #ffe5b4 30%, 
        #ffffcc 30%, 
        #ffffcc 40%, 
        #d1f2eb 40%, 
        #d1f2eb 50%, 
        #f3e5f5 50%, 
        #f3e5f5 60%, 
        #ffe4e1 60%, 
        #ffe4e1 70%
    );
    color: #333;
    padding: 10px;
    box-shadow: 5px 5px 15px rgba(0, 0, 0, 0.3);
    border-radius: 10px;
">
    <h2><strong>Stage 10 - Select Only Important Features</strong></h2>
</div>

In [None]:
# Select features with importance greater than a threshold (e.g., 0.01)
important_features = feature_importances[feature_importances['importance'] > 0.01].index
X = df[important_features]
print("Shape for X Dataframe: ", X.shape)
print("Columns for X Dataframe: ", X.columns)

<div style="
    background: linear-gradient(45deg, 
        #d6eaf8 0%, 
        #d6eaf8 10%, 
        #f2f4f4 10%, 
        #f2f4f4 20%, 
        #ffe5b4 20%, 
        #ffe5b4 30%, 
        #ffffcc 30%, 
        #ffffcc 40%, 
        #d1f2eb 40%, 
        #d1f2eb 50%, 
        #f3e5f5 50%, 
        #f3e5f5 60%, 
        #ffe4e1 60%, 
        #ffe4e1 70%
    );
    color: #333;
    padding: 10px;
    box-shadow: 5px 5px 15px rgba(0, 0, 0, 0.3);
    border-radius: 10px;
">
    <h2><strong>Stage 11 - Perform PCA (Principal Component Analysis)</strong></h2>
</div>

In [None]:
# If the number of features is large, PCA can help reduce dimensions
pca = PCA(n_components=2)  # Reducing to 2 components for visualization
X_pca = pca.fit_transform(X)

In [None]:
# Plot PCA results
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='coolwarm')
plt.title("PCA of Important Features")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.colorbar(label='Fraud (1) vs Non-Fraud (0)')
plt.show()

<div style="
    background: linear-gradient(45deg, 
        #d6eaf8 0%, 
        #d6eaf8 10%, 
        #f2f4f4 10%, 
        #f2f4f4 20%, 
        #ffe5b4 20%, 
        #ffe5b4 30%, 
        #ffffcc 30%, 
        #ffffcc 40%, 
        #d1f2eb 40%, 
        #d1f2eb 50%, 
        #f3e5f5 50%, 
        #f3e5f5 60%, 
        #ffe4e1 60%, 
        #ffe4e1 70%
    );
    color: #333;
    padding: 10px;
    box-shadow: 5px 5px 15px rgba(0, 0, 0, 0.3);
    border-radius: 10px;
">
    <h2><strong>Stage 12 - Train-test Split</strong></h2>
</div>

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

<div style="
    background: linear-gradient(45deg, 
        #d6eaf8 0%, 
        #d6eaf8 10%, 
        #f2f4f4 10%, 
        #f2f4f4 20%, 
        #ffe5b4 20%, 
        #ffe5b4 30%, 
        #ffffcc 30%, 
        #ffffcc 40%, 
        #d1f2eb 40%, 
        #d1f2eb 50%, 
        #f3e5f5 50%, 
        #f3e5f5 60%, 
        #ffe4e1 60%, 
        #ffe4e1 70%
    );
    color: #333;
    padding: 10px;
    box-shadow: 5px 5px 15px rgba(0, 0, 0, 0.3);
    border-radius: 10px;
">
    <h2><strong>Stage 13 - Feature Scaling</strong></h2>
</div>

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

<div style="
    background: linear-gradient(45deg, 
        #d6eaf8 0%, 
        #d6eaf8 10%, 
        #f2f4f4 10%, 
        #f2f4f4 20%, 
        #ffe5b4 20%, 
        #ffe5b4 30%, 
        #ffffcc 30%, 
        #ffffcc 40%, 
        #d1f2eb 40%, 
        #d1f2eb 50%, 
        #f3e5f5 50%, 
        #f3e5f5 60%, 
        #ffe4e1 60%, 
        #ffe4e1 70%
    );
    color: #333;
    padding: 10px;
    box-shadow: 5px 5px 15px rgba(0, 0, 0, 0.3);
    border-radius: 10px;
">
    <h2><strong>Stage 14 - Model Training and Evaluation</strong></h2>
</div>

In [None]:
# Define models
models = {
    'Logistic Regression': LogisticRegression(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'Gradient Boosting': GradientBoostingClassifier(),
    'XGBoost': xgb.XGBClassifier(),
    'LightGBM': lgb.LGBMClassifier(),
    'CatBoost': cb.CatBoostClassifier(silent=True),
    'AdaBoost': AdaBoostClassifier(),
    'Bagging': BaggingClassifier(),
    'KNN': KNeighborsClassifier()
    # 'SVM (RBF)': SVC(kernel='rbf', probability=True),
    # 'SVM (Linear)': LinearSVC(),
    # 'GaussianNB': GaussianNB()
    # 'LDA': LDA(),
    # 'QDA': QuadraticDiscriminantAnalysis(),
    # 'Ridge Classifier': RidgeClassifier(),
}

In [None]:
# Define reduced parameter grids
param_grids = {
    'Logistic Regression': {
        'C': [0.1, 1],
        'solver': ['liblinear'],
        'penalty': ['l2']
    },
    'Decision Tree': {
        'max_depth': [5, 10],
        'min_samples_split': [2, 5],
        'min_samples_leaf': [1, 2]
    },
    'Random Forest': {
        'n_estimators': [50, 100],
        'max_depth': [10],
        'min_samples_split': [2],
        'min_samples_leaf': [1]
    },
    'Gradient Boosting': {
        'n_estimators': [100],
        'learning_rate': [0.1],
        'max_depth': [5]
    },
    'XGBoost': {
        'n_estimators': [100],
        'learning_rate': [0.1],
        'max_depth': [5],
        'subsample': [0.8, 1.0]
    },
    'SVM (RBF)': {
        'C': [1, 10],
        'gamma': ['scale', 'auto']
    },
    'SVM (Linear)': {
        'C': [1, 10],
    },
    'LightGBM': {
        'n_estimators': [100],
        'learning_rate': [0.1],
        'max_depth': [3, 5],
    },
    'CatBoost': {
        'iterations': [100],
        'learning_rate': [0.1],
        'depth': [3, 5]
    },
    'KNN': {
        'n_neighbors': [3],
        'weights': ['uniform', 'distance']
    },
    'AdaBoost': {
        'n_estimators': [100],
        'learning_rate': [0.01, 0.1]
    },
    'Bagging': {
        'n_estimators': [100],
        'max_samples': [0.8, 1.0]
    },
    'LDA': {},
    'QDA': {},
    'Ridge Classifier': {
        'alpha': [0.1, 1]
    },
    'GaussianNB': {}
}

In [None]:
# Initialize an empty dictionary to store results
model_results = {}

# Handle class imbalance by computing class weights for each model that supports it
class_weights = compute_class_weight('balanced', classes=np.array([0, 1]), y=y_train)
class_weight_dict = {0: class_weights[0], 1: class_weights[1]}
print("class_weight_dict: ", class_weight_dict)

# Handle SMOTE for class imbalance
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train_scaled, y_train)

# Evaluate models with GridSearchCV
for model_name, model in models.items():
    print(f"Training model with GridSearchCV: {model_name}")
    
    # Get the parameter grid for the model
    param_grid = param_grids[model_name]
    
    # Modify model to include class weights where applicable
    if model_name in ['Logistic Regression', 'Random Forest', 'SVM (RBF)', 'SVM (Linear)']:
        # Assign class weights for models that support it
        if model_name == 'Logistic Regression':
            model = LogisticRegression(class_weight='balanced')
        elif model_name == 'Random Forest':
            model = RandomForestClassifier(class_weight='balanced')
        elif model_name in ['SVM (RBF)', 'SVM (Linear)']:
            model = SVC(probability=True, class_weight='balanced') if model_name == 'SVM (RBF)' else LinearSVC(class_weight='balanced')

    # Perform GridSearchCV with parallelism
    grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)
    
    # Fit the model with the best parameters using the resampled data
    grid_search.fit(X_train_smote, y_train_smote)
    
    # Get the best model and its parameters
    best_model = grid_search.best_estimator_
    print(f"Best parameters for {model_name}: {grid_search.best_params_}")
    
    # Predict on both train and test sets
    y_train_pred = best_model.predict(X_train_smote)
    y_test_pred = best_model.predict(X_test_scaled)
    
    # Store the results
    model_results[model_name] = {
        'train_accuracy': best_model.score(X_train_smote, y_train_smote),
        'test_accuracy': best_model.score(X_test_scaled, y_test),
        'y_test': y_test,
        'y_test_pred': y_test_pred,
        'classification_report': classification_report(y_test, y_test_pred),
        'roc_auc': roc_auc_score(y_test, best_model.predict_proba(X_test_scaled)[:, 1])
    }

    # Print results after all models are evaluated
    print("\nModel Evaluation Results:")
    print(f"Model: {model_results[model_name]}\n")
    print(f"Train Accuracy: {model_results[model_name]['train_accuracy']:.4f}")
    print(f"Test Accuracy: {model_results[model_name]['test_accuracy']:.4f}")
    print(f"ROC AUC: {model_results[model_name]['roc_auc']:.4f}\n")
    print(f"Classification Report:\n{model_results[model_name]['classification_report']}")
    print("-" * 80)

<div style="
    background: linear-gradient(45deg, 
        #d6eaf8 0%, 
        #d6eaf8 10%, 
        #f2f4f4 10%, 
        #f2f4f4 20%, 
        #ffe5b4 20%, 
        #ffe5b4 30%, 
        #ffffcc 30%, 
        #ffffcc 40%, 
        #d1f2eb 40%, 
        #d1f2eb 50%, 
        #f3e5f5 50%, 
        #f3e5f5 60%, 
        #ffe4e1 60%, 
        #ffe4e1 70%
    );
    color: #333;
    padding: 10px;
    box-shadow: 5px 5px 15px rgba(0, 0, 0, 0.3);
    border-radius: 10px;
">
    <h2><strong>Stage 15 - Displaying Evaluation Results for All Models</strong></h2>
</div>


In [None]:
# # Print results after all models are evaluated
# print("\nModel Evaluation Results:")
# print(f"Model: {model_results[model_name]}\n")
# print(f"Train Accuracy: {model_results[model_name]['train_accuracy']:.4f}")
# print(f"Test Accuracy: {model_results[model_name]['test_accuracy']:.4f}")
# print(f"ROC AUC: {model_results[model_name]['roc_auc']:.4f}\n")
# print(f"Classification Report:\n{model_results[model_name]['classification_report']}")
# print("-" * 80)

<div style="
    background: linear-gradient(45deg, 
        #d6eaf8 0%, 
        #d6eaf8 10%, 
        #f2f4f4 10%, 
        #f2f4f4 20%, 
        #ffe5b4 20%, 
        #ffe5b4 30%, 
        #ffffcc 30%, 
        #ffffcc 40%, 
        #d1f2eb 40%, 
        #d1f2eb 50%, 
        #f3e5f5 50%, 
        #f3e5f5 60%, 
        #ffe4e1 60%, 
        #ffe4e1 70%
    );
    color: #333;
    padding: 10px;
    box-shadow: 5px 5px 15px rgba(0, 0, 0, 0.3);
    border-radius: 10px;
">
    <h2><strong>Stage 16 - Plotting the Train Vs Test Accuracy Chart</strong></h2>
</div>

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc

# Initialize a list to store results for all models
results_list = []

# Iterate through the models to collect results and plot confusion matrix and ROC curve
for model_name, model in model_results.items():
    # Extract the predicted values and actual values
    y_test_pred = model['y_test_pred']  # Use the predicted labels
    y_test = model['y_test']  # Actual true labels
    
    # Extract metrics
    train_accuracy = model['train_accuracy']
    test_accuracy = model['test_accuracy']
    roc_auc = model['roc_auc']
    
    # Classification Report
    clf_report = classification_report(y_test, y_test_pred)

    # Print the model name followed by its evaluation metrics
    print("-" * 40)
    print(f"Model: {model_name}")
    print("-" * 40)
    print(f"Train Accuracy: {train_accuracy:.4f}")
    print(f"Test Accuracy: {test_accuracy:.4f}")
    print(f"ROC AUC: {roc_auc:.4f}")
    print("Classification Report:")
    print(clf_report)
    print("-" * 80)  # Separator line for clarity
    
    # Generate confusion matrix
    cm = confusion_matrix(y_test, y_test_pred)

    # ROC Curve
    fpr, tpr, _ = roc_curve(y_test, y_test_pred)
    roc_auc_value = auc(fpr, tpr)

    # Create subplots: 1 row, 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))  # Width, Height

    # Plot ROC Curve on the first subplot
    ax1.plot(fpr, tpr, color='b', lw=2, label=f'ROC curve (area = {roc_auc_value:.2f})')
    ax1.plot([0, 1], [0, 1], color='gray', linestyle='--')  # Random classifier line
    ax1.set_xlim([0.0, 1.0])
    ax1.set_ylim([0.0, 1.05])
    ax1.set_xlabel('False Positive Rate')
    ax1.set_ylabel('True Positive Rate')
    ax1.set_title(f'ROC Curve for {model_name}')
    ax1.legend(loc='lower right')

    # Plot Confusion Matrix on the second subplot
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
                xticklabels=['Predicted Negative', 'Predicted Positive'],
                yticklabels=['Actual Negative', 'Actual Positive'], ax=ax2)
    ax2.set_title(f'Confusion Matrix for {model_name}')
    ax2.set_xlabel('Predicted')
    ax2.set_ylabel('Actual')

    # Show both plots
    plt.tight_layout()
    plt.show()

    # Append the results to the list for the DataFrame
    results_list.append({
        'Model': model_name,
        'Train Accuracy': f"{train_accuracy:.4f}",
        'Test Accuracy': f"{test_accuracy:.4f}",
        'ROC AUC': f"{roc_auc:.4f}",
        'Classification Report': clf_report
    })

# Convert results into a DataFrame for better presentation
results_df = pd.DataFrame(results_list)

# Print the summary of results in a tabular format
# print("\nSummary of Model Evaluation Results:")
# print(results_df.to_string(index=False))  # Display as a pretty table
print("-" * 80)


<div style="
    background: linear-gradient(45deg, 
        #d6eaf8, 
        #f2f4f4 20%, 
        #ffe5b4 40%, 
        #ffffcc 60%, 
        #d1f2eb 80%, 
        #f3e5f5 100%
    );
    padding: 20px; 
    margin: 20px 0; 
    border-radius: 10px; 
    box-shadow: 5px 5px 15px rgba(0, 0, 0, 0.3);
    font-family: Arial, sans-serif; 
    color: #333;
    text-align: left;
">
    <h2><strong>Stage 17 - Final Conclusion</strong></h2>
</div>

### Conclusion:

1. **High Test Accuracy**: The model achieved a high test accuracy, indicating that it correctly predicted most instances in the test set. This is a promising result for the overall performance of the model.
   
2. **ROC AUC**: The **ROC AUC** is nearly 0.5, which is close to random guessing. This suggests that the model struggles to distinguish between the two classes effectively. The low ROC AUC indicates poor discriminative power, especially for class 1.

3. **Class Imbalance**: The classification report highlights a significant class imbalance. 
   - **Class 0** (majority class) has a high precision of **0.95** and recall of **1.00**, with an **F1-score of 0.97**, indicating that the model performs very well on class 0.
   - **Class 1** (minority class) has very low precision (**0.02**) and recall (**0.00**), with an **F1-score of 0.00**, indicating that the model struggles severely to identify the minority class (class 1).
   
4. **Impact of Class Imbalance**: The poor performance on class 1 suggests that the model may be biased towards predicting the majority class (class 0), and thus failing to identify the minority class. This is supported by the low recall and precision for class 1.

5. **Model Improvement Suggestions**:
   - **Address Class Imbalance**: Techniques such as resampling (SMOTE), class weights adjustment, or using more balanced metrics like **F1-score** for class 1 can help improve the model's ability to detect the minority class.
   - **Model Tuning**: Exploring other models or hyperparameters to better balance accuracy across both classes may improve performance.

7. **Final Remarks**: While the model shows strong performance in terms of overall accuracy, it is heavily biased towards the majority class, which makes it unreliable for detecting the minority class. Addressing the class imbalance should be a priority for improving model performance in real-world scenarios.

#### Final Remarks:

1. The model demonstrates strong overall accuracy, indicating its ability to correctly predict the majority of instances within the dataset.
   
2. There is a noticeable discrepancy between training and testing accuracy, which may suggest some degree of overfitting, although the difference is not extreme.

3. The ROC AUC score is close to random guessing, indicating that the model struggles with distinguishing between the two classes, especially for the minority class.

4. Class imbalance is a significant issue, as the model shows excellent performance on the majority class but fails to effectively identify the minority class.

5. Precision and recall for the majority class are very high, showcasing that the model can accurately predict this class without many false positives or negatives.

6. The performance for the minority class is poor, with the model having difficulty detecting and correctly predicting instances of this class.

7. The model's inability to perform well on the minority class suggests a bias toward the majority class, which reduces its overall usefulness in cases where detecting the minority class is important.

8. There is an imbalance between the precision and recall of the two classes, with the model being much more sensitive to the majority class.

9. Improvements to the model should focus on addressing class imbalance, such as through resampling techniques, class weighting, or exploring alternative models that are more adept at handling skewed distributions.

10. The current model, while performing well on the majority class, needs further optimization and tuning to ensure it can reliably detect the minority class and be more robust across all categories.

<div style="
    background: linear-gradient(45deg, 
        #d6eaf8, 
        #f2f4f4 20%, 
        #ffe5b4 40%, 
        #ffffcc 60%, 
        #d1f2eb 80%, 
        #f3e5f5 100%
    );
    padding: 20px; 
    margin: 20px 0; 
    border-radius: 10px; 
    box-shadow: 5px 5px 15px rgba(0, 0, 0, 0.3);
    font-family: Arial, sans-serif; 
    color: #333;
    text-align: center;
">
    <h2><strong>Stage 18 - Thank You</strong></h2>
    <p><center><strong>If it didn’t make you cry (tears of frustration or boredom), give that vote button a little click! </strong><center></p>
</div>

Thanks a ton for taking the time to dive into the code! If you enjoyed the ride (or at least didn’t fall asleep halfway through), don’t forget to hit that shiny vote button. 😉

It’s like a high-five in the digital world—except without the awkward hand placement. So go ahead, show some love, and let’s make sure this code gets the recognition it deserves! 

Happy coding! And remember, your vote could save a developer’s day!