#Project name - Paisabazaar Banking Fraud Analysis

 **Project type - Exploratory Data Analysis(EDA)**

# Problem Statement

Objective:
The objective of this analysis is to detect fraudulent activities in banking transactions and financial behaviors using customer data. This involves identifying key risk factors related to credit score, delayed payments, outstanding debt, credit utilization, and spending behavior.

Challenges in Banking Fraud Detection:
Fraudulent customers may manipulate credit behavior to appear genuine.
High-risk customers may have poor credit scores, high outstanding debts, or frequent delayed payments.
Fraudulent activities can arise due to excessive credit utilization and unusual spending patterns.

Key Analysis Areas:
Credit Risk Factors

Customers with poor credit scores and excessive debt.
Patterns in credit utilization ratio and delayed payments.
Payment & Spending Behavior Analysis

Identifying customers with risky spending habits (e.g., high spending with low income).
Understanding different payment behaviors (e.g., minimum payments vs. full payments).
Anomaly Detection for Fraudulent Activities

Finding outliers in financial transactions.
Identifying unusual loan application trends and banking activities.
Predictive Fraud Detection Model (if applicable)

Using machine learning to classify customers as fraudulent or non-fraudulent.
Finding correlations between income, credit score, loan amount, and fraud likelihood.
Expected Outcome:
A data-driven risk assessment model that can flag high-risk customers.
Insights into fraudulent patterns in banking transactions.
A potential predictive model to help Paisabazaar prevent financial fraud.

# EDA_Paisabazaar Banking Fraud Analysis

1.Know your Data


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from scipy import stats
from google.colab import drive
import plotly.express as px
drive.mount('/content/drive')


# Uploading Dataset

In [None]:
file_path = '/content/drive/My Drive/Colab Notebooks/Paisabazaar Banking Fraud Analysis.csv'


# Load Dataset

In [None]:
df_file = pd.read_csv(file_path)


# Dataset first Row

In [None]:
df_file.head()
print(df_file.head())

# Get information



In [None]:
# Get information about the data types and missing values
print(df_file.info())


# Get descriptive statistics


In [None]:
# Get descriptive statistics for numerical columns
print(df_file.describe())

# Missing Values And Duplicate Row


In [None]:
print("\nDuplicate Rows:", df_file.duplicated().sum())
print("\nMissing Values:\n", df_file.isnull().sum())

# # 2. Handling Missing Values

In [None]:
# 2. Handling Missing Values
imputer = SimpleImputer(strategy='mean')

# Select only numerical features for imputation
numerical_features = df_file.select_dtypes(include=np.number).columns

# Apply imputation to numerical features only
df_file[numerical_features] = imputer.fit_transform(df_file[numerical_features])

#Checking Data Types and Unique Values
print("\nColumn Data Types:\n", df_file.dtypes)
print("\nUnique Values per Column:\n", df_file.nunique())


# 3. Handling Outliers using IQR Method

In [None]:
# Handling missing values
# Select only numerical features for outlier handling
numerical_features = df_file.select_dtypes(include=np.number).columns
df_numeric = df_file[numerical_features]

# Calculate quantiles and IQR for numerical features only
Q1 = df_numeric.quantile(0.25)
Q3 = df_numeric.quantile(0.75)
IQR = Q3 - Q1

# Identify outliers in numerical features
outlier_condition = (df_numeric < (Q1 - 1.5 * IQR)) | (df_numeric > (Q3 + 1.5 * IQR))

# Remove rows with outliers from the original DataFrame
df = df_file[~outlier_condition.any(axis=1)]

# 4. Visualization

In [None]:
# 4. Visualization
plt.figure(figsize=(10, 6))
# Replace 'Annual_Income' with the desired column name for visualization
sns.histplot(df['Annual_Income'], bins=30, kde=True)
plt.title("Distribution of Annual_Income") # Update title accordingly
plt.show()


# Correlation Heatmap

In [None]:
plt.figure(figsize=(10, 6))
# Calculate correlation only for numerical features
numerical_features = df.select_dtypes(include=np.number).columns
correlation_matrix = df[numerical_features].corr()

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title("Feature Correlation Heatmap")
plt.show()


# Pairplot for Relationship


In [None]:
numeric_cols = df_file.select_dtypes(include=[np.number]).columns
if len(numeric_cols) > 1:
    sample_df = df_file[numeric_cols].sample(n=min(500, len(df)), random_state=42)  # Limit to 500 samples if dataset is large
    sns.pairplot(sample_df)
    plt.show()
else:
    print("Not enough numeric features for a pairplot.")


# # Interactive Scatter Matrix with Plotly

In [None]:
# Interactive Scatter Matrix with Plotly
fig = px.scatter_matrix(df, dimensions=df.select_dtypes(include=[np.number]).columns)
fig.update_layout(title="Scatter Matrix of Numeric Features")
fig.show()

# Interactive Boxplot for Outliers Detection

In [None]:
# Interactive Boxplot for Outliers Detection
fig = px.box(df, y=df.select_dtypes(include=[np.number]).columns, title="Interactive Boxplot for Outlier Detection")
fig.show()


# # Interactive Bar Chart for Categorical Data Distribution

In [None]:
# Interactive Bar Chart for Categorical Data Distribution
categorical_columns = df_file.select_dtypes(include=['object']).columns
for col in categorical_columns:
    # Get value counts and reset index
    value_counts_df = df_file[col].value_counts().reset_index()
    # Use the actual column names from the value_counts_df - 'index' for x and col for y
    fig = px.bar(value_counts_df, x='index', y='count', title=f"Distribution of {col}") # Changed y=col to y='count'
    fig.update_xaxes(title_text=col) # Set x-axis label to the original column name
    fig.show()

print("\nExploratory Data Analysis Completed Successfully!")