# Exploratory Data Analysis (EDA) for Fraud Detection

This notebook contains exploratory data analysis for the fraud detection project.

## Goals:
1. **Data Overview**: Understand the structure and basic statistics of the dataset
2. **Data Quality Assessment**: Identify missing values, duplicates, and data inconsistencies
3. **Target Variable Analysis**: Analyze the distribution of fraud vs legitimate transactions
4. **Feature Analysis**: Explore individual features and their relationships with fraud
5. **Correlation Analysis**: Identify correlations between features and with the target variable
6. **Time-based Patterns**: Analyze temporal patterns in fraud occurrence
7. **Amount Analysis**: Study transaction amounts and their relationship to fraud
8. **Geographic Patterns**: Analyze location-based fraud patterns
9. **Merchant Analysis**: Study fraud patterns by merchant category
10. **Feature Engineering Insights**: Identify potential new features to create

## Setup and Imports

Import necessary libraries and set up the environment.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Import project modules
import sys
sys.path.append('..')
from src.data_loader import DataLoader, load_sample_data
from src.preprocess import DataPreprocessor

## 1. Data Loading and Overview

Load the dataset and get a high-level understanding of the data structure.

In [2]:
# Load data
# TODO: Replace with actual data loading
df = load_sample_data()

print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print(f"\nData types:")
print(df.dtypes)
print(f"\nFirst few rows:")
df.head()

2025-07-19 14:12:47,138 - sample_data_loader - INFO - Generated sample data: 1000 rows, 14 columns
2025-07-19 14:12:47,142 - sample_data_loader - INFO - Fraud rate: 0.050
Dataset shape: (1000, 14)
Columns: ['transaction_id', 'amount', 'merchant_category', 'hour_of_day', 'day_of_week', 'customer_age', 'distance_from_home', 'distance_from_last_transaction', 'ratio_to_median_purchase_price', 'repeat_retailer', 'used_chip', 'used_pin_number', 'online_order', 'fraud']

Data types:
transaction_id                      int64
amount                            float64
merchant_category                  object
hour_of_day                         int32
day_of_week                         int32
customer_age                        int64
distance_from_home                float64
distance_from_last_transaction    float64
ratio_to_median_purchase_price    float64
repeat_retailer                     int64
used_chip                           int64
used_pin_number                     int64
online_order   

Unnamed: 0,transaction_id,amount,merchant_category,hour_of_day,day_of_week,customer_age,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,1,46.926809,travel,15,5,28,19.103636,4.90525,1.116329,0,1,0,1,0
1,2,301.012143,food,1,0,59,15.013947,1.489272,2.613066,0,1,0,1,0
2,3,131.674569,online,5,4,36,4.583363,13.400532,1.434548,0,1,0,1,0
3,4,91.294255,food,20,3,30,4.710932,19.446618,0.47728,0,0,0,0,0
4,5,16.962487,food,11,2,36,10.737036,27.922142,0.159812,0,1,0,0,0


## 2. Data Quality Assessment

Check for data quality issues including missing values, duplicates, and data inconsistencies.

In [None]:
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())

# Check for duplicates
print(f"\nNumber of duplicate rows: {df.duplicated().sum()}")

# Check for unique values in categorical columns
print("\nUnique values in categorical columns:")
categorical_cols = df.select_dtypes(include=['object']).columns
for col in categorical_cols:
    print(f"{col}: {df[col].nunique()} unique values")

## 3. Target Variable Analysis

Analyze the distribution of fraud vs legitimate transactions and understand class imbalance.

In [None]:
# Analyze target variable distribution
fraud_counts = df['fraud'].value_counts()
fraud_rate = df['fraud'].mean()

print(f"Fraud distribution:")
print(fraud_counts)
print(f"\nFraud rate: {fraud_rate:.3%}")

# Visualize target distribution
plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
fraud_counts.plot(kind='bar')
plt.title('Fraud vs Legitimate Transactions')
plt.xlabel('Fraud')
plt.ylabel('Count')

plt.subplot(1, 2, 2)
plt.pie(fraud_counts.values, labels=['Legitimate', 'Fraud'], autopct='%1.1f%%')
plt.title('Fraud Rate')

plt.tight_layout()
plt.show()

## 4. Feature Analysis

Explore individual features and their distributions, focusing on how they relate to fraud.

In [None]:
# Analyze numerical features
numerical_cols = df.select_dtypes(include=[np.number]).columns
numerical_cols = [col for col in numerical_cols if col != 'fraud']

print("Numerical features summary:")
print(df[numerical_cols].describe())

# Visualize numerical features by fraud status
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

for i, col in enumerate(numerical_cols[:6]):  # Plot first 6 features
    df.boxplot(column=col, by='fraud', ax=axes[i])
    axes[i].set_title(f'{col} by Fraud Status')
    axes[i].set_xlabel('Fraud')

plt.tight_layout()
plt.show()

## 5. Correlation Analysis

Identify correlations between features and with the target variable to understand feature importance.

In [None]:
# Calculate correlation matrix
correlation_matrix = df.corr()

# Plot correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

# Show correlations with target variable
target_correlations = correlation_matrix['fraud'].sort_values(ascending=False)
print("\nCorrelations with fraud:")
print(target_correlations)

## 6. Time-based Patterns

Analyze temporal patterns in fraud occurrence to identify time-based risk factors.

In [None]:
# Analyze fraud by hour of day
if 'hour_of_day' in df.columns:
    fraud_by_hour = df.groupby('hour_of_day')['fraud'].mean()
    
    plt.figure(figsize=(12, 4))
    
    plt.subplot(1, 2, 1)
    fraud_by_hour.plot(kind='bar')
    plt.title('Fraud Rate by Hour of Day')
    plt.xlabel('Hour')
    plt.ylabel('Fraud Rate')
    plt.xticks(rotation=45)
    
    plt.subplot(1, 2, 2)
    df['hour_of_day'].hist(bins=24, alpha=0.7)
    plt.title('Transaction Volume by Hour')
    plt.xlabel('Hour')
    plt.ylabel('Count')
    
    plt.tight_layout()
    plt.show()

# Analyze fraud by day of week
if 'day_of_week' in df.columns:
    fraud_by_day = df.groupby('day_of_week')['fraud'].mean()
    
    plt.figure(figsize=(10, 4))
    fraud_by_day.plot(kind='bar')
    plt.title('Fraud Rate by Day of Week')
    plt.xlabel('Day of Week')
    plt.ylabel('Fraud Rate')
    plt.xticks(rotation=45)
    plt.show()

## 7. Amount Analysis

Study transaction amounts and their relationship to fraud to understand monetary risk patterns.

In [None]:
# Analyze transaction amounts
if 'amount' in df.columns:
    plt.figure(figsize=(15, 5))
    
    plt.subplot(1, 3, 1)
    df['amount'].hist(bins=50, alpha=0.7)
    plt.title('Transaction Amount Distribution')
    plt.xlabel('Amount')
    plt.ylabel('Count')
    
    plt.subplot(1, 3, 2)
    df[df['fraud'] == 0]['amount'].hist(bins=50, alpha=0.7, label='Legitimate')
    df[df['fraud'] == 1]['amount'].hist(bins=50, alpha=0.7, label='Fraud')
    plt.title('Amount Distribution by Fraud Status')
    plt.xlabel('Amount')
    plt.ylabel('Count')
    plt.legend()
    
    plt.subplot(1, 3, 3)
    # Create amount bins and calculate fraud rate
    df['amount_bin'] = pd.cut(df['amount'], bins=10)
    fraud_by_amount = df.groupby('amount_bin')['fraud'].mean()
    fraud_by_amount.plot(kind='bar')
    plt.title('Fraud Rate by Amount Range')
    plt.xlabel('Amount Range')
    plt.ylabel('Fraud Rate')
    plt.xticks(rotation=45)
    
    plt.tight_layout()
    plt.show()

## 8. Merchant Analysis

Study fraud patterns by merchant category to identify high-risk business types.

In [None]:
# Analyze fraud by merchant category
if 'merchant_category' in df.columns:
    fraud_by_merchant = df.groupby('merchant_category')['fraud'].agg(['mean', 'count'])
    fraud_by_merchant.columns = ['fraud_rate', 'transaction_count']
    fraud_by_merchant = fraud_by_merchant.sort_values('fraud_rate', ascending=False)
    
    plt.figure(figsize=(12, 5))
    
    plt.subplot(1, 2, 1)
    fraud_by_merchant['fraud_rate'].plot(kind='bar')
    plt.title('Fraud Rate by Merchant Category')
    plt.xlabel('Merchant Category')
    plt.ylabel('Fraud Rate')
    plt.xticks(rotation=45)
    
    plt.subplot(1, 2, 2)
    fraud_by_merchant['transaction_count'].plot(kind='bar')
    plt.title('Transaction Count by Merchant Category')
    plt.xlabel('Merchant Category')
    plt.ylabel('Transaction Count')
    plt.xticks(rotation=45)
    
    plt.tight_layout()
    plt.show()
    
    print("\nFraud analysis by merchant category:")
    print(fraud_by_merchant)

## 9. Feature Engineering Insights

Based on the analysis, identify potential new features that could improve fraud detection.

In [None]:
# Create some example engineered features
df_engineered = df.copy()

# Time-based features
if 'hour_of_day' in df_engineered.columns:
    df_engineered['is_night'] = (df_engineered['hour_of_day'] >= 22) | (df_engineered['hour_of_day'] <= 6)
    df_engineered['is_weekend'] = df_engineered['day_of_week'].isin([5, 6])

# Amount-based features
if 'amount' in df_engineered.columns:
    df_engineered['amount_log'] = np.log1p(df_engineered['amount'])
    df_engineered['high_amount'] = df_engineered['amount'] > df_engineered['amount'].quantile(0.95)

# Show correlation of new features with fraud
new_features = ['is_night', 'is_weekend', 'amount_log', 'high_amount']
available_features = [f for f in new_features if f in df_engineered.columns]

if available_features:
    new_correlations = df_engineered[available_features + ['fraud']].corr()['fraud'].sort_values(ascending=False)
    print("Correlations of engineered features with fraud:")
    print(new_correlations)

## 10. Summary and Next Steps

Summarize key findings and outline next steps for the fraud detection pipeline.

In [None]:
# Summary statistics
print("=== EDA SUMMARY ===\n")

print(f"Dataset Overview:")
print(f"- Total transactions: {len(df):,}")
print(f"- Fraud rate: {df['fraud'].mean():.3%}")
print(f"- Features: {len(df.columns)}")

print(f"\nKey Findings:")
print(f"- Class imbalance: {df['fraud'].value_counts().to_dict()}")
print(f"- Missing values: {df.isnull().sum().sum()}")
print(f"- Duplicate rows: {df.duplicated().sum()}")

print(f"\nNext Steps:")
print(f"1. Implement data preprocessing pipeline")
print(f"2. Create engineered features based on insights")
print(f"3. Handle class imbalance using resampling techniques")
print(f"4. Train and evaluate multiple models")
print(f"5. Implement model explainability tools")