# EDA - Fraud Data (E-commerce Transactions)
This notebook explores the e-commerce transaction dataset to identify patterns related to fraudulent activities.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")
df = pd.read_csv('../data/raw/Fraud_Data.csv')

# 1. Data Integrity Check
print("### Data Info ###")
print(df.info())

print("\n### Missing Values ###")
print(df.isnull().sum())

print("\n### Duplicate Rows ###")
print(df.duplicated().sum())

**Interpretation (Data Cleaning):** The dataset is clean with no missing values or duplicates in the raw file. This ensures that the primary challenge will be feature engineering and handling class imbalance rather than data imputation.

## 2. Class Imbalance Visualization
Understanding the distribution of fraud vs legitimate transactions.

In [None]:
plt.figure(figsize=(8, 5))
ax = sns.countplot(x='class', data=df, palette='viridis')
plt.title('Distribution of Fraudulent vs Legitimate Transactions')
plt.xlabel('Class (0: Legitimate, 1: Fraud)')
plt.ylabel('Count')

counts = df['class'].value_counts()
percentages = df['class'].value_counts(normalize=True) * 100
for i, p in enumerate(ax.patches):
    ax.annotate(f'{counts[i]} ({percentages[i]:.2f}%)', (p.get_x() + p.get_width() / 2., p.get_height()), 
                ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')
plt.show()

**Modeling Implication:** At ~9.4% fraud, the imbalance is significant but not extreme. We will use Stratified Splitting to ensure both sets reflect this ratio, and SMOTE to balance the training set. Precision-Recall curves will be more informative than ROC-AUC.

## 3. Univariate Analysis
Distributions of numerical features.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

sns.histplot(df['purchase_value'], bins=50, kde=True, ax=axes[0], color='blue')
axes[0].set_title('Purchase Value Distribution')

sns.histplot(df['age'], bins=30, kde=True, ax=axes[1], color='green')
axes[1].set_title('User Age Distribution')

plt.show()

**Interpretation:** Purchase values are right-skewed, while age is more normally distributed. Most users are in their late 20s or 30s.

## 4. Bivariate Analysis
Relationships between features and fraud.

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='class', y='purchase_value', data=df, palette='Set2')
plt.title('Purchase Value vs Fraud Status')
plt.show()

plt.figure(figsize=(12, 6))
sns.countplot(x='source', hue='class', data=df)
plt.title('Fraud counts by Traffic Source')
plt.show()

**Interpretation:** Fraudulent transactions don't show a significantly higher purchase value on average, but they are distributed across all traffic sources, suggesting that fraud is not restricted to a single acquisition channel.