# EDA - Fraud Data (E-commerce Transactions)
This notebook explores the e-commerce transaction dataset to identify patterns related to fraudulent activities.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")
df = pd.read_csv('../data/raw/Fraud_Data.csv')
print(df.info())

## 1. Class Imbalance Visualization
We visualize the distribution of the target variable `class`.

In [None]:
plt.figure(figsize=(8, 5))
ax = sns.countplot(x='class', data=df, palette='viridis')
plt.title('Distribution of Fraudulent vs Legitimate Transactions')
plt.xlabel('Class (0: Legitimate, 1: Fraud)')
plt.ylabel('Count')

counts = df['class'].value_counts()
percentages = df['class'].value_counts(normalize=True) * 100
for i, p in enumerate(ax.patches):
    ax.annotate(f'{counts[i]} ({percentages[i]:.2f}%)', (p.get_x() + p.get_width() / 2., p.get_height()), 
                ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')
plt.show()

print(f"Legitimate: {counts[0]} ({percentages[0]:.2f}%)")
print(f"Fraudulent: {counts[1]} ({percentages[1]:.2f}%)")

**Interpretation:** The dataset is imbalanced, with approximately 9.36% of transactions flagged as fraudulent. While not as extreme as some bank datasets, it still requires special handling like SMOTE.

## 2. Univariate Distribution of Key Features

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

sns.histplot(df['purchase_value'], bins=50, kde=True, ax=axes[0], color='blue')
axes[0].set_title('Distribution of Purchase Value')

sns.histplot(df['age'], bins=30, kde=True, ax=axes[1], color='green')
axes[1].set_title('Distribution of User Age')

plt.show()

**Interpretation:** Purchase values are largely concentrated under $100, and user ages follow a relatively normal distribution centered around the early 30s.

## 3. Bivariate Relationships with Fraud Label

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='class', y='purchase_value', data=df, palette='Set2')
plt.title('Purchase Value vs Fraud Status')
plt.show()

plt.figure(figsize=(10, 6))
sns.boxplot(x='class', y='age', data=df, palette='Set3')
plt.title('Age vs Fraud Status')
plt.show()

**Interpretation:** There is no massive difference in the median purchase value or age between fraud and non-fraud cases, though fraud shows slightly more outliers in purchase value.

In [None]:
plt.figure(figsize=(12, 6))
sns.countplot(x='source', hue='class', data=df)
plt.title('Fraud counts by Traffic Source')
plt.show()