# EDA - Credit Card Fraud
Analysis of anonymized bank transaction data.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")
df = pd.read_csv('../data/raw/creditcard.csv')

# 1. Data Integrity Check
print("### Data Info ###")
print(df.info())

print("\n### Missing Values ###")
print(df.isnull().sum().sum())

print("\n### Duplicate Rows ###")
duplicates = df.duplicated().sum()
print(f'Number of duplicates: {duplicates}')

**Interpretation (Data Cleaning):** There are some duplicate rows that should be removed during preprocessing. No missing values are present in this dataset.

## 2. Class Imbalance Visualization

In [None]:
plt.figure(figsize=(8, 5))
ax = sns.countplot(x='Class', data=df, palette='magma')
plt.title('Credit Card Fraud Distribution')
plt.xlabel('Class (0: Legitimate, 1: Fraud)')
plt.ylabel('Count')

counts = df['Class'].value_counts()
percentages = df['Class'].value_counts(normalize=True) * 100
for i, p in enumerate(ax.patches):
    ax.annotate(f'{counts[i]} ({percentages[i]:.4f}%)', (p.get_x() + p.get_width() / 2., p.get_height()), 
                ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')
plt.yscale('log') 
plt.show()

**Modeling Implication:** The extreme imbalance (0.17%) makes this a high-needle-in-a-haystack problem. SMOTE is strictly necessary for training, but we must be extremely careful about overfitting to the synthetic samples. The Precision-Recall AUC will be the primary metric for model selection.

## 3. Univariate Analysis
Focusing on `Amount` and `Time`.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

sns.histplot(df['Amount'], bins=100, kde=True, ax=axes[0], color='purple')
axes[0].set_title('Transaction Amount Distribution')
axes[0].set_xlim([0, 1000])

sns.histplot(df['Time'], bins=50, kde=True, ax=axes[1], color='orange')
axes[1].set_title('Transaction Time Distribution')

plt.show()

**Interpretation:** Transactions are mostly low-value. The time distribution shows two peaks, representing daytime activity over a 48-hour period.