# EDA - Credit Card Fraud
This notebook analyzes the credit card transaction dataset, which primarily consists of anonymized PCA features.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")
df = pd.read_csv('../data/raw/creditcard.csv')
print(df.info())

## 1. Class Imbalance Visualization
Credit card fraud detection datasets are typically highly skewed.

In [None]:
plt.figure(figsize=(8, 5))
ax = sns.countplot(x='Class', data=df, palette='magma')
plt.title('Credit Card Fraud Distribution')
plt.xlabel('Class (0: Legitimate, 1: Fraud)')
plt.ylabel('Count')

counts = df['Class'].value_counts()
percentages = df['Class'].value_counts(normalize=True) * 100
for i, p in enumerate(ax.patches):
    ax.annotate(f'{counts[i]} ({percentages[i]:.4f}%)', (p.get_x() + p.get_width() / 2., p.get_height()), 
                ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')
plt.yscale('log')  # Log scale to actually see the fraud bar
plt.show()

print(f"Legitimate: {counts[0]} ({percentages[0]:.4f}%)")
print(f"Fraudulent: {counts[1]} ({percentages[1]:.4f}%)")

**Interpretation:** The imbalance is extremeâ€”only 0.17% of transactions are fraudulent. This highlights why accuracy is a poor metric here and why we must focus on Precision-Recall AUC.

## 2. Univariate Distribution of Key Features
We'll look at `Amount` and `Time` since other features are PCA components.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

sns.histplot(df['Amount'], bins=100, kde=True, ax=axes[0], color='purple')
axes[0].set_title('Distribution of Transaction Amount')
axes[0].set_xlim([0, 2000]) # Zoom in as most transactions are small

sns.histplot(df['Time'], bins=100, kde=True, ax=axes[1], color='orange')
axes[1].set_title('Distribution of Transaction Time')

plt.show()

**Interpretation:** Most transactions are small (under $100). The 'Time' feature shows a clear cyclical pattern (likely day/night cycles).

## 3. Bivariate Relationships (PCA Components vs Class)
Let's see if some PCA features show clear separation.

In [None]:
fig, axes = plt.subplots(1, 4, figsize=(20, 5))
for i, feat in enumerate(['V1', 'V2', 'V3', 'V4']):
    sns.boxplot(x='Class', y=feat, data=df, ax=axes[i])
    axes[i].set_title(f'{feat} by Class')
plt.show()

**Interpretation:** Some PCA features like V3 show a significantly different range for fraudulent transactions, suggesting they will be strong predictors.