# 01 - Exploratory Data Analysis (EDA)

Bu notebook'ta Credit Card Fraud Detection dataset'ini keşfedeceğiz.

## Hedefler:
- Dataset'i yüklemek ve incelemek
- Eksik değerleri kontrol etmek
- Veri dağılımlarını görselleştirmek
- Class imbalance'ı analiz etmek
- Feature'lar arası korelasyonları incelemek

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("✅ Libraries imported successfully")

## 1. Data Loading

In [None]:
# Load dataset
df = pd.read_csv('../data/creditcard.csv')

print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nFirst few rows:")
df.head()

## 2. Basic Information

In [None]:
# Dataset info
print("Dataset Info:")
df.info()

In [None]:
# Statistical summary
df.describe()

In [None]:
# Check for missing values
missing = df.isnull().sum()
print("Missing values:")
print(missing[missing > 0] if missing.sum() > 0 else "No missing values found ✅")

## 3. Target Variable Analysis

In [None]:
# Class distribution
class_counts = df['Class'].value_counts()
fraud_percentage = (class_counts[1] / len(df)) * 100

print(f"Class Distribution:")
print(f"Normal (0): {class_counts[0]:,} ({(class_counts[0]/len(df)*100):.2f}%)")
print(f"Fraud (1): {class_counts[1]:,} ({fraud_percentage:.4f}%)")
print(f"\n⚠️ Class Imbalance Ratio: 1:{class_counts[0]/class_counts[1]:.0f}")

In [None]:
# Visualize class distribution
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=('Class Distribution', 'Class Distribution (%)'),
    specs=[[{'type': 'bar'}, {'type': 'pie'}]]
)

# Bar chart
fig.add_trace(
    go.Bar(x=['Normal', 'Fraud'], y=class_counts.values, marker_color=['green', 'red']),
    row=1, col=1
)

# Pie chart
fig.add_trace(
    go.Pie(labels=['Normal', 'Fraud'], values=class_counts.values, marker_colors=['green', 'red']),
    row=1, col=2
)

fig.update_layout(height=400, showlegend=False, title_text="Target Variable Analysis")
fig.show()

## 4. Feature Analysis

In [None]:
# Time feature analysis
fig = px.histogram(df, x='Time', color='Class', 
                   title='Transaction Time Distribution',
                   labels={'Class': 'Transaction Type'},
                   color_discrete_map={0: 'green', 1: 'red'})
fig.show()

In [None]:
# Amount feature analysis
fig = make_subplots(rows=1, cols=2, subplot_titles=('Amount Distribution', 'Amount by Class'))

# Overall distribution
fig.add_trace(
    go.Histogram(x=df['Amount'], nbinsx=50, name='All'),
    row=1, col=1
)

# By class
fig.add_trace(
    go.Box(y=df[df['Class']==0]['Amount'], name='Normal', marker_color='green'),
    row=1, col=2
)
fig.add_trace(
    go.Box(y=df[df['Class']==1]['Amount'], name='Fraud', marker_color='red'),
    row=1, col=2
)

fig.update_layout(height=400, showlegend=True, title_text="Amount Analysis")
fig.show()

print(f"\nAmount Statistics:")
print(f"Normal transactions - Mean: ${df[df['Class']==0]['Amount'].mean():.2f}, Median: ${df[df['Class']==0]['Amount'].median():.2f}")
print(f"Fraud transactions - Mean: ${df[df['Class']==1]['Amount'].mean():.2f}, Median: ${df[df['Class']==1]['Amount'].median():.2f}")

In [None]:
# PCA features distribution
pca_features = [f'V{i}' for i in range(1, 29)]

# Sample a few PCA features for visualization
sample_features = ['V1', 'V2', 'V3', 'V4', 'V12', 'V14', 'V17']

fig = make_subplots(
    rows=2, cols=4,
    subplot_titles=sample_features,
    vertical_spacing=0.15
)

for idx, feature in enumerate(sample_features):
    row = idx // 4 + 1
    col = idx % 4 + 1
    
    fig.add_trace(
        go.Box(y=df[df['Class']==0][feature], name='Normal', marker_color='green', showlegend=(idx==0)),
        row=row, col=col
    )
    fig.add_trace(
        go.Box(y=df[df['Class']==1][feature], name='Fraud', marker_color='red', showlegend=(idx==0)),
        row=row, col=col
    )

fig.update_layout(height=600, title_text="Sample PCA Features Distribution by Class")
fig.show()

## 5. Correlation Analysis

In [None]:
# Correlation with target
correlations = df.corr()['Class'].sort_values(ascending=False)
print("Top 10 features correlated with Class:")
print(correlations.head(11))  # 11 because Class itself is included

print("\nBottom 10 features correlated with Class:")
print(correlations.tail(10))

In [None]:
# Visualize correlation with target
fig = go.Figure()

fig.add_trace(go.Bar(
    x=correlations.index[1:],  # Exclude Class itself
    y=correlations.values[1:],
    marker_color=['red' if x < 0 else 'green' for x in correlations.values[1:]]
))

fig.update_layout(
    title='Feature Correlation with Class',
    xaxis_title='Features',
    yaxis_title='Correlation',
    height=500
)
fig.show()

In [None]:
# Correlation heatmap (sample features)
sample_cols = ['Time', 'Amount', 'V1', 'V2', 'V3', 'V4', 'V12', 'V14', 'V17', 'Class']
corr_matrix = df[sample_cols].corr()

plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('Correlation Heatmap (Sample Features)')
plt.tight_layout()
plt.show()

## 6. Key Findings

### Summary:
1. **Dataset Size**: 284,807 transactions
2. **Class Imbalance**: Highly imbalanced (~0.17% fraud)
3. **Missing Values**: None
4. **Features**: 28 PCA-transformed features + Time + Amount
5. **Key Observations**:
   - Fraud transactions tend to have different patterns in certain V features
   - Amount distribution differs between normal and fraud transactions
   - Time feature shows transaction patterns throughout the day

### Next Steps:
1. Build a baseline model
2. Apply feature engineering
3. Handle class imbalance (SMOTE)
4. Optimize model performance

In [None]:
print("✅ EDA completed successfully!")
print("\nNext: Run 02_baseline.ipynb to build a baseline model")