# Day 1: Fraud Detection EDA Dashboard

**Interactive Exploratory Data Analysis for Credit Card Fraud Detection**

## Overview
- **Objective**: Explore and understand fraud detection patterns
- **Dataset**: Credit Card Transactions (Kaggle)
- **Tools**: Plotly Dash for interactive visualization

## What You'll Learn
1. Class distribution analysis (fraud vs legitimate)
2. Transaction amount patterns (log-scale histograms)
3. Feature correlation analysis
4. Time-based patterns (hourly fraud rates)
5. PCA visualization for high-dimensional data

---

## 1. Setup and Installation

In [None]:
# Install required packages
!pip install plotly dash pandas numpy scikit-learn

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

print("‚úÖ Libraries imported successfully!")

## 2. Load and Explore Data

In [None]:
# Load the credit card fraud dataset
# You can download from: https://www.kaggle.com/mlg-ulb/creditcardfraud

df = pd.read_csv('creditcard.csv')

print(f"Dataset Shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nFirst 5 rows:")
df.head()

In [None]:
# Summary statistics
print("Summary Statistics:")
print(df.describe())

# Check for missing values
print(f"\nMissing values per column:")
print(df.isnull().sum())

## 3. Class Distribution Analysis

In [None]:
# Count fraud vs legitimate transactions
class_counts = df['Class'].value_counts()
class_percentages = df['Class'].value_counts(normalize=True) * 100

# Create interactive bar chart
fig = go.Figure()
fig.add_trace(go.Bar(
    x=['Legitimate (0)', 'Fraud (1)'],
    y=class_counts.values,
    marker_color=['#4ECDC4', '#FF6B6B'],
    text=[f"{val:,} ({pct:.2f}%)" for val, pct in zip(class_counts.values, class_percentages.values)],
    textposition='auto',
))

fig.update_layout(
    title='Class Distribution: Fraud vs Legitimate Transactions',
    xaxis_title='Transaction Type',
    yaxis_title='Count',
    height=500,
    showlegend=False
)

fig.show()

print(f"\nüîç Key Insights:")
print(f"  ‚Ä¢ Legitimate: {class_counts[0]:,} ({class_percentages[0]:.2f}%)")
print(f"  ‚Ä¢ Fraud: {class_counts[1]:,} ({class_percentages[1]:.2f}%)")
print(f"  ‚Ä¢ Imbalance Ratio: 1:{class_counts[0]/class_counts[1]:.0f}")

## 4. Transaction Amount Analysis

In [None]:
# Create histograms for fraud vs legitimate amounts
legitimate_amounts = df[df['Class'] == 0]['Amount']
fraud_amounts = df[df['Class'] == 1]['Amount']

fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=('Legitimate Transactions', 'Fraudulent Transactions'),
    specs=[[{'type': 'histogram'}, {'type': 'histogram'}]]
)

# Legitimate transactions histogram
fig.add_trace(
    go.Histogram(
        x=legitimate_amounts,
        nbinsx=50,
        marker_color='#4ECDC4',
        name='Legitimate',
    ),
    row=1, col=1
)

# Fraudulent transactions histogram
fig.add_trace(
    go.Histogram(
        x=fraud_amounts,
        nbinsx=50,
        marker_color='#FF6B6B',
        name='Fraud',
    ),
    row=1, col=2
)

# Add log-scale option button
fig.update_layout(
    title='Transaction Amount Distribution',
    height=400,
    showlegend=False,
    updatemenus=[
        dict(
            type="buttons",
            direction="right",
            active=0,
            buttons=list([
                dict(label="Linear Scale", method="relayout", args=[{"yaxis.type": "linear"}]),
                dict(label="Log Scale", method="relayout", args=[{"yaxis.type": "log"}]),
            ]),
        )
    ]
)

fig.show()

print("\nüîç Key Insights:")
print(f"  ‚Ä¢ Median Legitimate Amount: ${legitimate_amounts.median():.2f}")
print(f"  ‚Ä¢ Median Fraud Amount: ${fraud_amounts.median():.2f}")
print(f"  ‚Ä¢ Max Legitimate: ${legitimate_amounts.max():.2f}")
print(f"  ‚Ä¢ Max Fraud: ${fraud_amounts.max():.2f}")

## 5. Correlation Analysis

In [None]:
# Compute correlation matrix for key features
# (Using only V1-V5 for clarity)
features = ['Amount'] + [f'V{i}' for i in range(1, 6)] + ['Class']
corr_matrix = df[features].corr()

# Create heatmap
fig = go.Figure(data=go.Heatmap(
    z=corr_matrix.values,
    x=corr_matrix.columns,
    y=corr_matrix.columns,
    colorscale='RdBu',
    zmid=0,
    text=np.round(corr_matrix.values, 2),
    texttemplate='%{text}',
    textfont={"size": 10},
))

fig.update_layout(
    title='Feature Correlation Heatmap',
    height=500,
    width=600
)

fig.show()

# Show correlations with fraud class
fraud_corr = corr_matrix['Class'].sort_values(ascending=False)
print("\nüîç Top Features Correlated with Fraud:")
for feature, corr in fraud_corr[1:6].items():  # Skip Class itself
    print(f"  ‚Ä¢ {feature}: {corr:.3f}")

## 6. Time-Based Patterns

In [None]:
# Convert Time column to datetime (if available)
# For this dataset, we'll analyze without explicit time column
# Instead, let's create sample time patterns

# Analyze transaction patterns by amount ranges
bins = [0, 10, 50, 100, 500, float('inf')]
labels = ['<$10', '$10-50', '$50-100', '$100-500', '>$500']
df['Amount_Range'] = pd.cut(df['Amount'], bins=bins, labels=labels)

# Calculate fraud rate per amount range
fraud_by_range = df.groupby('Amount_Range', observed=True).agg({
    'Class': ['mean', 'count']
}).reset_index()
fraud_by_range.columns = ['Amount Range', 'Fraud Rate', 'Count']

# Create bar chart
fig = go.Figure()

# Add bars
fig.add_trace(go.Bar(
    x=fraud_by_range['Amount Range'],
    y=fraud_by_range['Fraud Rate'] * 100,
    marker_color='#FF6B6B',
    text=fraud_by_range['Fraud Rate'].apply(lambda x: f"{x*100:.2f}%"),
    textposition='outside',
    name='Fraud Rate'
))

# Add count as secondary y-axis
fig.add_trace(go.Scatter(
    x=fraud_by_range['Amount Range'],
    y=fraud_by_range['Count'],
    mode='lines+markers',
    name='Transaction Count',
    yaxis='y2'
))

fig.update_layout(
    title='Fraud Rate by Transaction Amount Range',
    xaxis_title='Amount Range',
    yaxis_title='Fraud Rate (%)',
    yaxis2=dict(
        title='Transaction Count',
        overlaying='y',
        side='right'
    ),
    height=500,
    barmode='group'
)

fig.show()

print("\nüîç Key Insights:")
for _, row in fraud_by_range.iterrows():
    print(f"  ‚Ä¢ {row['Amount Range']}: {row['Fraud Rate']*100:.2f}% fraud ({row['Count']:,} transactions)")

## 7. PCA Visualization

In [None]:
# Sample data for PCA (use subset for speed)
sample_df = df.sample(n=min(10000, len(df)), random_state=42)

# Prepare features (V1-V28)
feature_cols = [f'V{i}' for i in range(1, 29)]
X = sample_df[feature_cols].values
y = sample_df['Class'].values

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Create scatter plot
fig = go.Figure()

# Add points for each class
for class_val, class_name, color in [(0, 'Legitimate', '#4ECDC4'), (1, 'Fraud', '#FF6B6B')]:
    mask = y == class_val
    fig.add_trace(go.Scatter(
        x=X_pca[mask, 0],
        y=X_pca[mask, 1],
        mode='markers',
        name=class_name,
        marker=dict(
            size=5,
            opacity=0.6,
            color=color
        ),
        text=[f"Class={class_val}" for _ in range(sum(mask))]
    ))

fig.update_layout(
    title=f'PCA Visualization (Explained Variance: {pca.explained_variance_ratio_.sum()*100:.1f}%)',
    xaxis_title='PC1',
    yaxis_title='PC2',
    height=600,
    hovermode='closest'
)

fig.show()

print(f"\nüîç PCA Results:")
print(f"  ‚Ä¢ PC1 Explained Variance: {pca.explained_variance_ratio_[0]*100:.2f}%")
print(f"  ‚Ä¢ PC2 Explained Variance: {pca.explained_variance_ratio_[1]*100:.2f}%")
print(f"  ‚Ä¢ Total Explained Variance: {pca.explained_variance_ratio_.sum()*100:.2f}%")

## 8. Summary Statistics Dashboard

In [None]:
# Calculate key statistics
stats = {
    'Total Transactions': len(df),
    'Fraud Cases': int(df['Class'].sum()),
    'Fraud Rate': f"{df['Class'].mean()*100:.3}%",
    'Avg Transaction Amount': f"${df['Amount'].mean():.2f}",
    'Median Transaction Amount': f"${df['Amount'].median():.2f}",
    'Max Transaction Amount': f"${df['Amount'].max():.2f}",
    'Features (V1-V28)': 28,
}

# Display as formatted table
print("="*60)
print("FRAUD DETECTION DATASET - SUMMARY STATISTICS")
print("="*60)
for key, value in stats.items():
    print(f"{key:.<40} {value}")
print("="*60)

## 9. Key Takeaways

### Data Characteristics:
1. **Severe Class Imbalance**: Only 0.173% fraud cases (typical for fraud detection)
2. **Anonymized Features**: V1-V28 are PCA-transformed (original features hidden)
3. **Amount Distribution**: Highly skewed, most transactions are small

### Challenges for Machine Learning:
- ‚ùå **Imbalanced Data**: Models biased toward majority class
- ‚ùå **Feature Anonymization**: Hard to interpret feature importance
- ‚ö†Ô∏è **Overlap**: Fraud and legitimate transactions overlap in feature space

### Solutions:
- ‚úÖ Use resampling techniques (SMOTE, oversampling)
- ‚úÖ Apply class weights during training
- ‚úÖ Use anomaly detection algorithms
- ‚úÖ Ensemble methods for better performance

### Next Steps:
‚Üí **Day 2**: Classification Benchmark (handle imbalance)
‚Üí **Day 3**: Feature Engineering (create better features)

---

**üìÅ Project Location**: `01_fraud_detection_core/fraud_detection_eda_dashboard/`