# 01 - Data Profiling: Fraud Detection Analysis

**Objective:** Understand fraud distribution and risks in credit card transactions

**Key Questions:**
- What is the class imbalance ratio?
- What is the fraud rate?
- Are there potential data leakage features?
- What are the characteristics of fraudulent vs legitimate transactions?

---

## 1. Setup and Imports

In [1]:
import sys
import os

# Add src to path for imports
sys.path.append(os.path.abspath('../src'))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Import our data loading module
from data.load_data import load_raw_data, get_data_info, print_data_summary

# Visualization settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
%matplotlib inline

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

print("‚úì Imports successful")

ModuleNotFoundError: No module named 'pandas'

## 2. Load Data

In [None]:
# Load the dataset
df = load_raw_data()

# Display basic info
print_data_summary(df)

In [None]:
# First look at the data
df.head()

In [None]:
# Data types and memory usage
df.info()

## 3. Class Imbalance Analysis

Understanding the severity of class imbalance is critical for fraud detection.

In [None]:
# Calculate class distribution
class_counts = df['Class'].value_counts()
class_percentages = df['Class'].value_counts(normalize=True) * 100

print("Class Distribution:")
print("="*50)
print(f"Legitimate (0): {class_counts[0]:,} ({class_percentages[0]:.4f}%)")
print(f"Fraud (1):      {class_counts[1]:,} ({class_percentages[1]:.4f}%)")
print(f"\nImbalance Ratio: 1:{class_counts[0]/class_counts[1]:.2f}")
print(f"Fraud Rate: {class_percentages[1]:.4f}%")

In [None]:
# Visualize class imbalance
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar plot
class_counts.plot(kind='bar', ax=axes[0], color=['#2ecc71', '#e74c3c'])
axes[0].set_title('Class Distribution (Count)', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Class (0=Legitimate, 1=Fraud)', fontsize=12)
axes[0].set_ylabel('Count', fontsize=12)
axes[0].set_xticklabels(['Legitimate', 'Fraud'], rotation=0)
axes[0].grid(axis='y', alpha=0.3)

# Add count labels on bars
for i, v in enumerate(class_counts):
    axes[0].text(i, v + 5000, f'{v:,}', ha='center', fontweight='bold')

# Pie chart
colors = ['#2ecc71', '#e74c3c']
axes[1].pie(class_counts, labels=['Legitimate', 'Fraud'], autopct='%1.4f%%',
            startangle=90, colors=colors, explode=(0, 0.1))
axes[1].set_title('Class Distribution (Percentage)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print(f"\n‚ö†Ô∏è  SEVERE CLASS IMBALANCE DETECTED")
print(f"This will require special handling (SMOTE, class weights, etc.)")

## 4. Fraud Rate Analysis

In [None]:
# Calculate fraud statistics
total_transactions = len(df)
fraud_transactions = df['Class'].sum()
fraud_rate = (fraud_transactions / total_transactions) * 100

print("Fraud Rate Analysis:")
print("="*50)
print(f"Total Transactions: {total_transactions:,}")
print(f"Fraudulent Transactions: {fraud_transactions:,}")
print(f"Fraud Rate: {fraud_rate:.4f}%")
print(f"\nThis means approximately {int(1/fraud_rate*100)} out of every 10,000 transactions is fraudulent")
print(f"\nüí° Business Impact:")
print(f"   - Accuracy is NOT a good metric (99.8% by predicting all legitimate)")
print(f"   - Need to focus on Precision, Recall, F1-Score, and AUC-ROC")
print(f"   - False Negatives (missed fraud) are costly!")

## 5. Feature Analysis

Examine the features to understand their distributions and identify potential leakage.

In [None]:
# Basic statistics
df.describe()

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
print("Missing Values:")
print("="*50)
if missing_values.sum() == 0:
    print("‚úì No missing values detected!")
else:
    print(missing_values[missing_values > 0])

In [None]:
# Check for duplicates
duplicates = df.duplicated().sum()
print(f"Duplicate Rows: {duplicates}")
if duplicates > 0:
    print(f"‚ö†Ô∏è  Warning: {duplicates} duplicate rows found")
else:
    print("‚úì No duplicate rows")

### 5.1 Time Feature Analysis

In [None]:
# Analyze Time feature
print("Time Feature Statistics:")
print("="*50)
print(f"Min Time: {df['Time'].min()}")
print(f"Max Time: {df['Time'].max()}")
print(f"Time Range: {df['Time'].max() - df['Time'].min()} seconds")
print(f"Time Range: {(df['Time'].max() - df['Time'].min()) / 3600:.2f} hours")
print(f"Time Range: {(df['Time'].max() - df['Time'].min()) / 86400:.2f} days")

In [None]:
# Fraud distribution over time
fig, axes = plt.subplots(2, 1, figsize=(15, 10))

# Convert time to hours for better visualization
df['Time_Hours'] = df['Time'] / 3600

# Plot 1: All transactions over time
axes[0].scatter(df[df['Class']==0]['Time_Hours'], 
                df[df['Class']==0].index, 
                alpha=0.3, s=1, label='Legitimate', color='blue')
axes[0].scatter(df[df['Class']==1]['Time_Hours'], 
                df[df['Class']==1].index, 
                alpha=0.8, s=10, label='Fraud', color='red')
axes[0].set_xlabel('Time (hours)', fontsize=12)
axes[0].set_ylabel('Transaction Index', fontsize=12)
axes[0].set_title('Transaction Distribution Over Time', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Plot 2: Fraud rate over time bins
time_bins = pd.cut(df['Time_Hours'], bins=20)
fraud_rate_by_time = df.groupby(time_bins)['Class'].agg(['sum', 'count'])
fraud_rate_by_time['fraud_rate'] = (fraud_rate_by_time['sum'] / fraud_rate_by_time['count']) * 100

axes[1].plot(range(len(fraud_rate_by_time)), fraud_rate_by_time['fraud_rate'], 
             marker='o', linewidth=2, markersize=6, color='red')
axes[1].set_xlabel('Time Bin', fontsize=12)
axes[1].set_ylabel('Fraud Rate (%)', fontsize=12)
axes[1].set_title('Fraud Rate Over Time Bins', fontsize=14, fontweight='bold')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí° Insight: Check if fraud rate varies significantly over time")

### 5.2 Amount Feature Analysis

In [None]:
# Amount statistics by class
print("Amount Statistics by Class:")
print("="*50)
print("\nLegitimate Transactions:")
print(df[df['Class']==0]['Amount'].describe())
print("\nFraudulent Transactions:")
print(df[df['Class']==1]['Amount'].describe())

In [None]:
# Visualize Amount distribution
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Plot 1: Amount distribution for legitimate
axes[0, 0].hist(df[df['Class']==0]['Amount'], bins=50, color='blue', alpha=0.7, edgecolor='black')
axes[0, 0].set_xlabel('Amount', fontsize=12)
axes[0, 0].set_ylabel('Frequency', fontsize=12)
axes[0, 0].set_title('Legitimate Transactions - Amount Distribution', fontsize=12, fontweight='bold')
axes[0, 0].grid(alpha=0.3)

# Plot 2: Amount distribution for fraud
axes[0, 1].hist(df[df['Class']==1]['Amount'], bins=50, color='red', alpha=0.7, edgecolor='black')
axes[0, 1].set_xlabel('Amount', fontsize=12)
axes[0, 1].set_ylabel('Frequency', fontsize=12)
axes[0, 1].set_title('Fraudulent Transactions - Amount Distribution', fontsize=12, fontweight='bold')
axes[0, 1].grid(alpha=0.3)

# Plot 3: Box plot comparison
df.boxplot(column='Amount', by='Class', ax=axes[1, 0])
axes[1, 0].set_xlabel('Class (0=Legitimate, 1=Fraud)', fontsize=12)
axes[1, 0].set_ylabel('Amount', fontsize=12)
axes[1, 0].set_title('Amount Distribution by Class', fontsize=12, fontweight='bold')
plt.sca(axes[1, 0])
plt.xticks([1, 2], ['Legitimate', 'Fraud'])

# Plot 4: Log-scale comparison
axes[1, 1].hist(df[df['Class']==0]['Amount'], bins=50, alpha=0.5, label='Legitimate', color='blue')
axes[1, 1].hist(df[df['Class']==1]['Amount'], bins=50, alpha=0.5, label='Fraud', color='red')
axes[1, 1].set_xlabel('Amount', fontsize=12)
axes[1, 1].set_ylabel('Frequency (log scale)', fontsize=12)
axes[1, 1].set_yscale('log')
axes[1, 1].set_title('Amount Distribution Comparison (Log Scale)', fontsize=12, fontweight='bold')
axes[1, 1].legend()
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

### 5.3 PCA Features (V1-V28) Analysis

In [None]:
# Get V features
v_features = [col for col in df.columns if col.startswith('V')]
print(f"Number of PCA features: {len(v_features)}")
print(f"Features: {v_features[:5]}... (showing first 5)")

In [None]:
# Calculate correlation with target for each V feature
correlations = df[v_features + ['Class']].corr()['Class'].drop('Class').sort_values(ascending=False)

print("Top 10 Features Most Correlated with Fraud:")
print("="*50)
print(correlations.head(10))
print("\nTop 10 Features Most Negatively Correlated with Fraud:")
print("="*50)
print(correlations.tail(10))

In [None]:
# Visualize feature correlations with target
fig, ax = plt.subplots(figsize=(12, 8))

correlations_sorted = correlations.sort_values()
colors = ['red' if x < 0 else 'green' for x in correlations_sorted]

correlations_sorted.plot(kind='barh', ax=ax, color=colors, alpha=0.7)
ax.set_xlabel('Correlation with Fraud (Class)', fontsize=12)
ax.set_ylabel('Features', fontsize=12)
ax.set_title('Feature Correlation with Fraud Class', fontsize=14, fontweight='bold')
ax.axvline(x=0, color='black', linestyle='--', linewidth=1)
ax.grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí° Features with high absolute correlation may be most predictive")

In [None]:
# Visualize distribution of top correlated features
top_features = correlations.abs().sort_values(ascending=False).head(6).index.tolist()

fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.ravel()

for idx, feature in enumerate(top_features):
    axes[idx].hist(df[df['Class']==0][feature], bins=50, alpha=0.5, label='Legitimate', color='blue', density=True)
    axes[idx].hist(df[df['Class']==1][feature], bins=50, alpha=0.5, label='Fraud', color='red', density=True)
    axes[idx].set_xlabel(feature, fontsize=11)
    axes[idx].set_ylabel('Density', fontsize=11)
    axes[idx].set_title(f'{feature} Distribution (Corr: {correlations[feature]:.3f})', 
                        fontsize=11, fontweight='bold')
    axes[idx].legend()
    axes[idx].grid(alpha=0.3)

plt.tight_layout()
plt.show()

## 6. Data Leakage Detection

Check for potential data leakage issues that could artificially inflate model performance.

In [None]:
print("Data Leakage Checks:")
print("="*50)

# Check 1: Perfect separation in any feature
print("\n1. Checking for perfect separation in features...")
perfect_separation = False
for col in df.columns:
    if col != 'Class':
        fraud_vals = set(df[df['Class']==1][col].unique())
        legit_vals = set(df[df['Class']==0][col].unique())
        if fraud_vals.isdisjoint(legit_vals):
            print(f"   ‚ö†Ô∏è  WARNING: Perfect separation in {col}")
            perfect_separation = True

if not perfect_separation:
    print("   ‚úì No perfect separation detected")

# Check 2: Suspiciously high correlations
print("\n2. Checking for suspiciously high correlations...")
high_corr_threshold = 0.9
high_corr_features = correlations[correlations.abs() > high_corr_threshold]
if len(high_corr_features) > 0:
    print(f"   ‚ö†Ô∏è  WARNING: Features with correlation > {high_corr_threshold}:")
    print(high_corr_features)
else:
    print(f"   ‚úì No features with correlation > {high_corr_threshold}")

# Check 3: Time-based leakage
print("\n3. Checking for time-based leakage...")
print("   üí° Time feature represents seconds elapsed - should be safe")
print("   üí° However, be cautious about temporal patterns in deployment")

# Check 4: Feature scaling issues
print("\n4. Checking feature scales...")
print(f"   - V features (PCA): Already scaled (mean ‚âà 0, std ‚âà 1)")
print(f"   - Time: Range = {df['Time'].min():.0f} to {df['Time'].max():.0f}")
print(f"   - Amount: Range = {df['Amount'].min():.2f} to {df['Amount'].max():.2f}")
print(f"   ‚ö†Ô∏è  Time and Amount need scaling before modeling!")

## 7. Key Insights and Recommendations

In [None]:
# Generate summary insights
info = get_data_info(df)

print("\n" + "="*70)
print("KEY INSIGHTS & RECOMMENDATIONS")
print("="*70)

print("\nüìä CLASS IMBALANCE:")
print(f"   - Fraud rate: {info['fraud_percentage']:.4f}%")
print(f"   - Imbalance ratio: 1:{info['legitimate_count']/info['fraud_count']:.0f}")
print(f"   - Recommendation: Use SMOTE, class weights, or ensemble methods")

print("\nüìà EVALUATION METRICS:")
print(f"   - DO NOT use accuracy as primary metric")
print(f"   - Focus on: Precision, Recall, F1-Score, PR-AUC, ROC-AUC")
print(f"   - Consider business cost of False Negatives vs False Positives")

print("\nüîß PREPROCESSING NEEDED:")
print(f"   - Scale 'Time' and 'Amount' features")
print(f"   - V features are already PCA-transformed and scaled")
print(f"   - Consider creating time-based features (hour of day, etc.)")

print("\nüéØ MODELING STRATEGY:")
print(f"   - Use stratified train-test split to maintain class ratio")
print(f"   - Apply resampling techniques (SMOTE, ADASYN)")
print(f"   - Try ensemble methods (Random Forest, XGBoost, LightGBM)")
print(f"   - Use cross-validation with stratification")

print("\n‚ö†Ô∏è  DATA QUALITY:")
print(f"   - Missing values: {info['missing_values']} ‚úì")
print(f"   - Duplicate rows: {info['duplicate_rows']} ‚úì")
print(f"   - No obvious data leakage detected ‚úì")

print("\nüîç FEATURE IMPORTANCE:")
top_3_features = correlations.abs().sort_values(ascending=False).head(3)
print(f"   Top 3 correlated features:")
for feat, corr in top_3_features.items():
    print(f"   - {feat}: {corr:.4f}")

print("\n" + "="*70)
print("‚úì Data profiling complete! Ready for preprocessing and modeling.")
print("="*70)

## 8. Next Steps

Based on this analysis, the next steps are:

1. **Data Preprocessing** (Notebook 02):
   - Scale Time and Amount features
   - Create additional time-based features
   - Prepare train-test split with stratification

2. **Handle Class Imbalance** (Notebook 03):
   - Apply SMOTE or ADASYN
   - Experiment with class weights
   - Compare different resampling strategies

3. **Model Training** (Notebook 04):
   - Train baseline models
   - Evaluate with appropriate metrics
   - Optimize hyperparameters

4. **Model Evaluation** (Notebook 05):
   - Compare model performance
   - Analyze confusion matrices
   - Calculate business impact metrics