# E-Commerce Fraud Detection

This project models E-Commerce transaction data to identify fraudelent activity, based on this [Kaggle Dataset](https://www.kaggle.com/datasets/umuttuygurr/e-commerce-fraud-detection-dataset). The dataset is synthetic, but very realistic, as it is modeled after real-life fraudulent activity observed in 2024, with scenarios such as
- Cards tested with $1 purchases at midnight
- Transactions that shipped ‚Äúgaming accessories‚Äù 5,000 km away
- Promo codes being reused from freshly created accounts.

I decided to focus on this dataset as it is the most complete, realistic data on transaction fraud that I could find. Other fraud datasets that weren't synthetic had to obfuscate the meaning of features and their values for privacy reasons, using techniques like PCA, so features had meaningless names like V1, V2, etc.

Here is a list of the columns in the dataset with brief descriptions:

- `transaction_id`: Unique transaction identifier
- `user_id`: User identifier (each user 40‚Äì60 transactions)
- `account_age_days`: Age of user account in days
- `total_transactions_user`: Number of transactions per user
- `avg_amount_user`: User‚Äôs mean transaction amount
- `amount`: Transaction amount (USD)
- `country`: User‚Äôs country
- `bin_country`: Country of the card-issuing bank
- `channel`: ‚Äúweb‚Äù or ‚Äúapp‚Äù
- `merchant_category`: Type of purchase: electronics, fashion, grocery, gaming, travel
- `promo_used`: whether a discount/promo was used
- `avs_flag`: Address Verification result, a mismatch in the billing address provided by a customer and the one on file with their card issuer.
- `cvv_result`: CVV code match result, indicates if 3 digit code on back of card provided during an online transaction matched the card issuer's records
- `three_ds_flag`: 3D Secure enabled, so if a transaction is flagged, it prompts the customer to complete an extra verification step, such as a one-time code sent to your phone, a password, or biometric login
- `transaction_time`: Transaction timestamp (UTC)
- `shipping_distance_km`: Distance between billing and shipping addresses
- `is_fraud`: Target label (1 = fraud, 0 = normal)

## Setup
### Define parameters
The input/output parameters are defined in the next cell.

In [None]:
# Data input parameters
kaggle_source = "umuttuygurr/e-commerce-fraud-detection-dataset"
data_dir = "../data"  # Relative to notebooks/ folder
csv_file = "transactions.csv"
# Column definitions
target_col = "is_fraud"
id_cols = ['transaction_id', 'user_id']
date_feature = 'transaction_time'
# Define categorical features (including binary flags)
categorical_features = ['country', 'bin_country', 'channel', 'merchant_category', 
                       'promo_used', 'avs_match', 'cvv_result', 'three_ds_flag']
# Define numeric features (continuous/count variables only)
numeric_features = ['account_age_days', 'total_transactions_user', 'avg_amount_user', 
                   'amount', 'shipping_distance_km']
# Validation/Test split ratios
val_ratio = .2
test_ratio = .2

# Prepend this string to final answers so they print as bold text
BOLD = "\033[1m"

### Import packages

In [None]:
# Add project root to path for imports (needed when running from notebooks/ folder)
import sys
from pathlib import Path
project_root = Path.cwd().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

# Import extracted EDA utilities
from src.fd1_nb.data_utils import (
    download_data_csv, load_data, split_train_val_test, analyze_target_stats,
    analyze_feature_stats, plot_target_distribution
)
from src.fd1_nb.eda_utils import (
    calculate_mi_scores, calculate_numeric_correlations, calculate_vif,
    plot_numeric_distributions, analyze_vif, analyze_correlations,
    plot_box_plots, analyze_temporal_patterns, analyze_categorical_fraud_rates,
    plot_categorical_fraud_rates, analyze_mutual_information
)
from src.fd1_nb.feature_engineering import (
    convert_utc_to_local_time, create_temporal_features,
    create_interaction_features, create_percentile_based_features
)

# Standard library imports (still used in notebook)
import pandas as pd

# Utilities for feature configuration
from src.deployment.preprocessing import FeatureConfig

### Define feature engineering functions

In [None]:
# ============ FEATURE ENGINEERING FUNCTIONS ============

def get_country_timezone_mapping():
    """Create mapping of country codes to capital city timezones."""
    return {
        'US': 'America/New_York',      # Washington D.C.
        'GB': 'Europe/London',          # London
        'FR': 'Europe/Paris',           # Paris
        'DE': 'Europe/Berlin',          # Berlin
        'IT': 'Europe/Rome',            # Rome
        'ES': 'Europe/Madrid',          # Madrid
        'NL': 'Europe/Amsterdam',       # Amsterdam
        'PL': 'Europe/Warsaw',          # Warsaw
        'RO': 'Europe/Bucharest',       # Bucharest
        'TR': 'Europe/Istanbul'         # Istanbul
    }

def create_amount_features(df):
    """Create transaction amount-based features."""
    df = df.copy()
    
    # Amount deviation from user's average
    df['amount_deviation'] = (df['amount'] - df['avg_amount_user']).abs()
    
    # Amount ratio compared to user's average (handle division by zero)
    df['amount_vs_avg_ratio'] = df['amount'] / df['avg_amount_user'].replace(0, 1)
    
    # Micro transactions (potential card testing)
    df['is_micro_transaction'] = (df['amount'] <= 5).astype(int)
    
    # Large transactions
    df['is_large_transaction'] = (df['amount'] >= df['amount'].quantile(0.95)).astype(int)
    
    features = ['amount_deviation', 'amount_vs_avg_ratio', 'is_micro_transaction', 'is_large_transaction']
    return df, features

def create_user_behavior_features(df):
    """Create user behavior-based features."""
    df = df.copy()
    
    # Transaction velocity (transactions per day of account age)
    df['transaction_velocity'] = df['total_transactions_user'] / df['account_age_days'].replace(0, 1)
    
    # New account flag (less than 30 days)
    df['is_new_account'] = (df['account_age_days'] <= 30).astype(int)
    
    # High frequency user
    df['is_high_frequency_user'] = (df['total_transactions_user'] >= df['total_transactions_user'].quantile(0.75)).astype(int)
    
    features = ['transaction_velocity', 'is_new_account', 'is_high_frequency_user']
    return df, features

def create_geographic_features(df, risk_distance_quantile=0.75):
    """Create geographic-based features."""
    df = df.copy()
    
    # Country mismatch (user country != card issuing country)
    df['country_mismatch'] = (df['country'] != df['bin_country']).astype(int)
    
    # High risk shipping distance (>75th percentile)
    distance_threshold = df['shipping_distance_km'].quantile(risk_distance_quantile)
    df['high_risk_distance'] = (df['shipping_distance_km'] > distance_threshold).astype(int)
    
    # Zero distance (billing = shipping, lower risk)
    df['zero_distance'] = (df['shipping_distance_km'] == 0).astype(int)
    
    features = ['country_mismatch', 'high_risk_distance', 'zero_distance']
    return df, features

def create_security_features(df):
    """Create security verification-based features."""
    df = df.copy()
    
    # Security score (count of passed verifications)
    df['security_score'] = df['avs_match'] + df['cvv_result'] + df['three_ds_flag']
    
    # Count of failed verifications
    df['verification_failures'] = 3 - df['security_score']
    
    # All verifications passed
    df['all_verifications_passed'] = (df['security_score'] == 3).astype(int)
    
    # All verifications failed (high risk)
    df['all_verifications_failed'] = (df['security_score'] == 0).astype(int)
    
    features = ['security_score', 'verification_failures', 'all_verifications_passed', 'all_verifications_failed']
    return df, features

def engineer_features(df, date_col='transaction_time', country_col='country'):
    """
    Master function to create all engineered features.
    Includes both UTC and local time-based features.
    """
    print("=" * 80)
    print("FEATURE ENGINEERING")
    print("=" * 80)
    
    df_eng = df.copy()
    all_new_features = []
    
    # 1. Convert to local time
    print("\n1. TIMEZONE CONVERSION:")
    df_eng = convert_utc_to_local_time(df_eng, date_col, country_col, timezone_mapping=get_country_timezone_mapping())
    
    # 2. Temporal features (UTC)
    print("\n2. TEMPORAL FEATURES (UTC):")
    df_eng, utc_features = create_temporal_features(df_eng, date_col, suffix='', late_night_hours=(23, 4), business_hours=(9, 17))
    print(f"  ‚úì Created {len(utc_features)} UTC temporal features")
    all_new_features.extend(utc_features)
    
    # 3. Temporal features (Local time)
    print("\n3. TEMPORAL FEATURES (LOCAL TIME):")
    df_eng, local_features = create_temporal_features(df_eng, 'local_time', suffix='_local', late_night_hours=(23, 4), business_hours=(9, 17))
    print(f"  ‚úì Created {len(local_features)} local time temporal features")
    all_new_features.extend(local_features)
    
    # 4. Amount features
    print("\n4. TRANSACTION AMOUNT FEATURES:")
    df_eng, amount_features = create_amount_features(df_eng)
    print(f"  ‚úì Created {len(amount_features)} amount-based features: {amount_features}")
    all_new_features.extend(amount_features)
    
    # 5. User behavior features
    print("\n5. USER BEHAVIOR FEATURES:")
    df_eng, behavior_features = create_user_behavior_features(df_eng)
    print(f"  ‚úì Created {len(behavior_features)} behavior features: {behavior_features}")
    all_new_features.extend(behavior_features)
    
    # 6. Geographic features
    print("\n6. GEOGRAPHIC FEATURES:")
    df_eng, geo_features = create_geographic_features(df_eng)
    print(f"  ‚úì Created {len(geo_features)} geographic features: {geo_features}")
    all_new_features.extend(geo_features)
    
    # 7. Security features
    print("\n7. SECURITY FEATURES:")
    df_eng, security_features = create_security_features(df_eng)
    print(f"  ‚úì Created {len(security_features)} security features: {security_features}")
    all_new_features.extend(security_features)
    
    # 8. Interaction features
    print("\n8. INTERACTION FEATURES (High-Risk Combinations):")
    # Define interaction feature configurations
    interaction_config = [
        {
            'name': 'new_account_with_promo',
            'conditions': ['is_new_account == 1', 'promo_used == 1'],
            'operator': 'and'
        },
        {
            'name': 'late_night_micro_transaction',
            'conditions': ['is_late_night_local == 1', 'is_micro_transaction == 1'],
            'operator': 'and'
        },
        {
            'name': 'foreign_card_failed_verification',
            'conditions': ['country_mismatch == 1', 'verification_failures > 0'],
            'operator': 'and'
        },
        {
            'name': 'new_high_velocity_account',
            'conditions': ['is_new_account == 1', 'is_high_frequency_user == 1'],
            'operator': 'and'
        },
        {
            'name': 'high_value_long_distance',
            'conditions': ['is_large_transaction == 1', 'high_risk_distance == 1'],
            'operator': 'and'
        },
        {
            'name': 'triple_risk_combo',
            'conditions': ['is_new_account == 1', 'promo_used == 1', 'verification_failures > 0'],
            'operator': 'and'
        }
    ]

    df_eng, interaction_features = create_interaction_features(df_eng, interaction_config)
    print(f"  ‚úì Created {len(interaction_features)} interaction features: {interaction_features}")
    all_new_features.extend(interaction_features)

    # Summary
    print("\n" + "=" * 80)
    print(f"FEATURE ENGINEERING COMPLETE")
    print(f"Total new features created: {len(all_new_features)}")
    print(f"Original shape: {df.shape}")
    print(f"New shape: {df_eng.shape}")
    print("=" * 80)
    
    return df_eng, all_new_features


def print_feature_recommendations(corr_df, mi_df, vif_df, numeric_features, categorical_features):
    """Print comprehensive feature selection recommendations."""
    print("=" * 80)
    print("FEATURE SELECTION RECOMMENDATIONS")
    print("=" * 80)
    
    print("\nüìä SUMMARY OF EDA FINDINGS:")
    print("-" * 80)
    
    print("\n1. NUMERIC FEATURES (Correlation Analysis):")
    print("   Features to KEEP (showing meaningful correlation):")
    for _, row in corr_df.head(5).iterrows():
        print(f"   ‚úì {row['feature']}: {row['correlation']:.4f} correlation")
    
    print("\n2. CATEGORICAL FEATURES (Mutual Information Analysis):")
    print("   Features to KEEP (MI > 0.01):")
    high_mi_features = mi_df[mi_df['mi_score'] > 0.01]
    for _, row in high_mi_features.iterrows():
        print(f"   ‚úì {row['feature']}: MI = {row['mi_score']:.4f}")
    
    print("\n3. MULTICOLLINEARITY CHECK:")
    if vif_df['VIF'].max() > 10:
        high_vif_features = vif_df[vif_df['VIF'] > 10]
        print("   ‚ö†Ô∏è  Consider removing or combining these features:")
        for _, row in high_vif_features.iterrows():
            print(f"   - {row['feature']}: VIF = {row['VIF']:.2f}")
    else:
        print("   ‚úì No severe multicollinearity detected")
    
    print("\n4. TEMPORAL PATTERNS:")
    print("   ‚úì Hour of day shows fraud patterns (consider time-based features)")
    print("   ‚úì Weekend/weekday distinction may be relevant")
    
    print("\n" + "=" * 80)
    print("RECOMMENDED FEATURES FOR MODELING:")
    print("=" * 80)
    
    print("\n‚úÖ NUMERIC FEATURES (all 5):")
    for feat in numeric_features:
        print(f"   ‚Ä¢ {feat}")
    
    print("\n‚úÖ CATEGORICAL FEATURES (all 8):")
    for feat in categorical_features:
        print(f"   ‚Ä¢ {feat}")
    
    print("\n‚úÖ TEMPORAL FEATURES TO ENGINEER:")
    print("   ‚Ä¢ hour (from transaction_time)")
    print("   ‚Ä¢ day_of_week (from transaction_time)")
    print("   ‚Ä¢ is_weekend (derived from day_of_week)")
    print("   ‚Ä¢ is_midnight (hours 23-01)")
    
    print("\nüí° SUGGESTED FEATURE ENGINEERING:")
    print("   ‚Ä¢ country_mismatch: (country != bin_country)")
    print("   ‚Ä¢ amount_deviation: |amount - avg_amount_user|")
    print("   ‚Ä¢ amount_vs_avg_ratio: amount / avg_amount_user")
    print("   ‚Ä¢ high_risk_distance: (shipping_distance_km > threshold)")
    print("   ‚Ä¢ security_score: combination of avs_match + cvv_result + three_ds_flag")
    print("   ‚Ä¢ transaction_velocity: total_transactions_user / account_age_days")
    
    print("\n‚ö° MODELING CONSIDERATIONS:")
    print("   ‚Ä¢ Use stratified sampling (class imbalance: 44:1)")
    print("   ‚Ä¢ Apply class weights or SMOTE for minority class")
    print("   ‚Ä¢ Use appropriate metrics: ROC-AUC, F1, Precision-Recall (not accuracy)")
    print("   ‚Ä¢ Consider threshold tuning for precision/recall trade-off")
    print("   ‚Ä¢ Try tree-based models (handle categorical features well)")
    
    print("\n" + "=" * 80)

def analyze_final_feature_selection(train_new_features):
    """
    Comprehensive final feature selection analysis based on EDA insights
    and engineered features. Returns categorized feature recommendations.
    """
    print("=" * 80)
    print("FINAL FEATURE SELECTION FOR MODELING")
    print("=" * 80)

    # Define all available features
    original_numeric = ['account_age_days', 'total_transactions_user', 'avg_amount_user',
                       'amount', 'shipping_distance_km']
    original_categorical = ['country', 'bin_country', 'channel', 'merchant_category',
                           'promo_used', 'avs_match', 'cvv_result', 'three_ds_flag']

    print("\nüìä AVAILABLE FEATURES:")
    print("-" * 80)
    print(f"Original features: {len(original_numeric) + len(original_categorical)}")
    print(f"  ‚Ä¢ Numeric: {len(original_numeric)}")
    print(f"  ‚Ä¢ Categorical: {len(original_categorical)}")
    print(f"Engineered features: {len(train_new_features)}")
    print(f"Total available: {len(original_numeric) + len(original_categorical) + len(train_new_features)}")

    # Categorize engineered features
    temporal_utc = ['hour', 'day_of_week', 'month', 'is_weekend', 'is_late_night', 'is_business_hours']
    temporal_local = ['hour_local', 'day_of_week_local', 'month_local', 'is_weekend_local',
                     'is_late_night_local', 'is_business_hours_local']
    amount_features = ['amount_deviation', 'amount_vs_avg_ratio', 'is_micro_transaction', 'is_large_transaction']
    behavior_features = ['transaction_velocity', 'is_new_account', 'is_high_frequency_user']
    geographic_features = ['country_mismatch', 'high_risk_distance', 'zero_distance']
    security_features = ['security_score', 'verification_failures', 'all_verifications_passed', 'all_verifications_failed']
    interaction_features = ['new_account_with_promo', 'late_night_micro_transaction',
                           'foreign_card_failed_verification', 'new_high_velocity_account',
                           'high_value_long_distance', 'triple_risk_combo']

    print("\nüîç FEATURE SELECTION ANALYSIS:")
    print("-" * 80)

    # 1. Original Features - Keep high-value ones
    print("\n1. ORIGINAL FEATURES:")
    print("   ‚úÖ KEEP ALL NUMERIC (5):")
    print("      ‚Ä¢ shipping_distance_km - Strong correlation (0.27)")
    print("      ‚Ä¢ amount - Moderate correlation (0.20)")
    print("      ‚Ä¢ account_age_days - Negative correlation (-0.12)")
    print("      ‚Ä¢ total_transactions_user, avg_amount_user - Baseline info")

    print("\n   ‚úÖ KEEP HIGH-VALUE CATEGORICAL (5 of 8):")
    print("      ‚Ä¢ avs_match - High MI (0.017), 9.8% fraud when failed")
    print("      ‚Ä¢ cvv_result - High MI (0.015), 10.6% fraud when failed")
    print("      ‚Ä¢ three_ds_flag - High MI (0.010), 6.7% fraud when disabled")
    print("      ‚Ä¢ channel - High signal (3.6% fraud on web vs 0.8% on app)")
    print("      ‚Ä¢ promo_used - High signal (4.6% fraud when used)")

    print("\n   ‚ö†Ô∏è  EXCLUDE (3 of 8) - Redundant with engineered features:")
    print("      ‚Ä¢ country ‚Üí Replaced by country_mismatch (more specific)")
    print("      ‚Ä¢ bin_country ‚Üí Replaced by country_mismatch")
    print("      ‚Ä¢ merchant_category ‚Üí Low signal, all categories near baseline")

    # 2. Temporal Features - Choose local over UTC
    print("\n2. TEMPORAL FEATURES:")
    print("   ‚úÖ KEEP LOCAL TIME FEATURES (6):")
    print("      ‚Ä¢ hour_local - Better captures 'unusual hour' fraud")
    print("      ‚Ä¢ is_late_night_local - Fraud scenario #1 (midnight transactions)")
    print("      ‚Ä¢ is_weekend_local, day_of_week_local, month_local")
    print("      ‚Ä¢ is_business_hours_local - Inverse of late_night signal")

    print("\n   ‚ö†Ô∏è  EXCLUDE UTC FEATURES (6) - Redundant:")
    print("      ‚Ä¢ Local time is more meaningful for fraud detection")
    print("      ‚Ä¢ UTC features don't align with human behavior patterns")

    # 3. Amount Features - Keep all
    print("\n3. AMOUNT FEATURES:")
    print("   ‚úÖ KEEP ALL (4):")
    print("      ‚Ä¢ is_micro_transaction - Fraud scenario #1 ($1 card testing)")
    print("      ‚Ä¢ amount_vs_avg_ratio - User deviation signal")
    print("      ‚Ä¢ is_large_transaction - High-value fraud attempts")
    print("      ‚Ä¢ amount_deviation - Absolute deviation signal")

    # 4. Behavior Features - Keep all
    print("\n4. USER BEHAVIOR FEATURES:")
    print("   ‚úÖ KEEP ALL (3):")
    print("      ‚Ä¢ is_new_account - Fraud scenario #3 (fresh accounts)")
    print("      ‚Ä¢ transaction_velocity - Rapid account usage")
    print("      ‚Ä¢ is_high_frequency_user - Baseline comparison")

    # 5. Geographic Features - Keep all
    print("\n5. GEOGRAPHIC FEATURES:")
    print("   ‚úÖ KEEP ALL (3):")
    print("      ‚Ä¢ country_mismatch - Replaces country + bin_country")
    print("      ‚Ä¢ high_risk_distance - Fraud scenario #2 (5000km shipments)")
    print("      ‚Ä¢ zero_distance - Low-risk indicator")

    # 6. Security Features - Keep composite score only
    print("\n6. SECURITY FEATURES:")
    print("   ‚úÖ KEEP COMPOSITE SCORE (1 of 4):")
    print("      ‚Ä¢ security_score - Replaces individual avs/cvv/3ds flags")

    print("\n   ‚ö†Ô∏è  EXCLUDE (3 of 4) - Redundant:")
    print("      ‚Ä¢ verification_failures ‚Üí Inverse of security_score")
    print("      ‚Ä¢ all_verifications_passed ‚Üí Encoded in security_score == 3")
    print("      ‚Ä¢ all_verifications_failed ‚Üí Encoded in security_score == 0")
    print("      Note: Keep original avs_match, cvv_result, three_ds_flag for interpretability")

    # 7. Interaction Features - Keep scenario-specific only
    print("\n7. INTERACTION FEATURES:")
    print("   ‚úÖ KEEP SCENARIO-SPECIFIC (3 of 6):")
    print("      ‚Ä¢ new_account_with_promo - Fraud scenario #3 (explicit)")
    print("      ‚Ä¢ late_night_micro_transaction - Fraud scenario #1 (explicit)")
    print("      ‚Ä¢ high_value_long_distance - Fraud scenario #2 variant")

    print("\n   ‚ö†Ô∏è  EXCLUDE COMPOSITE INTERACTIONS (3 of 6):")
    print("      ‚Ä¢ foreign_card_failed_verification ‚Üí Covered by country_mismatch + security_score")
    print("      ‚Ä¢ new_high_velocity_account ‚Üí Covered by is_new_account + is_high_frequency_user")
    print("      ‚Ä¢ triple_risk_combo ‚Üí Overly specific, low frequency")

    # Final recommendations
    print("\n" + "=" * 80)
    print("FINAL FEATURE SET FOR MODELING")
    print("=" * 80)

    # Build final feature lists
    final_numeric = original_numeric.copy()
    final_categorical = ['channel', 'promo_used', 'avs_match', 'cvv_result', 'three_ds_flag']
    final_temporal = temporal_local.copy()
    final_amount = amount_features.copy()
    final_behavior = behavior_features.copy()
    final_geographic = geographic_features.copy()
    final_security = ['security_score']
    final_interaction = ['new_account_with_promo', 'late_night_micro_transaction', 'high_value_long_distance']

    all_final_features = (final_numeric + final_categorical + final_temporal +
                         final_amount + final_behavior + final_geographic +
                         final_security + final_interaction)

    print(f"\nTotal features selected: {len(all_final_features)} (from {len(original_numeric) + len(original_categorical) + len(train_new_features)} available)")
    print(f"Reduction: {len(original_numeric) + len(original_categorical) + len(train_new_features) - len(all_final_features)} features excluded")

    print("\nüìã FINAL FEATURE LIST BY CATEGORY:")
    print("-" * 80)

    print(f"\n1. Original Numeric ({len(final_numeric)}):")
    for feat in final_numeric:
        print(f"   ‚Ä¢ {feat}")

    print(f"\n2. Original Categorical ({len(final_categorical)}):")
    for feat in final_categorical:
        print(f"   ‚Ä¢ {feat}")

    print(f"\n3. Temporal (Local Time) ({len(final_temporal)}):")
    for feat in final_temporal:
        print(f"   ‚Ä¢ {feat}")

    print(f"\n4. Amount Features ({len(final_amount)}):")
    for feat in final_amount:
        print(f"   ‚Ä¢ {feat}")

    print(f"\n5. User Behavior ({len(final_behavior)}):")
    for feat in final_behavior:
        print(f"   ‚Ä¢ {feat}")

    print(f"\n6. Geographic ({len(final_geographic)}):")
    for feat in final_geographic:
        print(f"   ‚Ä¢ {feat}")

    print(f"\n7. Security ({len(final_security)}):")
    for feat in final_security:
        print(f"   ‚Ä¢ {feat}")

    print(f"\n8. Interaction (Fraud Scenarios) ({len(final_interaction)}):")
    for feat in final_interaction:
        print(f"   ‚Ä¢ {feat}")

    print("\n" + "=" * 80)
    print("KEY DECISIONS:")
    print("=" * 80)
    print("‚úì Local time > UTC time (better fraud signal)")
    print("‚úì country_mismatch > individual country fields (more specific)")
    print("‚úì security_score composite > individual flags (reduces dimensionality)")
    print("‚úì Kept original avs/cvv/3ds for interpretability despite redundancy")
    print("‚úì Scenario-specific interactions > generic combinations")
    print("‚úì Excluded merchant_category (low predictive value)")

    print("\n" + "=" * 80)

    return {
        'numeric': final_numeric,
        'categorical': final_categorical,
        'temporal': final_temporal,
        'amount': final_amount,
        'behavior': final_behavior,
        'geographic': final_geographic,
        'security': final_security,
        'interaction': final_interaction,
        'all_features': all_final_features
    }


## Load data

In [None]:
download_data_csv(kaggle_source, data_dir, csv_file)
input_df = load_data(f"{data_dir}/{csv_file}", verbose=True)

In [None]:
input_df.head()

In [None]:
# No null values
input_df.info()

## Preprocessing
### Verify table grain

In [None]:
print(f"Every row is uniquely defined by transaction and user id columns: {len(input_df)==len(input_df.drop_duplicates(subset=id_cols))}")

### Target class balance
Target class imbalance is investigated prior to train/validation/test split because stratified splitting is necessary to handle the large class imbalance present in this dataset (only 2.2% fraud). Modeling will require techniques such as stratified sampling, class weights, appropriate metrics, etc.

In [None]:
analyze_target_stats(input_df, target_col)

### Convert date type

In [None]:
# Parse timestamps as UTC timezone-aware (fail if timezone missing)
input_df[date_feature] = pd.to_datetime(input_df[date_feature], utc=True, errors='coerce')

### Feature stats
Examine the distribution of categorical features and compute summary statistics for numerical features. Binary features (0/1 flags) are treated as categorical since they represent discrete states (see Setup, Define parameters section above).

In [None]:
analyze_feature_stats(input_df, id_cols, target_col, categorical_features, numeric_features)

### Train/Validation/Test Splits

In [None]:
train_df, val_df, test_df = split_train_val_test(
    input_df, 
    target_col=target_col,
    train_ratio=1 - val_ratio - test_ratio,
    val_ratio=val_ratio, 
    test_ratio=test_ratio,
    random_state=1,
    verbose=True
)

## EDA
### Numeric features
#### Calculate baseline metrics
Define baseline fraud rate for comparison throughout EDA.

In [None]:
# Calculate baseline fraud rate from training set
baseline_fraud_rate = train_df[target_col].mean()
print(f"Baseline fraud rate: {baseline_fraud_rate:.4f} ({baseline_fraud_rate*100:.2f}%)")
print(f"This will be used as a reference point throughout the EDA.")

#### Visualize distributions of numeric features

In [None]:
plot_numeric_distributions(train_df, numeric_features)

#### Multicollinearity Detection (VIF)

In [None]:
vif_df = analyze_vif(train_df, numeric_features)

### Bivariate Analysis: Features vs. Target
#### Calculate correlations with target

In [None]:
corr_df = analyze_correlations(train_df, numeric_features, target_col)

#### Box plots: Compare feature distributions between fraud and non-fraud

In [None]:
plot_box_plots(train_df, numeric_features, target_col)

### Temporal Analysis
Analyze fraud patterns over time to identify temporal trends.

In [None]:
analyze_temporal_patterns(train_df, date_feature, target_col, baseline_fraud_rate)

### Categorical Features
#### Fraud Rates

In [None]:
analyze_categorical_fraud_rates(train_df, categorical_features, target_col)

#### Visualize fraud rates for categorical features

In [None]:
plot_categorical_fraud_rates(train_df, categorical_features, target_col, baseline_fraud_rate)

#### Calculate mutual information scores for categorical features

In [None]:
mi_df = analyze_mutual_information(train_df, categorical_features, target_col)

### Initial Feature Selection Recommendations

In [None]:
print_feature_recommendations(corr_df, mi_df, vif_df, numeric_features, categorical_features)

## Feature Engineering

Apply feature engineering to create new predictive features. This includes:
- **Temporal features** (UTC and local timezone): hour, day_of_week, is_weekend, is_late_night, is_business_hours
- **Amount features**: amount_deviation, amount_vs_avg_ratio, micro/large transaction flags
- **User behavior**: transaction_velocity, new_account, high_frequency_user
- **Geographic**: country_mismatch, high_risk_distance, zero_distance
- **Security**: security_score, verification_failures, all_verifications_passed/failed

Local timezone conversion approximates transaction local time using the timezone of the user's country capital, enabling better detection of unusual-hour fraud patterns.

### Apply to training set

In [None]:
train_fe, train_new_features = engineer_features(train_df, date_col=date_feature, country_col='country')

### Apply to validation and test sets

In [None]:
val_fe, _ = engineer_features(val_df, date_col=date_feature, country_col='country')
test_fe, _ = engineer_features(test_df, date_col=date_feature, country_col='country')

### Inspect engineered features

In [None]:
# Display new features created
print(f"New features created ({len(train_new_features)}):")
for i, feat in enumerate(train_new_features, 1):
    print(f"  {i:2d}. {feat}")

# Display sample rows with key engineered features
print("\n" + "=" * 80)
print("Sample of engineered features:")
print("=" * 80)
sample_features = ['transaction_time', 'country', 'hour', 'hour_local', 'is_late_night', 
                  'is_late_night_local', 'amount', 'amount_vs_avg_ratio', 
                  'country_mismatch', 'security_score', 'is_fraud']
display(train_fe[sample_features].head(10))

## Final Feature Selection

In [None]:
final_features = analyze_final_feature_selection(train_new_features)

### Store final feature lists for modeling

In [None]:
# Store final feature lists as variables for easy access in modeling
final_numeric_features = final_features['numeric']
final_categorical_features = final_features['categorical']
final_engineered_features = (
    final_features['temporal'] + 
    final_features['amount'] + 
    final_features['behavior'] + 
    final_features['geographic'] + 
    final_features['security'] + 
    final_features['interaction']
)
final_all_features = final_features['all_features']

print(f"Final feature count: {len(final_all_features)}")
print(f"  ‚Ä¢ Numeric: {len(final_numeric_features)}")
print(f"  ‚Ä¢ Categorical: {len(final_categorical_features)}")
print(f"  ‚Ä¢ Engineered: {len(final_engineered_features)}")

## Save Feature Configuration

In [None]:
# Create and save feature configuration for deployment
# This config stores training-time statistics (quantile thresholds) needed for inference
feature_config = FeatureConfig.from_training_data(train_fe)
feature_config.save("../models/transformer_config.json")

print("‚úì Saved feature configuration for deployment:")
print(f"  ‚Ä¢ models/transformer_config.json")
print(f"\nTraining-time statistics (for inference):")
print(f"  ‚Ä¢ Amount threshold (95th): ${feature_config.amount_95th_percentile:.2f}")
print(f"  ‚Ä¢ Transaction threshold (75th): {feature_config.total_transactions_75th_percentile:.0f} transactions")
print(f"  ‚Ä¢ Distance threshold (75th): {feature_config.shipping_distance_75th_percentile:.2f} km")
print(f"  ‚Ä¢ Timezone mappings: {len(feature_config.timezone_mapping)} countries")
print(f"  ‚Ä¢ Final features: {len(feature_config.final_features)} features")