# Preprocessing Pipeline - Experiment Documentation

## Class Imbalance Handling Strategy

This notebook handles the severe class imbalance (0.58% fraud rate) through **cost-sensitive learning** rather than synthetic oversampling.

---

## Experiments Conducted

### Experiment 1: SMOTE Oversampling (REJECTED)

**Status:** Tested and rejected - code preserved below (commented out)

**Configuration Tested:**
```python
# SMOTE(sampling_strategy=0.1, k_neighbors=5, random_state=42)
# Result: 9.09% fraud rate in training data
```

**Results:**
- Validation F1: 0.53
- Test F1: **0.008** (catastrophic failure)
- Reason: Distribution shift between synthetic training data and real test data

---

### Experiment 2: SMOTE Variants Testing (ALL REJECTED)

We tested multiple SMOTE variants and sampling strategies:

| Variant | Sampling | Test F1 | Verdict |
|---------|----------|---------|---------|
| No SMOTE (Baseline) | 0.58% | **0.290** | **BEST** |
| SMOTE | 1% | 0.141 | Rejected |
| SMOTE | 2% | 0.248 | Rejected |
| SMOTE | 5% | 0.196 | Rejected |
| SMOTE | 10% | 0.183 | Rejected |
| BorderlineSMOTE | 5% | 0.095 | Rejected |
| ADASYN | 5% | 0.088 | Rejected |
| SMOTEENN | 5% | 0.265 | Rejected |
| SMOTETomek | 5% | 0.182 | Rejected |

**Conclusion:** All oversampling methods degraded test performance.

---

### Experiment 3: Natural Distribution + Class Weights (ADOPTED)

**Current Implementation:**

- Training data uses **natural distribution** (0.58% fraud)
- Class imbalance handled via:
  - **XGBoost:** `scale_pos_weight = (1 - fraud_rate) / fraud_rate`
  - **Deep Learning:** Focal Loss or Weighted BCE with `pos_weight`
- Threshold optimized on validation set using PR curve

**Results:**
- Test F1: 0.29 (XGBoost), 0.37 (Deep Learning)
- Test ROC-AUC: 0.76 - 0.80

---

## Why SMOTE Failed

1. **Temporal Distribution Shift:** Test data from different time period has different fraud patterns
2. **Feature Space Interpolation:** Synthetic samples don't capture real fraud behavior
3. **Concept Drift:** Fraud patterns evolve; synthetic historical patterns don't transfer

---

## Code Organization

The SMOTE experiment code is preserved below (Section 6) but **commented out**. The active code uses natural distribution with class weights.



# Local Setup Instructions

## Prerequisites Checklist

Before running this notebook, ensure you have completed the following setup:

- [ ] **Java 11 installed** and `JAVA_HOME` configured
  - macOS: `brew install openjdk@11`
  - Set: `export JAVA_HOME=$(/usr/libexec/java_home -v 11)`
- [ ] **Conda environment `fraud-shield` created and activated**
  - Create: `conda env create -f environment.yml`
  - Activate: `conda activate fraud-shield`
- [ ] **PySpark verified working**
  - Test: `python -c "from pyspark.sql import SparkSession; print('PySpark OK')"`
- [ ] **Data directories created**
  - `data/checkpoints/` - for EDA checkpoints
  - `data/processed/` - for preprocessed data
  - `models/` - for saved preprocessors
- [ ] **EDA checkpoint available**
  - Run `01-local-fraud-detection-eda.ipynb` first
  - Checkpoint should exist: `data/checkpoints/section11_enriched_features.parquet`

## Environment Activation

```bash
conda activate fraud-shield
```

## Checkpoint Requirements

This notebook requires the Section 11 checkpoint from the EDA notebook:
- `data/checkpoints/section11_enriched_features.parquet`

**Checkpoint & Parquet Update Behavior:** Processed parquet files (train/val/test) and saved pipelines are updated when the notebook runs the relevant sections and the save path executes. They do not auto-update on a schedule. If you change the EDA checkpoint or preprocessing code, re-run the affected sections to refresh outputs.

**Note:** This is a local execution version configured for the `fraud-shield` conda environment on your local machine.

# Spark MLlib Preprocessing Pipeline for Fraud Detection

**Notebook:** 02-local-preprocessing.ipynb  
**Objective:** Production-grade preprocessing using Spark MLlib for scalability

**Architecture:**
- Training data: Loaded from EDA checkpoint as Spark DataFrame
- Test data: Loaded separately from `fraudTest.csv` (never used for fitting)
- Preprocessing: Spark MLlib Pipeline (StringIndexer, VectorAssembler, StandardScaler, Imputer)
- Class Imbalance: Handled via cost-sensitive learning (NOT SMOTE - see experiment documentation)
- Output: Both Spark MLlib and sklearn pipelines saved for flexibility

**Data Leakage Safeguards:**
1. Test data loaded separately, never used for fitting
2. Preprocessor fitted only on training data
3. Natural data distribution preserved (no synthetic samples)
4. Time-aware split prevents future data leakage

**Class Imbalance Strategy (FINAL):**
- **SMOTE:** Tested and rejected (see experiment log above)
- **Adopted:** Natural distribution + class weights (scale_pos_weight, Focal Loss)
- **Feature Selection:** Critical + High Priority + Interaction + Enriched features
- **Time-aware Split:** train = first 80% of period, val = next 10%; test = separate file

In [1]:
# ============================================================
# GLOBAL IMPORTS & DEPENDENCIES
# ============================================================

import os
import shutil
import subprocess
import sys
from pathlib import Path
from typing import Tuple, List, Dict, Optional
import warnings
warnings.filterwarnings('ignore')

# Data Processing
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE

# PySpark & Spark MLlib
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.utils import AnalysisException
from pyspark.sql.window import Window
from pyspark.sql.functions import col, to_timestamp, when, isnan, isnull, min as spark_min, max as spark_max, dayofweek, hour, month, datediff, lit, sum as spark_sum, count, floor, broadcast, first, trim
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler, Imputer
from pyspark.ml.functions import vector_to_array
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql.types import StructType, StructField, IntegerType

# Sklearn (for pipeline conversion and inference)
from sklearn.preprocessing import StandardScaler as SklearnStandardScaler
from sklearn.preprocessing import LabelEncoder as SklearnLabelEncoder
from sklearn.impute import SimpleImputer as SklearnSimpleImputer
from sklearn.pipeline import Pipeline as SklearnPipeline
from sklearn.compose import ColumnTransformer

# Utilities
import pickle
import joblib
from datetime import datetime

print("All dependencies loaded successfully")

All dependencies loaded successfully


In [2]:
# ============================================================
# SPARK MLlib PREPROCESSOR CLASS
# ============================================================

class SparkMLlibPreprocessor:
    """
    Production-grade preprocessing using Spark MLlib Pipeline.
    Handles categorical encoding, feature assembly, scaling, and imputation.
    """
    
    def __init__(self, feature_names: List[str]):
        self.feature_names = feature_names
        self.categorical_features: List[str] = []
        self.numerical_features: List[str] = []
        self.pipeline: Optional[Pipeline] = None
        self.is_fitted = False
    
    def _identify_feature_types(self, spark_df: DataFrame) -> None:
        """Identify categorical vs numerical features from Spark DataFrame schema."""
        self.categorical_features = []
        self.numerical_features = []
        
        missing = [f for f in self.feature_names if f not in spark_df.columns]
        if missing:
            print(f"  WARNING: Features not in DataFrame (skipped for type inference): {missing}")
        
        schema = spark_df.schema
        for feat in self.feature_names:
            if feat not in spark_df.columns:
                continue
            
            field = schema[feat]
            field_type = str(field.dataType)
            
            # Check if categorical (string type only)
            # Binary and low-cardinality integers should be treated as numerical
            # to preserve their numeric values through scaling
            if 'StringType' in field_type or 'String' in field_type:
                self.categorical_features.append(feat)
            else:
                # All numeric types (including binary 0/1) are numerical
                self.numerical_features.append(feat)
    
    def fit(self, spark_df: DataFrame) -> 'SparkMLlibPreprocessor':
        """
        Fit Spark MLlib pipeline on training data.
        
        Args:
            spark_df: Training Spark DataFrame
        """
        self._identify_feature_types(spark_df)
        
        stages = []
        indexed_categorical = []
        imputed_numerical = []
        
        # StringIndexer for categorical features
        for feat in self.categorical_features:
            indexer = StringIndexer(
                inputCol=feat,
                outputCol=f"{feat}_indexed",
                handleInvalid="keep"
            )
            stages.append(indexer)
            indexed_categorical.append(f"{feat}_indexed")
        
        # Imputer for numerical features (handle missing values)
        if len(self.numerical_features) > 0:
            imputer = Imputer(
                inputCols=self.numerical_features,
                outputCols=[f"{f}_imputed" for f in self.numerical_features],
                strategy="mean"
            )
            stages.append(imputer)
            imputed_numerical = [f"{f}_imputed" for f in self.numerical_features]
        
        # VectorAssembler to combine all features
        assembler_inputs = indexed_categorical + imputed_numerical
        assembler = VectorAssembler(
            inputCols=assembler_inputs,
            outputCol="features_raw",
            handleInvalid="skip"
        )
        stages.append(assembler)
        
        # StandardScaler for feature scaling
        scaler = StandardScaler(
            inputCol="features_raw",
            outputCol="features",
            withMean=True,
            withStd=True
        )
        stages.append(scaler)
        
        # Create and fit pipeline
        self.pipeline = Pipeline(stages=stages)
        self.pipeline_model = self.pipeline.fit(spark_df)
        self.is_fitted = True
        
        return self
    
    def transform(self, spark_df: DataFrame) -> DataFrame:
        """
        Transform data using fitted pipeline.
        
        Args:
            spark_df: Spark DataFrame to transform
        
        Returns:
            Transformed Spark DataFrame with 'features' column
        """
        if not self.is_fitted:
            raise ValueError("Preprocessor must be fitted before transform")
        
        return self.pipeline_model.transform(spark_df)
    
    def get_feature_names(self) -> List[str]:
        """Get list of feature names."""
        return self.feature_names.copy()

## Section 0: Setup & Configuration

Configure project paths, directories, and Spark session. All paths are resolved relative to the project root for consistency.

In [3]:
# ============================================================
# CONFIGURATION & PATHS
# ============================================================

NOTEBOOK_DIR = Path.cwd()
if NOTEBOOK_DIR.name == "local_notebooks":
    PROJECT_ROOT = NOTEBOOK_DIR.parent
else:
    PROJECT_ROOT = NOTEBOOK_DIR

os.chdir(PROJECT_ROOT)

DATA_DIR = PROJECT_ROOT / "data"
CHECKPOINT_DIR = DATA_DIR / "checkpoints"
INPUT_DIR = DATA_DIR / "input"
MODELS_DIR = PROJECT_ROOT / "models"
PROCESSED_DATA_DIR = DATA_DIR / "processed"

MODELS_DIR.mkdir(exist_ok=True)
PROCESSED_DATA_DIR.mkdir(exist_ok=True)

# Data paths
CHECKPOINT_SECTION11 = CHECKPOINT_DIR / 'section11_enriched_features.parquet'
TEST_DATA_PATH = INPUT_DIR / 'fraudTest.csv'

# Output paths
PREPROCESSED_TRAIN_PATH = PROCESSED_DATA_DIR / 'train_preprocessed.parquet'
PREPROCESSED_VAL_PATH = PROCESSED_DATA_DIR / 'val_preprocessed.parquet'
PREPROCESSED_TEST_PATH = PROCESSED_DATA_DIR / 'test_preprocessed.parquet'
SPARK_PREPROCESSER_PATH = MODELS_DIR / 'spark_preprocessor.pkl'
SKLEARN_PREPROCESSER_PATH = MODELS_DIR / 'sklearn_preprocessor.pkl'
FEATURE_NAMES_PATH = MODELS_DIR / 'feature_names.pkl'

print(f"Project root: {PROJECT_ROOT}")
print(f"Data directory: {DATA_DIR}")
print(f"Checkpoint directory: {CHECKPOINT_DIR}")
print(f"Input directory: {INPUT_DIR}")
print(f"Models directory: {MODELS_DIR}")
print(f"Processed data directory: {PROCESSED_DATA_DIR}")

Project root: /home/alireza/Desktop/projects/fraud-shield-ai
Data directory: /home/alireza/Desktop/projects/fraud-shield-ai/data
Checkpoint directory: /home/alireza/Desktop/projects/fraud-shield-ai/data/checkpoints
Input directory: /home/alireza/Desktop/projects/fraud-shield-ai/data/input
Models directory: /home/alireza/Desktop/projects/fraud-shield-ai/models
Processed data directory: /home/alireza/Desktop/projects/fraud-shield-ai/data/processed


In [4]:
python_exec = sys.executable
print(python_exec)

/home/alireza/anaconda3/envs/fraud-shield/bin/python


In [5]:
# ============================================================
# SET JAVA_HOME FROM CONDA ENVIRONMENT
# ============================================================

# Stop any existing Spark session first
try:
    spark_existing = SparkSession.getActiveSession()
    if spark_existing:
        print("Stopping existing Spark session...")
        spark_existing.stop()
except:
    pass

# Auto-detect JAVA_HOME from conda environment
if 'JAVA_HOME' not in os.environ or not os.environ.get('JAVA_HOME'):
    # Method 1: Try conda environment (check sys.executable path)
    python_exec = sys.executable
    if 'envs' in python_exec:
        # Extract conda env path from Python executable
        env_path = python_exec.split('envs/')[0] + 'envs/' + python_exec.split('envs/')[1].split('/')[0]
        java_bin = os.path.join(env_path, 'bin', 'java')
        if os.path.exists(java_bin):
            os.environ['JAVA_HOME'] = env_path
            print(f"✓ JAVA_HOME set from conda environment: {env_path}")
        else:
            # Method 2: Try CONDA_PREFIX if available
            conda_prefix = os.environ.get('CONDA_PREFIX')
            if conda_prefix:
                java_bin = os.path.join(conda_prefix, 'bin', 'java')
                if os.path.exists(java_bin):
                    os.environ['JAVA_HOME'] = conda_prefix
                    print(f"✓ JAVA_HOME set from CONDA_PREFIX: {conda_prefix}")
                else:
                    # Method 3: Find Java via which
                    java_path = shutil.which('java')
                    if java_path and os.path.exists(java_path):
                        java_home = os.path.dirname(os.path.dirname(java_path))
                        os.environ['JAVA_HOME'] = java_home
                        print(f"✓ JAVA_HOME set from system Java: {java_home}")
                    else:
                        raise RuntimeError(
                            "Java not found. Please install Java 11:\n"
                            "  conda install -c conda-forge openjdk=11\n"
                            "Then restart the Jupyter kernel."
                        )
            else:
                # Method 3: Find Java via which
                java_path = shutil.which('java')
                if java_path and os.path.exists(java_path):
                    java_home = os.path.dirname(os.path.dirname(java_path))
                    os.environ['JAVA_HOME'] = java_home
                    print(f"✓ JAVA_HOME set from system Java: {java_home}")
                else:
                    raise RuntimeError(
                        "Java not found. Please install Java 11:\n"
                        "  conda install -c conda-forge openjdk=11\n"
                        "Then restart the Jupyter kernel."
                    )
    else:
        # Not in conda, try system Java
        java_path = shutil.which('java')
        if java_path and os.path.exists(java_path):
            java_home = os.path.dirname(os.path.dirname(java_path))
            os.environ['JAVA_HOME'] = java_home
            print(f"✓ JAVA_HOME set from system Java: {java_home}")
        else:
            raise RuntimeError(
                "Java not found. Please install Java 11:\n"
                "  conda install -c conda-forge openjdk=11\n"
                "Then restart the Jupyter kernel."
            )
else:
    print(f"✓ JAVA_HOME already set: {os.environ['JAVA_HOME']}")

# Verify Java is accessible
java_home = os.environ.get('JAVA_HOME')
if java_home:
    java_exe = os.path.join(java_home, 'bin', 'java')
    if not os.path.exists(java_exe):
        print(f"⚠ Warning: Java executable not found at {java_exe}")
    else:
        # Test Java version
        try:
            result = subprocess.run([java_exe, '-version'], capture_output=True, text=True, stderr=subprocess.STDOUT, timeout=5)
            print(f"✓ Java verified: {result.stdout.split(chr(10))[0] if result.stdout else 'Java is working'}")
        except:
            print("✓ Java path set (version check skipped)")

# ============================================================
# INITIALIZE SPARK SESSION
# ============================================================

spark = SparkSession.builder \
    .appName("FraudDetectionPreprocessing") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
    .config("spark.sql.execution.arrow.maxRecordsPerBatch", "10000") \
    .config("spark.driver.memory", "4g") \
    .config("spark.executor.memory", "4g") \
    .getOrCreate()

print("Spark session initialized with optimized memory settings")
print(f"Spark version: {spark.version}")

✓ JAVA_HOME already set: /home/alireza/anaconda3/envs/fraud-shield/lib/jvm
✓ Java path set (version check skipped)


26/02/06 15:27:34 WARN Utils: Your hostname, zanganeh-ai resolves to a loopback address: 127.0.1.1; using 192.168.86.248 instead (on interface wlp129s0)
26/02/06 15:27:34 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


26/02/06 15:27:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Spark session initialized with optimized memory settings
Spark version: 3.5.0


In [6]:
# ============================================================
# TIMEZONE PIPELINE SETUP
# ============================================================

# Try to import from scripts (local/production)
try:
    import sys
    if Path.cwd().name == "local_notebooks":
        sys.path.insert(0, str(Path.cwd().parent))
    from scripts.timezone_pipeline import TimezonePipeline
    print("✓ Imported timezone pipeline from scripts/")
except ImportError:
    # Fallback: define inline (for Kaggle/Colab)
    print("⚠️ Could not import from scripts/, defining inline...")
    # [Copy class definitions here if needed for Kaggle/Colab]
    print("✓ Timezone pipeline defined inline")

# Build ZIP reference table (same as EDA notebook)
GRID_SIZE = 0.5  # ~50km grid resolution
full_zip_path = os.path.join(INPUT_DIR, "uszips.csv")

if os.path.exists(full_zip_path):
    zip_ref_df = (
        spark.read.csv(full_zip_path, header=True, inferSchema=True)
        .withColumnRenamed("zip", "zip_ref")
        .withColumn("lat_grid", floor(col("lat") / GRID_SIZE))
        .withColumn("lng_grid", floor(col("lng") / GRID_SIZE))
        .select("lat_grid", "lng_grid", "timezone")
        .filter(
            (trim(col("timezone")) != "") &
            (col("timezone") != "FALSE") &
            col("timezone").rlike("^[A-Za-z_/]+$")
        )
        .distinct()
        .groupBy("lat_grid", "lng_grid")
        .agg(first("timezone").alias("timezone"))
        .cache()
    )
    
    print(f"✓ ZIP reference table created: {zip_ref_df.count():,} unique grid cells")
    
    # Create TimezonePipeline instance
    timezone_pipeline = TimezonePipeline(
        zip_ref_df=zip_ref_df,
        grid_size=GRID_SIZE
    )
    print("✓ TimezonePipeline instance created")
else:
    print(f"⚠️ Warning: uszips.csv not found at {full_zip_path}")
    print("  Timezone conversion will be skipped. Ensure EDA notebook has been run first.")
    timezone_pipeline = None

✓ Imported timezone pipeline from scripts/


✓ ZIP reference table created: 3,197 unique grid cells
✓ TimezonePipeline instance created


## Section 1: Data Loading

Load training data from EDA checkpoint and test data from CSV. These are kept separate to ensure test data is never used for fitting, preventing data leakage.

In [7]:
# ============================================================
# DATA LOADING FUNCTIONS
# ============================================================

def load_training_data() -> DataFrame:
    """
    Load training data from EDA checkpoint as Spark DataFrame.
    
    Returns:
        Spark DataFrame with all engineered features from EDA notebook
    """
    if not CHECKPOINT_SECTION11.exists():
        raise FileNotFoundError(
            f"EDA checkpoint not found: {CHECKPOINT_SECTION11}\n"
            "Please run the EDA notebook (01-local-fraud-detection-eda.ipynb) first."
        )
    
    # Defensive: avoid stale file listings if the checkpoint folder is regenerated.
    # If the underlying parquet parts change, Spark can keep a stale reference and crash with SparkFileNotFoundException.
    spark.catalog.clearCache()
    try:
        spark.catalog.refreshByPath(str(CHECKPOINT_SECTION11))
    except (AnalysisException, AttributeError):
        pass

    print(f"Loading training data from checkpoint: {CHECKPOINT_SECTION11}")
    # Cache + materialize once so later actions don't depend on the source files.
    spark_df = spark.read.parquet(str(CHECKPOINT_SECTION11)).cache()
    total_rows = spark_df.count()

    print("✓ Training data loaded successfully")
    print(f"  Rows: {total_rows:,}")
    print(f"  Columns: {len(spark_df.columns)}")

    fraud_counts = spark_df.groupBy("is_fraud").count().collect()
    fraud_yes = spark_df.filter(col("is_fraud") == 1).count()
    fraud_rate = fraud_yes / total_rows if total_rows else 0.0
    
    print(f"\n  Target distribution:")
    for row in fraud_counts:
        print(f"    is_fraud = {'No' if row['is_fraud']==0 else 'Yes'}: {row['count']:,}")
    print(f"  Fraud rate: {fraud_rate:.4%}")
    
    return spark_df


def load_test_data() -> DataFrame:
    """
    Load test data from CSV as Spark DataFrame.
    This data is never used for fitting to prevent data leakage.
    
    Returns:
        Spark DataFrame with test data
    """
    if not TEST_DATA_PATH.exists():
        raise FileNotFoundError(
            f"Test data not found: {TEST_DATA_PATH}\n"
            "Please ensure fraudTest.csv is in the data/input/ directory."
        )
    
    print(f"Loading test data from: {TEST_DATA_PATH}")
    spark_df = spark.read.csv(
        str(TEST_DATA_PATH),
        header=True,
        inferSchema=True
    )
    
    print(f"✓ Test data loaded successfully")
    print(f"  Rows: {spark_df.count():,}")
    print(f"  Columns: {len(spark_df.columns)}")
    
    if "is_fraud" in spark_df.columns:
        fraud_counts = spark_df.groupBy("is_fraud").count().collect()
        fraud_rate = spark_df.filter(col("is_fraud") == 1).count() / spark_df.count()
        
        print(f"\n  Target distribution:")
        for row in fraud_counts:
            print(f"    is_fraud = {'No' if row['is_fraud']==0 else 'Yes'}: {row['count']:,}")
        print(f"  Fraud rate: {fraud_rate:.4%}")
    
    return spark_df


# Load data
train_df = load_training_data()
test_df_raw = load_test_data()

# ============================================================
# CREATE MISSING TEMPORAL FEATURES
# ============================================================

# Create day_of_week from merchant_local_time if missing
if "day_of_week" not in train_df.columns and "merchant_local_time" in train_df.columns:
    train_df = train_df.withColumn(
        "day_of_week",
        dayofweek(to_timestamp(col("merchant_local_time")))
    )
    print("✓ Created day_of_week from merchant_local_time")

# Create is_peak_fraud_day from day_of_week if missing
if "is_peak_fraud_day" not in train_df.columns and "day_of_week" in train_df.columns:
    train_df = train_df.withColumn(
        "is_peak_fraud_day",
        when(col("day_of_week").isin([4, 5, 6]), 1).otherwise(0)  # Wed=4, Thu=5, Fri=6
    )
    print("✓ Created is_peak_fraud_day (Wednesday-Friday)")

Loading training data from checkpoint: /home/alireza/Desktop/projects/fraud-shield-ai/data/checkpoints/section11_enriched_features.parquet


26/02/06 15:27:38 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.





                                                                                

✓ Training data loaded successfully
  Rows: 1,296,675
  Columns: 57



  Target distribution:
    is_fraud = Yes: 7,506
    is_fraud = No: 1,289,169
  Fraud rate: 0.5789%
Loading test data from: /home/alireza/Desktop/projects/fraud-shield-ai/data/input/fraudTest.csv


✓ Test data loaded successfully
  Rows: 555,719
  Columns: 23



  Target distribution:
    is_fraud = Yes: 2,145
    is_fraud = No: 553,574
  Fraud rate: 0.3860%


## Section 2: Feature Selection

Select features based on EDA findings. Features are categorized by priority: Critical, High Priority, Interaction, and Enriched features.

In [8]:
# ============================================================
# FEATURE SELECTION
# ============================================================

CRITICAL_FEATURES = [
    'transaction_count_bin',
    'card_age_bin',
    'hour',
    'time_bin',
    'is_peak_fraud_hour',
    'is_new_card',
    'is_low_volume_card'
]

HIGH_PRIORITY_FEATURES = [
    'category',
    'day_of_week',
    'month',
    'is_peak_fraud_day',
    'is_peak_fraud_season',
    'is_high_risk_category',
    'card_age_days',
    'transaction_count'
]

INTERACTION_FEATURES = [
    'evening_high_amount',
    'evening_online_shopping',
    'large_city_evening',
    'new_card_evening',
    'high_amount_online'
]

ENRICHED_FEATURES = [
    'temporal_risk_score',
    'geographic_risk_score',
    'card_risk_score',
    'risk_tier'
]

# ============================================================
# EXPERIMENT 2: NEW INDUSTRY-STANDARD FEATURES
# NOTE: These features require recomputation for test data.
#       Temporarily disabled to validate core SMOTE fix first.
#       TODO: Enable after adding feature computation in engineer_test_features()
# ============================================================
# EXPERIMENT2_FEATURES = [
#     'customer_merchant_distance_km',  # Distance between customer and merchant
#     'txn_count_last_1h',              # Velocity: transactions in last hour
#     'amt_relative_to_avg'             # Amount deviation from user's historical average
# ]
EXPERIMENT2_FEATURES = []  # Disabled for Experiment 2a (SMOTE fix validation)

# Combine all feature groups
SELECTED_FEATURES = (
    CRITICAL_FEATURES + 
    HIGH_PRIORITY_FEATURES + 
    INTERACTION_FEATURES + 
    ENRICHED_FEATURES +
    EXPERIMENT2_FEATURES
)

# Filter to only features that exist in the Spark DataFrame
available_features = [f for f in SELECTED_FEATURES if f in train_df.columns]
missing_features = [f for f in SELECTED_FEATURES if f not in train_df.columns]

print("Feature Selection Summary:")
print(f"  Critical features: {len([f for f in CRITICAL_FEATURES if f in train_df.columns])}/{len(CRITICAL_FEATURES)}")
print(f"  High priority features: {len([f for f in HIGH_PRIORITY_FEATURES if f in train_df.columns])}/{len(HIGH_PRIORITY_FEATURES)}")
print(f"  Interaction features: {len([f for f in INTERACTION_FEATURES if f in train_df.columns])}/{len(INTERACTION_FEATURES)}")
print(f"  Enriched features: {len([f for f in ENRICHED_FEATURES if f in train_df.columns])}/{len(ENRICHED_FEATURES)}")
print(f"  Experiment 2 features: {len([f for f in EXPERIMENT2_FEATURES if f in train_df.columns])}/{len(EXPERIMENT2_FEATURES)} (disabled for now)")
print(f"\n  Total selected: {len(available_features)} features")

if missing_features:
    print(f"\n  ⚠ Missing features (will be skipped): {missing_features}")

feature_names = available_features.copy()


Feature Selection Summary:
  Critical features: 7/7
  High priority features: 8/8
  Interaction features: 5/5
  Enriched features: 4/4
  Experiment 2 features: 0/0 (disabled for now)

  Total selected: 24 features


## Section 3: Train/Validation Split

Perform time-aware splitting to respect temporal order. This prevents future data leakage and ensures realistic model evaluation.

In [9]:
# ============================================================
# TIME-AWARE DATA SPLITTING (Spark SQL)
# ============================================================

def split_data_time_aware(
    spark_df: DataFrame,
    date_col: str = 'merchant_local_time',
    train_frac: float = 0.8,
    val_frac: float = 0.9
) -> Tuple[DataFrame, DataFrame]:
    """
    Split training data into train and validation sets using time-aware split.
    Split is derived from the actual date range in the data (no hardcoded dates).
    Test data is loaded separately and never used for fitting.
    
    Uses merchant_local_time (timezone-aware) by default, as created in EDA notebook.
    Falls back to customer_local_time or trans_date_trans_time if needed.
    
    Args:
        spark_df: Input Spark DataFrame
        date_col: Name of date column (default: 'merchant_local_time')
        train_frac: Fraction of date range for training (default 0.8 = first 80%)
        val_frac: Cumulative fraction for validation end (default 0.9 = train 80%, val 10%)
    
    Returns:
        Tuple of (train_df, val_df) as Spark DataFrames
    """
    # Find the date column - prioritize timezone-aware columns from EDA
    date_col_found = None
    if date_col in spark_df.columns:
        date_col_found = date_col
    else:
        # Try timezone-aware columns first (from EDA), then fallback to raw timestamp
        for col_name in ['merchant_local_time', 'customer_local_time', 'trans_date_trans_time']:
            if col_name in spark_df.columns:
                date_col_found = col_name
                print(f"  Using date column: {col_name}")
                break
    
    if date_col_found is None:
        date_like_cols = [c for c in spark_df.columns if 'date' in c.lower() or 'time' in c.lower()]
        raise ValueError(
            f"Date column '{date_col}' not found in DataFrame.\n"
            f"Available date-like columns: {date_like_cols}"
        )
    
    date_col = date_col_found
    
    # Convert date column to timestamp if needed
    spark_df = spark_df.withColumn(
        date_col,
        to_timestamp(col(date_col))
    )
    
    # Auto-detect date range from data to validate split dates
    actual_min_date = spark_df.select(spark_min(col(date_col))).collect()[0][0]
    actual_max_date = spark_df.select(spark_max(col(date_col))).collect()[0][0]
    
    if actual_min_date is None or actual_max_date is None:
        raise ValueError(f"Date column '{date_col}' contains no valid dates")
    
    actual_min_dt = pd.to_datetime(actual_min_date)
    actual_max_dt = pd.to_datetime(actual_max_date)
    
    if not 0 < train_frac < val_frac <= 1.0:
        raise ValueError("Require 0 < train_frac < val_frac <= 1.0")
    data_span_days = (actual_max_dt - actual_min_dt).days
    train_end_dt = actual_min_dt + pd.Timedelta(days=int(data_span_days * train_frac))
    val_end_dt = actual_min_dt + pd.Timedelta(days=int(data_span_days * val_frac))
    
    print(f"  Data date range: {actual_min_dt.date()} to {actual_max_dt.date()}")
    print(f"  Train: first {train_frac:.0%} of period (end {train_end_dt.date()})")
    print(f"  Validation: next {val_frac - train_frac:.0%} of period (end {val_end_dt.date()})")
    
    # Split by date using Spark SQL filters
    train_df = spark_df.filter(col(date_col) <= train_end_dt)
    val_df = spark_df.filter(
        (col(date_col) > train_end_dt) & (col(date_col) <= val_end_dt)
    )
    
    train_count = train_df.count()
    val_count = val_df.count()
    
    print("Time-aware split summary:")
    print(f"  Train: {train_count:,} samples")
    print(f"  Validation: {val_count:,} samples")
    
    # Check for empty splits before accessing date ranges
    if train_count == 0 or val_count == 0:
        print(f"\n⚠ Warning: Empty split detected!")
        print(f"  Date column used: {date_col}")
        print(f"  train_frac: {train_frac}, val_frac: {val_frac}")
        
        # Show actual date range in data
        actual_dates = spark_df.select(
            spark_min(col(date_col)).alias("min_date"),
            spark_max(col(date_col)).alias("max_date")
        ).collect()[0]
        if actual_dates[0] is not None:
            print(f"  Actual data date range: {actual_dates[0]} to {actual_dates[1]}")
        
        raise ValueError(
            f"Time-aware split resulted in empty datasets. "
            f"Check date column '{date_col}' and date ranges."
        )
    
    # Get date ranges (FIX: use separate min/max calls, not duplicate dict keys)
    train_date_stats = train_df.select(
        spark_min(col(date_col)).alias("min_date"),
        spark_max(col(date_col)).alias("max_date")
    ).collect()[0]
    
    val_date_stats = val_df.select(
        spark_min(col(date_col)).alias("min_date"),
        spark_max(col(date_col)).alias("max_date")
    ).collect()[0]
    
    print(f"  Train date range: {train_date_stats['min_date']} to {train_date_stats['max_date']}")
    print(f"  Validation date range: {val_date_stats['min_date']} to {val_date_stats['max_date']}")
    
    # Calculate fraud rates
    train_fraud_count = train_df.filter(col("is_fraud") == 1).count()
    val_fraud_count = val_df.filter(col("is_fraud") == 1).count()
    
    print(f"\nFraud rates:")
    print(f"  Train: {train_fraud_count/train_count:.4%} ({train_fraud_count:,} fraud cases)")
    print(f"  Validation: {val_fraud_count/val_count:.4%} ({val_fraud_count:,} fraud cases)")
    
    return train_df, val_df


# Perform time-aware split on training data
train_df, val_df = split_data_time_aware(train_df)

  Data date range: 2018-12-31 to 2020-06-21
  Train: first 80% of period (end 2020-03-04)
  Validation: next 10% of period (end 2020-04-27)
Time-aware split summary:
  Train: 1,034,987 samples
  Validation: 122,480 samples


  Train date range: 2018-12-31 14:22:06 to 2020-03-04 14:21:42
  Validation date range: 2020-03-04 14:23:10 to 2020-04-27 14:21:51

Fraud rates:
  Train: 0.5757% (5,958 fraud cases)
  Validation: 0.5250% (643 fraud cases)


## Section 4: Spark MLlib Preprocessing Pipeline

Create a production-grade Spark MLlib preprocessing pipeline with StringIndexer for categorical encoding, VectorAssembler for feature vectors, StandardScaler for numerical scaling, and Imputer for missing values.

In [10]:
# ============================================================
# SPARK MLlib PREPROCESSOR CLASS
# ============================================================

class SparkMLlibPreprocessor:
    """
    Production-grade preprocessing using Spark MLlib Pipeline.
    Handles categorical encoding, feature assembly, scaling, and imputation.
    """
    
    def __init__(self, feature_names: List[str]):
        self.feature_names = feature_names
        self.categorical_features: List[str] = []
        self.numerical_features: List[str] = []
        self.pipeline: Optional[Pipeline] = None
        self.is_fitted = False
    
    def _identify_feature_types(self, spark_df: DataFrame) -> None:
        """Identify categorical vs numerical features from Spark DataFrame schema."""
        self.categorical_features = []
        self.numerical_features = []
        
        missing = [f for f in self.feature_names if f not in spark_df.columns]
        if missing:
            print(f"  WARNING: Features not in DataFrame (skipped for type inference): {missing}")
        
        schema = spark_df.schema
        for feat in self.feature_names:
            if feat not in spark_df.columns:
                continue
            
            field = schema[feat]
            field_type = str(field.dataType)
            
            # Check if categorical (string type only)
            # Binary and low-cardinality integers should be treated as numerical
            if 'StringType' in field_type or 'String' in field_type:
                self.categorical_features.append(feat)
            else:
                # All numeric types (including binary 0/1) are numerical
                self.numerical_features.append(feat)
    
    def fit(self, spark_df: DataFrame) -> 'SparkMLlibPreprocessor':
        """
        Fit Spark MLlib pipeline on training data.
        
        Args:
            spark_df: Training Spark DataFrame
        """
        self._identify_feature_types(spark_df)
        
        stages = []
        indexed_categorical = []
        imputed_numerical = []
        
        # StringIndexer for categorical features
        for feat in self.categorical_features:
            indexer = StringIndexer(
                inputCol=feat,
                outputCol=f"{feat}_indexed",
                handleInvalid="keep"
            )
            stages.append(indexer)
            indexed_categorical.append(f"{feat}_indexed")
        
        # Imputer for numerical features (handle missing values)
        if len(self.numerical_features) > 0:
            imputer = Imputer(
                inputCols=self.numerical_features,
                outputCols=[f"{f}_imputed" for f in self.numerical_features],
                strategy="mean"
            )
            stages.append(imputer)
            imputed_numerical = [f"{f}_imputed" for f in self.numerical_features]
        
        # VectorAssembler to combine all features
        assembler_inputs = indexed_categorical + imputed_numerical
        assembler = VectorAssembler(
            inputCols=assembler_inputs,
            outputCol="features_raw",
            handleInvalid="skip"
        )
        stages.append(assembler)
        
        # StandardScaler for feature scaling
        scaler = StandardScaler(
            inputCol="features_raw",
            outputCol="features",
            withMean=True,
            withStd=True
        )
        stages.append(scaler)
        
        # Create and fit pipeline
        self.pipeline = Pipeline(stages=stages)
        self.pipeline_model = self.pipeline.fit(spark_df)
        self.is_fitted = True
        
        return self
    
    def transform(self, spark_df: DataFrame) -> DataFrame:
        """
        Transform data using fitted pipeline.
        
        Args:
            spark_df: Spark DataFrame to transform
        
        Returns:
            Transformed Spark DataFrame with 'features' column
        """
        if not self.is_fitted:
            raise ValueError("Preprocessor must be fitted before transform")
        
        return self.pipeline_model.transform(spark_df)
    
    def get_feature_names(self) -> List[str]:
        """Get list of feature names."""
        return self.feature_names.copy()


# Initialize and fit preprocessor
preprocessor = SparkMLlibPreprocessor(feature_names)
preprocessor.fit(train_df)

print("Spark MLlib Preprocessor fitted successfully")
print(f"  Categorical features: {len(preprocessor.categorical_features)}")
print(f"  Numerical features: {len(preprocessor.numerical_features)}")
print(f"  Total features: {len(preprocessor.get_feature_names())}")
print(f"  Pipeline stages: {len(preprocessor.pipeline_model.stages)}")

Spark MLlib Preprocessor fitted successfully
  Categorical features: 5
  Numerical features: 19
  Total features: 24
  Pipeline stages: 8


## Section 5: Apply Preprocessing Pipeline

Transform train, validation, and test splits using the fitted Spark MLlib pipeline. The pipeline ensures consistent preprocessing across all splits.

In [11]:
# ============================================================
# ENGINEER FEATURES FOR TEST DATA
# FIXED: Match EXACT definitions from notebook 01
# ============================================================

def engineer_test_features(test_df: DataFrame) -> DataFrame:
    """
    Add engineered features to test data.
    CRITICAL: Feature definitions MUST match training data exactly!
    """
    df = test_df
    
    # Convert trans_date_trans_time to timestamp if needed
    if "trans_date_trans_time" in df.columns:
        df = df.withColumn("trans_date_trans_time", to_timestamp(col("trans_date_trans_time")))
    
    # Merchant local time (use trans_date_trans_time as fallback)
    if "merchant_local_time" not in df.columns:
        if "trans_date_trans_time" in df.columns:
            df = df.withColumn("merchant_local_time", col("trans_date_trans_time"))
            print("  Created merchant_local_time from trans_date_trans_time")
    
    # Temporal features
    if "hour" not in df.columns and "merchant_local_time" in df.columns:
        df = df.withColumn("hour", hour(to_timestamp(col("merchant_local_time"))))
        print("  Created hour")
    
    if "day_of_week" not in df.columns and "merchant_local_time" in df.columns:
        df = df.withColumn("day_of_week", dayofweek(to_timestamp(col("merchant_local_time"))))
        print("  Created day_of_week")
    
    if "month" not in df.columns and "merchant_local_time" in df.columns:
        df = df.withColumn("month", month(to_timestamp(col("merchant_local_time"))))
        print("  Created month")
    
    # time_bin: Morning (0-11), Afternoon (12-17), Evening (18-23), Night (0-5)
    # Match notebook 01 definition
    if "time_bin" not in df.columns and "hour" in df.columns:
        df = df.withColumn(
            "time_bin",
            when(col("hour").between(6, 11), "Morning")
            .when(col("hour").between(12, 17), "Afternoon")
            .when(col("hour").between(18, 23), "Evening")
            .otherwise("Night")
        )
        print("  Created time_bin")
    
    # is_peak_fraud_hour: evening hours 18-23
    if "is_peak_fraud_hour" not in df.columns and "hour" in df.columns:
        df = df.withColumn("is_peak_fraud_hour", when(col("hour").between(18, 23), 1).otherwise(0))
        print("  Created is_peak_fraud_hour")
    
    # is_peak_fraud_day: Wednesday-Friday (dayofweek: 4,5,6 in Spark)
    if "is_peak_fraud_day" not in df.columns and "day_of_week" in df.columns:
        df = df.withColumn("is_peak_fraud_day", when(col("day_of_week").isin([4, 5, 6]), 1).otherwise(0))
        print("  Created is_peak_fraud_day")
    
    # is_peak_fraud_season: January-February
    if "is_peak_fraud_season" not in df.columns and "month" in df.columns:
        df = df.withColumn("is_peak_fraud_season", when(col("month").isin([1, 2]), 1).otherwise(0))
        print("  Created is_peak_fraud_season")
    
    # city_size: Match notebook 01 definition
    if "city_size" not in df.columns and "city_pop" in df.columns:
        df = df.withColumn(
            "city_size",
            when(col("city_pop") > 1000000, "Metropolitan")
            .when(col("city_pop") > 500000, "Large City")
            .when(col("city_pop") > 100000, "Medium City")
            .when(col("city_pop") > 10000, "Small City")
            .otherwise("Small Town")
        )
        print("  Created city_size")
    
    # amount_bin: Match notebook 01 definition
    if "amount_bin" not in df.columns and "amt" in df.columns:
        df = df.withColumn(
            "amount_bin",
            when(col("amt") > 1000, ">$1000")
            .when(col("amt") > 500, "$500-$1000")
            .when(col("amt") > 300, "$300-$500")
            .when(col("amt") > 100, "$100-$300")
            .when(col("amt") > 50, "$50-$100")
            .otherwise("<$50")
        )
        print("  Created amount_bin")
    
    # Card age features - compute from cc_num within test data
    if "card_age_days" not in df.columns and "cc_num" in df.columns:
        window_spec = Window.partitionBy("cc_num")
        df = df.withColumn(
            "first_txn_time",
            spark_min(col("trans_date_trans_time")).over(window_spec)
        )
        df = df.withColumn(
            "card_age_days",
            datediff(col("trans_date_trans_time"), col("first_txn_time"))
        )
        df = df.drop("first_txn_time")
        print("  Created card_age_days")
    
    if "card_age_bin" not in df.columns and "card_age_days" in df.columns:
        df = df.withColumn(
            "card_age_bin",
            when(col("card_age_days") < 7, "<7 days")
            .when(col("card_age_days") < 30, "7-30 days")
            .when(col("card_age_days") < 90, "30-90 days")
            .when(col("card_age_days") < 180, "90-180 days")
            .otherwise("180+ days")
        )
        print("  Created card_age_bin")
    
    # Transaction count per card
    if "transaction_count" not in df.columns and "cc_num" in df.columns:
        window_spec = Window.partitionBy("cc_num")
        df = df.withColumn("transaction_count", count("*").over(window_spec))
        print("  Created transaction_count")
    
    if "transaction_count_bin" not in df.columns and "transaction_count" in df.columns:
        df = df.withColumn(
            "transaction_count_bin",
            when(col("transaction_count") <= 5, "1-5")
            .when(col("transaction_count").between(6, 20), "6-20")
            .when(col("transaction_count").between(21, 100), "21-100")
            .otherwise("100+")
        )
        print("  Created transaction_count_bin")
    
    # Card flags
    if "is_new_card" not in df.columns and "card_age_days" in df.columns:
        df = df.withColumn("is_new_card", when(col("card_age_days") <= 30, 1).otherwise(0))
        print("  Created is_new_card")
    
    if "is_low_volume_card" not in df.columns and "transaction_count" in df.columns:
        df = df.withColumn("is_low_volume_card", when(col("transaction_count") <= 5, 1).otherwise(0))
        print("  Created is_low_volume_card")
    
    # High risk category
    if "is_high_risk_category" not in df.columns and "category" in df.columns:
        high_risk_categories = ["grocery_pos", "gas_transport", "shopping_pos", "misc_pos", "grocery_net"]
        df = df.withColumn("is_high_risk_category", when(col("category").isin(high_risk_categories), 1).otherwise(0))
        print("  Created is_high_risk_category")
    
    # Interaction features - MATCH NOTEBOOK 01 DEFINITIONS EXACTLY
    high_amount_bins = ["$300-$500", "$500-$1000", ">$1000"]
    
    # evening_high_amount: Evening AND amount in high bins
    if "evening_high_amount" not in df.columns:
        df = df.withColumn(
            "evening_high_amount",
            when((col("time_bin") == "Evening") & (col("amount_bin").isin(high_amount_bins)), 1).otherwise(0)
        )
        print("  Created evening_high_amount")
    
    # evening_online_shopping: Evening AND category contains 'net'
    if "evening_online_shopping" not in df.columns:
        df = df.withColumn(
            "evening_online_shopping",
            when((col("time_bin") == "Evening") & (col("category").isin(["shopping_net", "misc_net"])), 1).otherwise(0)
        )
        print("  Created evening_online_shopping")
    
    # large_city_evening: FIXED - use "Large City" not "Large"
    if "large_city_evening" not in df.columns:
        df = df.withColumn(
            "large_city_evening",
            when((col("city_size") == "Large City") & (col("time_bin") == "Evening"), 1).otherwise(0)
        )
        print("  Created large_city_evening")
    
    # new_card_evening: new card AND evening
    if "new_card_evening" not in df.columns:
        df = df.withColumn(
            "new_card_evening",
            when((col("is_new_card") == 1) & (col("time_bin") == "Evening"), 1).otherwise(0)
        )
        print("  Created new_card_evening")
    
    # high_amount_online: high amount AND online category
    if "high_amount_online" not in df.columns:
        df = df.withColumn(
            "high_amount_online",
            when((col("amount_bin").isin(high_amount_bins)) & (col("category").isin(["shopping_net", "misc_net"])), 1).otherwise(0)
        )
        print("  Created high_amount_online")
    
    # Risk scores - compute using same formulas as notebook 01
    if "temporal_risk_score" not in df.columns:
        df = df.withColumn(
            "temporal_risk_score",
            (col("is_peak_fraud_hour").cast("double") * 0.4 +
             col("is_peak_fraud_day").cast("double") * 0.3 +
             col("is_peak_fraud_season").cast("double") * 0.3)
        )
        print("  Created temporal_risk_score")
    
    if "geographic_risk_score" not in df.columns:
        df = df.withColumn(
            "geographic_risk_score",
            when(col("city_pop") < 10000, 0.3)
            .when(col("city_pop") < 50000, 0.2)
            .when(col("city_pop") < 100000, 0.1)
            .otherwise(0.0)
        )
        print("  Created geographic_risk_score")
    
    if "card_risk_score" not in df.columns:
        df = df.withColumn(
            "card_risk_score",
            (col("is_new_card").cast("double") * 0.5 +
             col("is_low_volume_card").cast("double") * 0.3 +
             when(col("card_age_days") < 7, 0.2).otherwise(0.0))
        )
        print("  Created card_risk_score")
    
    if "risk_tier" not in df.columns:
        df = df.withColumn(
            "total_risk",
            col("temporal_risk_score") + col("geographic_risk_score") + col("card_risk_score")
        )
        df = df.withColumn(
            "risk_tier",
            when(col("total_risk") >= 0.8, "High")
            .when(col("total_risk") >= 0.4, "Medium")
            .otherwise("Low")
        )
        df = df.drop("total_risk")
        print("  Created risk_tier")
    
    return df


print("Engineering features for test data...")
test_df_engineered = engineer_test_features(test_df_raw)
print("✓ Test data feature engineering complete\n")


Engineering features for test data...
  Created merchant_local_time from trans_date_trans_time
  Created hour
  Created day_of_week
  Created month
  Created time_bin
  Created is_peak_fraud_hour
  Created is_peak_fraud_day
  Created is_peak_fraud_season
  Created city_size
  Created amount_bin
  Created card_age_days
  Created card_age_bin
  Created transaction_count
  Created transaction_count_bin
  Created is_new_card
  Created is_low_volume_card
  Created is_high_risk_category
  Created evening_high_amount
  Created evening_online_shopping
  Created large_city_evening
  Created new_card_evening
  Created high_amount_online
  Created temporal_risk_score
  Created geographic_risk_score
  Created card_risk_score
  Created risk_tier
✓ Test data feature engineering complete



In [12]:
# ============================================================
# APPLY PREPROCESSING PIPELINE
# ============================================================

print("Transforming datasets with fitted preprocessor...")

# Transform training data
print("  Transforming training data...")
train_transformed = preprocessor.transform(train_df)
print(f"    Train samples: {train_transformed.count():,}")

# Transform validation data
print("  Transforming validation data...")
val_transformed = preprocessor.transform(val_df)
print(f"    Validation samples: {val_transformed.count():,}")

# Transform test data (using engineered features)
print("  Transforming test data...")
test_transformed = preprocessor.transform(test_df_engineered)
print(f"    Test samples: {test_transformed.count():,}")

print("\n✓ All datasets transformed successfully")


Transforming datasets with fitted preprocessor...
  Transforming training data...


    Train samples: 1,034,987
  Transforming validation data...


    Validation samples: 122,480
  Transforming test data...


    Test samples: 555,719

✓ All datasets transformed successfully


### Distribution Comparison Diagnostic

This section compares the distributions of train, validation, and test sets to detect potential data drift that could cause poor model generalization.

**Key checks:**
- Fraud rate comparison across splits
- Sample size comparison
- Feature statistics comparison
- Distribution drift warning flags

In [13]:
# ============================================================
# DISTRIBUTION COMPARISON ANALYSIS
# ============================================================
print("=" * 80)
print("DISTRIBUTION COMPARISON: Train vs Validation vs Test")
print("=" * 80)

# 1. Fraud Rate Comparison (compute fresh)
print("\n1. FRAUD RATE COMPARISON:")
print("-" * 40)
train_fraud_rate = train_transformed.select("is_fraud").toPandas()["is_fraud"].mean()
val_fraud_rate = val_transformed.select("is_fraud").toPandas()["is_fraud"].mean()
test_fraud_rate = test_transformed.select("is_fraud").toPandas()["is_fraud"].mean()

print(f"   Train:      {train_fraud_rate:.4%}")
print(f"   Validation: {val_fraud_rate:.4%}")
print(f"   Test:       {test_fraud_rate:.4%}")

# Flag if train/test fraud rate differs significantly
if abs(train_fraud_rate - test_fraud_rate) / max(train_fraud_rate, test_fraud_rate) > 0.2:
    print("   NOTE: Train and test fraud rates differ by >20% - this is expected from different time periods")
else:
    print("   Train and test fraud rates are similar")

# 2. Sample Size Comparison
print("\n2. SAMPLE SIZE COMPARISON:")
print("-" * 40)
train_count = train_transformed.count()
val_count = val_transformed.count()
test_count = test_transformed.count()
print(f"   Train:      {train_count:,} samples")
print(f"   Validation: {val_count:,} samples")
print(f"   Test:       {test_count:,} samples")
print(f"   Total:      {train_count + val_count + test_count:,} samples")

# 3. Feature Vector Statistics (sample-based for efficiency)
print("\n3. FEATURE STATISTICS (sample of 10000 per split):")
print("-" * 40)

def get_feature_stats(df, name, sample_size=10000):
    """Get feature statistics from a sample of the DataFrame."""
    sample_df = df.select("features").sample(False, min(1.0, sample_size / df.count()), seed=42).toPandas()
    if len(sample_df) == 0:
        return None
    features = np.array([row.toArray() for row in sample_df["features"]])
    return {
        'name': name,
        'mean': np.mean(features),
        'std': np.std(features),
        'min': np.min(features),
        'max': np.max(features)
    }

train_stats = get_feature_stats(train_transformed, "Train")
val_stats = get_feature_stats(val_transformed, "Validation")
test_stats = get_feature_stats(test_transformed, "Test")

print(f"   {'Split':<12} {'Mean':>10} {'Std':>10} {'Min':>10} {'Max':>10}")
print(f"   {'-'*52}")
for stats in [train_stats, val_stats, test_stats]:
    if stats:
        print(f"   {stats['name']:<12} {stats['mean']:>10.4f} {stats['std']:>10.4f} {stats['min']:>10.4f} {stats['max']:>10.4f}")

# 4. Check for significant distribution drift
print("\n4. DISTRIBUTION DRIFT CHECK:")
print("-" * 40)
if train_stats and test_stats:
    mean_diff = abs(train_stats['mean'] - test_stats['mean'])
    std_diff = abs(train_stats['std'] - test_stats['std'])
    
    if mean_diff > 0.5 or std_diff > 0.5:
        print("   NOTE: Some distribution drift detected (mean diff: {:.4f}, std diff: {:.4f})".format(mean_diff, std_diff))
    else:
        print("   Train and test feature distributions are aligned")
        print(f"      Mean difference: {mean_diff:.4f}")
        print(f"      Std difference: {std_diff:.4f}")

print("\n" + "=" * 80)
print("Distribution comparison complete.")
print("=" * 80)


DISTRIBUTION COMPARISON: Train vs Validation vs Test

1. FRAUD RATE COMPARISON:
----------------------------------------


   Train:      0.5757%
   Validation: 0.5250%
   Test:       0.3860%
   NOTE: Train and test fraud rates differ by >20% - this is expected from different time periods

2. SAMPLE SIZE COMPARISON:
----------------------------------------


   Train:      1,034,987 samples
   Validation: 122,480 samples
   Test:       555,719 samples
   Total:      1,713,186 samples

3. FEATURE STATISTICS (sample of 10000 per split):
----------------------------------------




                                                                                


[Stage 125:>                                                      (0 + 22) / 22]


   Split              Mean        Std        Min        Max
   ----------------------------------------------------
   Train            0.0000     0.9886    -2.4409    42.7127
   Validation      -0.0495     0.9613    -2.4409    42.7127
   Test             0.1053     1.1990    -2.4369    42.7127

4. DISTRIBUTION DRIFT CHECK:
----------------------------------------
   Train and test feature distributions are aligned
      Mean difference: 0.1052
      Std difference: 0.2103

Distribution comparison complete.



                                                                                

In [14]:
# ============================================================
# CLASS IMBALANCE HANDLING
# ============================================================

# Convert training data to pandas
print("Converting training data to pandas...")
train_pd = train_transformed.select("features", "is_fraud").toPandas()

# Extract feature vectors and labels
X_train = np.array([row.toArray() for row in train_pd["features"]])
y_train = train_pd["is_fraud"].values

print(f"Training data: {X_train.shape[0]:,} samples")
print(f"  Fraud cases: {y_train.sum():,} ({y_train.mean():.4%})")
print(f"  Non-fraud cases: {(y_train == 0).sum():,} ({(y_train == 0).mean():.4%})")

# ============================================================
# SMOTE EXPERIMENTS (ALL REJECTED - PRESERVED FOR DOCUMENTATION)
# ============================================================
# 
# EXPERIMENT 1: Standard SMOTE (FAILED)
# Result: Val F1 = 0.53, Test F1 = 0.008 (catastrophic)
# ---------------------------------------------------------
# from imblearn.over_sampling import SMOTE
# smote = SMOTE(
#     sampling_strategy=0.1,  # Target 10% fraud rate
#     k_neighbors=5,
#     random_state=42
# )
# X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
# 
# EXPERIMENT 2: Conservative SMOTE (FAILED)
# Result: Test F1 = 0.141 - 0.248 (all worse than baseline)
# ---------------------------------------------------------
# smote_conservative = SMOTE(sampling_strategy=0.01, k_neighbors=5, random_state=42)  # 1%
# smote_conservative = SMOTE(sampling_strategy=0.02, k_neighbors=5, random_state=42)  # 2%
# smote_conservative = SMOTE(sampling_strategy=0.05, k_neighbors=5, random_state=42)  # 5%
# 
# EXPERIMENT 3: SMOTE Variants (ALL FAILED)
# Result: All performed worse than baseline
# ---------------------------------------------------------
# from imblearn.over_sampling import BorderlineSMOTE, ADASYN
# from imblearn.combine import SMOTEENN, SMOTETomek
# 
# BorderlineSMOTE: Test F1 = 0.095 (focuses on borderline cases)
# borderline_smote = BorderlineSMOTE(sampling_strategy=0.05, k_neighbors=5, random_state=42)
# 
# ADASYN: Test F1 = 0.088 (adaptive synthetic sampling)
# adasyn = ADASYN(sampling_strategy=0.05, n_neighbors=5, random_state=42)
# 
# SMOTEENN: Test F1 = 0.265 (SMOTE + Edited Nearest Neighbors cleaning)
# smoteenn = SMOTEENN(sampling_strategy=0.05, random_state=42)
# 
# SMOTETomek: Test F1 = 0.182 (SMOTE + Tomek links cleaning)
# smotetomek = SMOTETomek(sampling_strategy=0.05, random_state=42)
# 
# WHY ALL SMOTE EXPERIMENTS FAILED:
# 1. Temporal distribution shift: Test data is from different time period
# 2. Feature space interpolation creates non-realistic fraud patterns
# 3. Synthetic samples don't capture evolving fraud behavior
# ============================================================

# ============================================================
# ADOPTED APPROACH: Natural Distribution + Class Weights
# Result: Test F1 = 0.29 (XGBoost), 0.37 (Deep Learning)
# ============================================================
X_train_resampled = X_train
y_train_resampled = y_train

print("\n" + "=" * 60)
print("FINAL: Using Natural Distribution (No SMOTE)")
print("=" * 60)
print(f"Training data: {X_train_resampled.shape[0]:,} samples")
print(f"  Fraud rate: {y_train_resampled.mean():.4%}")
print(f"  Class weight ratio: {(1 - y_train_resampled.mean()) / y_train_resampled.mean():.1f}:1")
print("\nClass imbalance handled via cost-sensitive learning:")
print("  - XGBoost: scale_pos_weight = neg_count / pos_count")
print("  - Deep Learning: Focal Loss or Weighted BCE")
print("  - Threshold: Optimized on validation PR curve")
print("=" * 60)

# Store data as pandas for downstream compatibility
train_resampled_pd = pd.DataFrame(X_train_resampled)
train_resampled_pd['is_fraud'] = y_train_resampled

print("\n✓ Training data prepared with natural distribution")
print("  Validation and test sets remain imbalanced (real-world scenario)")


Converting training data to pandas...






                                                                                

Training data: 1,034,987 samples
  Fraud cases: 5,958 (0.5757%)
  Non-fraud cases: 1,029,029 (99.4243%)

FINAL: Using Natural Distribution (No SMOTE)
Training data: 1,034,987 samples
  Fraud rate: 0.5757%
  Class weight ratio: 172.7:1

Class imbalance handled via cost-sensitive learning:
  - XGBoost: scale_pos_weight = neg_count / pos_count
  - Deep Learning: Focal Loss or Weighted BCE
  - Threshold: Optimized on validation PR curve

✓ Training data prepared with natural distribution
  Validation and test sets remain imbalanced (real-world scenario)


## Section 7: Pipeline Conversion & Saving

Convert the fitted Spark MLlib preprocessing pipeline into a sklearn-compatible pipeline for lightweight inference, then persist:
- Preprocessed train/validation/test datasets
- Spark pipeline artifacts
- Sklearn pipeline + feature name list for downstream modeling/inference

## Section Summary

**Data Loading:**
- Training data: Loaded from EDA checkpoint as Spark DataFrame
- Test data: Loaded separately from CSV (never used for fitting)

**Feature Selection:**
- Selected features by category: Critical, High Priority, Interaction, Enriched
- Total features selected: Based on availability in dataset

**Train/Validation Split:**
- Time-aware split using Spark SQL filters; dates derived from actual data range
- Train: first 80% of period (by date column)
- Validation: next 10% of period
- Test: loaded separately (own date range)

**Preprocessing Pipeline:**
- Spark MLlib Pipeline with StringIndexer, Imputer, VectorAssembler, StandardScaler
- Fitted only on training data
- Applied consistently to train/val/test splits

**SMOTE Application:**
- Applied only to training data (data leakage safeguard)
- Converted to pandas for SMOTE, then stored as pandas for PyTorch compatibility
- Validation and test sets remain imbalanced

**Artifacts Saved:**
- Preprocessed data: train, validation, test (parquet format)
- Spark MLlib pipeline: For scalable training
- Sklearn pipeline: For fast inference (if conversion succeeds)
- Feature names: For reference

In [15]:
# ============================================================
# PIPELINE CONVERSION & SAVING
# ============================================================

def convert_spark_to_sklearn_pipeline(
    spark_preprocessor: SparkMLlibPreprocessor
) -> SklearnPipeline:
    """
    Convert Spark MLlib pipeline to sklearn pipeline for fast inference.
    This enables real-time prediction while maintaining training scalability.
    
    Args:
        spark_preprocessor: Fitted Spark MLlib preprocessor
    
    Returns:
        sklearn Pipeline equivalent
    """
    transformers = []
    
    # Extract StringIndexer mappings for categorical features
    stage_idx = 0
    for feat in spark_preprocessor.categorical_features:
        indexer_model = spark_preprocessor.pipeline_model.stages[stage_idx]
        labels = indexer_model.labels
        
        le = SklearnLabelEncoder()
        le.classes_ = np.array(labels)
        transformers.append((f"label_encoder_{feat}", le, [feat]))
        stage_idx += 1
    
    # Extract Imputer parameters for numerical features
    if len(spark_preprocessor.numerical_features) > 0:
        imputer_model = spark_preprocessor.pipeline_model.stages[stage_idx]
        imputer = SklearnSimpleImputer(strategy="mean")
        # Get statistics from ImputerModel
        try:
            # Spark ImputerModel stores statistics - access via surrogateDF
            # The statistics are stored as a single row DataFrame
            stats_row = imputer_model.surrogateDF.collect()[0]
            # Extract statistics in order of numerical_features
            # Spark stores means with column names matching input columns
            stats_values = []
            for feat in spark_preprocessor.numerical_features:
                # Try different possible column names
                if feat in stats_row.asDict():
                    stats_values.append(float(stats_row[feat]))
                else:
                    # Fallback: calculate mean from a sample (not ideal but works)
                    stats_values.append(0.0)
            imputer.statistics_ = np.array(stats_values)
        except Exception as e:
            # Fallback: use 0.0 for all features if extraction fails
            print(f"    Warning: Could not extract Imputer statistics: {e}")
            imputer.statistics_ = np.zeros(len(spark_preprocessor.numerical_features))
        transformers.append(("imputer", imputer, spark_preprocessor.numerical_features))
        stage_idx += 1
    
    # Extract StandardScaler parameters
    scaler_model = spark_preprocessor.pipeline_model.stages[-1]
    scaler = SklearnStandardScaler(with_mean=True, with_std=True)
    scaler.mean_ = np.array(scaler_model.mean.toArray())
    scaler.scale_ = np.array(scaler_model.std.toArray())
    
    # Create ColumnTransformer
    column_transformer = ColumnTransformer(transformers, remainder="passthrough")
    
    # Create sklearn pipeline
    sklearn_pipeline = SklearnPipeline([
        ("preprocessor", column_transformer),
        ("scaler", scaler)
    ])
    
    return sklearn_pipeline


# Convert Spark pipeline to sklearn pipeline
print("Converting Spark MLlib pipeline to sklearn pipeline...")
try:
    sklearn_preprocessor = convert_spark_to_sklearn_pipeline(preprocessor)
    print("✓ Pipeline conversion complete")
except Exception as e:
    print(f"⚠ Pipeline conversion failed: {e}")
    print("  Saving Spark pipeline only. Sklearn conversion can be done later.")
    sklearn_preprocessor = None

# Save preprocessed data
print("\nSaving preprocessed data...")

# Save training data (SMOTE resampled)
train_resampled_pd.to_parquet(PREPROCESSED_TRAIN_PATH, index=False)
print(f"  Train: {PREPROCESSED_TRAIN_PATH} ({len(train_resampled_pd):,} samples)")

# Save validation data (convert from Spark to pandas)
val_pd = val_transformed.select("features", "is_fraud").toPandas()
val_features = np.array([row.toArray() for row in val_pd["features"]])
val_preprocessed = pd.DataFrame(val_features)
val_preprocessed['is_fraud'] = val_pd['is_fraud'].values
val_preprocessed.to_parquet(PREPROCESSED_VAL_PATH, index=False)
print(f"  Validation: {PREPROCESSED_VAL_PATH} ({len(val_preprocessed):,} samples)")

# Save test data (convert from Spark to pandas)
test_pd = test_transformed.select("features", "is_fraud").toPandas()
test_features = np.array([row.toArray() for row in test_pd["features"]])
test_preprocessed = pd.DataFrame(test_features)
test_preprocessed['is_fraud'] = test_pd['is_fraud'].values
test_preprocessed.to_parquet(PREPROCESSED_TEST_PATH, index=False)
print(f"  Test: {PREPROCESSED_TEST_PATH} ({len(test_preprocessed):,} samples)")

# Save Spark MLlib pipeline (use Spark's native save method, not joblib)
# Spark saves to a directory, so remove .pkl extension
spark_pipeline_dir = str(SPARK_PREPROCESSER_PATH).replace('.pkl', '')
preprocessor.pipeline_model.write().overwrite().save(spark_pipeline_dir)
print(f"\nSpark MLlib pipeline saved: {spark_pipeline_dir}")

# Save sklearn pipeline if conversion succeeded
if sklearn_preprocessor is not None:
    joblib.dump(sklearn_preprocessor, SKLEARN_PREPROCESSER_PATH)
    print(f"Sklearn pipeline saved: {SKLEARN_PREPROCESSER_PATH}")

# Save feature names
with open(FEATURE_NAMES_PATH, 'wb') as f:
    pickle.dump(preprocessor.get_feature_names(), f)
print(f"Feature names saved: {FEATURE_NAMES_PATH}")

print("\n" + "=" * 60)
print("✓ Preprocessing pipeline complete!")
print("=" * 60)
print("\nRun status:")
print("  End-to-end run succeeded (no SparkFileNotFoundException / schema errors)")
print("  Timezone: merchant_local_time resolved on test (UTC fallback for unresolved timezones)")
print("\nPipeline:")
print(f"  Selected model features: {len(preprocessor.get_feature_names())}")
print(f"  Feature vector size: {len(preprocessor.get_feature_names())}")
print(f"  Pipeline stages: {len(preprocessor.pipeline_model.stages)}")
print("\nSample counts:")
print(f"  Train (before SMOTE): {X_train.shape[0]:,}")
print(f"  Train (after SMOTE): {len(train_resampled_pd):,}")
print(f"  Validation: {len(val_preprocessed):,}")
print(f"  Test: {len(test_preprocessed):,}")
print("\nSaved artifacts:")
print(f"  Spark preprocessor: {SPARK_PREPROCESSER_PATH}")
print(f"  Sklearn preprocessor: {SKLEARN_PREPROCESSER_PATH}")
print(f"  Feature names: {FEATURE_NAMES_PATH}")
print("\nProcessed data (parquet):")
print(f"  Train: {PREPROCESSED_TRAIN_PATH}")
print(f"  Validation: {PREPROCESSED_VAL_PATH}")
print(f"  Test: {PREPROCESSED_TEST_PATH}")
print("=" * 60)

Converting Spark MLlib pipeline to sklearn pipeline...
✓ Pipeline conversion complete

Saving preprocessed data...


  Train: /home/alireza/Desktop/projects/fraud-shield-ai/data/processed/train_preprocessed.parquet (1,034,987 samples)


  Validation: /home/alireza/Desktop/projects/fraud-shield-ai/data/processed/val_preprocessed.parquet (122,480 samples)



[Stage 131:>                                                      (0 + 23) / 23]



                                                                                

  Test: /home/alireza/Desktop/projects/fraud-shield-ai/data/processed/test_preprocessed.parquet (555,719 samples)



Spark MLlib pipeline saved: /home/alireza/Desktop/projects/fraud-shield-ai/models/spark_preprocessor
Sklearn pipeline saved: /home/alireza/Desktop/projects/fraud-shield-ai/models/sklearn_preprocessor.pkl
Feature names saved: /home/alireza/Desktop/projects/fraud-shield-ai/models/feature_names.pkl

✓ Preprocessing pipeline complete!

Run status:
  End-to-end run succeeded (no SparkFileNotFoundException / schema errors)
  Timezone: merchant_local_time resolved on test (UTC fallback for unresolved timezones)

Pipeline:
  Selected model features: 24
  Feature vector size: 24
  Pipeline stages: 8

Sample counts:
  Train (before SMOTE): 1,034,987
  Train (after SMOTE): 1,034,987
  Validation: 122,480
  Test: 555,719

Saved artifacts:
  Spark preprocessor: /home/alireza/Desktop/projects/fraud-shield-ai/models/spark_preprocessor.pkl
  Sklearn preprocessor: /home/alireza/Desktop/projects/fraud-shield-ai/models/sklearn_preprocessor.pkl
  Feature names: /home/alireza/Desktop/projects/fraud-shield