# Local Setup Instructions

## Prerequisites Checklist

Before running this notebook, ensure you have completed the following setup:

- [ ] **Java 11 installed** and `JAVA_HOME` configured
  - macOS: `brew install openjdk@11`
  - Set: `export JAVA_HOME=$(/usr/libexec/java_home -v 11)`
- [ ] **Conda environment `fraud-shield` created and activated**
  - Create: `conda env create -f environment.yml`
  - Activate: `conda activate fraud-shield`
- [ ] **PySpark verified working**
  - Test: `python -c "from pyspark.sql import SparkSession; print('PySpark OK')"`
- [ ] **Data directories created**
  - `data/checkpoints/` - for EDA checkpoints
  - `data/processed/` - for preprocessed data
  - `models/` - for saved preprocessors
- [ ] **EDA checkpoint available**
  - Run `01-local-fraud-detection-eda.ipynb` first
  - Checkpoint should exist: `data/checkpoints/section11_enriched_features.parquet`

## Environment Activation

```bash
conda activate fraud-shield
```

## Checkpoint Requirements

This notebook requires the Section 11 checkpoint from the EDA notebook:
- `data/checkpoints/section11_enriched_features.parquet`

**Note:** This is a local execution version configured for the `fraud-shield` conda environment on your local machine.

# Preprocessing Pipeline for Fraud Detection - Local Execution Version

**Notebook:** 02-local-preprocessing.ipynb  
**Objective:** Feature selection, encoding, scaling, and class imbalance handling

**Note:** Local execution version - configured for conda environment `fraud-shield`

## Preprocessing Strategy

**SMOTE Parameters:**
- k_neighbors: 5 (default, suitable for our dataset size)
- sampling_strategy: 0.1 (10% minority class ratio - balances performance and computational cost)
- random_state: 42 (reproducibility)

**Feature Selection Approach:**
- Start with Critical Priority features (transaction_count_bin, card_age_bin, hour)
- Add High Priority features (category, day_of_week, month)
- Include interaction features (evening_high_amount, new_card_evening, high_amount_online)
- Use feature importance from tree models to validate and refine

**Validation Split Method:**
- Time-aware split (respect temporal order)
- Train: Jan 2012 - Dec 2012 (12 months)
- Validation: Jan 2013 - Mar 2013 (3 months)
- Test: Apr 2013 - Jun 2013 (3 months)

In [1]:
# ============================================================
# GLOBAL IMPORTS & DEPENDENCIES
# ============================================================

import os
import sys
from pathlib import Path
from typing import Tuple, List, Dict, Optional
import warnings
warnings.filterwarnings('ignore')

# Data Processing
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from imblearn.over_sampling import SMOTE

# PySpark (for loading checkpoint data)
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Utilities
import pickle
import joblib
from datetime import datetime

print("All dependencies loaded successfully")

All dependencies loaded successfully


In [2]:
# ============================================================
# CONFIGURATION & PATHS
# ============================================================

# Path resolution for local execution
# Calculate PROJECT_ROOT based on notebook location
# Since notebook is in local_notebooks/, project root is parent directory
NOTEBOOK_DIR = Path.cwd()  # Current working directory
# If we're in local_notebooks/, go up one level to get project root
if NOTEBOOK_DIR.name == "local_notebooks":
    PROJECT_ROOT = NOTEBOOK_DIR.parent
else:
    # Fallback: assume we're already at project root
    PROJECT_ROOT = NOTEBOOK_DIR

# Change working directory to project root for consistency
os.chdir(PROJECT_ROOT)

DATA_DIR = PROJECT_ROOT / "data"
CHECKPOINT_DIR = DATA_DIR / "checkpoints"
MODELS_DIR = PROJECT_ROOT / "models"
PROCESSED_DATA_DIR = DATA_DIR / "processed"

# Create directories if they don't exist
MODELS_DIR.mkdir(exist_ok=True)
PROCESSED_DATA_DIR.mkdir(exist_ok=True)

# Checkpoint path from EDA notebook (Section 11 final checkpoint)
CHECKPOINT_SECTION11 = CHECKPOINT_DIR / 'section11_enriched_features.parquet'

# Output paths
PREPROCESSED_TRAIN_PATH = PROCESSED_DATA_DIR / 'train_preprocessed.parquet'
PREPROCESSED_VAL_PATH = PROCESSED_DATA_DIR / 'val_preprocessed.parquet'
PREPROCESSED_TEST_PATH = PROCESSED_DATA_DIR / 'test_preprocessed.parquet'
PREPROCESSER_PATH = MODELS_DIR / 'preprocessor.pkl'
FEATURE_NAMES_PATH = MODELS_DIR / 'feature_names.pkl'

print(f"Project root: {PROJECT_ROOT}")
print(f"Data directory: {DATA_DIR}")
print(f"Checkpoint directory: {CHECKPOINT_DIR}")
print(f"Models directory: {MODELS_DIR}")

Project root: /Users/abzanganeh/Desktop/projects/fraud-shield-ai
Data directory: /Users/abzanganeh/Desktop/projects/fraud-shield-ai/data
Checkpoint directory: /Users/abzanganeh/Desktop/projects/fraud-shield-ai/data/checkpoints
Models directory: /Users/abzanganeh/Desktop/projects/fraud-shield-ai/models


In [3]:
# ============================================================
# INITIALIZE SPARK SESSION (for loading checkpoint data)
# ============================================================

spark = SparkSession.builder \
    .appName("FraudDetectionPreprocessing") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .getOrCreate()

print("Spark session initialized")

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/01/25 22:57:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
26/01/25 22:57:45 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


Spark session initialized


## 1. Load Data from EDA Checkpoint

In [4]:
# ============================================================
# LOAD DATA FROM EDA CHECKPOINT
# ============================================================

def load_eda_checkpoint() -> pd.DataFrame:
    """
    Load the final enriched dataset from EDA notebook checkpoint.
    
    Returns:
        pd.DataFrame: Dataset with all engineered features
    """
    if not CHECKPOINT_SECTION11.exists():
        raise FileNotFoundError(
            f"EDA checkpoint not found: {CHECKPOINT_SECTION11}\n"
            "Please run the EDA notebook (01-local-fraud-detection-eda.ipynb) first."
        )
    
    print(f"Loading checkpoint from: {CHECKPOINT_SECTION11}")
    
    # Load as Spark DataFrame
    spark_df = spark.read.parquet(str(CHECKPOINT_SECTION11))
    
    # Convert to Pandas (for modeling)
    print("Converting Spark DataFrame to Pandas...")
    df = spark_df.toPandas()
    
    print(f"\n✓ Loaded successfully")
    print(f"Shape: {df.shape}")
    print(f"Columns: {len(df.columns)}")
    print(f"\nTarget distribution:")
    print(df['is_fraud'].value_counts())
    print(f"\nFraud rate: {df['is_fraud'].mean():.4%}")
    
    return df

# Load data
df = load_eda_checkpoint()

Loading checkpoint from: /Users/abzanganeh/Desktop/projects/fraud-shield-ai/data/checkpoints/section11_enriched_features.parquet


                                                                                

Converting Spark DataFrame to Pandas...


26/01/25 22:57:55 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
26/01/25 22:58:11 ERROR Executor: Exception in task 2.0 in stage 1.0 (TID 3)/ 4]
java.lang.OutOfMemoryError: Java heap space
	at java.base/java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:61)
	at java.base/java.nio.ByteBuffer.allocate(ByteBuffer.java:348)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$2(SparkPlan.scala:368)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$2$adapted(SparkPlan.scala:368)
	at org.apache.spark.sql.execution.SparkPlan$$Lambda$2546/0x00000008011c6840.apply(Unknown Source)
	at org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87)
	at org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75)
	at net.jpountz.lz4.LZ4BlockOut

ConnectionRefusedError: [Errno 61] Connection refused

## 2. Feature Selection

In [None]:
# ============================================================
# FEATURE SELECTION - CRITICAL PRIORITY FEATURES
# ============================================================

# Define feature categories based on EDA findings
CRITICAL_FEATURES = [
    'transaction_count_bin',  # 191.57x ratio
    'card_age_bin',            # 13.77x ratio
    'hour',                    # 36.85x ratio
    'time_bin',              # Derived from hour
    'is_peak_fraud_hour',     # Pattern flag
    'is_new_card',            # Pattern flag
    'is_low_volume_card'      # Pattern flag
]

HIGH_PRIORITY_FEATURES = [
    'category',               # 11.34x ratio
    'day_of_week',            # Temporal pattern
    'month',                  # 2.26x ratio
    'is_peak_fraud_day',      # Pattern flag
    'is_peak_fraud_season',   # Pattern flag
    'is_high_risk_category',  # Pattern flag
    'card_age_days',          # Continuous version
    'transaction_count'       # Continuous version
]

INTERACTION_FEATURES = [
    'evening_high_amount',
    'evening_online_shopping',
    'large_city_evening',
    'new_card_evening',
    'high_amount_online'
]

ENRICHED_FEATURES = [
    'temporal_risk_score',
    'geographic_risk_score',
    'card_risk_score',
    'risk_tier'
]

# Combine all features (start with Critical + High Priority)
SELECTED_FEATURES = CRITICAL_FEATURES + HIGH_PRIORITY_FEATURES + INTERACTION_FEATURES + ENRICHED_FEATURES

# Filter to only features that exist in the dataset
available_features = [f for f in SELECTED_FEATURES if f in df.columns]
missing_features = [f for f in SELECTED_FEATURES if f not in df.columns]

print("Feature Selection Summary:")
print(f"  Critical features: {len([f for f in CRITICAL_FEATURES if f in df.columns])}/{len(CRITICAL_FEATURES)}")
print(f"  High priority features: {len([f for f in HIGH_PRIORITY_FEATURES if f in df.columns])}/{len(HIGH_PRIORITY_FEATURES)}")
print(f"  Interaction features: {len([f for f in INTERACTION_FEATURES if f in df.columns])}/{len(INTERACTION_FEATURES)}")
print(f"  Enriched features: {len([f for f in ENRICHED_FEATURES if f in df.columns])}/{len(ENRICHED_FEATURES)}")
print(f"\n  Total selected: {len(available_features)} features")

if missing_features:
    print(f"\n  ⚠ Missing features (will be skipped): {missing_features}")

# Store feature list for later use
feature_names = available_features.copy()

## 3. Data Splitting (Time-Aware)

In [None]:
# ============================================================
# TIME-AWARE DATA SPLITTING
# ============================================================

def split_data_time_aware(
    df: pd.DataFrame,
    date_col: str = 'trans_date_trans_time',
    train_end: str = '2012-12-31',
    val_end: str = '2013-03-31'
) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """
    Split data respecting temporal order.
    
    Args:
        df: Input dataframe
        date_col: Name of date column
        train_end: End date for training set (YYYY-MM-DD)
        val_end: End date for validation set (YYYY-MM-DD)
    
    Returns:
        Tuple of (train_df, val_df, test_df)
    """
    # Convert date column to datetime if needed
    if not pd.api.types.is_datetime64_any_dtype(df[date_col]):
        df[date_col] = pd.to_datetime(df[date_col])
    
    train_end_dt = pd.to_datetime(train_end)
    val_end_dt = pd.to_datetime(val_end)
    
    # Split by date
    train_df = df[df[date_col] <= train_end_dt].copy()
    val_df = df[(df[date_col] > train_end_dt) & (df[date_col] <= val_end_dt)].copy()
    test_df = df[df[date_col] > val_end_dt].copy()
    
    print("Time-aware split summary:")
    print(f"  Train: {train_df.shape[0]:,} samples ({train_df[date_col].min()} to {train_df[date_col].max()})")
    print(f"  Validation: {val_df.shape[0]:,} samples ({val_df[date_col].min()} to {val_df[date_col].max()})")
    print(f"  Test: {test_df.shape[0]:,} samples ({test_df[date_col].min()} to {test_df[date_col].max()})")
    
    print(f"\nFraud rates:")
    print(f"  Train: {train_df['is_fraud'].mean():.4%}")
    print(f"  Validation: {val_df['is_fraud'].mean():.4%}")
    print(f"  Test: {test_df['is_fraud'].mean():.4%}")
    
    return train_df, val_df, test_df

# Check if we have a date column (try common names)
date_col = None
for col_name in ['trans_date_trans_time', 'merchant_local_time', 'customer_local_time']:
    if col_name in df.columns:
        date_col = col_name
        break

if date_col:
    train_df, val_df, test_df = split_data_time_aware(df, date_col=date_col)
else:
    print("⚠ No date column found - using random split instead")
    train_df, temp_df = train_test_split(df, test_size=0.3, random_state=42, stratify=df['is_fraud'])
    val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42, stratify=temp_df['is_fraud'])
    print(f"  Train: {train_df.shape[0]:,} samples")
    print(f"  Validation: {val_df.shape[0]:,} samples")
    print(f"  Test: {test_df.shape[0]:,} samples")

## 4. Feature Encoding & Preprocessing

In [None]:
# ============================================================
# FEATURE ENCODING CLASS
# ============================================================

class FeaturePreprocessor:
    """
    Handles encoding, scaling, and preprocessing of features.
    """
    
    def __init__(self):
        self.label_encoders: Dict[str, LabelEncoder] = {}
        self.scaler = StandardScaler()
        self.feature_names: List[str] = []
        self.categorical_features: List[str] = []
        self.numerical_features: List[str] = []
        self.is_fitted = False
    
    def _identify_feature_types(self, df: pd.DataFrame, features: List[str]) -> None:
        """Identify categorical vs numerical features."""
        self.categorical_features = []
        self.numerical_features = []
        
        for feat in features:
            if feat not in df.columns:
                continue
                
            # Check if categorical (object type or low cardinality integer)
            if df[feat].dtype == 'object' or df[feat].dtype.name == 'category':
                self.categorical_features.append(feat)
            elif df[feat].dtype in ['int64', 'int32', 'float64', 'float32']:
                # Check cardinality for integer features
                if df[feat].dtype in ['int64', 'int32']:
                    unique_count = df[feat].nunique()
                    if unique_count < 20:  # Low cardinality - treat as categorical
                        self.categorical_features.append(feat)
                    else:
                        self.numerical_features.append(feat)
                else:
                    self.numerical_features.append(feat)
            else:
                # Default to numerical
                self.numerical_features.append(feat)
    
    def fit(self, df: pd.DataFrame, features: List[str], target: str = 'is_fraud') -> 'FeaturePreprocessor':
        """
        Fit preprocessor on training data.
        
        Args:
            df: Training dataframe
            features: List of feature names to process
            target: Name of target column
        """
        self.feature_names = [f for f in features if f in df.columns]
        self._identify_feature_types(df, self.feature_names)
        
        # Fit label encoders for categorical features
        for feat in self.categorical_features:
            le = LabelEncoder()
            le.fit(df[feat].astype(str).fillna('missing'))
            self.label_encoders[feat] = le
        
        # Prepare numerical features for scaling
        X_numerical = df[self.numerical_features].fillna(0).values if len(self.numerical_features) > 0 else np.array([]).reshape(len(df), 0)
        
        # Fit scaler
        if len(self.numerical_features) > 0:
            self.scaler.fit(X_numerical)
        
        self.is_fitted = True
        return self
    
    def transform(self, df: pd.DataFrame) -> np.ndarray:
        """
        Transform data using fitted encoders and scaler.
        
        Args:
            df: Dataframe to transform
        
        Returns:
            numpy array of transformed features
        """
        if not self.is_fitted:
            raise ValueError("Preprocessor must be fitted before transform")
        
        encoded_features = []
        
        # Encode categorical features
        for feat in self.categorical_features:
            if feat in df.columns:
                le = self.label_encoders[feat]
                # Handle unseen categories
                encoded = df[feat].astype(str).fillna('missing')
                # Map to known classes or 'missing'
                encoded = encoded.apply(lambda x: x if x in le.classes_ else 'missing')
                # Re-encode if 'missing' was added
                if 'missing' not in le.classes_:
                    le.classes_ = np.append(le.classes_, 'missing')
                encoded_features.append(le.transform(encoded).reshape(-1, 1))
        
        # Scale numerical features
        if len(self.numerical_features) > 0:
            X_numerical = df[self.numerical_features].fillna(0).values
            X_numerical_scaled = self.scaler.transform(X_numerical)
            encoded_features.append(X_numerical_scaled)
        
        # Combine all features
        if encoded_features:
            X_combined = np.hstack(encoded_features)
        else:
            X_combined = np.array([]).reshape(len(df), 0)
        
        return X_combined
    
    def get_feature_names(self) -> List[str]:
        """Get list of feature names after encoding."""
        feature_names_out = []
        
        # Categorical features (one per category after encoding)
        for feat in self.categorical_features:
            feature_names_out.append(feat)
        
        # Numerical features
        feature_names_out.extend(self.numerical_features)
        
        return feature_names_out

# Initialize preprocessor
preprocessor = FeaturePreprocessor()
preprocessor.fit(train_df, feature_names)

print("Preprocessor fitted successfully")
print(f"  Categorical features: {len(preprocessor.categorical_features)}")
print(f"  Numerical features: {len(preprocessor.numerical_features)}")
print(f"  Total output features: {len(preprocessor.get_feature_names())}")

## 5. Apply Preprocessing & Handle Class Imbalance

In [None]:
# ============================================================
# APPLY PREPROCESSING TO ALL SPLITS
# ============================================================

# Transform all splits
X_train = preprocessor.transform(train_df)
y_train = train_df['is_fraud'].values

X_val = preprocessor.transform(val_df)
y_val = val_df['is_fraud'].values

X_test = preprocessor.transform(test_df)
y_test = test_df['is_fraud'].values

print("Preprocessing applied:")
print(f"  Train: X={X_train.shape}, y={y_train.shape}")
print(f"  Validation: X={X_val.shape}, y={y_val.shape}")
print(f"  Test: X={X_test.shape}, y={y_test.shape}")

print(f"\nOriginal fraud rates:")
print(f"  Train: {y_train.mean():.4%}")
print(f"  Validation: {y_val.mean():.4%}")
print(f"  Test: {y_test.mean():.4%}")

In [None]:
# ============================================================
# APPLY SMOTE FOR CLASS IMBALANCE HANDLING
# ============================================================

# SMOTE parameters (as decided in preprocessing strategy)
smote = SMOTE(
    sampling_strategy=0.1,  # 10% minority class ratio
    k_neighbors=5,
    random_state=42
)

print("Applying SMOTE to training set...")
print(f"  Before SMOTE: {X_train.shape[0]:,} samples (fraud: {y_train.sum():,}, {y_train.mean():.4%})")

X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

print(f"  After SMOTE: {X_train_resampled.shape[0]:,} samples (fraud: {y_train_resampled.sum():,}, {y_train_resampled.mean():.4%})")
print(f"  Oversampling ratio: {X_train_resampled.shape[0] / X_train.shape[0]:.2f}x")

# Note: SMOTE is only applied to training set
# Validation and test sets remain imbalanced (real-world scenario)

## 6. Save Preprocessed Data & Preprocessor

In [None]:
# ============================================================
# SAVE PREPROCESSED DATA
# ============================================================

# Save as DataFrames for easier loading
train_preprocessed = pd.DataFrame(X_train_resampled, columns=preprocessor.get_feature_names())
train_preprocessed['is_fraud'] = y_train_resampled

val_preprocessed = pd.DataFrame(X_val, columns=preprocessor.get_feature_names())
val_preprocessed['is_fraud'] = y_val

test_preprocessed = pd.DataFrame(X_test, columns=preprocessor.get_feature_names())
test_preprocessed['is_fraud'] = y_test

# Save to parquet
train_preprocessed.to_parquet(PREPROCESSED_TRAIN_PATH, index=False)
val_preprocessed.to_parquet(PREPROCESSED_VAL_PATH, index=False)
test_preprocessed.to_parquet(PREPROCESSED_TEST_PATH, index=False)

print("Preprocessed data saved:")
print(f"  Train: {PREPROCESSED_TRAIN_PATH}")
print(f"  Validation: {PREPROCESSED_VAL_PATH}")
print(f"  Test: {PREPROCESSED_TEST_PATH}")

# Save preprocessor
joblib.dump(preprocessor, PREPROCESSER_PATH)
print(f"\nPreprocessor saved: {PREPROCESSER_PATH}")

# Save feature names
with open(FEATURE_NAMES_PATH, 'wb') as f:
    pickle.dump(preprocessor.get_feature_names(), f)
print(f"Feature names saved: {FEATURE_NAMES_PATH}")

print("\n✓ Preprocessing pipeline complete!")