# Feature Engineering

- **Purpose:** Missing value handling and feature engineering for fraud detection  
- **Author:** Devbrew LLC  
- **Last Updated:** October 18, 2025  
- **Status:** In Progress  
- **License:** Apache 2.0 (Code) | Non-commercial (Data)

---

## Dataset License Notice

This notebook uses the **IEEE-CIS Fraud Detection dataset** from Kaggle.

**Dataset License:** Non-commercial research use only
- You must download the dataset yourself from [Kaggle IEEE-CIS Competition](https://www.kaggle.com/c/ieee-fraud-detection)
- You must accept the competition rules before downloading
- Cannot be used for commercial purposes
- Cannot redistribute the raw dataset

**Setup Instructions:** See [`../data_catalog/README.md`](../data_catalog/README.md) for download instructions.

**Code License:** This notebook's code is licensed under Apache 2.0 (open source).

---

## Notebook Configuration

### Environment Setup

We configure the Python environment with standardized settings, import required libraries, and set a fixed random seed for reproducibility. This ensures consistent results across runs and enables reliable experimentation.

These settings establish the foundation for all feature engineering operations.

In [12]:
import warnings
from pathlib import Path
import json
from typing import Optional, Tuple, List

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

# Configuration
warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)
pd.set_option("display.float_format", '{:.2f}'.format)

# Plotting configuration
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)
plt.rcParams["font.size"] = 10

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("\nEnvironment configurated successfully")
print(f"pandas: {pd.__version__}")
print(f"numpy: {np.__version__}")


Environment configurated successfully
pandas: 2.3.3
numpy: 2.3.3


### Path Configuration

We define the project directory structure and validate that required processed data from the exploration phase exists. The validation ensures we have the necessary inputs before proceeding with feature engineering.

This configuration pattern ensures we can locate all required data artifacts from previous pipeline stages.

In [13]:
# Project paths
PROJECT_ROOT = Path.cwd().parent
DATA_DIR = PROJECT_ROOT / "data_catalog"
PROCESSED_DIR = DATA_DIR / "processed"
IEEE_CIS_DIR = DATA_DIR / "ieee-fraud"
NOTEBOOKS_DIR = PROJECT_ROOT / "notebooks"

# Ensure processed directory exists
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

# Validate required data
def validate_required_data() -> dict:
    """Validate that required datasets exist before feature engineering"""
    paths_status = {
        'IEEE Train Transaction:': (IEEE_CIS_DIR / 'train_transaction.csv').exists(),
        'IEEE Train Identity:': (IEEE_CIS_DIR / 'train_identity.csv').exists(),
    }

    print("\nData Availability Check:")
    for name, exists in paths_status.items():
        status = "Found" if exists else "Missing"
        print(f" • {name} {status}")
    
    all_exist = all(paths_status.values())
    if not all_exist:
        print("\n[WARNING] Some datasets are missing. Check data_catalog/README.md for instructions")
    else:
        print("\nAll required datasets are available")
    
    return paths_status
    
path_status = validate_required_data()


Data Availability Check:
 • IEEE Train Transaction: Found
 • IEEE Train Identity: Found

All required datasets are available


## Helper Functions

We define reusable utilities for missing value analysis, imputation strategies, and feature engineering operations. These functions implement error handling, type hints, and standardized output formats following production best practices.

These utilities form the foundation for all feature engineering operations and enable reproducible, maintainable code.

In [15]:
def analyze_missing_values(df: pd.DataFrame) -> pd.DataFrame:
    """
    Comprehensive missing value analysis with categoriazation.

    Args:
        df: DataFrame to analyze

    Returns:
        DataFrame with missing value statistics
    """
    missing_pct = (df.isnull().sum() / len(df) * 100).sort_values(ascending=False)
    missing_summary = pd.DataFrame({
        "column": missing_pct.index,
        "missing_pct": missing_pct.values,
        "missing_count": df.isnull().sum().values
    })

    # Categorize by severity
    missing_summary['category'] = pd.cut(
        missing_summary['missing_pct'],
        bins=[0.1, 0, 50, 90, 100],
        labels=['none', 'low', 'medium', 'high']
    )
    
    return missing_summary

def apply_missing_value_strategy(
    df: pd.DataFrame,
    drop_threshold: float = 90.0,
    ) -> Tuple[pd.DataFrame, List[str]]:
    """
    Apply missing value handling strategy

    Strategy:
    - Drop columns with > 90% missing values
    - Impute numeric columns with median
    - Impute categorical columns with mode or "Unknown"
    

    Args:
        df: DataFrame to process
        drop_threshold: Percentage threshold for dropping columns

    Returns:
        Tuple of (cleaned_df, dropped_columns)
    """
    print(f"\nApplying Missing Value Strategy\n")

    # Analyze missing values
    missing_summary = analyze_missing_values(df)
   
    # Identity columns to drop
    cols_to_drop = missing_summary[missing_summary['missing_pct'] > drop_threshold]['column'].tolist()

    print(f"Dropping {len(cols_to_drop)} columns with > {drop_threshold}% missing values: {cols_to_drop}")
    df_clean = df.drop(columns=cols_to_drop)

    # Seperate numeric and categorical columns
    numeric_cols = df_clean.select_dtypes(include=[np.number]).columns.tolist()
    categorical_cols = df_clean.select_dtypes(include=['object']).columns.tolist()

    # Remove target if present
    if 'isFraud' in numeric_cols:
        numeric_cols.remove('isFraud')
    

    # Impute numeric with median
    print(f"\nImputing {len([col for col in numeric_cols if df_clean[col].isnull().any()])} numeric columns with median")

    for col in numeric_cols:
        if df_clean[col].isnull().any():
            median_val = df_clean[col].median()
            df_clean[col].fillna(median_val, inplace=True)
    
    # Impute categorical with mode or "Unknown"
    print(f"\nImputing {len([col for col in categorical_cols if df_clean[col].isnull().any()])} categorical columns with mode or 'Unknown'")
    for col in categorical_cols:
        if df_clean[col].isnull().any():
            mode_val = df_clean[col].mode()
            if len(mode_val) > 1:
                df_clean[col].fillna(mode_val[0], inplace=True)
            else:
                df_clean[col].fillna("Unknown", inplace=True)
    
    # Verify
    remaining_missing = df_clean.isnull().sum().sum()
    print(f"\nMissing values after strategy: {remaining_missing}")
    print(f"Final shape: {df_clean.shape}")

    return df_clean, cols_to_drop

def calculate_velocity_features(
    df: pd.DataFrame,
    group_col: str,
    time_col: str,
    windows: List[int] = [3000, 86400],
    ) -> pd.DataFrame:
    """
    Calculate transaction velocity features (count in time window).

    Args:
        df: DataFrame sorted by time
        group_col: Column to group by (e.g. user_id)
        time_col: Column containing transaction timestamps
        windows: Time windows in seconds [1h=3600, 24h=86400]

    Returns:
        DataFrame with velocity columns added
    """
    print(f"\nCalucalting Velocity Features for {group_col}\n")

    df = df.sort_values([group_col, time_col]).reset_index(drop=True)
    
    for window in windows:
        window_name = f"{window/3600}h" if window >= 3600 else f"{window}s"
        col_name = f'{group_col}_txn_{window_name}'

        print(f"Calulcating {col_name}...")

        # Use rolling window
        df[col_name] = df.groupby(group_col)[time_col].transform(
            lambda x: x.rolling(window=len(x), min_periods=1).apply(
                lambda times: ((time.iloc[-1] - times) <= window).sum() - 1,
                raw=False
            )
        )
    
    print(f"\nVelocity features created!")
    return df

def engineer_time_features(df: pd.DataFrame, time_col: str = 'TransactionDT') -> pd.DataFrame:
    """
    Engineer time-based features from transaction timestamp.

    Args:
        df: DataFrame with time column
        time_col: Name of time column (seconds since reference)

    Returns:
        DataFrame with time features added
    """
    print(f"\nEngineering Time Features\n")

    df[f'{time_col}_hour'] = (df[time_col] // 3600 )% 24
    df[f'{time_col}_day'] = (df[time_col] // 86400 )% 7
    df[f'{time_col}_is_weekend'] = df[f'{time_col}_day'].isin([5,6]).astype(int)

    print(f'Created: {time_col}_hour, {time_col}_day, {time_col}_is_weekend features')
    return df

def engineer_device_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Engineer device reuse features.

    Args:
        df: DataFrame with DeviceInfo column

    Returns:
        DataFrame with device features added
    """
    print(f"\nEngineering Device Features\n")
    
    if 'DeviceInfo' not in df.columns:
        print(f"[WARNING] DeviceInfo not found. Skipping device features.")
        return df

    # Cards per device
    device_card_counts = df.groupby('DeviceInfo')['card1'].nunique().to_dict()
    df['device_card_count'] = df['DeviceInfo'].map(device_card_counts)

    # Multi-card device flag
    df['device_multi_card'] = (df['device_card_count'] > 1).astype(int)

    print(f"Created: device_card_count, device_multi_card features")
    return df

def engineer_amount_features(df: pd.DataFrame, amount_col: str = 'TransactionAmt') -> pd.DataFrame:
    """
    Engineer amount-based statistical features.

    Args:
        df: DataFrame with amount column
        amount_col: Name of amount column

    Returns:
        DataFrame with amount features added
    """
    print(f"\nEngineering Amount Features\n")

    if 'card1' not in df.columns:
        print(f"[WARNING] card1 not found. Skipping amount features.")
        return df
    
    # Per-card statistics
    cards_stats = df.groupby('card1')[amount_col].agg(['mean', 'std']).reset_index()
    cards_stats.columns = ['card1', 'card_amt_mean', 'card_amt_std']

    df = df.merge(cards_stats, on='card1', how='left')
    

    # Z-score
    df['amt_zscore'] = (df[amount_col] - df['card_amt_mean']) / (df['card_amt_std'] + 1e-6)

    df['amt_zscore'].fillna(0, inplace=True)
    
    print(f"Created: card_amt_mean, card_amt_std, amt_zscore features")
    return df

print('\nHelper functions loaded')    


Helper functions loaded
