<h1 style="color:#FB4834;"> 02 - Feature Engineering <h1/>


This notebook performs Feature Engineering on the fraud detection dataset.

**Focus: Minimizing False Alerts for Legitimate Frequent Customers**

***

In [15]:
import pandas as pd
import numpy as np
import os

RAW_DATA_PATH = '../data/raw/'
PROCESSED_DATA_PATH = '../data/processed/'
DATA_FILE = 'transactions.csv'

os.makedirs(PROCESSED_DATA_PATH, exist_ok=True)

In [16]:
input_filepath = os.path.join(RAW_DATA_PATH, DATA_FILE)
print(f"Loading dataset from: {input_filepath}...")
if not os.path.exists(input_filepath):
    raise FileNotFoundError(f"Data file not found at: {input_filepath}")


df = pd.read_csv(input_filepath)

# keep every later calc in strict time order
df = df.sort_values('unix_time').reset_index(drop=True)

print("Dataset loaded successfully.")
print(f"Original shape: {df.shape}")



df_fe = df.copy()

Loading dataset from: ../data/raw/transactions.csv...
Dataset loaded successfully.
Original shape: (1852394, 35)


In [27]:
df.columns

Index(['cc_num', 'merchant', 'category', 'amt', 'first', 'last', 'gender',
       'street', 'city', 'state', 'zip', 'lat', 'long', 'city_pop', 'job',
       'dob', 'trans_num', 'unix_time', 'merch_lat', 'merch_long', 'is_fraud',
       'amt_month', 'amt_year', 'amt_month_shopping_net_spend',
       'count_month_shopping_net', 'first_time_at_merchant',
       'dist_between_client_and_merch', 'trans_month', 'trans_day', 'hour',
       'year', 'times_shopped_at_merchant', 'times_shopped_at_merchant_year',
       'times_shopped_at_merchant_month', 'times_shopped_at_merchant_day'],
      dtype='object')

## 1. Customer Behavior Features
These features help establish what's "normal" for each customer, reducing false alerts for regular behavior.

- `customer_preferred_hour`: Most common hour the customer transacts
- `customer_preferred_category`: Most common category they shop in
- `is_preferred_hour`: Flag if transaction is during their usual hours
- `is_preferred_category`: Flag if transaction is in their usual category
- `is_frequent_merchant`: Flag if customer shops often at this merchant

**Why?** Legitimate customers tend to have consistent patterns. Deviations might indicate fraud.


In [46]:
def create_customer_behavior_features(df):
    """Create features related to customer behavior patterns"""
    df = df.copy()
    group = df.groupby("cc_num", group_keys=False)
    # Regular transaction patterns
    df["customer_preferred_hour"] = group["hour"].apply(
        lambda s: s.shift().mode().iloc[0] if len(s) > 1 else -1
    )
    df["customer_preferred_category"] = group["category"].apply(
        lambda s: s.shift().mode().iloc[0] if len(s) > 1 else "_none"
    )

    df["times_prior_at_merchant"] = group.cumcount()

    # Transaction regularity scores
    df["is_preferred_hour"] = (df["hour"] == df["customer_preferred_hour"]).astype(int)
    df["is_preferred_category"] = (
        df["category"] == df["customer_preferred_category"]
    ).astype(int)

    # ! This is arbitrary and debatable
    df['is_frequent_merchant'] = (
    df.groupby(['cc_num', 'merchant']).cumcount() >= 8   # 0‑based
    ).astype(int)

    print("Created customer behavior features.")
    return df

## 2. Amount Pattern Features
These features detect if transaction amounts are unusual for the specific customer.

- `customer_amt_mean`: Average spending for this customer
- `customer_amt_std`: How much their spending typically varies
- `amt_zscore_by_customer`: How unusual this amount is for them
- `amt_pct_diff_from_mean`: Percentage difference from their average

**Why?** Frequent customers have consistent spending patterns. Sudden large deviations might indicate fraud.


In [18]:
def create_amount_pattern_features(df):
    """Create features related to amount patterns per customer"""
    df = df.copy()

    # Customer's normal amount ranges
    group = df.groupby("cc_num", group_keys=False)

    df["customer_amt_mean"] = group["amt"].apply(lambda s: s.shift().expanding().mean())
    df["customer_amt_std"] = (
        group["amt"].apply(lambda s: s.shift().expanding().std()).fillna(0.01)
    )

    # Transaction amount unusualness
    df["amt_zscore_by_customer"] = (df["amt"] - df["customer_amt_mean"]) / df[
        "customer_amt_std"
    ].clip(lower=0.01)
    df["amt_pct_diff_from_mean"] = (df["amt"] - df["customer_amt_mean"]) / df[
        "customer_amt_mean"
    ].fillna(df["amt"])

    print("Created amount pattern features.")
    return df

## 3. Time Pattern Features
These features look at when customers typically transact.

- `customer_tx_hour_std`: How consistent are their transaction times
- `time_pattern_regularity`: Score for timing consistency
- `high_frequency_day`: Multiple transactions same day
- `high_frequency_month`: Many transactions this month

**Why?** Regular customers often transact at similar times. Unusual timing might indicate fraud.


In [19]:
def create_time_pattern_features(df):
    """Create features related to transaction timing patterns"""
    df = df.copy()
    
    # Time-based patterns
    df['customer_tx_hour_std'] = df.groupby('cc_num')['hour'].transform('std')
    df['time_pattern_regularity'] = 1 / (1 + df['customer_tx_hour_std'])
    
    # Using existing merchant time features
    df['high_frequency_day'] = (df['times_shopped_at_merchant_day'] > 1).astype(int)
    df['high_frequency_month'] = (df['times_shopped_at_merchant_month'] > 5).astype(int)
    
    print("Created time pattern features.")
    return df

## 4. Velocity Features
These features look at transaction frequency patterns.

- `merchant_velocity_score`: Combined score of daily/monthly/yearly frequency
- Weighted more heavily on recent activity

**Why?** Helps distinguish between genuine frequent shopping and suspicious rapid transactions.


In [20]:
def create_velocity_features(df):
    """Create transaction velocity features using existing merchant metrics"""
    df = df.copy()
    
    # Using the existing merchant frequency features
    df['merchant_velocity_score'] = (
        df['times_shopped_at_merchant_day'] * 0.5 +
        df['times_shopped_at_merchant_month'] * 0.3 +
        df['times_shopped_at_merchant_year'] * 0.2
    )
    
    # Normalize the velocity score
    df['merchant_velocity_score'] = (df['merchant_velocity_score'] - df['merchant_velocity_score'].mean()) / df['merchant_velocity_score'].std()
    
    print("Created velocity features.")
    return df

## 5. Basic Transformations
Foundation features that support the above analysis.

- `amt_log1p`: Normalized transaction amounts
- `day_of_week`, `is_weekend`: Time-based context
- `transaction_datetime`: Unified timestamp for analysis

**Why?** Makes it easier to compare transactions across different scales and times.

---

In [21]:
def create_datetime_features(df):
    """Create datetime-related features from separate time columns"""
    df = df.copy()
    
    # Create datetime from components
    time_cols_map = {
        'year': 'year', 
        'trans_month': 'month', 
        'trans_day': 'day', 
        'hour': 'hour'
    }
    
    df['transaction_datetime'] = pd.to_datetime(
        df[time_cols_map.keys()].rename(columns=time_cols_map), 
        errors='coerce'
    )
    
    # Extract time components
    if not df['transaction_datetime'].isnull().all():
        df['day_of_week'] = df['transaction_datetime'].dt.dayofweek  # Monday=0, Sunday=6
        df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
        df['day_of_year'] = df['transaction_datetime'].dt.dayofyear
        print("Successfully created day_of_week, is_weekend, day_of_year.")
    else:
        print("Could not create 'transaction_datetime' reliably, skipping dependent features.")
        df = df.drop(columns=['transaction_datetime'], errors='ignore')
    
    return df

In [22]:
def create_amount_features(df):
    """Apply transformations to amount-related columns"""
    df = df.copy()
    
    # Log transform amount
    if 'amt' in df.columns:
        df['amt_log1p'] = np.log1p(df['amt'])
        print("Created 'amt_log1p'.")
    else:
        print("Warning: Column 'amt' not found for log transformation.")
    
    return df

In [47]:
def apply_all_feature_engineering(df):
    """Apply all feature engineering steps in the correct order"""
    print("Starting feature engineering process...")
    
    # 1. Create datetime features first (needed for other features)
    df = create_datetime_features(df)
    
    # 2. Basic amount transformations
    df = create_amount_features(df)
    
    # 3. Customer behavior features
    df = create_customer_behavior_features(df)
    
    # 4. Amount patterns (depends on basic amount features)
    df = create_amount_pattern_features(df)
    
    # 5. Time patterns (depends on datetime features)
    df = create_time_pattern_features(df)
    
    # 6. Velocity features (depends on datetime features)
    df = create_velocity_features(df)
    
    print("Feature engineering completed.")
    return df

df_final = apply_all_feature_engineering(df_fe)

Starting feature engineering process...
Successfully created day_of_week, is_weekend, day_of_year.
Created 'amt_log1p'.
Created customer behavior features.
Created amount pattern features.
Created time pattern features.
Created velocity features.
Feature engineering completed.


In [48]:
df_final

Unnamed: 0,cc_num,merchant,category,amt,first,last,gender,street,city,state,...,is_frequent_merchant,customer_amt_mean,customer_amt_std,amt_zscore_by_customer,amt_pct_diff_from_mean,customer_tx_hour_std,time_pattern_regularity,high_frequency_day,high_frequency_month,merchant_velocity_score
0,2703186189652095,"fraud_Rippin, Kub and Mann",misc_net,4.97,Jennifer,Banks,F,561 Perry Cove,Moravian Falls,NC,...,0,,0.010000,,,6.549756,0.132455,0,0,0.032240
1,630423337322,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,Stephanie,Gill,F,43039 Riley Greens Suite 393,Orient,WA,...,0,,0.010000,,,6.382896,0.135448,0,0,-0.347733
2,38859492057661,fraud_Lind-Buckridge,entertainment,220.11,Edward,Sanchez,M,594 White Dale Suite 530,Malad City,ID,...,0,,0.010000,,,6.985246,0.125231,0,0,-0.601049
3,3534093764340240,"fraud_Kutch, Hermiston and Farrell",gas_transport,45.00,Jeremy,White,M,9443 Cynthia Court Apt. 038,Boulder,MT,...,0,,0.010000,,,7.144652,0.122780,0,0,-1.107680
4,375534208663984,fraud_Keeling-Crist,misc_pos,41.96,Tyler,Garcia,M,408 Bradley Rest,Doe Hill,VA,...,0,,0.010000,,,6.944418,0.125875,0,0,-1.107680
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1852389,30560609640617,fraud_Reilly and Sons,health_fitness,43.77,Michael,Olson,M,558 Michael Estates,Luray,MO,...,0,62.356436,110.845707,-0.167678,-0.298068,6.945358,0.125860,0,0,-0.601049
1852390,3556613125071656,fraud_Hoppe-Parisian,kids_pets,111.84,Jose,Vasquez,M,572 Davis Mountains,Lake Jackson,TX,...,0,50.435516,168.381067,0.364676,1.217485,5.842394,0.146148,0,0,-0.601049
1852391,6011724471098086,fraud_Rau-Robel,kids_pets,86.88,Ann,Lawson,F,144 Evans Islands Apt. 683,Burbank,WA,...,0,88.704797,119.965224,-0.015211,-0.020572,6.598787,0.131600,1,0,1.045502
1852392,4079773899158,fraud_Breitenberg LLC,travel,7.99,Eric,Preston,M,7020 Doyle Stream Apt. 951,Mesa,ID,...,0,61.016205,89.535597,-0.592236,-0.869051,6.991283,0.125136,0,0,-0.474391


In [34]:
# Save the processed data
df_final.to_csv(os.path.join(PROCESSED_DATA_PATH, 'transactions_processed.csv'), index=False)
print("Processed data saved to:", os.path.join(PROCESSED_DATA_PATH, 'transactions_processed.csv'))


Processed data saved to: ../data/processed/transactions_processed.csv


In [49]:
# what percentage of transactions come from frequent customers? (is_frequent_merchant is included per row)

df_final['is_frequent_merchant'].value_counts()


is_frequent_merchant
0    1851260
1       1134
Name: count, dtype: int64