# E-Commerce Fraud Detection

This project models E-Commerce transaction data to identify fraudelent activity, based on this [Kaggle Dataset](https://www.kaggle.com/datasets/umuttuygurr/e-commerce-fraud-detection-dataset). The dataset is synthetic, but very realistic, as it is modeled after real-life fraudulent activity observed in 2024, with scenarios such as
- Cards tested with $1 purchases at midnight
- Transactions that shipped “gaming accessories” 5,000 km away
- Promo codes being reused from freshly created accounts.

I decided to focus on this dataset as it is the most complete, realistic data on transaction fraud that I could find. Other fraud datasets that weren't synthetic had to obfuscate the meaning of features and their values for privacy reasons, using techniques like PCA, so features had meaningless names like V1, V2, etc.

Here is a list of the columns in the dataset with brief descriptions:

- `transaction_id`: Unique transaction identifier
- `user_id`: User identifier (each user 40–60 transactions)
- `account_age_days`: Age of user account in days
- `total_transactions_user`: Number of transactions per user
- `avg_amount_user`: User’s mean transaction amount
- `amount`: Transaction amount (USD)
- `country`: User’s country
- `bin_country`: Country of the card-issuing bank
- `channel`: “web” or “app”
- `merchant_category`: Type of purchase: electronics, fashion, grocery, gaming, travel
- `promo_used`: whether a discount/promo was used
- `avs_flag`: Address Verification result, a mismatch in the billing address provided by a customer and the one on file with their card issuer.
- `cvv_result`: CVV code match result, indicates if 3 digit code on back of card provided during an online transaction matched the card issuer's records
- `three_ds_flag`: 3D Secure enabled, so if a transaction is flagged, it prompts the customer to complete an extra verification step, such as a one-time code sent to your phone, a password, or biometric login
- `transaction_time`: Transaction timestamp (UTC)
- `shipping_distance_km`: Distance between billing and shipping addresses
- `is_fraud`: Target label (1 = fraud, 0 = normal)

## Setup
### Define parameters
The input/output parameters are defined in the next cell.

In [67]:
# Data input parameters
kaggle_source = "umuttuygurr/e-commerce-fraud-detection-dataset"
data_dir = "./data"
csv_file = "transactions.csv"
# Column definitions
target_col = "is_fraud"
id_cols = ['transaction_id', 'user_id']
date_col = 'transaction_time'
# Validation/Test split ratios
val_ratio = .2
test_ratio = .2

### Import packages

In [13]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
from pathlib import Path
import pickle
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mutual_info_score, roc_auc_score, f1_score
from sklearn.model_selection import KFold, train_test_split
from sklearn.preprocessing import StandardScaler
from statsmodels.stats.outliers_influence import variance_inflation_factor
import xgboost as xgb

### Define functions

In [76]:
def download_data_csv(kaggle_source, data_dir, csv_file):
    """Download csv file from kaggle_source. Requires install of kaggle python
    package to use the Kaggle API and Kaggle API credentials set up in
    `~/.kaggle/kaggle.json`. Creates data directory, data_dir, if it doesn't
    exist. csv_file is the name of the downloaded file.
    """
    Path(data_dir).mkdir(parents=True, exist_ok=True)
    if not os.path.exists(f"{data_dir}/{csv_file}"):
        print(f"\nDownloading dataset from Kaggle...")
        !kaggle datasets download -d {kaggle_source} -p {data_dir} --unzip
        print("Download complete!")
    else:
        print(f"\nDataset already exists at {data_dir}/{csv_file}")

def load_data(data_dir, csv_file, verbose=True):
    df = pd.read_csv(
        f"{data_dir}/{csv_file}",
        low_memory=False  # Read entire file to infer dtypes properly
    )
    if verbose:
        print(f"\nDataset Shape: {df.shape[0]} rows, {df.shape[1]} columns")
        print(f"\nMemory Usage:\n{df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    return df

def analyze_target_stats(df, target_col):
    # Target distribution
    target_dist = df[target_col].value_counts(normalize=True)
    print("\nTarget Distribution (%):")
    print(target_dist * 100)
    # Check class imbalance
    target_vals = target_dist.values
    target_ratio = target_vals[0] / target_vals[1]
    if target_ratio > 10:
        print(f"\nWarning: Large class imbalance ({target_ratio:.1f}) detected!")
    else:
        print(f"\nClass imbalance = {target_ratio:.1f}")

def analyze_feature_stats(df, id_cols, target_col):
    # Summary for categorical/object columns
    categorical_cols = df.select_dtypes(include=['object']).columns
    if len(categorical_cols) > 0:
        print("Categorical Columns Summary:")
        for col in categorical_cols:
            print(f"\n{col}:")
            print(f"  Unique values: {df[col].nunique()}")
            print(f"  Top 5 values:\n{df[col].value_counts().head()}")

    # Statistical summary for numerical columns
    numeric_cols = df.select_dtypes(include=['int', 'float']).columns
    numeric_cols = [nc for nc in numeric_cols if nc not in id_cols and nc != target_col]
    print()
    display(df[numeric_cols].describe())
        
def split_train_val_test(df, val_ratio=.2, test_ratio=.2, stratify=None, r_seed=1, verbose=False):
    """Use the train_test_split function from sklearn to split input dataframe
    into randomly shuffled train, validation, and test datasets with the
    validation dataset containing val_ratio of the input data and the test
    dataset containing test_ratio of the input data. Stratify, if provided, is 
    the name of the column in df to use when stratifying the splits.
    """
    n = len(df)
    # Generate test dataset
    strat_col = stratify
    if stratify:
        strat_col = df[stratify]
    full_train_df, test_df = train_test_split(df, test_size=test_ratio, stratify=strat_col, random_state=r_seed)
    test_df = test_df.reset_index(drop=True)
    # Generate train, validation, and test splits
    val_ft_ratio = val_ratio / (1 - test_ratio)
    if stratify:
        strat_col = full_train_df[stratify]
    train_df, val_df = train_test_split(full_train_df, test_size=val_ft_ratio, stratify=strat_col, random_state=r_seed)
    train_df = train_df.reset_index(drop=True)
    val_df = val_df.reset_index(drop=True)
    if verbose:
        print(f"All rows in the original dataframe are contained within the training, validation, or test datasets: {len(train_df) + len(val_df) + len(test_df) == len(df)}")
    return train_df, val_df, test_df

def calculate_mi_scores(df, categorical_features, target_col):
    """
    Calculate mutual information scores for categorical and ordinal features.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Input dataframe
    categorical_features : list
        List of categorical feature column names
    target_col : str
        Name of target column
    
    Returns:
    --------
    pd.DataFrame : DataFrame with features and their MI scores, sorted by score descending
    """
    mi_scores = []
    
    for feature in categorical_features:
        score = mutual_info_score(df[feature], df[target_col])
        mi_scores.append({'feature': feature, 'mi_score': score})
    
    mi_df = pd.DataFrame(mi_scores).sort_values('mi_score', ascending=False).reset_index(drop=True)
    return mi_df


def calculate_numeric_correlations(df, numeric_features, target_col):
    """
    Calculate Pearson correlations for numeric features with target.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Input dataframe
    numeric_features : list
        List of numeric feature column names
    target_col : str
        Name of target column
    
    Returns:
    --------
    pd.DataFrame : DataFrame with features and their correlation with target, sorted by absolute value descending
    """
    correlations = []
    
    for feature in numeric_features:
        corr = df[feature].corr(df[target_col])
        correlations.append({'feature': feature, 'correlation': corr, 'abs_correlation': abs(corr)})
    
    corr_df = pd.DataFrame(correlations).sort_values('abs_correlation', ascending=False).reset_index(drop=True)
    return corr_df[['feature', 'correlation']]


def calculate_vif(df, numeric_features):
    """
    Calculate Variance Inflation Factor for numeric features.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Input dataframe
    numeric_features : list
        List of numeric feature column names
    
    Returns:
    --------
    pd.DataFrame : DataFrame with features and their VIF values, sorted by VIF descending
    """
    # Create subset with only numeric features
    X = df[numeric_features].values
    
    vif_data = []
    for i, feature in enumerate(numeric_features):
        vif = variance_inflation_factor(X, i)
        vif_data.append({'feature': feature, 'VIF': vif})
    
    vif_df = pd.DataFrame(vif_data).sort_values('VIF', ascending=False).reset_index(drop=True)
    return vif_df

## Load data

In [29]:
download_data_csv(kaggle_source, data_dir, csv_file)
input_df = load_data(data_dir, csv_file, verbose=True)

/nDataset already exists at ./data/transactions.csv

Dataset Shape: 299695 rows, 17 columns

Memory Usage:
107.29 MB


In [26]:
input_df.head()

Unnamed: 0,transaction_id,user_id,account_age_days,total_transactions_user,avg_amount_user,amount,country,bin_country,channel,merchant_category,promo_used,avs_match,cvv_result,three_ds_flag,transaction_time,shipping_distance_km,is_fraud
0,1,1,141,47,147.93,84.75,FR,FR,web,travel,0,1,1,1,2024-01-06T04:09:39Z,370.95,0
1,2,1,141,47,147.93,107.9,FR,FR,web,travel,0,0,0,0,2024-01-09T20:13:47Z,149.62,0
2,3,1,141,47,147.93,92.36,FR,FR,app,travel,1,1,1,1,2024-01-12T06:20:11Z,164.08,0
3,4,1,141,47,147.93,112.47,FR,FR,web,fashion,0,1,1,1,2024-01-15T17:00:04Z,397.4,0
4,5,1,141,47,147.93,132.91,FR,US,web,electronics,0,1,1,1,2024-01-17T01:27:31Z,935.28,0


In [27]:
# No null values
input_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299695 entries, 0 to 299694
Data columns (total 17 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   transaction_id           299695 non-null  int64  
 1   user_id                  299695 non-null  int64  
 2   account_age_days         299695 non-null  int64  
 3   total_transactions_user  299695 non-null  int64  
 4   avg_amount_user          299695 non-null  float64
 5   amount                   299695 non-null  float64
 6   country                  299695 non-null  object 
 7   bin_country              299695 non-null  object 
 8   channel                  299695 non-null  object 
 9   merchant_category        299695 non-null  object 
 10  promo_used               299695 non-null  int64  
 11  avs_match                299695 non-null  int64  
 12  cvv_result               299695 non-null  int64  
 13  three_ds_flag            299695 non-null  int64  
 14  tran

## Preprocessing
### Verify table grain

In [39]:
print(f"Every row is uniquely defined by transaction and user id columns: {len(input_df)==len(input_df.drop_duplicates(subset=id_cols))}")

Every row is uniquely defined by transaction and user id columns: True


### Target class balance

In [35]:
analyze_target_stats(input_df, target_col)


Target Distribution (%):
is_fraud
0    97.793757
1     2.206243
Name: proportion, dtype: float64



### Convert date type

In [48]:
input_df[date_col] = pd.to_datetime(input_df[date_col], errors='coerce')

### Feature stats
Examine the distribution of categorical features and compute summary statistics for numerical features

In [77]:
analyze_feature_stats(input_df, id_cols, target_col)

Categorical Columns Summary:

country:
  Unique values: 10
  Top 5 values:
country
US    32430
GB    30602
FR    30343
NL    30220
TR    30074
Name: count, dtype: int64

bin_country:
  Unique values: 10
  Top 5 values:
bin_country
US    32295
GB    30563
FR    30261
NL    30256
TR    29972
Name: count, dtype: int64

channel:
  Unique values: 2
  Top 5 values:
channel
web    152226
app    147469
Name: count, dtype: int64

merchant_category:
  Unique values: 5
  Top 5 values:
merchant_category
electronics    60220
travel         59922
grocery        59913
gaming         59839
fashion        59801
Name: count, dtype: int64



Unnamed: 0,account_age_days,total_transactions_user,avg_amount_user,amount,promo_used,avs_match,cvv_result,three_ds_flag,shipping_distance_km
count,299695.0,299695.0,299695.0,299695.0,299695.0,299695.0,299695.0,299695.0,299695.0
mean,973.397871,50.673321,148.142973,177.165279,0.15364,0.837999,0.87211,0.784588,357.049028
std,525.241409,5.976391,200.364624,306.926507,0.360603,0.368453,0.333968,0.411109,427.672074
min,1.0,40.0,3.52,1.0,0.0,0.0,0.0,0.0,0.0
25%,516.0,46.0,46.19,42.1,0.0,1.0,1.0,1.0,136.6
50%,975.0,51.0,90.13,89.99,0.0,1.0,1.0,1.0,273.02
75%,1425.0,56.0,173.45,191.11,0.0,1.0,1.0,1.0,409.18
max,1890.0,60.0,4565.29,16994.74,1.0,1.0,1.0,1.0,3748.56


### Train/Validation/Test Splits

In [64]:
train_df, val_df, test_df = split_train_val_test(pp_df, val_ratio=val_ratio, test_ratio=test_ratio, stratify=target_col, verbose=True)

['transaction_id', 'user_id']

## EDA
### Target variable

### Numeric Features

#### Multicollinearity Detection (VIF)

#### Bivariate Analysis: Features vs. Target

### Categorical Features
#### Fraud Rates

### Feature Selection Recommendations

## Model Training