# Phase 1: Exploratory Data Analysis (EDA) & Feature Engineering
**Project:** Nigerian-Context Fraud Detection System
**Goal:** Identify patterns in Nigerian financial transactions and engineer features that capture local fraud behaviors (USSD drains, midnight spikes, etc.).

### Objectives:
1. Load the synthetic Nigerian transaction data.
2. Analyze the distribution of fraud across different "Naira Bands."
3. Engineer 20+ features for our ML models.

In [3]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from src.logger import get_logger
from pathlib import Path

# Initialize logger for the notebook
logger = get_logger("notebook_eda")

# Set visual style
plt.style.use('fivethirtyeight')
%matplotlib inline

# Load the data

filepath = Path("../../data/nigerian_transactions.csv").resolve()
df = pd.read_csv(filepath)
df['timestamp'] = pd.to_datetime(df['timestamp'])

logger.info("Data loaded and timestamp converted.")
df.head()

[2025-12-27 00:41:48] INFO [notebook_eda.<module>:22] Data loaded and timestamp converted.


Unnamed: 0,transaction_id,timestamp,sender_bank,amount,channel,location,device_type,is_fraud
0,TRX-1000000,2024-01-06 08:48:00,OPay,299301.54,Mobile App,Lagos,Feature Phone,0
1,TRX-1000001,2024-01-20 22:00:00,Moniepoint,5416.13,Web,Kano,Web,0
2,TRX-1000002,2024-01-20 01:07:00,Moniepoint,19560.98,POS,Enugu,Android,0
3,TRX-1000003,2024-01-28 03:57:00,Moniepoint,1017.63,USSD,Ibadan,Android,0
4,TRX-1000004,2024-01-07 16:01:00,OPay,19093.36,Mobile App,Abuja,Feature Phone,0


## 1. Contextual Analysis: The "Naira Band"
In the Nigerian ecosystem, fraud often follows specific amount patterns.
* **Micro-transactions:** Airtime/Data top-ups (Small amounts).
* **Target Bands:** ₦50,000 to ₦200,000 (Common daily limits/drain targets).


In [4]:
def engineer_initial_features(data):
    df_feat = data.copy()

    # 1. Temporal Features (WAT - West Africa Time)
    df_feat['hour'] = df_feat['timestamp'].dt.hour
    df_feat['day_of_week'] = df_feat['timestamp'].dt.dayofweek

    # Flag: Midnight transactions (12AM - 4AM) - High risk in Nigeria
    df_feat['is_midnight'] = df_feat['hour'].apply(lambda x: 1 if x <= 4 else 0)

    # 2. Amount Bands (Naira specific)
    conditions = [
        (df_feat['amount'] <= 5000),
        (df_feat['amount'] > 5000) & (df_feat['amount'] <= 50000),
        (df_feat['amount'] > 50000) & (df_feat['amount'] <= 200000),
        (df_feat['amount'] > 200000)
    ]
    values = ['micro', 'small', 'medium', 'high']
    df_feat['amount_band'] = np.select(conditions, values)

    # 3. Channel Risk
    df_feat['is_ussd'] = (df_feat['channel'] == 'USSD').astype(int)

    logger.info("Initial features engineered: Hour, Midnight Flag, Amount Bands.")
    return df_feat

df_engineered = engineer_initial_features(df)
df_engineered.head()

TypeError: Choicelist and default value do not have a common dtype: The DType <class 'numpy.dtypes._PyLongDType'> could not be promoted by <class 'numpy.dtypes.StrDType'>. This means that no common DType exists for the given inputs. For example they cannot be stored in a single array unless the dtype is `object`. The full list of DTypes is: (<class 'numpy.dtypes.StrDType'>, <class 'numpy.dtypes.StrDType'>, <class 'numpy.dtypes.StrDType'>, <class 'numpy.dtypes.StrDType'>, <class 'numpy.dtypes._PyLongDType'>)

## 2. Visualizing Fraud Distribution

In [None]:
plt.figure(figsize=(12, 6))
sns.countplot(data=df_engineered, x='hour', hue='is_fraud')
plt.title('Transaction Volume vs Fraud Incidents by Hour (WAT)')
plt.show()

## 3. Advanced Feature Engineering: Transaction Velocity
Fraudsters often perform "Account Probing" or "Rapid Draining."
We will calculate:
* **Transaction Count (Last 24h):** Frequency of transactions from the same bank/location.
* **Amount Deviation:** Is this ₦200,000 transfer "normal" for this specific channel?

In [None]:
def engineer_velocity_features(data):
    # Ensure data is sorted by time for rolling calculations
    df_v = data.sort_values('timestamp').copy()

    # 1. Frequency: How many transactions in the last 24 hours?
    df_v['tx_count_24h'] = df_v.groupby(['sender_bank', 'location'])['timestamp'] \
                               .transform(lambda x: x.rolling('24H', on=x).count())

    # 2. Amount Velocity: Cumulative spend in the last 24 hours
    df_v['total_spend_24h'] = df_v.groupby(['sender_bank', 'location'])['amount'] \
                                 .transform(lambda x: x.rolling('24H').sum())

    # 3. Average Transaction Value (ATV) Deviation
    df_v['avg_tx_amount'] = df_v.groupby('sender_bank')['amount'].transform('mean')
    df_v['amount_vs_avg_ratio'] = df_v['amount'] / df_v['avg_tx_amount']

    logger.info("Velocity features engineered: tx_count_24h, total_spend_24h, amount_ratio.")
    return df_v


df_final = engineer_velocity_features(df_engineered)

In [None]:
# 1. Finalize the features
df_final = engineer_velocity_features(df_engineered)

# 2. Drop columns we don't need for the ML model
cols_to_drop = ['timestamp', 'transaction_id']
ml_ready_df = df_final.drop(columns=cols_to_drop)

# 3. Save as Parquet
file_output = '../../data/nigerian_fraud_features.parquet'
ml_ready_df.to_parquet(file_output, index=False)

logger.info(f"Feature Engineering Complete. Data saved to {file_output}")
ml_ready_df.head()