# **Notebook 1** Fraud Detection EDA (Timezone-Aware)
**Project:** Fraud Shield AI  
**Notebook:** 01_fraud_detection_eda.ipynb  
**Objective:** Exploratory Data Analysis & Feature Validation (No Modeling)

## üìã Project Information

| **Attribute** | **Details** |
| :--- | :--- |
| **Author** | Alireza Barzin Zanganeh |
| **Contact** | abarzinzanganeh@gmail.com |
| **Date** | January 18, 2026 |
| **Project Type** | Capstone Project|

**Data Source Attribution:**  
ZIP code and timezone data provided by [SimpleMaps US ZIP Code Database](https://simplemaps.com/data/us-zips). Free version used for EDA.  
Use in production requires linking back as per their license, which we comply with.

## Problem Statement
The objective is to design and implement a comprehensive fraud detection system capable of identifying fraudulent transactions with high accuracy and scalability. The system will leverage:

Supervised learning to identify known fraud patterns from labeled historical data.
Deep learning to model complex relationships and sequential transaction behaviors.
The solution will focus on minimizing false positives, maximizing fraud recall, and maintaining scalability for high-volume real-time transaction processing.

## Notebook Scope & Constraints

This notebook is **EDA-only** and focuses on:
- Dataset understanding
- Assumption validation
- Pattern discovery
- Feature design guidance

### Explicitly Out of Scope
- Model training
- Feature scaling
- Hyperparameter tuning
- Evaluation metrics

All modeling work is deferred to later notebooks.


## Critical Time Assumptions

Transaction timestamps are **not guaranteed to be in local time**.

Any analysis involving:
- Hour of day
- Day of week
- Night vs daytime behavior

is considered **provisional** until timestamps are converted to **local transaction time**
using latitude/longitude‚Äìbased timezone inference.

Timezone correction is treated as a first-class EDA requirement.


## Analysis Roadmap

This notebook follows a strict, sequential EDA structure:

1. Environment & Dataset Loading  
2. Data Quality Assessment  
3. Target Variable (Fraud) Overview  
4. Temporal Analysis (Timezone-Aware)  
5. Calendar-Level Patterns (Month, Seasonality)  
6. Key Insights & Next Steps

Each section builds on validated assumptions from the previous one.


# 1. Environment & Dataset Inspection


# Global Imports & Dependencies

In [None]:
# ============================================================
# GLOBAL IMPORTS & DEPENDENCIES
# ============================================================
# All imports consolidated here for clarity and maintainability

# Standard Library
import os
import sys
import time
from typing import Tuple, Dict, List

# PySpark Core
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import (
    col, count, sum as spark_sum, avg, min as spark_min, max as spark_max,
    round as spark_round, floor, ceil, trim, when, lit, concat, concat_ws,
    hour, dayofweek, dayofmonth, month, year, date_format,
    from_utc_timestamp, to_timestamp, unix_timestamp,
    broadcast, first, last, collect_list, collect_set,
    percentile_approx, stddev, variance, corr
)
from pyspark.sql.window import Window
from pyspark.sql.types import *

# Data Analysis & Visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Suppress warnings for clean output
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ All dependencies loaded successfully")



In [None]:
# Imports are now consolidated in the Global Imports & Dependencies section above
print("‚úì Using global imports")


## 1.1 Kaggle Environment & Input Validation

### Purpose
Verify the Kaggle execution environment and confirm the availability and structure
of input datasets **before initializing Spark**.

### Why This Matters
- Confirms datasets are correctly mounted
- Avoids hard-coded paths
- Ensures Spark reads from the correct input directory


In [None]:
# Clear Spark cache if needed (uncomment to use)
# spark.catalog.clearCache()

print("Spark session ready - cache cleared if needed")


In [None]:
# Environment & dataset inspection

INPUT_DIR = "/kaggle/input"

print("Listing all files under /kaggle/input:\n")

for dirname, _, filenames in os.walk(INPUT_DIR):
    for filename in filenames:
        print(os.path.join(dirname, filename))



# 2. Spark Session Initialization


## 2.1 Spark Session Setup

### Purpose
Initialize a **SparkSession** to enable distributed data processing and replace
pandas-based workflows with Spark DataFrames.

### Why This Matters
- Enables scalable EDA and feature engineering
- Required for Spark `read`, `groupBy`, and ML operations
- Establishes a single entry point for Spark APIs


In [None]:

spark = (
    SparkSession.builder
    .appName("FraudShieldAI")
    .getOrCreate()
)

spark



# 3. Data Loading & Structural Validation


## 3.1 Load Training Dataset (Spark)

### Purpose
Load the training dataset into a **Spark DataFrame** and inspect its basic structure.

### What We Validate
- Number of rows
- Number of columns
- Column names and schema
- Sample records


In [None]:
# ============================================================
# LOAD TRAINING DATASET WITH VALIDATION
# ============================================================

# Construct path
dataset_folders = os.listdir(INPUT_DIR)
dataset_path = sorted(os.listdir(INPUT_DIR))[0]
full_path = os.path.join(INPUT_DIR, 'fraud-detection', "fraudTrain.csv")

# Validate path exists
if not os.path.exists(full_path):
    raise FileNotFoundError(f"Training data not found: {full_path}")

print(f"Dataset folder: {dataset_path}")
print(f"Loading from: {full_path}")

# Load with Spark
train_df = spark.read.csv(full_path, header=True, inferSchema=True)

# Drop index column if exists
if "_c0" in train_df.columns:
    train_df = train_df.drop("_c0")

# Basic validation
num_rows = train_df.count()
num_cols = len(train_df.columns)

print(f"\n‚úì Loaded successfully")
print(f"Shape: ({num_rows:,} rows, {num_cols} columns)")

# Show sample
print("\nSample records:")
train_df.show(3, truncate=False)


# 4. Data Quality & Target Overview

## 4.1 Fraud Statistics

### Purpose
Compute basic fraud statistics from the training dataset using **Spark**:
- Total transactions  
- Fraudulent transactions count  
- Legitimate transactions count  
- Fraud rate (%)

### Notes
- `.count()` replaces `len()` in Spark
- `sum()` can be applied with `agg` or `select` + `sum`


In [None]:
# Total transactions
total = train_df.count()

# Fraudulent transactions count
fraud_count = train_df.agg(spark_sum(col("is_fraud"))).collect()[0][0]

# Legitimate transactions
legit_count = total - fraud_count

# Fraud rate in percentage
fraud_rate = (fraud_count / total) * 100

# Print results
print(f"Total transactions: {total:,}")
print(f"Fraud transactions: {fraud_count:,}")
print(f"Legitimate transactions: {legit_count:,}")
print(f"Fraud rate: {fraud_rate:.4f}%")


## 4.2 Missing Values Analysis

### Purpose
Check for missing values in each column of the Spark DataFrame and compute the total number of missing entries.

### Notes
- In Spark, use `isNull()` combined with `agg` and `sum` for each column.
- `.count()` can also be used to compute non-missing rows.


### 4.3 Numeric Summary Statistics (Optional)

### Purpose
Get summary statistics (count, mean, stddev, min, max) for numeric columns in the Spark DataFrame.

### Notes
- Only computes **numeric columns** by default
- Returns a Spark DataFrame (use `.show()` to visualize)


In [None]:
# Compute missing values per column
missing_df = train_df.select([
    spark_sum(when(col(c).isNull(), 1).otherwise(0)).alias(c)
    for c in train_df.columns
])

# Show missing values per column
missing_df.show(truncate=False)

# Total missing values
total_missing = sum([missing_df.collect()[0][c] for c in missing_df.columns])
print(f"\nTotal missing values: {total_missing}")


In [None]:
# Spark equivalent of pandas describe()
# train_df.describe().show(truncate=False)

# 5. Key Findings from Initial Exploration


## 1. Dataset Size & Structure

- 1,296,675 transactions (~1.3M)
- 23 columns (features)
- ~1GB memory footprint

This size justifies using **PySpark** for distributed processing and scalable feature engineering.

---

## 2. Class Imbalance Analysis (CRITICAL)

| Class      | Count     | Percentage |
|-----------|-----------|------------|
| Fraud     | 7,506     | 0.58%      |
| Legitimate| 1,289,169 | 99.42%     |

**Imbalance ratio:** 172:1

### Why this matters

- Models can predict *‚Äúnot fraud‚Äù* for everything and still get ~99.4% accuracy ‚Üí **useless**.
- This is why **SMOTE** or other imbalance-handling methods are mandatory.
- This is the **key challenge** mentioned in class.

---

## 3. Missing Values

- Zero missing values ‚Üí clean dataset.
- No need for imputation strategies.
- Can proceed directly to **feature engineering**.

---

## 4. Geographic Analysis (lat/long)

### Customer Location

- Latitude: 20.0¬∞ to 66.7¬∞
- Longitude: -165.7¬∞ to -67.9¬∞ (all negative)

### What this tells us

- Negative longitude ‚Üí **Western Hemisphere**.
- This range covers the **entire United States**:
  - Alaska: ~64¬∞N, -165¬∞W (northern/western extreme)
  - Florida: ~25¬∞N, -80¬∞W (southern/eastern extreme)
  - California: ~37¬∞N, -120¬∞W (west coast)
  - Maine: ~45¬∞N, -68¬∞W (east coast)

### Cruise/Water transactions?

- To check this, compare lat/long to known land coordinates.
- If coordinates fall in the ocean (Atlantic or Pacific), they could be **cruise ship** transactions.
- This is a strong potential fraud indicator (class mentioned cruise transactions have higher fraud).

### Not in Europe/Asia because

- Europe longitude: about -10¬∞ to 40¬∞ (mostly positive).
- Asia longitude: about 60¬∞ to 180¬∞ (positive).
- Our data: all negative longitudes (-165 to -68) ‚Üí **North America only**.

---

## 5. Transaction Amount Analysis (`amt` column)

### Key statistics

- Min: \$1.00 ‚Äî small purchases (coffee, snacks)
- Median (50%): \$47.52 ‚Äî typical transaction
- Mean: \$70.35 ‚Äî slightly higher than median (right-skewed)
- 75th percentile: \$83.14 ‚Äî most transactions under \$100
- Max: \$28,948.90 ‚Äî very large purchase

### What this tells us

- Distribution is **right-skewed** ‚Äî a few very large transactions pull the mean up.
- Most transactions are **< \$100** ‚Äî normal retail purchases.
- Large amounts (e.g., **\$900+**) are rare ‚Äî class mentioned using bins: `>200`, `>300`, `>900` as features.
- Fraud likely correlates with **unusual amounts** ‚Äî need to compare fraud vs non-fraud distributions.

---

## 6. Credit Card Numbers (`cc_num`)

- Range: 60 billion to 4.9 quintillion.
- This is effectively an **ID**, not a numeric magnitude feature.
- Best used for **grouping**, e.g.:
  - *How many transactions in the last hour for this card?*

---

## 7. Time Analysis (`unix_time`)

- Range: 1,325,376,018 to 1,371,817,018.
- Converts to approximately: **January 1, 2012 ‚Üí June 21, 2013** (~18 months).

### Implications

- Over a year of data:
  - Can detect **seasonal patterns**.
  - Can analyze **holidays**.
  - Can study **weekly/daily** patterns.

---

## 8. City Population (`city_pop`)

- Min: 23 ‚Äî very small towns.
- Median: 2,456 ‚Äî small cities.
- Max: 2,906,700 ‚Äî major cities (e.g., Chicago ~2.7M).

### Possible fraud signal

- Fraud rates may differ between **small towns** and **large cities**.

---

## 9. Merchant Location (`merch_lat`, `merch_long`)

- Similar range to customer location.

### Critical feature

- **Distance between customer and merchant**:
  - Example: customer in NYC, merchant in LA ‚Üí potentially suspicious.

---

# What We DON'T Know Yet (Need to Explore)

## 1. Temporal Patterns

- What hours have most fraud? (night vs day)
- What days? (weekend vs weekday)
- Holidays?

## 2. Category Analysis

- What are the categories? (e.g., `gas_transport`, `grocery_pos`, etc.)
- Which categories have the **highest fraud rate**?

## 3. Amount Distribution

- How do **fraud amounts** compare to **legitimate** amounts?
- Are frauds typically **small**, **medium**, or **large**?

## 4. Geographic Patterns

- Which **states** have the most fraud?
- Are there **fraud hotspots**?
- How does **distance between customer and merchant** differ for fraud vs legitimate?

## 5. Merchant Analysis

- How many **unique merchants**?
- Do some merchants have **more fraud** than others?
- Merchant name patterns? (e.g., `fraud_` prefix in sample data)

## 6. Customer Demographics

- Gender distribution in fraud.
- Age (from DOB) correlation with fraud.
- Job types with higher fraud incidence.



# 6. Timezone Resolution & Temporal Feature Engineering


## 6.1 Constants and Utilities

In [None]:


GRID_SIZE = 0.5  # ~50km grid resolution


### 6.1.1 Helper Function for Timezone Resolution (DRY Pattern)


In [None]:
# ============================================================
# TIMEZONE RESOLUTION HELPER (DRY PATTERN)
# ============================================================

def resolve_timezone_with_grid(df, lat_col, lon_col, grid_size, zip_ref_df, entity_name="locations"):
    """
    Resolve timezones using grid-based lookup with nearest neighbor fallback.
    
    Design: DRY pattern - reusable for both merchants and customers.
    Performance: Broadcast, caching, vectorized neighbor search.
    
    Args:
        df: Source DataFrame
        lat_col: Latitude column name
        lon_col: Longitude column name  
        grid_size: Grid resolution in degrees
        zip_ref_df: Reference table with (lat_grid, lng_grid, timezone)
        entity_name: For logging
    
    Returns:
        tuple: (timezone_df, metrics_dict)
    """
    import time
    start_time = time.time()
    
    print("=" * 80)
    print(f"RESOLVING TIMEZONES: {entity_name.upper()}")
    print("=" * 80)
    
    # Direct grid match
    timezone_df = (
        df.select(lat_col, lon_col).distinct()
        .withColumn("lat_grid", floor(col(lat_col) / grid_size))
        .withColumn("lng_grid", floor(col(lon_col) / grid_size))
        .join(broadcast(zip_ref_df), on=["lat_grid", "lng_grid"], how="left")
        .select(lat_col, lon_col, col("timezone"))
        .cache()
    )
    
    total = timezone_df.count()
    direct = timezone_df.filter(col("timezone").isNotNull()).count()
    direct_rate = (direct / total) * 100
    
    print(f"Direct matches: {direct:,} / {total:,} ({direct_rate:.2f}%)")
    
    # Nearest neighbor fallback for NULLs
    null_count = total - direct
    resolved_fallback = 0
    
    if null_count > 0:
        print(f"Applying nearest neighbor for {null_count:,} locations...")
        null_locs = timezone_df.filter(col("timezone").isNull()).select(lat_col, lon_col)
        
        neighbors_found = None
        for dx in [-1, 0, 1]:
            for dy in [-1, 0, 1]:
                if dx == 0 and dy == 0:
                    continue
                neighbor_tz = (
                    null_locs
                    .withColumn("lat_grid", floor(col(lat_col) / grid_size) + dx)
                    .withColumn("lng_grid", floor(col(lon_col) / grid_size) + dy)
                    .join(broadcast(zip_ref_df), on=["lat_grid", "lng_grid"], how="inner")
                    .select(lat_col, lon_col, col("timezone"))
                    .dropDuplicates([lat_col, lon_col])
                )
                if neighbors_found is None:
                    neighbors_found = neighbor_tz
                else:
                    neighbors_found = neighbors_found.union(neighbor_tz).dropDuplicates([lat_col, lon_col])
        
        if neighbors_found:
            timezone_df = (
                timezone_df
                .join(neighbors_found.withColumnRenamed("timezone", "fallback_tz"),
                     on=[lat_col, lon_col], how="left")
                .withColumn("timezone", when(col("timezone").isNull(), col("fallback_tz")).otherwise(col("timezone")))
                .drop("fallback_tz")
                .cache()
            )
            resolved_fallback = timezone_df.filter(col("timezone").isNotNull()).count() - direct
            print(f"‚úì Resolved {resolved_fallback:,} via nearest neighbor")
    
    final_coverage = timezone_df.filter(col("timezone").isNotNull()).count()
    final_rate = (final_coverage / total) * 100
    elapsed = time.time() - start_time
    
    print(f"Final coverage: {final_coverage:,} / {total:,} ({final_rate:.2f}%)")
    print(f"Completed in {elapsed:.1f}s")
    print("=" * 80)
    
    return timezone_df, {
        "entity": entity_name,
        "total": total,
        "direct_matches": direct,
        "fallback_matches": resolved_fallback,
        "final_coverage": final_coverage,
        "direct_rate": direct_rate,
        "final_rate": final_rate,
        "elapsed_seconds": elapsed
    }

print("‚úì Timezone resolution helper function loaded")


## 6.2 Load and Prepare ZIP Reference Table (Immutable)

This table is never joined directly into train_df.

In [None]:
dataset_folders = os.listdir(INPUT_DIR)
zipcode_path = sorted(os.listdir(INPUT_DIR))[1]
full_zip_path = os.path.join(INPUT_DIR, zipcode_path, "uszips.csv")

# ============================================================
# CREATE ZIP REFERENCE TABLE (DEDUPLICATED BY GRID CELL)
# ============================================================
# CRITICAL: Ensure only ONE timezone per grid cell to prevent row explosion
# Multiple ZIPs can fall into the same grid cell - we take the first one

zip_ref_df = (
    spark.read.csv(full_zip_path, header=True, inferSchema=True)
    .withColumnRenamed("zip", "zip_ref")
    .withColumn("lat_grid", floor(col("lat") / GRID_SIZE))
    .withColumn("lng_grid", floor(col("lng") / GRID_SIZE))
    .select("lat_grid", "lng_grid", "timezone")
    .filter(
        (trim(col("timezone")) != "") &
        (col("timezone") != "FALSE") &
        col("timezone").rlike("^[A-Za-z_/]+$")
    )
    .distinct()  # Remove duplicate grid+timezone combinations
    .groupBy("lat_grid", "lng_grid")
    .agg(first("timezone").alias("timezone"))  # Ensure ONE timezone per grid cell
    .cache()
)

print(f"‚úì ZIP reference table created: {zip_ref_df.count():,} unique grid cells")
print(f"‚úì Timezones available: {zip_ref_df.select('timezone').distinct().count()}")
zip_ref_df.select("timezone").distinct().show(20, truncate=False)


### 6.2.1 Customer Timezone (ZIP ‚Üí Timezone)

### Design

- Join only on zip

- Add exactly one column

- Skip creation if it already exists

#### 6.2.1.1 Create Customer Timezone Feature

In [None]:
# ============================================================
# CUSTOMER TIMEZONE RESOLUTION (GRID-BASED WITH FALLBACK)
# ============================================================
# Design: Uses DRY helper function with nearest neighbor fallback
# Coverage: 100% (direct + fallback + production fallback)
# Performance: Broadcast joins, caching, vectorized neighbor search

if "customer_timezone" not in train_df.columns:
    # Use DRY helper function (includes nearest neighbor fallback)
    customer_tz_df, customer_metrics = resolve_timezone_with_grid(
        df=train_df,
        lat_col="lat",
        lon_col="long",
        grid_size=GRID_SIZE,
        zip_ref_df=zip_ref_df,
        entity_name="customers"
    )
    
    # Join to main DataFrame
    train_df = train_df.join(
        customer_tz_df.withColumnRenamed("timezone", "customer_timezone"),
        on=["lat", "long"],
        how="left"
    )
    
    # Validation column for data quality tracking
    train_df = train_df.withColumn(
        "customer_timezone_valid",
        col("customer_timezone").isNotNull()
    )
    
    # Cleanup cached intermediate DataFrame
    customer_tz_df.unpersist()
    
    print(f"‚úì Customer timezone column added to train_df")
    print(f"  Coverage: {customer_metrics['final_rate']:.2f}% ({customer_metrics['final_coverage']:,} / {customer_metrics['total']:,})")
else:
    print("‚ö† customer_timezone already exists, skipping")


#### 6.2.1.2 Validate Customer Timezone

In [None]:
if "customer_timezone_valid" not in train_df.columns:
    train_df = train_df.withColumn(
        "customer_timezone_valid",
        col("customer_timezone").isNotNull()
    )


### 6.2.2 Merchant Timezone (Lat/Lng ‚Üí Grid ‚Üí Timezone)

#### Design

- Convert merchant lat/lng into grid keys

- Join using derived grid

- Never leak grid columns into train_df

#### 6.2.2.1 Create Merchant Timezone Feature

In [None]:
# ============================================================
# MERCHANT TIMEZONE RESOLUTION (GRID-BASED WITH FALLBACK)
# ============================================================
# Design: Uses DRY helper function with nearest neighbor fallback
# Coverage: ~99.4% (direct + fallback + production fallback)
# Performance: Broadcast joins, caching, vectorized neighbor search
# 
# Note: Merchants in rural areas (empty grid cells) use 8-neighbor
#       fallback to find nearest timezone

if "merchant_timezone" not in train_df.columns:
    # Use same DRY helper function (includes nearest neighbor fallback)
    merchant_tz_df, merchant_metrics = resolve_timezone_with_grid(
        df=train_df,
        lat_col="merch_lat",
        lon_col="merch_long",
        grid_size=GRID_SIZE,
        zip_ref_df=zip_ref_df,
        entity_name="merchants"
    )
    
    # Join to main DataFrame
    train_df = train_df.join(
        merchant_tz_df.withColumnRenamed("timezone", "merchant_timezone"),
        on=["merch_lat", "merch_long"],
        how="left"
    )
    
    # Validation column for data quality tracking
    train_df = train_df.withColumn(
        "merchant_timezone_valid",
        col("merchant_timezone").isNotNull()
    )
    
    # Cleanup cached intermediate DataFrame
    merchant_tz_df.unpersist()
    
    print(f"‚úì Merchant timezone column added to train_df")
    print(f"  Coverage: {merchant_metrics['final_rate']:.2f}% ({merchant_metrics['final_coverage']:,} / {merchant_metrics['total']:,})")
else:
    print("‚ö† merchant_timezone already exists, skipping")


#### 6.2.2.2 Validate Merchant Timezone

In [None]:
if "merchant_timezone_valid" not in train_df.columns:
    train_df = train_df.withColumn(
        "merchant_timezone_valid",
        col("merchant_timezone").isNotNull()
    )


### 6.2.3 Final Schema Verification

In [None]:
train_df.printSchema()


### 6.2.3.3 Timezone Resolution Summary & Metrics

**Purpose:** Display comprehensive metrics for both customer and merchant timezone resolution to validate data quality.

**What We Track:**
- Direct match rate (grid-based lookup)
- Fallback match rate (nearest neighbor)
- Final coverage percentage
- Processing time

In [None]:
# ============================================================
# TIMEZONE RESOLUTION SUMMARY & METRICS
# ============================================================
# Purpose: Display comprehensive metrics for both customer and merchant
#          timezone resolution to validate data quality

if 'customer_metrics' in locals() and 'merchant_metrics' in locals():
    import pandas as pd
    
    # Create summary DataFrame
    summary_df = pd.DataFrame([customer_metrics, merchant_metrics])
    
    print("\n" + "=" * 80)
    print("TIMEZONE RESOLUTION SUMMARY")
    print("=" * 80)
    print(summary_df[['entity', 'total', 'direct_matches', 'fallback_matches', 
                      'final_coverage', 'direct_rate', 'final_rate', 'elapsed_seconds']].to_string(index=False))
    print("=" * 80)
    
    # Key insights
    print("\nüìä KEY INSIGHTS:")
    print(f"  Customer coverage: {customer_metrics['final_rate']:.2f}% "
          f"({customer_metrics['direct_rate']:.2f}% direct, "
          f"{customer_metrics['fallback_matches']:,} via fallback)")
    print(f"  Merchant coverage: {merchant_metrics['final_rate']:.2f}% "
          f"({merchant_metrics['direct_rate']:.2f}% direct, "
          f"{merchant_metrics['fallback_matches']:,} via fallback)")
    print(f"  Total processing time: {customer_metrics['elapsed_seconds'] + merchant_metrics['elapsed_seconds']:.1f}s")
    print("=" * 80 + "\n")
else:
    print("‚ö†Ô∏è  Metrics not available - ensure both customer and merchant timezones were resolved")

### 6.2.4 Data Quality Check 

In [None]:
train_df.select(
    "customer_timezone",
    "merchant_timezone",
    "customer_timezone_valid",
    "merchant_timezone_valid"
).summary().show()


## 6.3 Convert UTC Timestamps to Local Time

### Purpose
Now that we have timezone information for both merchants and customers, we can convert the raw transaction timestamps to **local time**.

### Why This Matters
- Raw timestamps in the data are in UTC (or mixed timezones)
- "10 PM" in UTC could be 2 PM Pacific or 5 PM Eastern
- We need to know "What time was it AT THE MERCHANT when transaction occurred?"
- This gives us the TRUE local hour for fraud pattern analysis

### What We'll Create
Two new timestamp columns:
1. **`merchant_local_time`**: Transaction time at merchant location (PRIMARY for analysis)
2. **`customer_local_time`**: Transaction time at customer location (for behavioral analysis)

### Method
Use Spark's native `from_utc_timestamp()` function:
- Input: UTC timestamp + timezone string
- Output: Local timestamp
- Fast, optimized, handles DST automatically

In [None]:

# Convert UTC to merchant local time
if "merchant_local_time" not in train_df.columns:
    train_df = train_df.withColumn(
        "merchant_local_time",
        from_utc_timestamp(col("trans_date_trans_time"), col("merchant_timezone"))
    )
    
    print("‚úì Merchant local time created")
    print("\nSample conversions (UTC ‚Üí Merchant Local):")
    train_df.select(
        "trans_date_trans_time",
        "merchant_timezone", 
        "merchant_local_time"
    ).show(5, truncate=False)
else:
    print("‚ö† merchant_local_time already exists, skipping")


In [None]:
# Convert UTC to customer local time
if "customer_local_time" not in train_df.columns:
    train_df = train_df.withColumn(
        "customer_local_time",
        from_utc_timestamp(col("trans_date_trans_time"), col("customer_timezone"))
    )
    
    print("‚úì Customer local time created")
    print("\nSample conversions (UTC ‚Üí Customer Local):")
    train_df.select(
        "trans_date_trans_time",
        "customer_timezone",
        "customer_local_time"
    ).show(5, truncate=False)
else:
    print("‚ö† customer_local_time already exists, skipping")

### 6.3.1 Local Time Conversion Summary

Conversion applied above. Validation performed in Section 6.3.3 (Production Fallback).


In [None]:
# Validation happens in Section 6.3.3 with before/after comparison
print("‚úì Local time conversion complete - see Section 6.3.3 for validation")


### 6.3.2 Compare UTC vs Local Time Examples

### Purpose
Show concrete examples of how timestamps changed

### What to Look For
- Time difference between UTC and local
- Different timezones showing different offsets
- Verify conversions look correct

In [None]:
# Show side-by-side comparison for different timezones
print("UTC vs LOCAL TIME COMPARISON\n")
print("Showing examples from different timezones:")

comparison_df = train_df.select(
    "trans_date_trans_time",
    "merchant_timezone",
    "merchant_local_time",
    hour("trans_date_trans_time").alias("utc_hour"),
    hour("merchant_local_time").alias("local_hour")
).filter(
    col("merchant_timezone").isNotNull()
)

# Show examples from each major timezone
for tz in ["America/New_York", "America/Chicago", "America/Denver", "America/Los_Angeles"]:
    print(f"\n{tz}:")
    comparison_df.filter(col("merchant_timezone") == tz).show(2, truncate=False)

### Section 6.3.3: Production Fallback 

In [None]:
# ============================================================
# PRODUCTION FALLBACK: Single-Pass Validation & 100% Coverage
# ============================================================

import time
start_time = time.time()

# Single-pass aggregation for validation (BEFORE fallback)
counts_before = train_df.agg(
    count("*").alias("total"),
    spark_sum(when(col("merchant_local_time").isNotNull(), 1).otherwise(0)).alias("merchant_valid"),
    spark_sum(when(col("customer_local_time").isNotNull(), 1).otherwise(0)).alias("customer_valid")
).collect()[0]

total = counts_before["total"]
merchant_before = counts_before["merchant_valid"]
customer_before = counts_before["customer_valid"]

print("=" * 80)
print("PRODUCTION FALLBACK: Ensuring 100% Coverage")
print("=" * 80)
print(f"\nBefore fallback:")
print(f"Merchant: {merchant_before:,} / {total:,} ({merchant_before/total*100:.2f}%)")
print(f"Customer: {customer_before:,} / {total:,} ({customer_before/total*100:.2f}%)")

# Fallback Level 1: If merchant_local_time is NULL, use customer_local_time
train_df = train_df.withColumn(
    "merchant_local_time",
    when(col("merchant_local_time").isNull(), col("customer_local_time"))
    .otherwise(col("merchant_local_time"))
)

# Fallback Level 2: If still NULL, use UTC (guaranteed to exist)
train_df = train_df.withColumn(
    "merchant_local_time",
    when(col("merchant_local_time").isNull(), col("trans_date_trans_time"))
    .otherwise(col("merchant_local_time"))
)

train_df = train_df.withColumn(
    "customer_local_time",
    when(col("customer_local_time").isNull(), col("trans_date_trans_time"))
    .otherwise(col("customer_local_time"))
)

# Single-pass validation (AFTER fallback)
counts_after = train_df.agg(
    spark_sum(when(col("merchant_local_time").isNotNull(), 1).otherwise(0)).alias("merchant_valid"),
    spark_sum(when(col("customer_local_time").isNotNull(), 1).otherwise(0)).alias("customer_valid")
).collect()[0]

merchant_after = counts_after["merchant_valid"]
customer_after = counts_after["customer_valid"]

elapsed = time.time() - start_time

print(f"\nAfter fallback:")
print(f"Merchant: {merchant_after:,} / {total:,} ({merchant_after/total*100:.2f}%)")
print(f"Customer: {customer_after:,} / {total:,} ({customer_after/total*100:.2f}%)")

if merchant_after == total and customer_after == total:
    print(f"\n‚úì SUCCESS: 100% coverage achieved in {elapsed:.1f}s")
else:
    print(f"\n‚ö† WARNING: Coverage incomplete")
    
print("=" * 80)


### 6.3.4 Section Summary

**What We Accomplished:**
- ‚úÖ Converted all UTC timestamps to merchant local time
- ‚úÖ Converted all UTC timestamps to customer local time  
- ‚úÖ Validated conversion success rates
- ‚úÖ Verified timestamps look correct

**Key Findings:**
- Merchant conversion rate: [shown in validation above]
- Customer conversion rate: [shown in validation above]

**Next Steps (Section 6.4):**
Now that we have LOCAL timestamps, we need to extract temporal features:
- Local hour (0-23)
- Local day of week (Mon-Sun)
- Local time bins (Late Night, Morning, etc.)
- Weekend flags

These features will replace our previous UTC-based analysis.

# 7. Temporal Fraud Pattern Analysis


## 7.0 Helper Functions for Temporal Analysis

### Design Principles
- **DRY Pattern:** Reusable functions for all temporal analyses
- **Validation First:** Check data quality before aggregation  
- **Separation of Concerns:** Temporal patterns only (no amount analysis mixed in)
- **Performance:** Minimize Spark operations, post-process in Pandas
- **Idempotency:** Cache results to avoid recomputation

In [None]:
# ============================================================
# HELPER FUNCTIONS FOR TEMPORAL ANALYSIS (DRY PATTERN)
# ============================================================

def validate_temporal_coverage(df, time_column, analysis_name):
    """
    Validate temporal data coverage before analysis.
    
    Args:
        df: PySpark DataFrame
        time_column: Name of timestamp column to validate
        analysis_name: Description for logging
    
    Returns:
        tuple: (filtered_df, coverage_pct, valid_count, total_count)
    """
    total_count = df.count()
    filtered_df = df.filter(col(time_column).isNotNull())
    valid_count = filtered_df.count()
    coverage_pct = (valid_count / total_count) * 100
    
    print("=" * 80)
    print(f"DATA VALIDATION: {analysis_name}")
    print("=" * 80)
    print(f"Total rows:      {total_count:,}")
    print(f"Valid rows:      {valid_count:,}")
    print(f"Coverage:        {coverage_pct:.2f}%")
    print(f"Excluded (NULL): {total_count - valid_count:,}")
    
    if coverage_pct < 95:
        print(f"\n‚ö†Ô∏è  WARNING: Only {coverage_pct:.2f}% coverage - results may be biased")
    else:
        print(f"\n‚úì Coverage acceptable ({coverage_pct:.2f}%)")
    
    print("=" * 80 + "\n")
    
    return filtered_df, coverage_pct, valid_count, total_count


def aggregate_fraud_by_dimension(df, dimension_col, dimension_name, cache_name=None):
    """
    Aggregate fraud statistics by a single dimension.
    
    Design: Minimal Spark operations, simple post-processing in Pandas.
    
    Args:
        df: PySpark DataFrame
        dimension_col: Column to group by (e.g., "hour", "day_of_week")
        dimension_name: Human-readable name for logging
        cache_name: Optional cache key for idempotency
    
    Returns:
        pandas.DataFrame with fraud statistics
    """
    from pyspark.sql.functions import count, sum as spark_sum, round as spark_round
    
    # Check cache (idempotency)
    if cache_name and cache_name in globals():
        print(f"‚ö†Ô∏è  Using cached {dimension_name} analysis")
        return globals()[cache_name]
    
    print(f"Aggregating fraud by {dimension_name}...")
    
    # Minimal Spark aggregation (only what we need)
    agg_result = (
        df
        .groupBy(dimension_col)
        .agg(
            count("*").alias("total_txns"),
            spark_sum("is_fraud").alias("fraud_count")
        )
        .withColumn(
            "fraud_rate_pct",
            spark_round((col("fraud_count") / col("total_txns")) * 100, 4)
        )
        .orderBy(dimension_col)
    )
    
    # Convert to Pandas for further processing
    stats_df = agg_result.toPandas()
    
    # Compute derived columns in Pandas (not Spark)
    stats_df['legit_count'] = stats_df['total_txns'] - stats_df['fraud_count']
    
    # Cache if requested
    if cache_name:
        globals()[cache_name] = stats_df
    
    print(f"‚úì {dimension_name} analysis complete ({len(stats_df)} groups)")
    
    return stats_df


print("‚úì Helper functions loaded")

## 7.1 Fraud Patterns by Hour

### Hypothesis
- **Expected:** Fraud rates higher during late night/early morning hours (11 PM - 6 AM)
- **Reasoning:** Victims asleep, delayed detection, less merchant monitoring

### What We'll Analyze
1. Fraud count and rate by hour (0-23)
2. Peak fraud hours
3. Safest transaction hours
4. Visual patterns and anomalies

### 7.1.1 Data Validation - Merchant Local Time Coverageta

In [None]:
# ============================================================
# VALIDATE MERCHANT LOCAL TIME FOR HOURLY ANALYSIS
# ============================================================

analysis_df, coverage_pct, valid_count, total_count = validate_temporal_coverage(
    df=train_df,
    time_column="merchant_local_time",
    analysis_name="Hourly Fraud Analysis"
)

# Extract hour if not already present
if "hour" not in analysis_df.columns:
    analysis_df = analysis_df.withColumn(
        "hour",
        hour(col("merchant_local_time"))
    )
    print("‚úì Extracted hour from merchant_local_time")
else:
    print("‚úì Hour column already exists")

### 7.1.2 Hourly Fraud Aggregation (Local Time)

Using helper function for clean, reusable analysis.

In [None]:
# ============================================================
# HOURLY FRAUD AGGREGATION USING HELPER FUNCTION
# ============================================================

# Aggregate using DRY helper function
hourly_fraud_stats = aggregate_fraud_by_dimension(
    df=analysis_df,
    dimension_col="hour",
    dimension_name="Hour of Day",
    cache_name="cached_hourly_fraud_stats"
)

# Display results
print("\n" + "=" * 100)
print("FRAUD STATISTICS BY HOUR (LOCAL TIME)")
print("=" * 100)
print(hourly_fraud_stats[['hour', 'total_txns', 'fraud_count', 'legit_count', 'fraud_rate_pct']].to_string(index=False))
print("=" * 100)

# Key insights
peak_hour = hourly_fraud_stats.loc[hourly_fraud_stats['fraud_rate_pct'].idxmax()]
safest_hour = hourly_fraud_stats.loc[hourly_fraud_stats['fraud_rate_pct'].idxmin()]
risk_ratio = peak_hour['fraud_rate_pct'] / safest_hour['fraud_rate_pct']

print(f"\nüìä KEY INSIGHTS:")
print(f"Peak fraud hour:    {int(peak_hour['hour'])}:00 ({peak_hour['fraud_rate_pct']:.4f}% fraud rate)")
print(f"Safest hour:        {int(safest_hour['hour'])}:00 ({safest_hour['fraud_rate_pct']:.4f}% fraud rate)")
print(f"Risk ratio:         {risk_ratio:.2f}x higher at peak")

### 7.1.3 Hourly Fraud Visualizations

In [None]:
# ============================================================
# HOURLY FRAUD VISUALIZATIONS
# ============================================================

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

sns.set_style("whitegrid")
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Chart 1: Fraud Count by Hour
ax1 = axes[0, 0]
ax1.bar(hourly_fraud_stats['hour'], hourly_fraud_stats['fraud_count'], 
        color='crimson', alpha=0.7, edgecolor='darkred')
ax1.set_xlabel('Hour of Day (Local Time)', fontsize=12, fontweight='bold')
ax1.set_ylabel('Fraud Count', fontsize=12, fontweight='bold')
ax1.set_title('Fraud Transaction Count by Hour', fontsize=14, fontweight='bold')
ax1.set_xticks(range(0, 24))
ax1.grid(axis='y', alpha=0.3)

# Chart 2: Fraud Rate by Hour
ax2 = axes[0, 1]
ax2.plot(hourly_fraud_stats['hour'], hourly_fraud_stats['fraud_rate_pct'], 
         marker='o', linewidth=2.5, color='darkred', markersize=6)
ax2.fill_between(hourly_fraud_stats['hour'], hourly_fraud_stats['fraud_rate_pct'], 
                  alpha=0.3, color='crimson')
ax2.set_xlabel('Hour of Day (Local Time)', fontsize=12, fontweight='bold')
ax2.set_ylabel('Fraud Rate (%)', fontsize=12, fontweight='bold')
ax2.set_title('Fraud Rate by Hour', fontsize=14, fontweight='bold')
ax2.set_xticks(range(0, 24))
ax2.grid(True, alpha=0.3)
ax2.axvline(x=int(peak_hour['hour']), color='red', linestyle='--', linewidth=2, alpha=0.7)

# Chart 3: Fraud vs Legitimate Counts
ax3 = axes[1, 0]
x = np.arange(len(hourly_fraud_stats))
width = 0.35
ax3.bar(x - width/2, hourly_fraud_stats['fraud_count'], width, 
        label='Fraud', color='crimson', alpha=0.7)
ax3.bar(x + width/2, hourly_fraud_stats['legit_count'], width, 
        label='Legitimate', color='green', alpha=0.7)
ax3.set_xlabel('Hour of Day (Local Time)', fontsize=12, fontweight='bold')
ax3.set_ylabel('Transaction Count (log scale)', fontsize=12, fontweight='bold')
ax3.set_title('Fraud vs Legitimate Transactions by Hour', fontsize=14, fontweight='bold')
ax3.set_xticks(x)
ax3.set_xticklabels(hourly_fraud_stats['hour'])
ax3.legend()
ax3.set_yscale('log')
ax3.grid(axis='y', alpha=0.3)

# Chart 4: Transaction Volume
ax4 = axes[1, 1]
ax4.bar(hourly_fraud_stats['hour'], hourly_fraud_stats['total_txns'], 
        color='steelblue', alpha=0.7, edgecolor='darkblue')
ax4.set_xlabel('Hour of Day (Local Time)', fontsize=12, fontweight='bold')
ax4.set_ylabel('Total Transactions', fontsize=12, fontweight='bold')
ax4.set_title('Transaction Volume by Hour', fontsize=14, fontweight='bold')
ax4.set_xticks(range(0, 24))
ax4.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig('hourly_fraud_patterns.png', dpi=150, bbox_inches='tight')
plt.show()

print("‚úì Visualizations saved as 'hourly_fraud_patterns.png'")

### 7.1.4 Key Findings Summary

**Peak:** 6 PM (18:00) - 36.85x riskier than safest hour (6 AM)

**Next:** Section 7.2 - Day of Week Analysis

## 7.2 Fraud Patterns by Day of Week

### Hypothesis
- **Expected:** Weekend vs weekday patterns may differ
- **Alternative:** Higher weekday volume = more fraud opportunities

**Note:** PySpark `dayofweek()`: 1=Sunday, 2=Monday, ..., 7=Saturday


### 7.2.1 Data Validation


In [None]:
# Validate and prepare data for day-of-week analysis
daily_analysis_df, coverage, valid, total = validate_temporal_coverage(
    df=train_df,
    time_column="merchant_local_time",
    analysis_name="Day of Week Analysis"
)

# Extract day_of_week if needed
if "day_of_week" not in daily_analysis_df.columns:
    daily_analysis_df = daily_analysis_df.withColumn(
        "day_of_week",
        dayofweek(col("merchant_local_time"))
    )
    print("‚úì Extracted day_of_week from merchant_local_time")


### 7.2.2 Day of Week Aggregation


In [None]:
# Aggregate by day of week
daily_fraud_stats = aggregate_fraud_by_dimension(
    df=daily_analysis_df,
    dimension_col="day_of_week",
    dimension_name="Day of Week",
    cache_name="cached_daily_fraud"
)

# Add human-readable labels
day_names = {1: "Sunday", 2: "Monday", 3: "Tuesday", 4: "Wednesday",
             5: "Thursday", 6: "Friday", 7: "Saturday"}
daily_fraud_stats['day_name'] = daily_fraud_stats['day_of_week'].map(day_names)
daily_fraud_stats['is_weekend'] = daily_fraud_stats['day_of_week'].isin([1, 7]).astype(int)

print("\n" + "=" * 100)
print("FRAUD BY DAY OF WEEK (LOCAL TIME)")
print("=" * 100)
print(daily_fraud_stats[['day_name', 'is_weekend', 'total_txns', 'fraud_count', 'fraud_rate_pct']].to_string(index=False))
print("=" * 100)

peak_day = daily_fraud_stats.loc[daily_fraud_stats['fraud_rate_pct'].idxmax()]
print(f"\nüìä Peak day: {peak_day['day_name']} ({peak_day['fraud_rate_pct']:.4f}%)")


### 7.2.3 Weekend vs Weekday Comparison


In [None]:
# Aggregate by weekend flag
weekend_stats = aggregate_fraud_by_dimension(
    df=daily_analysis_df,
    dimension_col="is_weekend",
    dimension_name="Weekend vs Weekday",
    cache_name="cached_weekend_stats"
)

weekend_stats['period'] = weekend_stats['is_weekend'].map({0: 'Weekday', 1: 'Weekend'})

print("\n" + "=" * 100)
print("WEEKEND vs WEEKDAY")
print("=" * 100)
print(weekend_stats[['period', 'total_txns', 'fraud_count', 'fraud_rate_pct']].to_string(index=False))
print("=" * 100)

weekend_rate = weekend_stats[weekend_stats['period'] == 'Weekend']['fraud_rate_pct'].values[0]
weekday_rate = weekend_stats[weekend_stats['period'] == 'Weekday']['fraud_rate_pct'].values[0]
ratio = weekend_rate / weekday_rate if weekend_rate > weekday_rate else weekday_rate / weekend_rate
period = 'Weekend' if weekend_rate > weekday_rate else 'Weekday'
print(f"\n{period} transactions are {ratio:.2f}x riskier")


### 7.2.4 Key Findings Summary

**Day of week patterns identified.**

**Next:** Section 7.3 - Monthly/Seasonal Patterns


## 7.3 Fraud Patterns by Month (Seasonal Analysis)

### Hypothesis
- Seasonal patterns during holidays (December, tax season)
- Shopping peaks (Black Friday, Christmas) = more fraud opportunities


In [None]:
# Extract month if needed
if "month" not in train_df.columns:
    train_df = train_df.withColumn("month", month(col("merchant_local_time")))

# Aggregate by month
monthly_fraud_stats = aggregate_fraud_by_dimension(
    df=train_df,
    dimension_col="month",
    dimension_name="Month",
    cache_name="cached_monthly_fraud"
)

month_names = {1: "Jan", 2: "Feb", 3: "Mar", 4: "Apr", 5: "May", 6: "Jun",
               7: "Jul", 8: "Aug", 9: "Sep", 10: "Oct", 11: "Nov", 12: "Dec"}
monthly_fraud_stats['month_name'] = monthly_fraud_stats['month'].map(month_names)

print("\n" + "=" * 100)
print("FRAUD BY MONTH (LOCAL TIME)")
print("=" * 100)
print(monthly_fraud_stats[['month_name', 'total_txns', 'fraud_count', 'fraud_rate_pct']].to_string(index=False))
print("=" * 100)


**Next:** Section 7.4 - Weekend Deep Dive


## 7.4 Weekend vs Weekday Deep Dive

Detailed comparison of weekend and weekday fraud behaviors.


In [None]:
# Already computed in Section 7.2.3
# Display detailed statistics

if 'weekend_stats' in locals():
    print("Weekend vs Weekday detailed analysis completed in Section 7.2.3")
    print("See above for statistical breakdown")
else:
    print("Run Section 7.2.3 first")


**Next:** Section 7.5 - Time Bin Analysis


## 7.5 Time Bin Analysis

Analyzing fraud by time periods: Night, Morning, Afternoon, Evening


In [None]:
# Extract time_bin if needed
if "time_bin" not in train_df.columns:
    train_df = train_df.withColumn(
        "time_bin",
        when((col("hour") >= 23) | (col("hour") < 6), "Night")
        .when((col("hour") >= 6) & (col("hour") < 12), "Morning")
        .when((col("hour") >= 12) & (col("hour") < 18), "Afternoon")
        .otherwise("Evening")
    )

# Aggregate by time bin
timebin_fraud_stats = aggregate_fraud_by_dimension(
    df=train_df,
    dimension_col="time_bin",
    dimension_name="Time Bin",
    cache_name="cached_timebin_fraud"
)

print("\n" + "=" * 100)
print("FRAUD BY TIME BIN (LOCAL TIME)")
print("=" * 100)
print(timebin_fraud_stats[['time_bin', 'total_txns', 'fraud_count', 'fraud_rate_pct']].to_string(index=False))
print("=" * 100)

peak_bin = timebin_fraud_stats.loc[timebin_fraud_stats['fraud_rate_pct'].idxmax()]
print(f"\nüìä Highest risk period: {peak_bin['time_bin']} ({peak_bin['fraud_rate_pct']:.4f}%)")


**Next:** Section 7.6 - Summary & Conclusions


## 7.6 Temporal Analysis Summary & Conclusions

### Key Discoveries

1. **Hour of Day:** 36.85x risk variation, peak at 6 PM
2. **Day of Week:** [Results from 7.2]
3. **Monthly:** Seasonal patterns identified
4. **Time Bins:** Evening period highest risk

### Production Recommendations

- **Dynamic Scoring:** Time-based risk multipliers
- **Thresholds:** Hour and day-specific amount limits
- **Monitoring:** Focus resources on high-risk periods

### Feature Engineering Implications

All temporal features (hour, day_of_week, month, time_bin, is_weekend) are strong fraud predictors and ready for modeling.

---

**Section 7 Complete** ‚úÖ

**Next Notebook:** 02_preprocessing.ipynb - Feature Engineering & Model Preparation
