# Model Training

- **Purpose:** Environment setup and loading/validating feature-engineered dataset for fraud model training  
- **Author:** Devbrew LLC  
- **Last Updated:** October 23, 2025  
- **Status:** In progress  
- **License:** Apache 2.0 (Code) | Non-commercial (Data)

---

## Dataset License Notice

This notebook uses the **IEEE-CIS Fraud Detection dataset** from Kaggle.

**Dataset License:** Non-commercial research use only  
- You must download the dataset yourself from [Kaggle IEEE-CIS Competition](https://www.kaggle.com/c/ieee-fraud-detection)  
- You must accept the competition rules before downloading  
- Cannot be used for commercial purposes  
- Cannot redistribute the raw dataset

**Setup Instructions:** See [`../data_catalog/README.md`](../data_catalog/README.md) for download instructions.

**Code License:** This notebook's code is licensed under Apache 2.0 (open source).

---

## Notebook Configuration

### Environment Setup

We configure the Python environment with standardized settings, import required libraries, and set a fixed random seed for reproducibility. This ensures consistent results across runs and enables reliable experimentation.

These settings establish the foundation for all model training operations.

In [2]:
import warnings
from pathlib import Path
import json
import hashlib
from typing import Dict, Any, Optional

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Configuration
warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)
pd.set_option("display.float_format", '{:.2f}'.format)

# Plotting configuration
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)
plt.rcParams["font.size"] = 10

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("Environment configured successfully")
print(f"pandas: {pd.__version__}")
print(f"numpy: {np.__version__}")

Environment configured successfully
pandas: 2.3.3
numpy: 2.3.3


### Path Configuration

We define the project directory structure and validate that required processed data from feature engineering exists. The validation ensures we have the necessary inputs before proceeding with training.

This configuration pattern ensures we can locate all required data artifacts from previous pipeline stages.

In [3]:
# Project paths
PROJECT_ROOT = Path.cwd().parent
DATA_DIR = PROJECT_ROOT / "data_catalog"
IEEE_CIS_DIR = DATA_DIR / "ieee-fraud"
PROCESSED_DIR = DATA_DIR / "processed"
NOTEBOOKS_DIR = PROJECT_ROOT / "notebooks"

# Ensure processed directory exists
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

# Expected artifacts
FEATURES_PATH = PROCESSED_DIR / "train_features.parquet"
FE_METADATA_PATH = PROCESSED_DIR / "feature_engineering_metadata.json"

def validate_required_artifacts():
    """Validate that required artifacts exist before training."""
    path_status = {
        'train_features.parquet': FEATURES_PATH.exists(),
        'feature_engineering_metadata.json': FE_METADATA_PATH.exists()
    }
    print("Artifact Availability Check:")
    for name, exists in path_status.items():
        status = "Found" if exists else "Missing"
        print(f" - {name}: {status}")

    all_exist = all(path_status.values())

    if not all_exist:
        print("\n[WARNING] Some artifacts are missing; ensure feature engineering completed successfully")
    else:
        print("\nAll required artifacts are available")

artifact_status = validate_required_artifacts()

Artifact Availability Check:
 - train_features.parquet: Found
 - feature_engineering_metadata.json: Found

All required artifacts are available


## Load Features & Data Manifest

We load the feature-engineered dataset and validate integrity against recorded metadata. We also create a simple data manifest to document:
- shape
- feature count
- missing values
- target distribution (`isFraud`)
- memory footprint
- file hash (for reproducibility)

### Validation Checklist
- Verify shape matches metadata (590,540 × 432)
- Confirm zero missing values
- Check target distribution (~3.5% fraud rate)
- Validate `TransactionDT` exists for time-based split
- Validate identifiers (`TransactionID`) and target (`isFraud`) are present

In [5]:
def file_sha256(path: Path, chunk_size: int = 2**20) -> Optional[str]:
    """Computer SHA-256 hash of a file; returns None if file missing"""
    if not path.exists():
        return None
    h = hashlib.sha256()
    with path.open("rb") as f:
        while True:
            chunk = f.read(chunk_size)
            if not chunk:
                break
            h.update(chunk)
    return h.hexdigest()

def build_data_manifest(df: pd.DataFrame, file_path: Path, fe_meta: Optional[Dict[str, Any]]) -> Dict[str, Any]:
    """Create a manifest capturing data properties for reproducibility."""
    manifest: Dict[str, Any] = {
        "generated_at": pd.Timestamp.now().isoformat(),
        "source_file": str(file_path),
        "source_hash_sha256": file_sha256(file_path),
        "rows": int(df.shape[0]),
        "columns": int(df.shape[1]),
        "memory_gb": float(df.memory_usage().sum() / 1e9),
        "dtypes_summary": df.dtypes.astype(str).value_counts().to_dict(),
        "null_values_total": int(df.isna().sum().sum()),
        "columns_with_nulls": df.columns[df.isna().any()].tolist(),
        "target": "isFraud",
        "target_distribution": df["isFraud"].value_counts(dropna=False).to_dict() if "isFraud" in df.columns else {},
        "target_rate": float(df["isFraud"].mean()) if "isFraud" in df.columns else None,
        "has_transactiondt": bool("TransactionDT" in df.columns),
        "has_transactionid": bool("TransactionID" in df.columns),
        "random_state": RANDOM_STATE,
    }

    if fe_meta:
        manifest["feature_engineering_metadata"] = {
            "total_features_expected": fe_meta.get("total_features"),
            "dataset_shape_expected": fe_meta.get("dataset_shape"),
            "engineering_date": fe_meta.get("engineering_date"),
        }

    return manifest

# Load feature engineering metadata if available
fe_meta = None
if FE_METADATA_PATH.exists():
    with open(FE_METADATA_PATH, "r") as f:
        fe_meta = json.load(f)

print("Loading Feature-Engineered Data...")
if not FEATURES_PATH.exists():
    raise FileNotFoundError(f"Missing features file: {FEATURES_PATH}")

df = pd.read_parquet(FEATURES_PATH)
print(f"Loaded features: {df.shape[0]:,} rows x {df.shape[1]:,} columns")
print(f"Memory usage: {df.memory_usage().sum() / 1e9:.2f} GB")

# Basic target info
if "isFraud" in df.columns:
    print("\nTarget Distribution:")
    print(pd.Series(df["isFraud"]).value_counts())
    print(f"Fraud rate: {df['isFraud'].mean() * 100:.2f}%")
else:
    print("\n[WARNING] 'isFraud' target column not found in features dataset")

# Data check vs metadata
if fe_meta and "dataset_shape" in fe_meta:
    expected_rows, expected_cols = fe_meta["dataset_shape"]
    ok_shape = (df.shape[0] == expected_rows) and (df.shape[1] == expected_cols)
    print(f"\nShape validation vs metadata: {'PASS' if ok_shape else 'FAIL'}")
    print(f" - (Expected {expected_rows:,} x {expected_cols:,} columns, got {df.shape[0]:,} x {df.shape[1]:,} columns)")
else:
    print("\n[WARNING] Feature engineering metadata not availables for shape validation")

# Nulls
total_nulls = int(df.isnull().sum().sum())
print(f"\nTotal missing values: {total_nulls:,}")

# Key column presence
print(f"Has TransactionDT: {'Yes' if 'TransactionDT' in df.columns else 'No'}")
print(f"Has TransactionID: {'Yes' if 'TransactionID' in df.columns else 'No'}")
    

# Dtypes quick summary
dtype_counts = df.dtypes.astype(str).value_counts()
print("\nDtype Summary:")
for dtype, count in dtype_counts.items():
    print(f" - {dtype}: {count}")

# Build and save manifest
manifest = build_data_manifest(df, FEATURES_PATH, fe_meta)

# Save manifest to data catalog
MANIFEST_PATH = PROCESSED_DIR / "training_data_manifest.json"
MANIFEST_PATH.parent.mkdir(parents=True, exist_ok=True)
with open(MANIFEST_PATH, "w") as f:
    json.dump(manifest, f, indent=4)
print(f"\nTraining data manifest saved to: {MANIFEST_PATH}")

Loading Feature-Engineered Data...
Loaded features: 590,540 rows x 432 columns
Memory usage: 2.04 GB

Target Distribution:
isFraud
0    569877
1     20663
Name: count, dtype: int64
Fraud rate: 3.50%

Shape validation vs metadata: PASS
 - (Expected 590,540 x 432 columns, got 590,540 x 432 columns)

Total missing values: 0
Has TransactionDT: Yes
Has TransactionID: Yes

Dtype Summary:
 - float64: 394
 - object: 29
 - int64: 9

Training data manifest saved to: /Users/joekariuki/Documents/Research/Projects/devbrew-payments-fraud-sanctions/data_catalog/processed/training_data_manifest.json
