# Model Training

- **Purpose:** Environment setup and loading/validating feature-engineered dataset for fraud model training  
- **Author:** Devbrew LLC  
- **Last Updated:** October 22, 2025  
- **Status:** In progress  
- **License:** Apache 2.0 (Code) | Non-commercial (Data)

---

## Dataset License Notice

This notebook uses the **IEEE-CIS Fraud Detection dataset** from Kaggle.

**Dataset License:** Non-commercial research use only  
- You must download the dataset yourself from [Kaggle IEEE-CIS Competition](https://www.kaggle.com/c/ieee-fraud-detection)  
- You must accept the competition rules before downloading  
- Cannot be used for commercial purposes  
- Cannot redistribute the raw dataset

**Setup Instructions:** See [`../data_catalog/README.md`](../data_catalog/README.md) for download instructions.

**Code License:** This notebook's code is licensed under Apache 2.0 (open source).

---

## Notebook Configuration

### Environment Setup

We configure the Python environment with standardized settings, import required libraries, and set a fixed random seed for reproducibility. This ensures consistent results across runs and enables reliable experimentation.

These settings establish the foundation for all model training operations.

In [2]:
import warnings
from pathlib import Path
import json
import hashlib
from typing import Dict, Any, Optional

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Configuration
warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)
pd.set_option("display.float_format", '{:.2f}'.format)

# Plotting configuration
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)
plt.rcParams["font.size"] = 10

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("Environment configured successfully")
print(f"pandas: {pd.__version__}")
print(f"numpy: {np.__version__}")

Environment configured successfully
pandas: 2.3.3
numpy: 2.3.3


### Path Configuration

We define the project directory structure and validate that required processed data from feature engineering exists. The validation ensures we have the necessary inputs before proceeding with training.

This configuration pattern ensures we can locate all required data artifacts from previous pipeline stages.

In [4]:
# Project paths
PROJECT_ROOT = Path.cwd().parent
DATA_DIR = PROJECT_ROOT / "data_catalog"
IEEE_CIS_DIR = DATA_DIR / "ieee-fraud"
PROCESSED_DIR = DATA_DIR / "processed"
NOTEBOOKS_DIR = PROJECT_ROOT / "notebooks"

# Ensure processed directory exists
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

# Expected artifacts
FEATURES_PATH = PROCESSED_DIR / "train_features.parquet"
FE_METADATA_PATH = PROCESSED_DIR / "feature_engineering_metadata.json"

def validate_required_artifacts():
    """Validate that required artifacts exist before training."""
    path_status = {
        'train_features.parquet': FEATURES_PATH.exists(),
        'feature_engineering_metadata.json': FE_METADATA_PATH.exists()
    }
    print("Artifact Availability Check:")
    for name, exists in path_status.items():
        status = "Found" if exists else "Missing"
        print(f" - {name}: {status}")

    all_exist = all(path_status.values())

    if not all_exist:
        print("\n[WARNING] Some artifacts are missing; ensure feature engineering completed successfully")
    else:
        print("\nAll required artifacts are available")

artifact_status = validate_required_artifacts()

Artifact Availability Check:
 - train_features.parquet: Found
 - feature_engineering_metadata.json: Found

All required artifacts are available
