# Task 2: Model Building and Training

This notebook builds and evaluates machine learning models (Logistic Regression and Random Forest) for fraud detection, addressing class imbalance with SMOTE, and incorporating normalization and categorical encoding.

## Objectives
- Preprocess data for modeling (SMOTE, scaling, encoding).
- Train and evaluate Logistic Regression and Random Forest models.
- Report performance metrics (accuracy, precision, recall, F1-score, ROC-AUC).

## Datasets
- `processed_ecommerce.csv`: Cleaned e-commerce data from Task 1.
- `processed_creditcard.csv`: Cleaned credit card data from Task 1.

## Setup
- Run in the same virtual environment with dependencies from `requirements.txt`.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from imblearn.over_sampling import SMOTE
import sys
import os
sys.path.append('..')
from src.data_utils import load_data

%matplotlib inline
sns.set_style('whitegrid')

# Load processed datasets
ecommerce_df = load_data('data/processed/processed_ecommerce.csv')
creditcard_df = load_data('data/processed/processed_creditcard.csv')

## Data Preprocessing

- Apply SMOTE to balance classes.
- Normalize numerical features.
- Encode categorical variables.

In [None]:
# Select features and target
features = ['purchase_value', 'time_since_signup', 'hour_of_day', 'day_of_week', 'source', 'browser', 'country']
target = 'class'

# Separate features and target for e-commerce data
X = ecommerce_df[features]
y = ecommerce_df[target]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Identify categorical and numerical columns
cat_cols = ['source', 'browser', 'country']
num_cols = ['purchase_value', 'time_since_signup', 'hour_of_day', 'day_of_week']

# Encode categorical variables
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
X_train_cat = encoder.fit_transform(X_train[cat_cols])
X_test_cat = encoder.transform(X_test[cat_cols])

# Scale numerical features
scaler = StandardScaler()
X_train_num = scaler.fit_transform(X_train[num_cols])
X_test_num = scaler.transform(X_test[num_cols])

# Combine numerical and categorical features
X_train_processed = np.hstack((X_train_num, X_train_cat))
X_test_processed = np.hstack((X_test_num, X_test_cat))

# Apply SMOTE to handle class imbalance
smote = SMOTE(random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train_processed, y_train)

print('Original train set shape:', X_train_processed.shape)
print('Resampled train set shape:', X_train_res.shape)
print('Class distribution after SMOTE:', np.bincount(y_train_res))

## Model Training and Evaluation

- Train Logistic Regression and Random Forest.
- Evaluate with metrics.

In [None]:
# Initialize models
lr_model = LogisticRegression(random_state=42, max_iter=1000)
rf_model = RandomForestClassifier(random_state=42, n_estimators=100)

# Train models
lr_model.fit(X_train_res, y_train_res)
rf_model.fit(X_train_res, y_train_res)

# Predict
y_pred_lr = lr_model.predict(X_test_processed)
y_pred_rf = rf_model.predict(X_test_processed)

# Evaluate metrics
metrics_lr = {
    'accuracy': accuracy_score(y_test, y_pred_lr),
    'precision': precision_score(y_test, y_pred_lr),
    'recall': recall_score(y_test, y_pred_lr),
    'f1': f1_score(y_test, y_pred_lr),
    'roc_auc': roc_auc_score(y_test, y_pred_lr)
}
metrics_rf = {
    'accuracy': accuracy_score(y_test, y_pred_rf),
    'precision': precision_score(y_test, y_pred_rf),
    'recall': recall_score(y_test, y_pred_rf),
    'f1': f1_score(y_test, y_pred_rf),
    'roc_auc': roc_auc_score(y_test, y_pred_rf)
}

print('Logistic Regression Metrics:', metrics_lr)
print('Random Forest Metrics:', metrics_rf)

## Next Steps

- Optimize hyperparameters using GridSearchCV.
- Generate ROC curves for visualization.
- Prepare Interim-2 report with detailed analysis.