# Data Preprocessing & Resampling

In this notebook, we apply:
1.  **Scaling**: RobustScaler on Amount and Time.
2.  **Splitting**: 80/20 Train/Test split.
3.  **Resampling**: Visualizing SMOTE vs Undersampling.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import os
import importlib

# Add src to path
sys.path.append(os.path.join(os.getcwd(), '..', 'src'))

import load_data
import preprocess
importlib.reload(load_data)
importlib.reload(preprocess)

from load_data import load_data
from preprocess import preprocess_data, get_resampled_data

sns.set_style("whitegrid")

## 1. Load and Split Data

In [None]:
df = load_data()
if df is not None:
    X_train, X_test, y_train, y_test = preprocess_data(df)
    print("Train Shape:", X_train.shape)
    print("Test Shape:", X_test.shape)

## 2. Verify Scaling
We use RobustScaler, so values should be centered around 0 but might have large range due to outliers.

In [None]:
if df is not None:
    fig, ax = plt.subplots(1, 2, figsize=(14, 5))
    
    sns.histplot(X_train['Amount'], bins=50, ax=ax[0])
    ax[0].set_title('Scaled Amount Distribution (Train)')
    
    sns.histplot(X_train['Time'], bins=50, ax=ax[1])
    ax[1].set_title('Scaled Time Distribution (Train)')
    
    plt.show()

## 3. Resampling Visualization
We compare Class Distribution before and after SMOTE/Undersampling.

In [None]:
if df is not None:
    # Apply SMOTE
    X_smote, y_smote = get_resampled_data(X_train, y_train, method='SMOTE')
    
    # Apply Undersampling
    X_under, y_under = get_resampled_data(X_train, y_train, method='Undersampling')
    
    print("Original Train Count:", y_train.value_counts(sort=False).to_dict())
    print("SMOTE Count:", y_smote.value_counts(sort=False).to_dict())
    print("Undersampling Count:", y_under.value_counts(sort=False).to_dict())

    # Plot
    fig, ax = plt.subplots(1, 3, figsize=(18, 5))
    
    sns.countplot(x=y_train, ax=ax[0])
    ax[0].set_title(f'Original (N={len(y_train)})')
    
    sns.countplot(x=y_smote, ax=ax[1])
    ax[1].set_title(f'SMOTE (N={len(y_smote)})')
    
    sns.countplot(x=y_under, ax=ax[2])
    ax[2].set_title(f'Undersampling (N={len(y_under)})')
    
    plt.show()