# Modeling Plan:


- Three sets of multiple models. 
- Each model tests these sets:
1. Train on original data and test on original data (80/20 split)
2. Train on original + synthetic and test on original
3. Train on synthetic and test on original



## Model Selection
- Linear Regression
- Random Forests
- XGBoost
- Hill Climbing? Never done before

## Feature Selection
- PCA
- Regularization (via Lasso?)
- Somehow test the collinearity we discovered in the prev notebook, but need to be able to iterate quickly with simpler models? 






### Plan
- to begin, we will just use the original set since it is the smallest. once we refine our approach there, we will create functions so that we can easily (without constant code repitition) call the functions on each of the datasets we split
- phase 1 includes:
    - split data
    - ensure no data leakage (not sure what this really means)
    - finalize feature selection?
    - scale features? 
    - define eval metrics (Kaggle uses RMSLE)
- phase 2 includes:
    - base model (maybe lr and rf first to try to tune params and then test on xgb/hill climbing after)
    - cv once we get to rf, xgb, hill climbing
    - evaluate on val set
    - check for overfitting
- phase 3 includes:
    - visuals and reporting
    - feature importance plots
    - analyze residuals?
    - documentation
    - create final submission (synthetic train on synthetic test) for late kaggle submission out of curiosity 


### Personal Preferences/Stylizations
- when training, i like to see progress bars/output updates every x time interval to ensure that everything is running correctly (quick dopamine when programming == good)
- historically, when trying n different models i would just create a standard notebook that repeated the same code n times, just replacing the model definition to iterate -> don't want to do that anymore


# Phase 1 - Data Prep and Library Imports

In [3]:
import os
import sys
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import seaborn as sns 
sys.path.append('../src')

import importlib
import data_processing
importlib.reload(data_processing)
from data_processing import *

# 1.0 --> Load data
synthetic_path = '../data/raw/synthetic_train.csv'  
original_path = '../data/raw/og_calories.csv'       
combined_df = combine_datasets(synthetic_path, original_path)
combined_df.head()

Loaded synthetic_train dataset: (750000, 9)
Loaded og_all dataset: (15000, 9)
✅ Combined datasets: 765000 total samples
   • Synthetic: 750000 samples
   • Original: 15000 samples


Unnamed: 0,id,sex,age,height,weight,duration,heart_rate,body_temp,calories,tag
0,0.0,male,36,189.0,82.0,26.0,101.0,41.0,150.0,syn_train
1,1.0,female,64,163.0,60.0,8.0,85.0,39.7,34.0,syn_train
2,2.0,female,51,161.0,64.0,7.0,84.0,39.8,29.0,syn_train
3,3.0,male,20,192.0,90.0,25.0,105.0,40.7,140.0,syn_train
4,4.0,female,38,166.0,61.0,25.0,102.0,40.6,146.0,syn_train


In [4]:
# 1.1 --> Add calorie_burn_rate and perform that outlier Removal (Pre-Splitting)


def remove_outliers_by_dataset(combined_df):
    '''
        Remove outliers applied by dataset type

        Args:
            combined_df (DataFrame): dataframe with 'tag' column indicating dataset source

        Returns:
            dataframe with outliers removed
    '''
    combined_df = calculate_calorie_burn_rate(combined_df)
    
    print('Step 1.1 -> Removing outliers (Pre-Split)')
    print('=' * 50)

    clean_df = combined_df.copy()

    # track removal statistics
    original_counts = {}
    clean_counts = {}

    # process each dataset type separately 
    for tag in clean_df['tag'].unique():
        print(f'\nProcessing {tag} dataset...')

        # get subset
        mask = clean_df['tag'] == tag
        subset = clean_df[mask].copy()
        original_counts[tag] = len(subset)

        # apply appropriate outlier removal
        if tag == 'syn_train':
            result = remove_calorie_rate_outliers_synthetic(subset)
            if isinstance(result, tuple):
                subset_clean = result[0] # the function returns df, summary so need to grab first item in tuple
        elif tag == 'og_all':
            result = remove_calorie_rate_outliers_original(subset, min_rate=2.0)
            if isinstance(result, tuple):
                subset_clean = result[0]
        else:
            print(f'Unknown tag: {tag}, skipping outlier removal')
            subset_clean = subset

        clean_counts[tag] = len(subset_clean)

        # update main df
        clean_df = clean_df[~mask]
        clean_df = pd.concat([clean_df, subset_clean], ignore_index=True)

        # progress update
        removed = original_counts[tag] - clean_counts[tag]
        removal_pct = (removed / original_counts[tag]) * 100
        print(f'{tag}: {original_counts[tag]} -> {clean_counts[tag]}'
                f'({removed} removed, {removal_pct:.2f}%)')
    
    return clean_df, original_counts, clean_counts

clean_df, original_counts, clean_counts = remove_outliers_by_dataset(combined_df)
print(clean_df.shape)

Step 1.1 -> Removing outliers (Pre-Split)

Processing syn_train dataset...
Removed 6587 outliers (0.88%) from synthetic dataset
syn_train: 750000 -> 743413(6587 removed, 0.88%)

Processing og_all dataset...
Removed 127 low-end outliers (0.85%) from original dataset
og_all: 15000 -> 14873(127 removed, 0.85%)
(758286, 11)


In [5]:
# 1.2 --> Split Dataset

def split_datasets(clean_df, test_size=0.15, val_size=0.15, random_state=42):
    '''
        Split original dataset into 70/15/15 and keep synthetic data separate due to experiment design

        Args:
            clean_df (DataFrame): the combined dataframe with outliers removed
            test_size (float): proportion of original data used for test set
            val_size (float): proportion of original data used for val set
            random_state (int): seed for reproducibility

        Returns:
            dictionary with split datasets
    '''

    print(f'\nStep 1.2 --> Splitting Dataset')
    print("=" * 50)

    original_data = clean_df[clean_df['tag'] == 'og_all'].copy()
    synthetic_data = clean_df[clean_df['tag'] == 'syn_train'].copy()

    # remove tag col for modeling
    original_features = original_data.drop(columns=['tag'])
    synthetic_features = synthetic_data.drop(columns=['tag'])

    # first split (test 15%) and ensuring that the sex split is balanced
    train_and_val, test = train_test_split(original_features, test_size=test_size, random_state=random_state, stratify=original_features['sex'])

    # second split (val 15% of original)
    val_proportion = val_size / (1 - test_size) # adjust for remaining data since we are now taking 15% of 85%
    train, val = train_test_split(train_and_val, test_size=val_proportion, random_state=random_state, stratify=train_and_val['sex'])

    # create dataset dictionary
    datasets = {
        'og_train': train,
        'og_val': val,
        'og_test': test,
        'synthetic_full': synthetic_features
    }

    return datasets



# output and validate splits
def validate_splits(datasets):
    print(f'\nSplit Validation')
    print('=' * 50)

    total_original = sum(len(df) for name, df in datasets.items() if name.startswith('og_'))

    for name, df in datasets.items():
        percentage = (len(df) / total_original) * 100 if name.startswith('og_') else None

        if percentage:
            print(f'- {name}: {len(df)} samples ({percentage:.2f}%)')
        else:
            print(f'-{name}: {len(df)} samples')

    # verify balanced gender distribution
    print(f'\nGender Distribution')
    print('=' * 50)
    for name, df in datasets.items():
        if 'sex' in df.columns:
            gender_dist = df['sex'].value_counts(normalize=True)
            male_pct = gender_dist.get('male', 0) * 100
            female_pct = gender_dist.get('female', 0) * 100
            print(f'-{name}: {male_pct:.2f}% male, {female_pct:.2f}% female')


# save the split initial split datasets
def save_split_datasets(datasets, output_dir='../data/processed/'):
    print(f'\nSaving Split Datasets')
    print('=' * 50)

    os.makedirs(output_dir, exist_ok=True)

    for name, df in datasets.items():
        file_path = os.path.join(output_dir, f'{name}.csv')
        df.to_csv(file_path, index=False)
        print(f'Saved {name}: {file_path}')

datasets = split_datasets(clean_df, test_size=.15, val_size=.15, random_state=42)
validate_splits(datasets)
save_split_datasets(datasets)


Step 1.2 --> Splitting Dataset

Split Validation
- og_train: 10411 samples (70.00%)
- og_val: 2231 samples (15.00%)
- og_test: 2231 samples (15.00%)
-synthetic_full: 743413 samples

Gender Distribution
-og_train: 49.22% male, 50.78% female
-og_val: 49.22% male, 50.78% female
-og_test: 49.22% male, 50.78% female
-synthetic_full: 49.49% male, 50.51% female

Saving Split Datasets
Saved og_train: ../data/processed/og_train.csv
Saved og_val: ../data/processed/og_val.csv
Saved og_test: ../data/processed/og_test.csv
Saved synthetic_full: ../data/processed/synthetic_full.csv


In [6]:
# Step 1.3 --> Create Pipeline function to handle Steps 1.1 and 1.2 

def run_data_pipeline(combined_df, save_outputs=False):
    '''
        Run the data prep pipeline

        Args:
            combined_df (DataFrame): raw combined dataframe 
            save_outputs (Boolean): determine whether to save the intermediate split datasets
        
        Returns:
            dictionary with clean split datasets
    '''

    print(f'\nStarting Data Preparation Pipeline')
    print('=' * 50)

    # 1.1
    clean_df, original_counts, clean_counts = remove_outliers_by_dataset(combined_df)
    

    # 1.2
    datasets = split_datasets(clean_df)
    validate_splits(datasets)

    if save_outputs:
        save_split_datasets(datasets)

    print(f'Data Prep Complete')

    return datasets