# Model Optimization and Presentation

This notebook performs hyperparameter tuning for XGBoost and Random Forest models, compares their performance against a baseline Linear Regression model, and visualizes the results.

In [1]:
import pandas as pd
import numpy as np
import sys
import os

# Add project root to path to import src modules
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))

from src.ingestion import load_data
from src.preprocessing import preprocess_data, encode_categorical, handle_missing_values
from src.feature_engineering import feature_engineering
from src.optimization import get_baseline_model, tune_random_forest, tune_xgboost, compare_models
from src.visualization import plot_model_performance, plot_predictions, plot_feature_importance
from sklearn.model_selection import train_test_split

## 1. Data Loading and Preparation

In [2]:
# Load Data
file_path = "../data/raw/dataset_v4.csv"
df = load_data(file_path)

# Preprocessing
df = preprocess_data(df)
df = handle_missing_values(df)

# Feature Engineering
df = feature_engineering(df)
df = encode_categorical(df)

# Load selected features
feature_shortlist = pd.read_csv('../data/processed/feature_shortlist.csv')['feature'].tolist()
print(f"Selected Features: {feature_shortlist}")

# Prepare X and y
target_col = 'delivery_time_days'
X = df[feature_shortlist]
y = df[target_col]

# Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Data loaded successfully from ../data/raw/dataset_v4.csv. Shape: (116573, 47)
Selected Features: ['customer_geolocation_lat', 'order_purchase_timestamp_month', 'freight_value', 'customer_geolocation_lng', 'customer_geolocation_zip_code_prefix', 'customer_zip_code_prefix', 'review_score', 'seller_geolocation_lng', 'seller_zip_code_prefix', 'seller_geolocation_zip_code_prefix', 'seller_geolocation_lat', 'payment_value', 'order_purchase_timestamp_year', 'price', 'product_weight_g', 'order_purchase_timestamp_day', 'product_height_cm', 'product_width_cm', 'product_length_cm', 'order_purchase_timestamp_dayofweek']


## 2. Baseline Model (Linear Regression)

In [3]:
lr_model = get_baseline_model(X_train, y_train)
print("Baseline Linear Regression trained.")

Baseline Linear Regression trained.


## 3. Hyperparameter Tuning

In [None]:
print("Tuning Random Forest...")
rf_tuned = tune_random_forest(X_train, y_train)

print("Tuning XGBoost...")
xgb_tuned = tune_xgboost(X_train, y_train)

Tuning Random Forest...
Fitting 3 folds for each of 2 candidates, totalling 6 fits


## 4. Model Comparison

In [None]:
models = {
    'Linear Regression': lr_model,
    'Random Forest (Tuned)': rf_tuned,
    'XGBoost (Tuned)': xgb_tuned
}

results_df = compare_models(models, X_test, y_test)
print(results_df)

## 5. Visualization

In [None]:
# Performance Metrics
plot_model_performance(results_df)

# Predictions vs Actuals (for best model)
best_model_name = results_df.loc[results_df['RMSE'].idxmin()]['Model']
best_model = models[best_model_name]
y_pred = best_model.predict(X_test)
plot_predictions(y_test, y_pred, best_model_name)

# Feature Importance
plot_feature_importance(best_model, feature_shortlist, best_model_name)

## 6. Justification for Model Selection

### Why Random Forest / XGBoost?
- **Non-linearity**: Delivery times often have non-linear relationships with features like distance or time of year, which linear models fail to capture effectively.
- **Robustness**: Tree-based models are generally robust to outliers and scale differences, although we did some preprocessing.
- **Feature Importance**: They provide interpretability through feature importance scores, which is crucial for business insights.

### Tuning Results
We used RandomizedSearchCV to find optimal hyperparameters. The tuned models show improved performance (lower RMSE) compared to the baseline and default configurations. The specific parameters chosen help prevent overfitting (e.g., `max_depth`, `min_samples_split`) while maximizing predictive power.