The goal of this notebook is to setup a global model training framework, where a single model is trained on all pc types.

In [None]:
import os

from dotenv import load_dotenv
import mlflow

from src.modeling.multivariate_model_training import train_global_model

load_dotenv()
mlflow.set_tracking_uri(os.getenv("MLFLOW_TRACKING_URI"))

In [None]:
# Suppress annoying mlflow warning about dependencies
# We manage dependencies with uv...

import warnings

warnings.filterwarnings("ignore", message="Failed to resolve installed pip version")

# 1. Load and Prepare Data 

**See `src/modeling/multivariate_data_prep.py` -> `load_and_prepare_data` for details.**

In [None]:
# from src.modeling.multivariate_data_prep import load_and_prepare_data

# df, target_col, feature_cols = (
# load_and_prepare_data(group_by_pc_types=False, horizon=6)
# )

We define a function to load data and separate features and target variable from the dataframe. There are different types of features:
- Target variable: `pc_price`.
- Meta features: `region`, `pc_type` and `date`. (Used for grouping and weighting but not as model features.)
- Numerical features: `pc_price_lag_*`, `pc_price_rolling_mean_*`, `regional_avg_price`, `regional_price_volaility_`, `price_deviation_from_regional_avg`, exogenous features like `bpa_capacity_loss_kt` and their lags (less lags than for target), and time features like `month_sin`, `month_cos` and raw `month` or `year`.
- Categorical binary features (can keep as is, tree based models handle $0$ and $1$):  `is_recycled`, `is_glass_filled`, `is_flame_retardant`.
- Categorical features with multiple categories (Label encoded): `region`, `pc_type`.

# 2. Split Data

**See `src/modeling/multivariate_data_prep.py` -> `adaptive_train_validation_test_split` and `adaptive_train_test_split` for details.**

In [None]:
# from src.modeling.multivariate_data_prep import adaptive_train_test_split

# train_df, test_df = adaptive_train_test_split(
#     df,
#     target_test_ratio=0.2,
#     min_train_samples=55,
#     min_test_samples=20,
#     group_by_pc_types=False,
# )

For data not grouped by PC types, we can use a standard train-validation-test split. However, since the data is imbalanced across different pc types, we need to ensure that the train set and the test set contain enough samples from each pc type.

For data grouped by PC types, because the data is imbalanced across different pc types, we only perform a train-test split (no validation set). We need to ensure that the train set and the test set contain enough samples from each pc type. This is a problem especially for rare pc types (`gf20` notably). To do this, we use a function that performs an adaptive train-test split, ensuring that each pc type is represented in both sets with a minimum number of samples.

# 3. Prepare Features and Target

**See `src/modeling/multivariate_data_prep.py` -> `prepare_training_data` for details.**

Prepare the training, validation and test sets by separating features and target variable, and shift the features and target variable according to the specified history and forecast horizons.

# 4. Compute Sample Weights

**See `src/modeling/multivariate_data_prep.py` -> `compute_sample_weights` for details.**

Because the data is imbalanced across different pc types, we compute sample weights to give more importance to under-represented pc types during model training. This helps the model to learn better representations for these rare pc types. Without this, the model might be biased towards the more common pc types, leading to poor performance on the rare ones. The global performance metric might be good, but the performance on rare pc types would be bad.

Additionally, we can also weight samples based on region: we are only concerned about performance in Europe, so we can give more weight to samples from this region. We keep pc types from all regions in the training set to have more data, but we want to prioritize performance on European pc types.

# 5. Define evaluation metrics

**See `src/modeling/evaluation.py` -> `multi_compute_performance_metrics` for details.**

For model evaluation, we will use the Mean Absolute Percentage Error (MAPE) as our primary metric. MAPE is particularly useful in this context because it provides a normalized measure of prediction accuracy, allowing us to assess how well our model performs across different pc types and price ranges. However, using just a global MAPE can be misleading due to the imbalanced nature of the dataset. To address this, we will also compute a weighted MAPE, where each pc type's contribution to the overall metric is weighted inversely proportional to its frequency in the dataset. This approach ensures that the model's performance on rare pc types is adequately represented in the evaluation, preventing the model from being overly optimized for the more common pc types at the expense of the rare ones. We have $3$ metrics in total:
- Global MAPE: Overall MAPE across all samples.
- Weighted MAPE: MAPE computed with pc type weights.
- Per pc type MAPE: MAPE computed for each pc type separately.

# 6. Train Global Model

**See `src/modeling/multivariate_model_training.py` -> `train_global_model` for details.**

We train a single global model on all pc types using the computed sample weights, and log it to MLflow. This gives us the following training pipeline.

In [None]:
train_global_model(
    group_by_pc_types=False,
    horizon=3,
    use_validation_set=False,
    target_test_ratio=0.2,
    target_validation_ratio=0.1,
    min_train_samples=55,
    min_test_samples=20,
    min_validation_samples=15,
    weighting_method="balanced",
    model_type="xgboost",
    hyperparameter_grid=None,
    mlflow_run_name="global_xgboost_model_6m_horizon_no_grouping",
    n_trials=1,
    shap_max_display=5,
)

In [None]:
mlflow.search_runs()