# Forecasting with LightGBM
- Using Sklearn and MLflow
- Iterative Multi-step Forecasting (for now)  


Beginning with interable model where we can assess how much variability is explained by different features, then we will build deep learning models and compare the performance.

In [1]:
import mlflow
import pandas as pd
# import optuna
import pickle
from pathlib import Path
import lightgbm as lgb
# import os
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error
from sklearn.model_selection import TimeSeriesSplit 
# from skopt import BayesSearchCV
import numpy as np
import time

# Give url for local mlflow server
mlflow.set_tracking_uri("http://127.0.0.1:5000")

# Multiple outputs per notebook cell
%config InteractiveShell.ast_node_interactivity = 'all'

# random_state for different processes
RANDOM_STATE = 221


### Load Data
- Creating train/test and validation sets, then ensure windows are correct

In [2]:
cd = Path.cwd()
data_dir = str(cd.parents[1])
upsampled = data_dir + '/datasets/country_energy/load_wthr_downsample.pickle'
downsampled = data_dir + '/datasets/country_energy/load_wthr_upsample.pickle'

# Xu - upsampled
with open(upsampled, 'rb') as f:
    Xu = pickle.load(f)

# Xd - downsampled
with open(downsampled, 'rb') as f:
    Xd = pickle.load(f)

# Create test/train and validation set. Ensure dates are correct
validation_u = Xu.loc[Xu['day'] > (Xu['day'].max() - pd.to_timedelta('8day'))] # 8 days is selected because last day 2019-4-30 has 0 hours
Xu = Xu.loc[Xu['day'] < (Xu['day'].max() - pd.to_timedelta('7day'))]
print(f'Ensure validation set is last 7 days; min date: {validation_u.index.min()}, max date: {validation_u.index.max()}')
print(f'Ensure train/test set excludes last 7 days; min date: {Xu.index.min()}, max date: {Xu.index.max()}')

validation_d = Xd.loc[Xd['day'] > (Xd['day'].max() - pd.to_timedelta('8day'))]
Xd = Xd.loc[Xd['day'] < (Xd['day'].max() - pd.to_timedelta('7day'))]
print(f'Ensure validation set is last 7 days; min date: {validation_d.index.min()}, max date: {validation_d.index.max()}')
print(f'Ensure train/test set excludes last 7 days; min date: {Xd.index.min()}, max date: {Xd.index.max()}')

Ensure validation set is last 7 days; min date: 2019-04-23 00:00:00+00:00, max date: 2019-04-30 00:00:00+00:00
Ensure train/test set excludes last 7 days; min date: 2015-01-01 00:00:00+00:00, max date: 2019-04-22 23:45:00+00:00
Ensure validation set is last 7 days; min date: 2019-04-23 00:00:00+00:00, max date: 2019-04-30 00:00:00+00:00
Ensure train/test set excludes last 7 days; min date: 2015-01-01 00:00:00+00:00, max date: 2019-04-22 23:00:00+00:00


Prepare Data for Model Training

In [3]:
Xd_features = Xd.drop(columns='load_actual') # country, day, hdd, cdd
yd = Xd['load_actual']

X_train, X_test = Xd_features[0:24], Xd_features[24:36]
y_train, y_test = yd[36:60], yd[60:72]

X_train.head()
X_test.head()
y_train.head()
y_test.head()

Unnamed: 0_level_0,day,day_ordinal,year,week_of_year,month,hour,country,is_weekend,is_holiday,temperature,...,hdd,cdd,temperature_lag1_days,temperature_rollmean1_days,temperature_lag2_days,temperature_rollmean2_days,temperature_lag7_days,temperature_rollmean7_days,temperature_lag14_days,temperature_rollmean14_days
utc_timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-01-01 00:00:00+00:00,2015-01-01,735599.0,2015.0,1.0,1.0,0.0,DE,0.0,1.0,-0.981,...,1,0,,,,,,,,
2015-01-01 01:00:00+00:00,2015-01-01,735599.0,2015.0,1.0,1.0,1.0,DE,0.0,1.0,-1.035,...,1,0,,,,,,,,
2015-01-01 02:00:00+00:00,2015-01-01,735599.0,2015.0,1.0,1.0,2.0,DE,0.0,1.0,-1.109,...,1,0,,,,,,,,
2015-01-01 03:00:00+00:00,2015-01-01,735599.0,2015.0,1.0,1.0,3.0,DE,0.0,1.0,-1.166,...,1,0,,,,,,,,
2015-01-01 04:00:00+00:00,2015-01-01,735599.0,2015.0,1.0,1.0,4.0,DE,0.0,1.0,-1.226,...,1,0,,,,,,,,


Unnamed: 0_level_0,day,day_ordinal,year,week_of_year,month,hour,country,is_weekend,is_holiday,temperature,...,hdd,cdd,temperature_lag1_days,temperature_rollmean1_days,temperature_lag2_days,temperature_rollmean2_days,temperature_lag7_days,temperature_rollmean7_days,temperature_lag14_days,temperature_rollmean14_days
utc_timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-01-02 00:00:00+00:00,2015-01-02,735600.0,2015.0,1.0,1.0,0.0,DE,0.0,0.0,-0.49,...,1,0,-0.981,-2.51325,,,,,,
2015-01-02 01:00:00+00:00,2015-01-02,735600.0,2015.0,1.0,1.0,1.0,DE,0.0,0.0,-0.432,...,1,0,-1.035,-2.331667,,,,,,
2015-01-02 02:00:00+00:00,2015-01-02,735600.0,2015.0,1.0,1.0,2.0,DE,0.0,0.0,-0.326,...,1,0,-1.109,-2.143792,,,,,,
2015-01-02 03:00:00+00:00,2015-01-02,735600.0,2015.0,1.0,1.0,3.0,DE,0.0,0.0,-0.17,...,1,0,-1.166,-1.962167,,,,,,
2015-01-02 04:00:00+00:00,2015-01-02,735600.0,2015.0,1.0,1.0,4.0,DE,0.0,0.0,-0.016,...,1,0,-1.226,-1.802333,,,,,,


utc_timestamp
2015-01-02 12:00:00+00:00    59567.8950
2015-01-02 13:00:00+00:00    58561.0800
2015-01-02 14:00:00+00:00    58011.8600
2015-01-02 15:00:00+00:00    59286.7050
2015-01-02 16:00:00+00:00    62207.6925
Name: load_actual, dtype: float64

utc_timestamp
2015-01-03 12:00:00+00:00    52905.5625
2015-01-03 13:00:00+00:00    51660.2750
2015-01-03 14:00:00+00:00    51103.9925
2015-01-03 15:00:00+00:00    53047.3625
2015-01-03 16:00:00+00:00    56013.8200
Name: load_actual, dtype: float64

Objective Function (LightGBM) -- consider adding timing param within mlflow (assuming mlflow doesn't already track that)

In [28]:
def objective(params, x, y, n_splits=10): # n_splits=10, params, # adjust x, y being the upsample and downsample datasets
    
    # Time Series splits for cross validation 
    num_rows_horizon = len(x.loc[x['day'] > (x['day'].max() - pd.to_timedelta('7day'))])
    ts_cv = TimeSeriesSplit(n_splits=10, test_size=num_rows_horizon) 
    folds = []
    fold_mae = []
    fold_mape = []

    # drop unneeded day variable
    x = x.drop(columns='day')

    # start run with ml flow, record metrics
    # try this: mlflow.autolog()
    with mlflow.start_run(nested=False): #nested=True
        mlflow.log_params(params)

        # Dataset splits for cross-validation
        for i, (train_idx, test_idx) in enumerate(ts_cv.split(x)):
            X_train, X_test = x.iloc[train_idx], x.iloc[test_idx]
            y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

            # model, model fit, and predictions
                # partial w/ params?
            model = lgb.LGBMRegressor(
                **params, 
                random_state=RANDOM_STATE, 
                n_jobs=-1,
                early_stopping_round = 5,
 # early_stopping_min_delta 
                )  # path_smooth 
            model.fit(X_train, y_train)
            y_pred = model.predict(X_test)

            # loss metrics
            mae = mean_absolute_error(y_test, y_pred)
            mape = mean_absolute_percentage_error(y_test, y_pred)

            # record loss metrics for mean
            folds.append(i+1)
            fold_mae.append(mae)
            fold_mape.append(mape)
    
        # log mean / std of folds
        avg_mae = np.mean(fold_mae)
        std_mae = np.std(fold_mae)
        avg_mape = np.mean(fold_mape)
        std_mape = np.std(fold_mape)
        mlflow.log_metrics({
            'avg_mae': avg_mae,
            'std_mae': std_mae,
            'avg_mape': avg_mape,
            'std_mape': std_mape,   
        })

        # fold level results
        tbl = pd.DataFrame({'folds': folds, 
                            'mae_per_fold': fold_mae,
                            "mape_per_fold": fold_mape}).round(4)
        mlflow.log_table(data=tbl, artifact_file='results_per_fold.json')
        
    mlflow.end_run()
    return {'avg_mae': avg_mae, 'std_mae': std_mae, 'avg_mape': avg_mape, 'std_mape': std_mape}

Model Training and Hyperparameter Selection

In [29]:
# Prep data
Xd_features = (Xd.loc[Xd['country'] == 'BE']
               .drop(columns=['load_actual', 'country']) # hdd cdd
               .reset_index(drop=True)) 

yd = (Xd.loc[Xd.country == 'BE']
      .reset_index(drop=True)[['load_actual']])
    ### automate to work across countries

#### Seek to do param searhc in parallel
# Prelim Params
# params = {
#     'n_estimators': trial.suggest_int('n_estimators', 100, 1000), # adjust
#     'learning_rate': trial.suggest_float('learning_rate', 0.001, 0.5),
#     'num_leaves': trial.suggest_int('num_leaves', 31, 511), # adjust
#     'max_depth': trial.suggest_int('max_depth', 3, 9), # adjust if overfit
#     'subsample': trial.suggest_float('subsample', 0.5, 1), # research
#     'colsample_bytree': trial.suggest-float('colsample_bytree', 0.7, 1),
#     'reg_alpha': trial.suggest_float('reg_alpha', 0, 1),
#     'reg_lambda': trial.suggest_float('reg_lambda', 0, 1)
# }

# simplified params for testing
# params = {
#     'learning_rate': trial.suggest_float('learning_rate', 0.001, 0.5),
#     'reg_alpha': trial.suggest_float('reg_alpha', 0, 1)
# }

params = {'learning_rate': [0.001],
          'reg_alpha': [0.5]}

mlflow.end_run() # cancel any existing flows
if __name__ == "__main__":

    mlflow.set_experiment("Time Series CV Parameter Tuning")
    
    # Parameter search loop - seek to do in parallel
    # BayesSearchCV(
    #         estimator = LGBMClassifier(),
    #         search_spaces = param_space,
    #         scoring = 'accuracy', 
    #         cv = 1, # cv=None defaults to 3 folds
    #         n_iter = 50, 
    #         n_jobs = -1,
    #         return_train_score = True,
    #         random_state = 42 
    #     ).fit(x_train, y_train);

    start = time.time() # mlflow may replace this
    results = objective(params, Xd_features, yd) 
    end = round(time.time() - start, 2)
    print(f"\n\nParams = {params}: MAE = {results['avg_mae']:.2f}, MAPE = {results['avg_mape']:.2%}, Total Runtime = {end}")    


    # for params in generate_parameter_combinations(param_grid):  # Implement your grid generator
    #     with mlflow.start_run():
    #         results = objective(params, X, y)
    #         print(f"Tested {params}: MAE={results['avg_mae']:.2f}, MAPE={results['avg_mape']:.2%}")    

<Experiment: artifact_location='mlflow-artifacts:/359589701925068288', creation_time=1740264869609, experiment_id='359589701925068288', last_update_time=1740264869609, lifecycle_stage='active', name='Time Series CV Parameter Tuning', tags={}>

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006785 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3930
[LightGBM] [Info] Number of data points in the train set: 36072, number of used features: 23
[LightGBM] [Info] Start training from score 9978.581286
🏃 View run intrigued-squid-866 at: http://127.0.0.1:5000/#/experiments/359589701925068288/runs/d046f44dd80341c0b2c06f709ef5b53f
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/359589701925068288


ValueError: For early stopping, at least one dataset and eval metric is required for evaluation

In [44]:
results

{'avg_mae': 887.8164082561352,
 'std_mae': 123.64013095076388,
 'avg_mape': 9.279068587006188,
 'std_mape': 123.64013095076388}

In [70]:
# Hypothesis 1: Index values are different for selected country, Belgium
# Hypothesis 2: there is not row index value, only utc_timestamp
    # both were true

# Hypothesis 3:


RangeIndex(start=0, stop=37752, step=1)

In [None]:
# Set random string
    # ensure if covers mlfflow and other processes

# parallelization with mlflow
# https://mlflow.org/docs/latest/traditional-ml/hyperparameter-tuning-with-child-runs/notebooks/index.html

# study.trials_dataframe()


Results Graphs (MLflow)

In [None]:
# Can optuna or MLflow assist in this
# mlflow.set_experiment("check-localhost-connection")
# mlflow.set_experiment("LightGBM Forecasting")

# with mlflow.start_run():
#     mlflow.log_metric("foo", 1)
#     mlflow.log_metric("bar", 2)


2025/02/20 13:53:40 INFO mlflow.tracking.fluent: Experiment with name 'check-localhost-connection' does not exist. Creating a new experiment.


<Experiment: artifact_location='mlflow-artifacts:/140043457852554855', creation_time=1740084820943, experiment_id='140043457852554855', last_update_time=1740084820943, lifecycle_stage='active', name='check-localhost-connection', tags={}>

🏃 View run omniscient-cod-280 at: http://127.0.0.1:5000/#/experiments/140043457852554855/runs/b337cbc6f61c4ea7960c2d7baf56f10e
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/140043457852554855


Feature Importance of Best Model, MAPE, MAE

Best Model on final validation set