# New Model Features Exploration

In this notebook I'll explore engineering some new features that I think will increase the model's predictive power:
- Lag features
- Weather features: Temperature and maybe cloud cover
- Holidays

# Library Imports

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import os
from prefect import flow 

In [2]:
# Auto reload core modules so I don't need to restart kernel when I change
# the code in those modules
%load_ext autoreload
%autoreload 2

In [11]:
from core.consts import EIA_TEST_SET_HOURS, EIA_EARLIEST_HOUR_UTC
from flows.train_model_flow import train_model
from core.utils import utcnow_minus_buffer_ts
from core.types import DVCDatasetInfo, ModelFeatureFlags
from core.data import get_dvc_remote_repo_url

In [9]:
# @flow()
# def run_eia_extraction():
#     start_ts = pd.to_datetime(EIA_EARLIEST_HOUR_UTC)
#     end_ts = utcnow_minus_buffer_ts()
#     eia_df = concurrent_fetch_EIA_data(start_ts, end_ts)
#     return eia_df
# eia_df = run_eia_extraction()

In [15]:
# @flow()
# def run_eia_transform(df):
#     # Type conversions
#     df = transform(df)
#     # Preprocess: Outlier capping + temporal features
#     df = preprocess_data(df)
#     return df
# eia_df = run_eia_transform(eia_df)
git_PAT = os.getenv('DVC_GIT_REPO_PAT')
git_repo_url = get_dvc_remote_repo_url(git_PAT)
path = 'data/eia_d_df_2019-01-01_00_2024-10-01_00.parquet'
rev = 'f82d25aad35da88dd595e9f7cfed6ac03a13296b'
dvc_dataset_info = DVCDatasetInfo(repo=git_repo_url, path=path, rev=rev)

reg = train_model(dvc_dataset_info=dvc_dataset_info, mlflow_tracking=False, log_prints=True)

Calculating Metrics:   0%|          | 0/15 [00:00<?, ?it/s]

GX Validation success: suite:etl


Input data skew: 159.52987768899504
Output data skew: 0.8481053110465101
Null demand values: 117


                                             utc_ts        D  hour  month  \
utc_ts                                                                      
2019-01-01 00:00:00+00:00 2019-01-01 00:00:00+00:00  94016.0     0      1   
2019-01-01 01:00:00+00:00 2019-01-01 01:00:00+00:00  90385.0     1      1   
2019-01-01 02:00:00+00:00 2019-01-01 02:00:00+00:00  86724.0     2      1   
2019-01-01 03:00:00+00:00 2019-01-01 03:00:00+00:00  82978.0     3      1   
2019-01-01 04:00:00+00:00 2019-01-01 04:00:00+00:00  79536.0     4      1   
...                                             ...      ...   ...    ...   
2024-09-30 20:00:00+00:00 2024-09-30 20:00:00+00:00  95573.0    20      9   
2024-09-30 21:00:00+00:00 2024-09-30 21:00:00+00:00  96891.0    21      9   
2024-09-30 22:00:00+00:00 2024-09-30 22:00:00+00:00  97449.0    22      9   
2024-09-30 23:00:00+00:00 2024-09-30 23:00:00+00:00  97578.0    23      9   
2024-10-01 00:00:00+00:00 2024-10-01 00:00:00+00:00  97712.0     0     10   

Performing  cross validation
Fitting 8 folds for each of 1 candidates, totalling 8 fits
[CV] END learning_rate=0.02, max_depth=5, n_estimators=1000, objective=reg:squarederror; total time=   0.8s
[CV] END learning_rate=0.02, max_depth=5, n_estimators=1000, objective=reg:squarederror; total time=   1.5s
[CV] END learning_rate=0.02, max_depth=5, n_estimators=1000, objective=reg:squarederror; total time=   1.8s
[CV] END learning_rate=0.02, max_depth=5, n_estimators=1000, objective=reg:squarederror; total time=   0.9s
[CV] END learning_rate=0.02, max_depth=5, n_estimators=1000, objective=reg:squarederror; total time=   1.3s
[CV] END learning_rate=0.02, max_depth=5, n_estimators=1000, objective=reg:squarederror; total time=   1.1s
[CV] END learning_rate=0.02, max_depth=5, n_estimators=1000, objective=reg:squarederror; total time=   0.9s
[CV] END learning_rate=0.02, max_depth=5, n_estimators=1000, objective=reg:squarederror; total time=   1.1s
Cross validation results:
   mean_fit_time  std_

In [None]:
eia_df

# Lag Features

Let's add timeseries lag features, for the same day of week $Y$ years in the past for $Y \in \{1,2,3\}$

After notebook exploration, this logic should be added to the train_model_flow's feature pre-processing feature engineering section.

In [None]:
ts_to_D = eia_df.D.to_dict()
# Trick: Offset by 364 days => lagged day is same day of week
LAG_DAYS_1Y = '364 days'
LAG_DAYS_2Y = '728 days'
LAG_DAYS_3Y = '1092 days'

eia_df['lag_1y'] = (eia_df.index - pd.Timedelta(LAG_DAYS_1Y)).map(ts_to_D)
eia_df['lag_2y'] = (eia_df.index - pd.Timedelta(LAG_DAYS_2Y)).map(ts_to_D)
eia_df['lag_3y'] = (eia_df.index - pd.Timedelta(LAG_DAYS_3Y)).map(ts_to_D)
eia_df

In [None]:
# Confirm, for a given row, that the lag values are correct
# Timestamps of interest
t = '2024-09-17 20:00:00+00:00'
t_lag1y = eia_df.loc[t, 'utc_ts'] - pd.Timedelta(LAG_DAYS_1Y)
t_lag2y = eia_df.loc[t, 'utc_ts'] - pd.Timedelta(LAG_DAYS_2Y)
t_lag3y = eia_df.loc[t, 'utc_ts'] - pd.Timedelta(LAG_DAYS_3Y)
# Confirm this rows lag column values match the D value of their respective rows
assert eia_df.loc[t, 'lag_1y'] == eia_df.loc[t_lag1y, 'D']
assert eia_df.loc[t, 'lag_2y'] == eia_df.loc[t_lag2y, 'D']
assert eia_df.loc[t, 'lag_3y'] == eia_df.loc[t_lag3y, 'D']
# Confirm that day of week is maintained for lagged dates
assert pd.to_datetime(t).dayofweek == pd.to_datetime(t_lag1y).dayofweek
assert pd.to_datetime(t).dayofweek == pd.to_datetime(t_lag2y).dayofweek
assert pd.to_datetime(t).dayofweek == pd.to_datetime(t_lag3y).dayofweek
print('All good') # TODO add this as a functional test

In [None]:
# reg = train_xgboost(eia_df, hyperparam_tuning=False)
reg = train_model(
    dvc_dataset_info=dvc_dataset_info,
    mlflow_tracking=False,
    feature_flags=ModelFeatureFlags(lag=True),
)
# TODO: Next replace above with train_model flow to try out feature flags in training

In [None]:
type(reg)

# Features from Additional Data Sources

## Weather

[OpenMeteo](https://open-meteo.com/)
- Easily handles large historical data requests:
  ```sh
  curl "https://archive-api.open-meteo.com/v1/era5?latitude=52.52&longitude=13.41&start_date=2019-01-01&end_date=2024-09-31&hourly=temperature_2m,cloud_cover" > temp_data.json
  ```
- And forecasts:
  ```sh
  curl "https://api.open-meteo.com/v1/forecast?latitude=52.52&longitude=13.41&hourly=temperature_2m,cloud_cover&forecast_days=14" > forecast_data.json
  ```
- With a common response format
- Caching: Historical data will never change. Is it worth implementing caching? No, skip that until you're forced to do it for some reason.

### Questions

- Should I include multiple weather features: Temp, cloud cover, and precipitation level? Perhaps there's predictive value (e.g. on a cloudy day people turn on more lights, on a rainy/snowy day people stay home, etc).
- Historical vs Forecast data: For training my model, I'll use historical weather data for features. For predictions, the weather data may either be historical or forecast depending on whether the test/eval time period is in the past or future. How to merge historical and forecast data seamlessly?
  

## Holidays

[Calendarific](https://calendarific.com/api-documentation)

```sh
curl "https://calendarific.com/api/v2/holidays?&api_key=${API_KEY}&country=US&type=national&year=2019" > holidays_2019.json
```

- Need to make one API request per year.
- Includes lots of obscure holidays, but can filter to `primary_type: "Federal Holiday"`
- **TODO**: This has an API limit, and the amount of data is small - so prefetch it all and store it in a file.

# Questions

- What location should I choose as representative of the weather for the PJM region? Could take multiple and average - but simpler approach (one location) is probably better to start.

Notes:

- 