### Time Series Workshop 
# 6. Multi-Step Ahead Forecasting &#x1F449; &#x1F449; &#x1F449;

For now, we've limited ourselves to single-step forecasting, i.e., we always predicted one given time-step (1h for the air pollution, 1month for the retail challenge) ahead.

But what about multi-step forecasting. Can we predict the next 24h of air pollution? Or the next 12 months of retail sales?

Here we'll tackle this problem and dive into the two most common approaches to multi-step forecasting: 
- Direct forecasting
- Recursive forecasting 

In [3]:
%config InlineBackend.figure_format='retina'
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline

from timeseries.data import load_air_quality
from timeseries.utils import print_metrics

DATA_DIR = Path("..") / Path("data")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Load data
Let's come back to our air-pollution example from before!

In [6]:
SPLIT_DATE = "2005-02-01"
TARGET_COL = "co_sensor"
FILE_PATH = DATA_DIR / "air_quality.csv"
VARIABLES = [TARGET_COL, "humidity"]

df_in = load_air_quality(FILE_PATH)[VARIABLES]
df_in.head()

Unnamed: 0_level_0,co_sensor,humidity
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1
2004-04-04 00:00:00,1224.0,56.5
2004-04-04 01:00:00,1215.0,59.2
2004-04-04 02:00:00,1115.0,62.4
2004-04-04 03:00:00,1124.0,65.0
2004-04-04 04:00:00,1028.0,65.3


# Streamlined pre-processing
- We were quite verbose with our feature-engineering earlier
- Let's streamline this a little bit with some more concise transformers

In [12]:
from feature_engine.creation import CyclicalFeatures
from feature_engine.datetime import DatetimeFeatures
from feature_engine.imputation import DropMissingData
from feature_engine.selection import DropFeatures
from feature_engine.timeseries.forecasting import (
    LagFeatures,
    WindowFeatures,
)

# Date feature transformer:
datetime_features = DatetimeFeatures(
    variables="index",
    features_to_extract=[
        "month",
        "week",
        "day_of_week",
        "day_of_month",
        "hour",
        "weekend",
    ],
)

# Lag feature transformer:
lag_features = LagFeatures(
    variables=VARIABLES, freq=["1H", "24H"], missing_values="ignore"
)

# Window feature transformer:
window_features = WindowFeatures(
    variables=VARIABLES,
    window="3H",
    freq="1H",
    missing_values="ignore",
    functions=["mean", "min", "max", "std"],
)

# Cyclical feature transformer (this one we already know!):
cyclic_features = CyclicalFeatures(variables=["month", "hour"], drop_original=False)

# Drop missing data transformer:
dropnas = DropMissingData()

# Drop features transformer (to avoid look-ahead bias):
drop_features = DropFeatures(features_to_drop=VARIABLES)

Combine all of this in a neat little sklearn pipeline

In [15]:
pipe = Pipeline(
    [
        ("datetime_features", datetime_features),
        ("lag_features", lag_features),
        ("window_features", window_features),
        ("cyclic_features", cyclic_features),
        ("dropnas", dropnas),
        ("drop_features", drop_features),
    ]
)
pipe

In [18]:
df = df_in.copy()

df_processed = pipe.fit_transform(df)
df_processed.head(3)

Unnamed: 0_level_0,month,week,day_of_week,day_of_month,hour,weekend,co_sensor_lag_1H,humidity_lag_1H,co_sensor_lag_24H,humidity_lag_24H,...,co_sensor_window_3H_max,co_sensor_window_3H_std,humidity_window_3H_mean,humidity_window_3H_min,humidity_window_3H_max,humidity_window_3H_std,month_sin,month_cos,hour_sin,hour_cos
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2004-04-05 00:00:00,4,15,0,5,0,0,1188.0,60.8,1224.0,56.5,...,1196.0,45.785733,58.566667,56.1,60.8,2.358672,0.866025,-0.5,0.0,1.0
2004-04-05 01:00:00,4,15,0,5,1,0,1065.0,65.8,1215.0,59.2,...,1196.0,73.432509,61.8,58.8,65.8,3.605551,0.866025,-0.5,0.269797,0.962917
2004-04-05 02:00:00,4,15,0,5,2,0,999.0,79.2,1115.0,62.4,...,1188.0,95.921843,68.6,60.8,79.2,9.5142,0.866025,-0.5,0.519584,0.854419


Ah, way better and not too cluttered.

## Multi-step forecasting: Direct approach!
- Split train-test first

In [19]:
X_train = df[df.index < "2005-03-04"]
X_train

Unnamed: 0_level_0,co_sensor,humidity
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1
2004-04-04 00:00:00,1224.0,56.5
2004-04-04 01:00:00,1215.0,59.2
2004-04-04 02:00:00,1115.0,62.4
2004-04-04 03:00:00,1124.0,65.0
2004-04-04 04:00:00,1028.0,65.3
...,...,...
2005-03-03 19:00:00,1473.0,82.4
2005-03-03 20:00:00,1396.0,84.0
2005-03-03 21:00:00,1285.0,83.6
2005-03-03 22:00:00,1206.0,82.5


In [None]:
aa
# input data
X_train = df[df.index < "2005-03-04"]
X_test = df[df.index >= pd.Timestamp("2005-03-04") - pd.offsets.Hour(24)]

# target
y_train = df[df.index < "2005-03-04"][TARGET_COL]
y_test = df[df.index >= pd.Timestamp("2005-03-04") - pd.offsets.Hour(24)][
    "CO_sensor"
]

## Analyze data
- Make yourselves familiar with the data. This one doesn't have too many pitfalls.. hopefully.
- Do we have missing data?
- Can we see some obvious seasonal pattern? If so, what could be the reason for this?

In [19]:
...

Ellipsis

## Feature engineering
- Create some features that you think might be useful for forecasting
- Do we need to do some more pre-processing?

In [20]:
...

Ellipsis

## Train-test split
- Split the data into train- and test sets according to the SPLIT_DATE parameter defined above


In [21]:
...

Ellipsis

## Build models and forecast!
- Fit the processed training data 
- Predict for the test set
- Calculate the usual metrics
  - How good is your forecast? Compare a naive baseline model with something more sophisticated.
  - What metric is the most appropriate here?
  - Can you manage to beat my own forecast? (We will have a little competition here) &#x1F6A8;

Good luck!

In [22]:
...

Ellipsis