# Horizon as a Feature

In this notebook we illustrate how to train a single model for multiple horizons by the use of an additional horizon feature. 

We compare this approach with the traditional one, i.e. one model for horizon.

As an example, we use precipitation data from a city in Switzerland (Lugano).

In [None]:
import pandas as pd
import plotly.graph_objects as go
from lightgbm import LGBMRegressor

## Load Climate data


The data can be found [here](https://www.meteoswiss.admin.ch/services-and-publications/applications/ext/climate-tables-homogenized.html).


First, let's load the climate data in a Pandas DataFrame.

In [None]:
def load_climate_data(path):
    # Load data in a dataframe.
    with open(path,"r", encoding="ISO-8859-1") as f:
        data = f.readlines()
    columns = data[27].split()
    data = [v.split() for v in data[28:]]
    data = pd.DataFrame(data, columns = columns)
    
    # Fix time
    data["time"] = pd.to_datetime(data.Year + "-" + data.Month.astype(str))
    data = data.drop(columns = ["Year","Month"])
    data = data.set_index("time")
    
    # Fix types
    data = data.replace("NA",None)
    return data.astype(float)


DATA_PATH = "./data/climate-reports-tables-homogenized_LUG.txt"
data = load_climate_data(DATA_PATH)

## Quick Exploration

Let's have a quick look at the data.

In [None]:
data.tail()

In [None]:
data.shape

In [None]:
# Let's visualize the data.
def show_data(data,title=""):
    trace = [go.Scatter(x=data.index,y=data[c],name=c) for c in data.columns]
    go.Figure(trace,layout=dict(title=title)).show()

show_data(data,"Weather Data in Lugano")

We have 150+ years of monthly temperatures and precipitations in the region of Lugano (Switzerland). 

We see that the data is seasonal and that precipitation is more irregular than temperature, as expected. 

Since precipitation it's harder to predict, it's also more interesting! Let's try to forecast its values for the next 1, 2, and 3 months.

## Data Engineering

Let's prepare the data to train a forecasting model. We are gonna use lagged values of precipitation and temperature to forecast future values of precipitation.

In [None]:
def build_target_features(data, horizon):
    targ = build_target(data.Precipitation, horizon)
    feat = build_features(data, horizon)
    
    # Drop missing values generated by lags/horizon.
    idx = ~(feat.isnull().any(axis=1) | targ.isnull())
    feat = feat.loc[idx]
    targ = targ.loc[idx]
    
    return targ, feat


def build_target(series, horizon):
    return series.shift(-horizon)


def build_features(data, horizon):
    """Build lagged features.
    
    We depend on horizon due to relative lags shift. 
    E.g, if the horizon is equal to 1, the target value of 12 months 
    before corresponds to a lag of 11.
    """
    # Here we hardcode values to simplify code reading, but everything could 
    # (and should) be parametrized.
    precipitation_lags = [0, 1, 2, 12 - horizon, 24 - horizon, 36 - horizon]
    temperature_lags = [0, 1, 12 - horizon, 24 - horizon]
    
    # Concatenate precipitation and temperature features.
    features = pd.concat(
        [
            build_lagged_features(data.Precipitation, lags=precipitation_lags),
            build_lagged_features(data.Temperature, lags=temperature_lags),
        ],
        axis=1,
    )
    
    # Add horizon_month as a feature.
    features["horizon_month"] = (features.index.month + horizon - 1) % 12 + 1

    # Trick to later allow concatenation of features for different target horizons.
    features = features.rename(
        columns={
            f"Precipitation_lag_{12-horizon}": "Precipitation_lag_12_before_target",
            f"Precipitation_lag_{24-horizon}": "Precipitation_lag_24_before_target",
            f"Precipitation_lag_{36-horizon}": "Precipitation_lag_36_before_target",
            f"Temperature_lag_{12-horizon}": "Temperature_lag_12_before_target",
            f"Temperature_lag_{24-horizon}": "Temperature_lag_24_before_target",
        }
    )
    
    return features


def build_lagged_features(series, lags):
    return pd.concat([series.shift(lag).rename(f"{series.name}_lag_{lag}") for lag in lags] ,axis=1)


In [None]:
# Let's build the targets and features for each horizon.
HORIZONS = [1,2,3]
target_features = {h: build_target_features(data, h) for h in HORIZONS}

## Split Train & Test

Let's consider the last 10 years as test set.

In [None]:
TEST_SIZE = 10 * 12

In [None]:
def split_train_test(target_features, test_size):
    targ_feat_split = {}
    for horizon, (targ,feat) in target_features.items():
        targ_train = targ.iloc[:-test_size]
        feat_train = feat.iloc[:-test_size]
        targ_test = targ.iloc[-test_size:]
        feat_test = feat.iloc[-test_size:]
        
        targ_feat_split[horizon] = targ_train, feat_train, targ_test, feat_test
        
    return targ_feat_split


targ_feat_split = split_train_test(target_features, test_size=TEST_SIZE)

## Models training

We are going to use LightGBM as ML model. 
Since we only care about comparing the two different approaches, we'll keep default hyperparameters.

### Train: One different model per horizon

In [None]:
def train_models_by_horizon(targ_feat_split, model_params=None):
    if model_params is None:
        model_params = {}
    
    # Train one model for each horizon
    models_by_horizon = {}
    for horizon, (targ_train,feat_train,_,_) in targ_feat_split.items():
        model = LGBMRegressor(**model_params)
        model.fit(feat_train, targ_train)
        models_by_horizon[horizon] = model
        
    return models_by_horizon


models_by_horizon = train_models_by_horizon(targ_feat_split)

### Train: One model for all horizons

In [None]:
def train_model_across_horizons(targ_feat_split, model_params=None):
    if model_params is None:
        model_params = {}
    
    # Concatenate data across horizons.
    targ_train_all = []
    feat_train_all = []
    for horizon, (targ_train,feat_train,_,_) in targ_feat_split.items():
        # Add horizon as a feature.
        feat_train = feat_train.copy()
        feat_train["target_horizon"] = horizon
        
        targ_train_all.append(targ_train)
        feat_train_all.append(feat_train)
        
    targ_train_all = pd.concat(targ_train_all)
    feat_train_all = pd.concat(feat_train_all)
    
    # Train a single model.
    model = LGBMRegressor(**model_params)
    model.fit(feat_train_all, targ_train_all)

    return model


model_shared = train_model_across_horizons(targ_feat_split)

## Predict on the Test set

Let's make predictions on the test set with the two approaches.

In [None]:
def predict_models_by_horizon(targ_feat_split, models_by_horizon):
    preds = {}
    for horizon, (_,_,_,feat_test) in targ_feat_split.items():
        preds[horizon] = models_by_horizon[horizon].predict(feat_test)
    return preds


preds_by_horizon = predict_models_by_horizon(targ_feat_split, models_by_horizon)

In [None]:
def predict_model_across_horizons(targ_feat_split, model):
    preds = {}
    for horizon, (_,_,_,feat_test) in targ_feat_split.items():
        # Add horizon as a feature.
        feat_test = feat_test.copy()
        feat_test["target_horizon"] = horizon
        
        preds[horizon] = model.predict(feat_test)
    return preds


preds_model_shared = predict_model_across_horizons(targ_feat_split, model_shared)

## Error Analysis

In [None]:
# Let's combine the output in a convenient format.
output = {}
for horizon in HORIZONS:
    df = targ_feat_split[horizon][2].rename("target").to_frame()
    df["pred_model_by_horizon"] = preds_by_horizon[horizon]
    df["pred_model_shared"] = preds_model_shared[horizon]
    output[horizon] = df

In [None]:
def print_stats(output):
    output_all = pd.concat(output.values())
    mae_by_horizon = (output_all.target - output_all.pred_model_by_horizon).abs().mean()
    mae_shared = (output_all.target - output_all.pred_model_shared).abs().mean()

    print("                 BY HORIZON     SHARED")
    print(f"MAE overall    :    {mae_by_horizon:.1f}         {mae_shared:.1f}\n")
    for h,df in output.items():   
        mae_by_horizon = (df.target - df.pred_model_by_horizon).abs().mean()
        mae_shared = (df.target - df.pred_model_shared).abs().mean()
        print(f"MAE - horizon {h}:    {mae_by_horizon:.1f}         {mae_shared:.1f}")

# Let's show some statistics.
print_stats(output)

We see that the shared model across horizons always leads to lower MAE.

This is somewhat expected, as it was trained on a larger dataset.

In [None]:
# Let's have a look at the predictions.
for horizon, df in output.items():
    show_data(df,f"Predictions at Horizon {horizon}")

We see that the models still fail to capture extreme events (like August 2014). 

This is normal since:
- we have very limited information on the real state of the system
- extreme events are very hard to predict with ML, since we have limited observations in the training set by definition.