# Forecasting

We want to forecast future observations based on past observations:

-   Naive methods
-   Exponential Smoothing models
-   ARIMA/SARIMA models
-   How to set up a one-step-ahead forecast

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

In [None]:
import warnings
warnings.filterwarnings('ignore')

We use a dataset of monthly totals (1000s) of international airline
passengers between 1949 and 1960:

In [None]:
# Load the AirPassengers dataset
data = sm.datasets.get_rdataset("AirPassengers").data

# Convert to datetime and set as index
data['Month'] = pd.date_range(start='1949-01-01', periods=len(data), freq='MS')
data.set_index('Month', inplace=True)

# Convert passengers column to time series
air_passengers = data['value']

# Create training and validation sets
training = air_passengers['1949-01-01':'1956-12-01']
validation = air_passengers['1957-01-01':]

In [None]:
plt.plot(air_passengers)

## 1. Naive Methods

Any forecasting method should be evaluated by being compared to a naive
method. This helps ensure that the efforts put in having a more complex
model are worth it in terms of performance.

The simplest of all methods is called simple naive forecasting.
Extremely simple: the forecast for tomorrow is what we are observing
today.

Another approach, seasonal naive, is a little more "complex": the
forecast for tomorrow is what we observed the week/month/year (depending
what horizon we are working with) before.

Here is how to do a seasonal naive forecast:

In [None]:
from sklearn.metrics import mean_absolute_percentage_error

# Seasonal naive forecast: repeat last 12 months of training for the forecast horizon
season_length = 12
h = len(validation)

# Get the last season from training data
last_season = training[-season_length:]

# Repeat last season to match the forecast horizon
# The numpy tile function repeats the input array an arbitrary number of times.
# The resulting array has the same dimensions as the input array.
naive_forecast = np.tile(last_season.values, h // season_length + 1)[:h]

# Compute MAPE
mape = mean_absolute_percentage_error(validation, naive_forecast) * 100
print(f'MAPE: {mape:.2f}%')

This gives us a **MAPE of 19.5%**.

In [None]:
# Plot the full original series
plt.figure(figsize=(10, 6))
plt.plot(air_passengers, color='blue', label='Actual', linewidth=1)

# Create a datetime index for the forecast
forecast_index = validation.index
plt.plot(forecast_index, naive_forecast, color='red', label='Seasonal Naive Forecast', linewidth=2)

# Add labels and title
plt.xlabel('Year')
plt.ylabel('Passengers')
plt.title('Seasonal Naive Forecast')
plt.legend()
plt.grid(True)
plt.show()

What happened in the last year of data is repeated as a forecast for the
entire validation set.

## 2. Exponential Smoothing

In exponential smoothing we give a declining weight to observations: the
more recent an observation, the more importance it will have in our
forecast.

Parameters can also be added. You can for instance add a trend parameter
(**Holt method**) or add a seasonality (**Holt-Winters**).

### Holt / Holt-Winters method

The model can be additive or multiplicative:

$$
y_t = f(S_t, T_t, E_t)
$$

  - S: seasonal component
  - T: trend component
  - E: error (remainder)

  -   **Additive model**: $S_t + T_t + E_t$
  -   **Multiplicative model**: $S_t \cdot T_t \cdot E_t$
  -   model: error type \| trend type \| season type: add = additive; mul =
    multiplicative;

In [None]:
from statsmodels.tsa.holtwinters import ExponentialSmoothing

# Fit ETS model with additive error, trend, and seasonality
ets_model = ExponentialSmoothing(
    training,
    trend='add',
    seasonal='add',
    seasonal_periods=12
).fit(optimized=True)

#ets_model = ETSModel(training, error='add', trend='add', seasonal='add', damped_trend=True)
#ets_model = ets_model.fit()

# Forecast
ets_forecast = ets_model.forecast(len(validation))

# Compute MAPE
ets_mape = mean_absolute_percentage_error(validation, ets_forecast) * 100
print(f'MAPE (ETS): {ets_mape:.2f}%')

In [None]:
# Plot the full original series
plt.figure(figsize=(10, 6))
plt.plot(air_passengers, color='blue', label='Actual', linewidth=1)

# Create a datetime index for the forecast
forecast_index = validation.index
plt.plot(forecast_index, ets_forecast, color='red', label='Seasonal Naive Forecast', linewidth=2)

# Add labels and title
plt.xlabel('Year')
plt.ylabel('Passengers')
plt.title('Exponential smoothing - additive model')
plt.legend()
plt.grid(True)
plt.show()

In [None]:
from statsmodels.tsa.exponential_smoothing.ets import ETSModel

# Fit the ETS model
ets_model = ETSModel(training, error='mul', trend='mul', seasonal='mul', damped_trend=True)
ets_fitted = ets_model.fit()

#ets_model = ets_model = ExponentialSmoothing(
#    training,
#    trend='mul',
#    seasonal='mul',
#    seasonal_periods=12
#)
#ets_fitted = ets_model.fit(optimized=True)

# Forecast for the validation period
forecast_horizon = len(validation)
ets_forecast_mul = ets_fitted.forecast(steps=forecast_horizon)

# Calculate MAPE
mape = mean_absolute_percentage_error(validation, ets_forecast_mul) * 100
print(f'MAPE: {mape:.2f}%')

In [None]:
# Plot the full original series
plt.figure(figsize=(10, 6))
plt.plot(air_passengers, color='blue', label='Actual', linewidth=1)

# Create a datetime index for the forecast
forecast_index = validation.index
plt.plot(forecast_index, ets_forecast_mul, color='red', label='Seasonal Naive Forecast', linewidth=2)

# Add labels and title
plt.xlabel('Year')
plt.ylabel('Passengers')
plt.title('Exponential smoothing - multiplicative model')
plt.legend()
plt.grid(True)
plt.show()

## 3. ARIMA/SARIMA models

Autoregressive Integrated Moving Average model.

ARIMA models contain three things:

-   AR(p): autoregressive part of the model. Means that we use $p$ past
    observations from the time series as predictors
-   Differencing (**d**): Used to transform the time-series into a
    stationary data sequence by taking the differences between
    successive observations at appropriate lags $d$
-   MA(q): Moving Average - uses $q$ past forecast errors as predictors

If you need to add a seasonal component to the model you can use SARIMA
(Seasonal ARIMA).


In [None]:
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_absolute_percentage_error

# Fit SARIMA model (p=2,d=1,q=1) with seasonal (P=1,D=1,Q=1,S=12)
sarima_model = SARIMAX(training,
                       order=(2, 1, 1), #tuple of the form (p = Autoregressive order, d = Integration order, q = Moving average order (lag))
                                        ## AR(0) = White Noise; AR(1): Random Walks and Oscillations; AR(p>1): higher order
                       seasonal_order=(1, 1, 1, 12), # Seasonal parameters (P, D, Q, S)
                       enforce_stationarity=False,
                       enforce_invertibility=False)
sarima_result = sarima_model.fit(disp=False)

# Forecast
forecast_horizon = len(validation)
sarima_forecast = sarima_result.forecast(steps=forecast_horizon)

# Compute MAPE
mape = mean_absolute_percentage_error(validation, sarima_forecast) * 100
print(f'MAPE: {mape:.2f}%')

In [None]:
# Plot the full original series
plt.figure(figsize=(10, 6))
plt.plot(air_passengers, color='blue', label='Actual', linewidth=1)

# Create a datetime index for the forecast
forecast_index = validation.index
plt.plot(forecast_index, sarima_forecast, color='red', label='Seasonal Naive Forecast', linewidth=2)

# Add labels and title
plt.xlabel('Year')
plt.ylabel('Passengers')
plt.title('Exponential smoothing - SARIMA model')
plt.legend()
plt.grid(True)
plt.show()

## 5. Setting up a one-step-ahead forecast

In all the previous examples, we forecasted 4 years into the future (48 months).
However, if you want to forecast on a daily basis, why would you use a
forecasted value from 4 years ago when you could use the real
observations to predict tomorrow?

The idea of setting up a one-step-ahead forecast is to evaluate how well
a model would have done if you were forecasting for one month ahead,
during 4 years, using latest observations to make your forecast.

Simply put: instead of forecasting once for the 48 months ahead, we
forecast 48 times for the upcoming month, using latest observations.

Coding this is quite simple. All we need is to iteratively add the
latest observation to the training dataset, forecast from there and
repeat.

In [None]:
# Assume `air_passengers` is the full series (Pandas Series)
# training: initial training set up to some date
# validation: the next 48 months to validate against
nmonths = 48
one_step_ahead_sarima = np.zeros((nmonths, 2))

for i in range(nmonths):
    # Extend training window by 1 each loop
    end_idx = len(training) + i
    training_observed = air_passengers.iloc[:end_idx]

    # Fit SARIMA model
    model = SARIMAX(training_observed,
                    order=(0, 1, 1),
                    seasonal_order=(1, 1, 0, 12),
                    enforce_stationarity=False,
                    enforce_invertibility=False)
    results = model.fit(disp=False)

    # One-step-ahead forecast
    forecast = results.forecast(steps=1)[0]

    # Store actual and predicted values
    observed = validation.iloc[i]
    one_step_ahead_sarima[i, 0] = observed
    one_step_ahead_sarima[i, 1] = forecast

# Compute MAPE
mape = mean_absolute_percentage_error(one_step_ahead_sarima[:, 0], one_step_ahead_sarima[:, 1]) * 100
print(f'MAPE: {mape:.2f}%')

In [None]:
# Plot the full original series
plt.figure(figsize=(10, 6))
plt.plot(air_passengers, color='blue', label='Actual', linewidth=1)

# Create a datetime index for the forecast
forecast_index = validation.index
plt.plot(forecast_index, one_step_ahead_sarima[:,1], color='red', label='Seasonal Naive Forecast', linewidth=2)

# Add labels and title
plt.xlabel('Year')
plt.ylabel('Passengers')
plt.title('One-step ahead SARIMA forecast')
plt.legend()
plt.grid(True)
plt.show()

## Stillbirth data

European Stillbirth Rate Time Series Dataset (data repo at:
<https://zenodo.org/record/6505519>)

In [None]:
url = "https://raw.githubusercontent.com/filippob/longitudinal_data_analysis/refs/heads/main/data/stillbirth/sbr_all.xlsx"
stillbirth = pd.read_excel(url, sheet_name=1)

stillbirth

In [None]:
stillbirth.shape

In [None]:
temp = stillbirth[['year', 'sbr_swe']]
temp = temp.dropna()
temp.shape

In [None]:
# Generate a datetime index from 1775 to 2021 with yearly frequency
years = pd.date_range(start='1775', end='2022', freq='Y')

# Create time series from the 'sbr_swe' column
temp_series = pd.Series(temp['sbr_swe'].values[:len(years)], index=years)

In [None]:
plt.plot(temp_series, color='blue', label='Actual', linewidth=1)
forecast_index = validation.index

In [None]:
# #Create training and validation sets: time-wise
training = temp_series['1775':'1999']
validation = temp_series['2000':]

In [None]:
validation.head()

In [None]:
# Fit SARIMA model
model = SARIMAX(training,
                    order=(0, 1, 1),
                    seasonal_order=(1, 1, 0, 12),
                    enforce_stationarity=False,
                    enforce_invertibility=False)
results = model.fit(disp=False)

# Forecast
forecast_horizon = len(validation)
sarima_forecast = results.forecast(steps=forecast_horizon)

In [None]:
# Compute MAPE
mape = mean_absolute_percentage_error(validation, sarima_forecast) * 100
print(f'MAPE: {mape:.2f}%')

In [None]:
# Plot the full original series
plt.figure(figsize=(10, 6))
plt.plot(temp_series, color='blue', label='Actual', linewidth=1)

# Create a datetime index for the forecast
forecast_index = validation.index
plt.plot(forecast_index, sarima_forecast, color='red', label='Seasonal Naive Forecast', linewidth=2)

# Add labels and title
plt.xlabel('Year')
plt.ylabel('Passengers')
plt.title('One-step ahead SARIMA forecast')
plt.legend()
plt.grid(True)
plt.show()

In [None]:
nyears = 22
one_step_ahead_sarima = np.zeros((nyears, 2))

# Loop through each year in validation
for i in range(nyears):
    # Extend the training set by one year in each loop
    end_year = 1999 + i + 1
    training_observed = temp_series[:str(end_year)]

    # Fit SARIMA model: (1,1,1) x (1,1,0,10)
    model = SARIMAX(training_observed,
                    order=(1, 1, 1),
                    seasonal_order=(1, 1, 0, 10),
                    enforce_stationarity=False,
                    enforce_invertibility=False)
    results = model.fit(disp=False)

    # One-step-ahead forecast
    forecast = results.forecast(steps=1)[0]

    # Actual value from validation
    observed = validation.iloc[i]

    # Store observed and predicted
    one_step_ahead_sarima[i, 0] = observed
    one_step_ahead_sarima[i, 1] = forecast

# Compute MAPE
mape = mean_absolute_percentage_error(one_step_ahead_sarima[:, 0], one_step_ahead_sarima[:, 1]) * 100
print(f'MAPE: {mape:.2f}%')

In [None]:
# Plot the full original series
plt.figure(figsize=(10, 6))
plt.plot(temp_series, color='blue', label='Actual', linewidth=1)

# Create a datetime index for the forecast
forecast_index = validation.index
plt.plot(forecast_index, one_step_ahead_sarima[:,1], color='red', label='Seasonal Naive Forecast', linewidth=2)

# Add labels and title
plt.xlabel('Year')
plt.ylabel('Passengers')
plt.title('One-step ahead SARIMA forecast')
plt.legend()
plt.grid(True)
plt.show()

## Normalising time series data?

In some circumstances, time series data may need to be normalised: e.g.
future data outside the range of training data, especially when the
forecasting method does not account for seasonality, trend etc.

We can use percent change aong the squence:

In [None]:
# Calculate percent change
x = temp['sbr_swe'].diff().iloc[1:]               # First difference (skip NaN)
z = temp['sbr_swe'].shift(1).iloc[1:]             # Previous values
percent_change = (x / z) * 100

In [None]:
# Add leading 0 to match length
sbr_swe_norm = pd.concat([pd.Series([0]), percent_change])
sbr_swe_norm = pd.Series([x for x in sbr_swe_norm], index=years)

In [None]:
sbr_swe_norm.head()

In [None]:
plt.plot(sbr_swe_norm)
plt.title("Normalised stillbirth rate data")

#### Splitting data

In [None]:
#Create training and validation sets: time-wise
training = sbr_swe_norm['1775':'1999']
validation = sbr_swe_norm['2000':]

### Fitting the forecasting model on the training data

In [None]:
# Fit SARIMA model
model = SARIMAX(training,
                    order=(0, 1, 1),
                    seasonal_order=(1, 1, 0, 12),
                    enforce_stationarity=False,
                    enforce_invertibility=False)
results = model.fit(disp=False)

# Forecast
forecast_horizon = len(validation)
sarima_forecast_norm = results.forecast(steps=forecast_horizon)

### Backtransform

We now have test values expressed as sequential percent differences. To
evaluate our model, we need to backtransform the data to the original
stillbirth rate:

-   dived by 100 (to remove percent)
-   multiply by the original validation data shifted backwards by 1
    (sequential differences): we obtain the vector of sequential
    differences
-   now sum the original validation data (shifted backwards by 1) and
    you'll have the original validation data

In [None]:
## validation is from year 200
valid_orig = temp_series['1999':]
d = np.array((validation/100)) * np.array(valid_orig[0:(len(valid_orig)-1)]) ## vector of sequential differences
d + valid_orig[0:(len(valid_orig)-1)]

This was easy (actually, we already had the original validation data,
this was mainly a sanity check test). We need to do the same thing for
the model predictions, to bring them on the same scale as the original
stillbirth rate.

In [None]:
valid_orig.head()

In [None]:
d = np.array(sarima_forecast/100) * np.array(valid_orig[0:(len(valid_orig)-1)])
backtransformed_pred = d + np.array(valid_orig[0:(len(valid_orig)-1)])

In [None]:
backtransformed_pred

In [None]:
# Compute MAPE
mape = mean_absolute_percentage_error(backtransformed_pred, valid_orig[1:]) * 100
print(f'MAPE: {mape:.2f}%')

In [None]:
# Plot the full original series
plt.figure(figsize=(10, 6))
plt.plot(temp_series, color='blue', label='Actual', linewidth=1)

# Create a datetime index for the forecast
forecast_index = validation.index
plt.plot(forecast_index, backtransformed_pred, color='red', label='Seasonal Naive Forecast', linewidth=2)

# Add labels and title
plt.xlabel('Year')
plt.ylabel('Passengers')
plt.title('One-step ahead SARIMA forecast')
plt.legend()
plt.grid(True)
plt.show()

**Q: would normalization improve predictions with the naive method? Try
it!**