In [1]:
import warnings
warnings.filterwarnings('ignore')

<h1 style = "fontsize:400%;text-align:center;">QBUS3850: Time Series and Forecasting</h1>
<h2 style = "fontsize:300%;text-align:center;">Forecast Combination</h2>
<h3 style = "fontsize:200%;text-align:center;">Lecture Notes</h3>

<h2 style = "fontsize:300%;text-align:center;">Forecast Combination</h2>

- A game is to guess the number of jelly beans in a jar.
- While individiual guesses are wrong, the average of many guesses will be close to the answer.
- In forecasting we can improve forecasts by averaging over different models 
- The same principle works for expert judgements.

<img src="beans.jpeg" alt="beans" width="600"/>


# Wisdom of the crowd (of forecasters)

- In a seminal 1969 paper Bates and Granger propose forecast combination.
- Consider the case of two forecasts.
- Forecast quality measured by Mean Square Error of forecasts.
- They are able to show that
  - Combinination weights depend on variances and covariances of forecast errors.
  - The combined forecast is better than any individual forecast.

# The math

Let $\hat{y}_{1}$ and $\hat{y}_{2}$ be two forecasts, and $w$ be a combination weight. The combined forecast $\hat{y}_c$ is given by

$$\hat{y}_{c}=w\hat{y}_{1}+(1-w)\hat{y}_{2}$$

Also let $\sigma_1$ and $\sigma_2$ be the forecast error variances of the two forecasts respectvely.

$$\sigma^2_1 = E\left[(y-\hat{y}_{1})^2\right]\quad\textrm{and}\quad\sigma^2_2 = E\left[(y-\hat{y}_{2})^2\right]$$

This expectation is with respect to forecast errors (not in-sample errors).

# More detail

Each forecast is made at time $t+h$ using information at time $t$. The expectation is with respect to the conditional distribution of $\hat{y}_{t+h|t}$. For each forecast.

$$\sigma^2 = E_{y_{t+h|t}}\left[(y_{t+h}-\hat{y}_{t+h|t})^2\right]$$

If forecasts are unbiased $E_{t+h|t}\left[\hat{y}_{t+h|t}\right]=y_{t+h}$. This means the expected square error is given by

$$\sigma^2 = E_{y_{t+h|t}}\left[\left(\hat{y}_{t+h|t}-E_{t+h|t}[\hat{y}_{t+h|t}]\right)^2\right]$$

Each $sigma^2$ is the forecast error variance. From now on let's keep the notation simple.

# Expected value of combination


$$E\left[\hat{y}_{c}\right]=wE\left[\hat{y}_{1}\right]+(1-w)E\left[\hat{y}_{2}\right]$$

If forecasts are unbiased, then

$$E\left[\hat{y}_{c}\right]=wy+(1-w)y=y$$

Combinations of unbiased forecasts are also unbiased (if weights sum to one).

# Variance of combination

$$\begin{aligned}E\left[(\hat{y}_{c}-y)^2\right]&=E\left[(w\hat{y}_1+(1-w)\hat{y}_2-y)^2\right]\\&=E\left[(w\hat{y}_1+(1-w)\hat{y}_2-wy-(1-w)y)^2\right]\\&=E\left[\left(w(\hat{y}_{1}-y)+(1-w)(\hat{y}_{2}-y)\right)^2\right]\\&=E\left[w^2(\hat{y}_{1}-y)^2+2w(1-w)(\hat{y}_{1}-y)(\hat{y}_{2}-y)+(1-w)^2(\hat{y}_{2}-y)^2\right]\\&=w^2\sigma^2_1+2w(1-w)\rho\sigma_1\sigma_2+(1-w^2)\sigma_2^2\end{aligned}$$

Where $\rho$ is the correlation between forecasts. 



# Optimal Combination Weights

Minimising the above equation for $w$ gives the optimal weight

$$w^{(\textrm{opt})}=\frac{\sigma^2_2-\rho\sigma_1\sigma_2}{\sigma_1^2+\sigma_2^2+2\rho\sigma_1\sigma_2}$$

For the case of uncorrelated forecasts this simplifies to:

$$w^{(\textrm{opt})}=\frac{\sigma^2_2}{\sigma_1^2+\sigma_2^2}$$

It can be proven that the optimal combination weights have a smaller variance compared to any individual forecast.

# Your turn...

- When $\sigma_1$ is high is $w^{(\textrm{opt})}$ bigger or smaller? 
  - Does this make sense?
- When $\sigma_2$ is high is $w^{(\textrm{opt})}$ bigger or smaller? 
  - Does this make sense?
- When $\rho$ is high is $w^{(\textrm{opt})}$ bigger or smaller? 
  - Does this make sense?
  
  
  

# The general case

For more than two forecasts, the objective function is

$$\boldsymbol{w}^{(\textrm{opt})}=\underset{\mathbf{w}}{argmin}\,\mathbf{w}'\boldsymbol{\Sigma}\mathbf{w}\:\textrm{s.t}\:\boldsymbol{\iota}'\mathbf{w}=1$$

Where $\boldsymbol{\iota}$ is a column of 1's. This can be solved as

$$\boldsymbol{w}^{(\textrm{opt})}=\frac{{\boldsymbol{\Sigma^{-1}\iota}}}{\boldsymbol{\iota'\Sigma^{-1}\iota}}$$



# Estimating $\sigma_j$ and $\rho$

- In practice the forecast variances are not known and need to be estimated.
- For statistical models we have expressions for forecast variance, but this is not always available.
  - For example consider expert judgments
- As long as we have forecasts and true values  (e.g. via a rolling window), we can estimate $\sigma_j$ and $\rho$ and use these to calculate combination weights.

# Strategies (2 forecast case)

Let $\hat\sigma_1$ and $\hat\sigma_2$ be the mean square (forecast) errors. Let $T$ be the time at which we want to form combination weighs. methods include:

- $w_T=\frac{\hat\sigma_2}{\hat\sigma_1+\hat\sigma_2}$
- $w_T=\gamma w_{T-1}+(1-\gamma)\frac{\hat\sigma_2}{\hat\sigma_1+\hat\sigma_2}$

Alternative variances (and covariances) can be computed as weighted sums with higher weights for more recent errors

$$\tilde\sigma^2_1=\sum \alpha_t(y_t-\hat{y}_{1,t})^2$$

where $\alpha_t$ is increasing in $t$ and a similar expression is used to estimate the $\tilde\sigma^2_2$.

<h2 style = "fontsize:300%;text-align:center;">Forecast Combination Puzzle</h2>

# The puzzle

- The theory shows that equal weights (i.e. 1/K) is not guaranteed to be optimal.
- However years of forecasting practice and research have shown that equal weights often outperform so-called optimal weights.
- This is known as the *forecast combination puzzle*.
- There are several explanations of the puzzle.

# Explanation

- Recall from a few slides back, that to show combinations are unbiased

$$E\left[\hat{y}_{c}\right]=wE\left[\hat{y}_{1}\right]+(1-w)E\left[\hat{y}_{2}\right]$$

- This line of math assumes that $w$ is fixed. However as we have seen, it is often estimated from data.
- The need to estimate $w$ results in a bias in forecast combination.
- In constrast, equal weights are truly non-random.

# Visually

- The bottom curve shows the expected mean square error of combined forecasts against the value of the weight.
  - F is the optimal point
  - E is equal weights
- The top curve is higher due to randomness of weights.
  - R is the optimal point

<img src="puzzle.png" alt="puzzle" width="900"/>

Source: [Claeskens et al. (2016)](https://www.sciencedirect.com/science/article/abs/pii/S0169207016000327)

# General case uncorrelated forecasts

If forecasts are uncorrelated, then $\Sigma$ is diagonal and the weight is proportional to the inverse forecast error variance. For weight $j$ when there are $K$ forecasts in total

$$w_j=\frac{1/\sigma^2_j}{\sum\limits_{i=1}^K1/\sigma_k^2}$$

This provides a simple way of getting weights that does not rely on estimating covariances.

# Shrink to equal weights

Diebold and Shin (2019) propose using regularisation to shrink towards equal weights. Rewrite problem as a regression model

$y=w_1\hat{y}_1+w_2\hat{y}_2+w_3\hat{y}_3+\dots+w_k\hat{y}_K+\epsilon$

Rather than minimise

$RSS = \sum(y-w_1\hat{y}_1+w_2\hat{y}_2+w_3\hat{y}_3+\dots+w_k\hat{y}_K)^2$

Add a L1 or L2 penalty on w's.

# Shrink to equal weights

- Egalitarian Ridge

$$\mathbf{w}^{(er)}=\underset{\mathbf{w}}{argmin}\left(RSS+\sum\limits_{k=1}^K(w_k-(1/K))^2\right)$$

- Egalitarian Lasso

$$\mathbf{w}^{(el)}=\underset{\mathbf{w}}{argmin}\left(RSS+\sum\limits^K_{k=1}|w_k-(1/K)|\right)$$