Chapters/01-Introduction.Rmd

# Introduction {#intro}
I started writing this book in 2020 during the COVID-19 pandemic, having figured out that it had been more than ten years since publication of the fundamental monograph of @Hyndman2008b. Their original book discussed ETS (Error-Trend-Seasonality) model in the Single Source of Error (SSOE) form and that the topic has not been updated substantially since then. If you are interested in learning more about exponential smoothing, then this is a must-read material on the topic.

However, there has been substantial progress in the area since 2008, and I have developed some models and functions based on SSOE, making the framework more flexible and general. Given that publication of all aspects of these models in peer-reviewed journals would be very time consuming, I have decided to summarise all progress in this book, showing what happens inside the models and how to use the functions in different cases, so that there is a source to refer to.

Many parts of this monograph rely on such topics as a model, scales of information, model uncertainty, likelihood, information criteria and model building. All these topics are discussed in detail in the online textbook of @SvetunkovSBA. You are recommended to familiarise yourself with them before moving to ADAM's more advanced modelling topics.

In this chapter, we explain what forecasting is, how it is different from planning and analytics and the main forecasting principles one should follow in order not to fail in trying to predict the future.


## Forecasting, planning and analytics {#forecastingPlanningAnalytics}
While there are many definitions of what forecast is, I like the following one, proposed by Sergey Svetunkov [@Svetunkov2014Textbook]: **Forecast is a scientifically justified assertion about possible states of an object in future**. This definition does not have the word "probability" in it, because in some cases, forecasts do not rely on rigorous statistical methods and theory of probabilities. For example, the Delphi method allows obtaining judgmental forecasts, typically focusing on what to expect, not on the probability side. An essential word in the definition is "**scientific**". If a prediction is made based on coffee grounds, then it is not a forecast. Judgmental predictions, on the other hand, can be considered as forecasts if a person has a reason behind them. If they do not, this should be called "guesses", not forecasts. Finally, the word "future" is important as well, as it shows the focus of the discipline: without the future, there is no forecasting, only overfitting. As for the definition of **forecasting**, it is a process of producing forecasts -- as simple as that.

Forecasting is a vital activity carried by many organisations, some of which do it unconsciously or label it as "demand planning" or "predictive analytics". However, there is a difference between the terms "forecasting" and "planning". The latter relies on the former and implies the company's actions to adjust its decisions. For example, if we forecast that the sales will go down, a company may make some marketing decisions to increase the demand on the product. The first part relates to forecasting, while the second relates to planning. If a company does not like a forecast, it should change something in its activities, not in the forecast itself. It is important not to confuse these terms in practice.

Another crucial thing to keep in mind is that any forecasting activity should be done to inform decisions. Forecasting for the sake of forecasting is pointless. Yes, we can forecast the overall number of hospitalisations due to SARS-CoV-2 virus in the world for the next decade, but what decisions can be made based on that? If there are some decisions, then this exercise is worthwhile. If not, then this is just a waste of time.

::: example
Retailers typically need to order some amount of milk that they will sell over the next week. They do not know how much they will sell, so they usually order, hoping to satisfy, let us say, 95% of demand. This situation tells us that the forecasts need to be made a week ahead, they should be cumulative (considering the overall demand during a week before the following order) and that they should focus on an upper bound of a 95% prediction interval. Producing only point forecasts would not be helpful in this situation.
:::

Related to this is the question of forecasts accuracy. In reality, accurate forecasts do not always translate to good decisions. This is because many different aspects of reality need to be taken into account, and forecasting focuses only on one of them. Capturing the variability of demand correctly is sometimes more useful than producing very accurate point forecasts -- this is because many decisions are based on distributions of possible values rather than on point forecasts. The classical example of this situation is inventory management, where the ordering decisions are made based on quantiles of distribution to form safety stock. Furthermore, the orders are typically done in pallets, so it is not important whether the expected demand is 99 or 95 units if a pallet includes 100 units of a product. This means that whenever we produce forecasts, we need to consider how they will be used.

In some cases, accurate forecasts might be wasted if people make decisions differently and/or do not trust what they see. For example, a demand planner might decide that a straight line is not an appropriate forecast for their data and would start changing the values, introducing noise. This might happen due to a lack of experience, expertise or trust in models [@Spavound2022], and this means that it is crucial to understand who will use the forecasts and how.

Finally, in practice, not all issues can be resolved with forecasting. In some cases, companies can make decisions based on other reasons. For example, promotional decisions can be dictated by the existing stocks of the product that need to be moved out. In another case, if the holding costs for a product are low, then there is no need to spend time forecasting the demand for it -- a company can implement a simple replenishment policy, ordering, when the stock reaches some threshold. And in times of crisis, some decisions are dictated by the company's financial situation, not by forecasts: arguably, you do not need to predict demand for products that are sold out of prestige if they are not profitable, and a company needs to cut costs.

Summarising all the above, it makes sense to determine what decisions will be made based on forecasts, by whom and how. There is no need to waste time and effort on improving the forecasting accuracy if the process in the company is flawed and forecasts are then ignored, not needed or amended inappropriately.

As for **analytics**, this is a relatively new term, which implies a set of activities based on analysis, forecasting and optimisation to support informed managerial decisions. The term is broad and relies on many research areas, including forecasting, simulations, optimisation etc. In this monograph, we will focus on the forecasting aspect, occasionally discussing how to analyse the existing processes (thus touching the analytics part) and how various models could help make good practical decisions.


## Forecasting principles {#forecastingPrinciples}
If you have decided that you need to forecast something, it makes sense to keep several important forecasting principles in mind.

First, as discussed earlier, you need to understand why the forecast is required, how it will be used and by whom. Answers to these questions will guide you in deciding what technique to use, how specifically to forecast and what should be reported. For example, if a client does not know machine learning, it might be unwise to use Neural Networks for forecasting -- the client will potentially not trust the technique and thus will not trust the forecasts, switching to simpler methods. If the final decision is to order a number of units, it would be more reasonable to produce cumulative forecasts over the lead time (time between the order and product delivery) and form safety stock based on the model and assumed distribution.

When you understand what to forecast and how, the second principle comes into play: select the relevant error measure. You need to decide how to measure the accuracy of forecasting methods, keeping in mind that accuracy needs to be as close to the final decision as possible. For example, if you need to decide the number of nurses for a specific day in the A&E department based on the patients' attendance, then it would be more reasonable to compare models in terms of their quantile performance (see Section \@ref(uncertainty)) rather than expectation or median. Thus, it would be more appropriate to calculate pinball loss instead of MAE or RMSE (see details in Chapter \@ref(forecastsEvaluation)).

Third, you should always test your models on a sample of data not seen by them. Train your model on one part of a sample (called *train set* or *in-sample*) and test it on another one (called *test set* or *holdout sample*). This way, you can have some guarantees that the model will not overfit the data and that it will be reasonable when you need to produce a final forecast. Yes, there are cases when you do not have enough data to do that. All you can do in these situations is use simpler, robust models [for example, damped trend exponential smoothing by @Roberts1982; and @Gardner1985a; or Theta by @Assimakopoulos2000] and to use judgment in deciding whether the final forecasts are reasonable or not. But in all the other cases, you should test the model on the data it is unaware of. The recommended approach, in this case, is rolling origin, discussed in more detail in Section \@ref(rollingOrigin).

Fourth, the forecast horizon should be aligned with specific decisions in practice. If you need predictions for a week ahead, there is no need to produce forecasts for the next 52 weeks. If you do that then on the one hand, this will be costly and excessive, and on the other hand, the accuracy measurement will not align with the company's needs. The related issue is the test set (or holdout) size selection. There is no unique guideline for this, but it should not be shorter than the forecasting horizon and preferrably it should align with the specific horizon coming from managerial decisions.

Fifth, the time series aggregation level should be as close to the specific decisions as possible. There is no need to produce forecasts on an hourly level for the next week (168 hours ahead) if the decision is based on the order of a product for the whole week. We would not need such a granularity of data for the decision; aggregating the actual values to the weekly level and then applying models will do the trick. Otherwise, we would be wasting a lot of time making complicated models work on an hourly level.

Sixth, you need to have benchmark models. Always compare forecasts from your favourite approach with those from Naïve, global average and/or regression (they are discussed in Section \@ref(simpleForecastingMethods)) -- depending on what you deal with specifically. If your fancy Neural Network performs worse than Naïve, it does not bring value and should not be used in practice. Comparing one Neural Network with another is also not a good idea because Simple Exponential Smoothing (see Section \@ref(SES)), being a much simpler model, might beat both networks, and you would never find out about that. If possible, also compare forecasts from the proposed approach with forecasts of other well-established benchmarks, such as ETS [@Hyndman2008b], ARIMA [@Box1976] and Theta [@Assimakopoulos2000].

Finally, when comparing forecasts from different models, you might end up with several very similar performing approaches. If the difference between them is not significant, then the general recommendation is to select the faster and simpler one. This is because simpler models are more difficult to break, and those that work faster are more attractive in practice due to reduced energy consumption [save the planet and stop global warming! @Dhar2020].

These principles do not guarantee that you will end up with the most accurate forecasts, but at least you will not end up with unreasonable ones.


## Types of forecasts {#typesOfForecasts}
Depending on circumstances, we might require different types of forecasts with different characteristics. It is essential to understand what your model produces to measure its performance correctly (see Section \@ref(errorMeasures)) and make correct decisions in practice. Several things are typically produced for forecasting purposes. We start with the most common one.

### Point forecasts {#typesOfForecastsPoint}
Point forecast is the most often produced type of forecast. It corresponds to a trajectory from a model. This, however, might align with different types of statistics depending on the model and its assumptions. In the case of a pure additive model (such as linear regression), the point forecasts correspond to the conditional expectation (**mean**) from the model. The conventional interpretation of this value is that it shows what to expect on average if the situation would repeat itself many times (e.g. if we have the day with similar conditions, then the average temperature will be 10 degrees Celsius). In the case of time series, this interpretation is difficult to digest, given that time does not repeat itself, but this is the best we have. We will come back to the technicalities of producing conditional expectations from ADAM in Section \@ref(ADAMForecastingMoments).

Another type of point forecast is the (conditional) geometric expectation (**geometric mean**). It typically arises, when the model is applied to the data in logarithms and the final forecast is then exponentiated. This becomes apparent from the following definition of geometric mean:
\begin{equation}
    \check{\mu} = \sqrt[T]{\prod_{t=1}^T y_t} = \exp \left(\frac{1}{T} \sum_{t=1}^T \log(y_t) \right) ,
    (\#eq:GeoMean)
\end{equation}
where $y_t$ is the actual value on observation $t$, and $T$ is the sample size. To use the geometric mean, we need to assume that the actual values can only be positive. Otherwise, the root in \@ref(eq:GeoMean) might produce imaginary units (because of taking a square root of a negative number) or be equal to zero (if one of the values is zero). In general, the arithmetic and geometric means are related via the following inequality:
\begin{equation}
    \check{\mu} \leq \mu ,
    (\#eq:GeoAndArithMeans)
\end{equation}
where $\check{\mu}$ is the geometric mean and $\mu$ is the arithmetic one. Although geometric mean makes sense in many contexts, it is more difficult to explain than the arithmetic one to decision makers.

Finally, sometimes **medians** are used as point forecasts. In this case, the point forecast splits the sample into two halves and shows the level below which 50% of observations will lie in the future.

::: remark
The specific type of point forecast will differ depending on the model used in construction. For example, in the case of the pure additive model, assuming some symmetric distribution (e.g. Normal one), the arithmetic mean, geometric mean and median will coincide. On the other hand, a model constructed in logarithms will assume an asymmetric distribution for the original data, leading to the following relation between the means and the median (in case of positively skewed distribution):
\begin{equation}
    \check{\mu} \leq \tilde{\mu} \leq \mu ,
    (\#eq:GeoAndArithMeansAndMedian)
\end{equation}
where $\tilde{\mu}$ is the median of distribution.
:::

### Quantiles and prediction intervals {#typesOfForecastsInterval}
As some forecasters say, all point forecasts are wrong. They will never correspond to the actual values because they only capture the model's mean (or median) performance, as discussed in the previous subsection. Everything that is not included in the point forecast can be considered as the uncertainty of demand. For example, we never will be able to say precisely how many cups of coffee a cafe will sell on the forthcoming Monday, but we can at least capture the main tendencies and the uncertainty around our point forecast.

```{r adamExampleNormal, fig.cap="An example of a well behaved data, point forecast and a 95% prediction interval.", echo=FALSE}
testAdam <- adam(rnorm(100,100,10), "ANN", persistence=0, h=10, holdout=TRUE)
testAdamForecast <- forecast(testAdam, h=10, interval="approx")
plot(testAdamForecast, main="")
```

Figure \@ref(fig:adamExampleNormal) shows an example with a well-behaved demand, for which the best point forecast is the straight line. To capture the uncertainty of demand, we can construct the prediction interval, which will tell roughly where the demand will lie in $1-\alpha$ percent of cases. The interval in Figure \@ref(fig:adamExampleNormal) has the width of 95% (significance level $\alpha=0.05$) and shows that if the situation is repeated many times, the actual demand will be between `r round(testAdamForecast$lower[1],2)` and `r round(testAdamForecast$upper[1],2)`. Capturing the uncertainty correctly is important because real-life decisions need to be made based on the full information, not only on the point forecasts.

We will discuss how to produce prediction intervals in more detail in Section \@ref(ADAMForecastingPI). For a more detailed discussion on the concepts of prediction and confidence intervals, see Chapter 6 of @SvetunkovSBA.

Another way to capture the uncertainty (related to the prediction interval) is via specific quantiles of distribution. The prediction interval typically has two sides, leaving $\frac{\alpha}{2}$ values outside of each side of distribution. Instead of producing the interval, in some cases, we might need just a specific quantile, essentially creating the one-sided prediction interval (see Section \@ref(forecastingADAMOtherOneSided) for technicalities). The bound in this case will show the particular value below which the pre-selected percentage of cases would lie. This becomes especially useful in such contexts as the safety stock calculation (because we are not interested in knowing the lower bound, we want products in inventory to satisfy some proportion of demand).


### Forecast horizon
Finally, an important aspect in forecasting is the horizon, for which we need to produce forecasts. Depending on the context, we might need:

1. Only a specific value h steps ahead, e.g., the temperature following Monday.
2. All values from 1 to h steps ahead, e.g. how many patients we will have each day next week.
3. Cumulative values for the period from 1 to h steps ahead, e.g. what the cumulative demand over the lead time (the time between the order and product delivery) will be (see discussion in Section \@ref(forecastingADAMOtherCumulative)).

It is essential to understand how decisions are made in practice and align them with the forecast horizon. In combination with the point forecasts and prediction intervals discussed above, this will give us an understanding of what to produce from the model and how. For example, in the case of safety stock calculation, it would be more reasonable to produce quantiles of the cumulative over the lead time demand than to produce point forecasts from the model.


## Models, methods and typical assumptions {#modelsMethods}
While we do not aim to fully cover the topic of models, methods, and typical assumptions of statistical models, we need to make several important definitions to clarify what we will discuss later in this monograph. For a more detailed discussion, see Chapters 1 and 15 of @SvetunkovSBA.

The Cambridge dictionary [@CambridgeMethod] defines **method** as a particular way of doing something. So, the method does not necessarily explain the structure or how any error term interacts with it; it only describes how a value (for example, point forecast) is produced. In our context, the forecasting method would be a formula that generates point forecasts based on some parameters and available data. It would not explain how what underlies the data.

**Statistical model** on the other hand, is a 'mathematical representation of a real phenomenon with a complete specification of distribution and parameters' [@Svetunkov2019a]. It explains what happens inside the data, reveals the structure and shows how the error term interacts with the structure.

@Chatfield2001 was the first to discuss the distinction between forecasting model and method, although the two are not thoroughly defined in their paper: "method, meaning a computational procedure for producing forecasts", and "a model, meaning a mathematical representation of reality".

While discussing statistical models, we should also define **true model**. It is "the idealistic statistical model that is correctly specified (has all the necessary components in the correct form), and applied to the data in population" [@SvetunkovSBA]. Some statisticians also use the term **Data Generating Process** (DGP) when discussing the true model. Still, we need to distinguish between the two terms, as DGP implies that the data is somehow generated using a mathematical formula. In real life, the data is not generated from any function; it comes from a measurement of a complex process, influenced by many factors (e.g. behaviour of a group of customers based on their individual preferences and mental states). The DGP is useful when we want to conduct experiments on simulated data in a controlled environment, but it is not helpful when applying models to the data. Finally, the true model is an abstract notion because it is never known or reachable (e.g. we do not always have all the necessary variables). But it is still a useful one, as it allows us to see what would happen if we knew the model and, more importantly, what would happen if the model we used was wrong (which is always the case in real life).

Related to this definition is the **estimated** or **applied model**, which is the statistical model that is applied to the available sample of data. This model will almost always be wrong because even if we know the specification of the true model for some mysterious reason, we would still need to estimate it on our data. In this case, the estimates of parameters would differ from those in the population, and thus the model will still be wrong.

Mathematically, in the simplest case the true model can be written as:
\begin{equation}
    y_t = \mu_{y,t} + \epsilon_t,
    (\#eq:TrueModel)
\end{equation}
where $y_t$ is the actual observed value, $\mu_{y,t}$ is the structure and $\epsilon_t$ is the true noise. If we manage to capture the structure correctly, the model applied to the sample of data would be written as:
\begin{equation}
    y_t = \hat{\mu}_{y,t} + e_t,
    (\#eq:AppliedModel)
\end{equation}
where $\hat{\mu}_{y,t}$ is the estimate of the structure $\mu_{y,t}$ and $e_t$ is the estimate of the noise $\epsilon_t$ (also known as "**residuals**"). If the structure is captured correctly, there would still be a difference between \@ref(eq:TrueModel) and \@ref(eq:AppliedModel) because the latter is estimated on the data. However, if the sample size increases and we use an adequate estimation procedure, then due to Central Limit Theorem [see Chapter 6 of @SvetunkovSBA], the distance between the two models will decrease and asymptotically (with the increase of sample size) $e_t$ would converge to $\epsilon_t$. This does not happen automatically, and some assumptions must hold for this to happen.

### Assumptions of statistical models {#assumptions}
Very roughly, the typical assumptions of statistical models can be split into the following categories [@SvetunkovSBA]:

1. Model is correctly specified:
a. We have not omitted important variables in the model (underfitting the data);
b. We do not have redundant variables in the model (overfitting the data);
c. The necessary transformations of the variables are applied;
d. We do not have outliers in the model;
2. Errors are independent and identically distributed (i.i.d.):
a. There is no autocorrelation in the residuals;
b. The residuals are homoscedastic (i.e. have constant variance);
c. The expectation of residuals is zero, no matter what;
d. The variable follows the assumed distribution;
e. More, generally speaking, the distribution of residuals does not change over time;
3. The explanatory variables are not correlated with anything but the response variable:
a. No multicollinearity;
b. No endogeneity.

Many of these assumptions come from the idea that we have correctly captured the structure, meaning that we have not omitted any essential variables, we have not included redundant ones, and we transformed all the variables correctly (e.g. took logarithms, where needed). If all these assumptions hold, then we would expect the applied model to converge to the true one with the increase of the sample size. If some of them do not hold, then the point forecasts from our model might be biased, or we might end up producing wider (or narrower) prediction intervals than expected.

These assumptions with their implications on an example of multiple regression are discussed in detail in Chapter 15 of @SvetunkovSBA. The diagnostics of dynamic models based on these assumptions is discussed in Chapter \@ref(diagnostics).