# types of forecast "goodness"

Allan Murphy's essay "What is a good forecast? An essay on the nature of goodness in weather forecasting" [Murphy 1993](https://doi.org/10.1175/1520-0434(1993)008%3C0281:WIAGFA%3E2.0.CO;2) explored several fundamental concepts in forecast verification. In it, he defined three distinct types of forecast "goodness":

## consistency
Consistency was defined as the degree to which the issued forecast corresponds to the forecaster's best judgment about the situation. Since the judgment of a forecaster is internal, this is not something that can be measured. Murphy (1993) discussed ways that forecasts can be inconsistent, such as when the uncertainty inherent in the forecaster's judgment is not properly expressed in the final forecast product, or if performance measures/scores are designed in such as way as to encourage forecasters to alter their issued products to try to optimize the expected score.

## quality
Quality was defined as "the degree of correspondence between forecasts and observerations" (Murphy 1993). This is typically what we are investigating when practicing forecast verification, typically different verification scores and measures provide information related to different "attributes" of quality. There are several dimensions or "aspects" of forecast quality which are inter-related, complicating comprehensive analyses of forecast quality.

## value
Value was defined as "the benefits realized - or expenses incurred - by individuals or organizations who use the forecasts to guide their choices" (Murphy 1993). Murphy made it clear that forecasts by themselves have no intrinsic value, they become valuable if/when they are able to positively influence the decisions made by the users of the forecast. Therefore, forecast value is directly connected to the end-users/stakeholders and their decision-making problems, making analysis of value quite challenging.

## aspects of quality

[Murphy (1993)](https://doi.org/10.1175/1520-0434(1993)008%3C0281:WIAGFA%3E2.0.CO;2) defined nine aspects of forecast quality, which are defined below. Each of these can be related to the general statistical framework for forecast verification established by 
[Murphy and Winkler (1987)](https://doi.org/10.1175/1520-0493(1987)115%3C1330:AGFFFV%3E2.0.CO;2) based on the joint probability of forecasts and observations. Denoting the forecasts by $f$ and observations by $x$, Murphy and Winkler (1987) described how the complete analysis of forecast verification information can be obtained from the joint distribution of forecasts and observations $p(f,x)$. 

### uncertainty
The variability of the observations, typically measured by the sample variance of observed values ($s^2_x=\frac{1}{N}\sum(x_i-\bar x)^2$). Presumably, observations that are highly variable are more difficult to predict, and are considered to have high uncertainty.

### sharpness
The variability of the forecasts, described by the distribution of forecast values, often measured by the sample variance of forecast values ($s^2_f=\frac{1}{N}\sum(f_i-\bar f)^2$)

### bias
The correspondence between the mean forecast $\bar f=\frac{1}{N}\sum f_i$ and mean observation $\bar x=\frac{1}{N}\sum x_i$. Typically measured as the difference between them (often called mean error $ME = \frac{1}{N}\sum(f_i-x_i)=\bar f - \bar x$) or as a ratio when forecasts are binary or probabilistic (often called frequency bias $B=\frac{\bar f}{\bar x}$).

### association
The strength of the linear relationship between individual pairs of forecasts and observations, typically measured by the sample covariance: 
\begin{equation}
s_{fx}=\frac{1}{N}\sum(f_i-\bar f)(x_i-\bar x)
\end{equation}

or correlation coefficient: 
\begin{equation}
r_{fx}=\frac{s_{fx}}{s_f s_x}
\end{equation}

Mutual information can also provide information on the degree of dependence between forecast and observed variables.

### accuracy
The average correspondence between individual pairs of forecast and observations. Many typical verification scores are measures of accuracy, such as mean squared error:
\begin{equation}
MSE=\frac{1}{N}\sum(f_i-x_i)^2
\end{equation}

and mean absolute error:
\begin{equation}
MAE=\frac{1}{N}\sum|f_i-x_i|
\end{equation}

### skill
Since "ignorant" forecasting approaches (persistence, climatology, or even random chance) can display some degree of accuracy, a forecasting system must provide better accuracy than those methods in order to be considered "skillful". Therefore, forecast skill is defined in terms of relative accuracy, or the accuracy of a forecast system compared to the accuracy of forecasts based on a "reference" approach, such as random chance, persistence, climatology, or a competing forecast system.

We can convert an accuracy measure into a measure of forecast skill by comparing the $score$ to one determined from a “reference” forecast ($score_{ref}$). The difference between the score obtained from the sample of forecasts and observations and that obtained from the reference forecast can normalized using the expected value of the measure for a perfect forecasting system ($score_{perfect}$). The reference forecast system provides a "baseline" for determining the degree of skill in the forecast system of interest. Positive skill is obtained when the forecast system is “better” than the reference system, negative skill when the forecasts are “worse” than the reference. 

A generic skill score ($SS$) can be defined using the following:

\begin{equation}
SS=\frac{score - score_{ref}}{score_{perfect} - score_{ref}}
\end{equation}

### reliability
The remaining aspects of forecast quality are related to conditional distributions (either the forecasts conditioned on the observations $p(f|x)$ or the observations conditioned on the forecasts $p(x|f)$). Reliability is generally concerned with conditional bias, or whether the forecasts can be "taken at face value". Murphy (1993) defined reliability as "the correspondence between the conditional mean observation and the conditioning forecast". For example, a set of forecasts of "30%" probability are considered reliable if the observed events occur 30% of time that specific probability was issued. Reliability is typically analyzed with reliability (or attributes) diagrams.

### resolution
Resolution is also determined from the conditional distribution of observations given the forecast, so it is connected to reliability. Murphy (1993) defined resolution as the "difference between the conditional mean observation and the unconditional mean observation" (conditioned on the forecast). In other words, how do the conditional mean observations that correspond to forecasts of "40" differ from those that correspond to forecasts of "60", or the overall mean observation. Forecasts with good resolution ability can sort (or "resolve") observed outcomes with different frequencies when different forecast values are issued. Resolution can also be analyzed with reliability diagrams.

### discrimination
Discrimination is roughly defined as the forecast system's ability to discriminate among different observations and is determined from the conditional distribution of forecasts given the observation $p(f|x)$. Murphy (1993) defined two types of discrimination, "the correspondence between the conditional mean forecast and the conditioning observation" as well as the "difference between the conditional mean forecast and the unconditional mean forecast". In situations with only two possible observed outcomes (for instance, the event occured $x=1$ or did not occur $x=0$), discrimination can be examined by comparing the two resulting conditional forecast disributions $p(f|x=0)$ and $p(f|x=1)$, which is typically called a discrimination diagram. A forecast system with good discrimination ability will show these conditional forecast distributions as well separated, perhaps skewing towards higher forecast values when $x=1$ and lower forecast values when $x=0$. ROC curves are also used to analyze discrimination.