# Introduction: The Cautious Expert and the Charlatan

Consider two types of probabilistic forecasters. The first type of forecaster, who we shall term the expert, provides a probabilistic forecast for an event happening cautiously, reflecting that there is inherent uncertainty that governs the likelihood of the event occurring. The second type of forecaster is more confident that he is right. Whenever an event is more likely than not to occur, he expresses more certainty in his forecast than is reasonable, given the prevalent uncertainty. We shall term this forecaster a charlatan. 

An expert forecaster will state that the probability of rolling a five or less on a fair die is five in six. A charlatan will see that such an event happens more than half the time and proclaim that it will happen with a higher degree of confidence, and when it does happen far more often than not, will herald his expertise to the world, denigrating those wishy-washy fence sitting forecasters that give a low probability of occurrence for an event that is obviously likely to happen. When a six is rolled, the charlatan will claim that nobody could have foreseen such an event, and point to his expert counterpart that also got it wrong in 'falsely' claiming that such an event would not happen - the expert did provide 83% certainty that it wouldn't happen, after all. Of course, when rolling a die, an event that can be repeated again and again and the results tabulated, the charlatan will be derobed. 

## Measures to Evaluate Probabilistic Performance

To take a small detour, let's consider how we show that the expert in this scenario is right, and the charlatan misguided. To evaluate probabistic forecasting performance, the average difference between a degree of certitude provided for an event's occurrence, and whether or not it occurred (if it occurred, then the event scores 1, if it doesn't, it scores 0) is computed over many repeat trials, and the forecaster with a lower difference is proclaimed a better forecaster (in this instance at least). 

To score the expert and the charlatan in the dice rolling exercise, let's consider two widely used scoring rules - the simple average error, which simply takes the difference in probability between whether the event occurred and the forecasted probability; and the Brier Score, which takes the squared difference. 

That is, the simple average forecast error for an event that has two outcomes, occurs or does not occur is given by:

$$
SAFE_i = (E_i - f) 
$$

and the Brier Score is given, in its traditional form by:

$$
Brier_i = (E_i - f)^2 + ((1-E_i) - (1-f))^2
$$

for each of a number of trials $i = 1, 2, ..., n$, where $E-i$ is 1 if the event occurred and 0 if it did not occur in trial i, and $f$ is the probability of E occurring given by the forecaster.

We let the charlatan's forecasting strategy be to double the confidence in an event occurring by halving his estimate of the unlikely outcome occurring. That is, if an event will occur with probability 0.2, the charlatan says it will occur with probability 0.1, and if it will occur with probability 0.8, the charlatan will see that it won't occur with probability 0.2, claim this will occur will probability 0.1, and hence forecast that the event will occur with probability 1 - 0.1 = 0.9. In the die roll example, the expert forecast is that a 5 or less will be rolled on average five in six times, and provide a probabilistic forecast of occurrence of 0.8333, and the charlatan will provide a probabilistic forecast of $1 - \frac {1 - 0.8333}{2} = 0.9167$.

We can see in Figure 1a that in the die roll example that the expert forecaster's simple average error will be 0.1667 if the event occurs and 0.8333 if it does not, and will hence converge to $0.8333 * 0.1667 + 0.1667 * 0.8333 = 0.2778$ over a large number of trials, while the charlatan's simple average error will converge to $0.8333 * 0.0834 + 0.1667 * 0.9167 = 0.2222$. This perverse outcome, where the worse forecaster gets a better score, illustrates the unsuitability of SAFE for evaluating forecasting performance, yet for some reason the measure still enjoys widespread use, including, notably, by the political polling outfit FiveThirtyEight in evaluating the quality of election forecasts. 

The expert's Brier Score converges to $0.8333 * (0.1667^2 + 0.1667^2) + 0.1667 * (0.8333^2 + 0.8333^2) = 0.2778$, while the charlatan's Brier Score converges to $0.8333 * (0.0834^2 + 0.0834^2) + 0.1667 * (0.9167^2 + 0.9167^2) = 0.2917$ over a large number of trials (Figure 1b). Here because the Brier Score heavily punishes (by squaring the error term) events occurring when a high degree of certainty has been given against its occurrence, the expert gets a lower, and hence 'better' Brier Score in this occasion. (*NB: A lower Brier Score is only better when comparing similar forecasts, and should only be compared in relative terms, not in absolute terms. For example, a forecaster that correctly predicts that a die will roll 5 or less in five out of six occasions will get a better Brier score than if he correctly predicts that the same die will roll 3 or less in three out of six occasions, which is a useful feature in some instances, rewarding boldness in predictions but makes comparison hard when aleatory uncertainty is high*).

![image.png](attachment:image.png)

More generally, where each prediction has feedback before the next prediction is made, the Brier Score is a valid rule, adjudging better probabilistic forecasts to indeed be better, explaining its popularity in gauging the quality of weather forecasts and in forecasting competitions. The heatmap in Figure 2 shows that for a given probabilistic forecast, the longterm Brier score is lowest when the forecast is equal to the actual probability of the event's occurrence, as expected.

But what about the case where new information is received subsequent to a forecast being made but by prior to an event's occurrence, and where forecasts are allowed to be updated? 

![image.png](attachment:image.png)

## Brier Scores in Situations of Systemic Uncertainty

Brier Scores come unstuck, however, when three factors are present in a prediction problem: (1) forecasts are subject to systemic, rather than (or as well as) aleatory uncertainty - that is, the uncertainty in prediction comes from simply not knowing enough information, unlike the die rolling example when we know the longterm probability of rolling a certain value; (2) this systemic uncertainty changes over time as the result of new information being diffused; and (3) forecasts are allowed to be updated when new information, that reduces the level of systemic uncertainty, is processed.

In many real world forecasting problems, which predict human behaviour, in contrast with games of chance, there is no steady-state aleatory uncertainty, but rather the uncertainty comes from an inability to fully attain or process all information before making a probabilistic forecast, but as information is released, this uncertainty decreases closer to an event's occurrence (or non-occurrence). Consequently, an expert forecaster able to process all information currently available to him (but not any information that he does not yet have access to that is useful for the forecasting problem) will increase his confidence in an event's happening as he receives and processes more information about the event. 

In the scenario below, assume that after an initial probabilistic forecast is made about an event's occurrence, information is disseminated which, when processed, changes  the likelihood of an event occurring. 

We assume that at time T = t, t>0, the original probability of an event's occurrence at time T = 0, $p_0$, and the information disseminated between T=0 and T=t, if interpreted correctly, comprises everything necessary to make an accurate probabilstic forecast at time T = t. 

Let $I_s$ = s with probability 50% and -s with probability 50%, where s = step size. That is, the step size in the random walk on p is s. 

$I_t = \begin{Bmatrix} 
      s \; \texttt {with prob}\; 0.5 \\
     -s \; \texttt {with prob}\; 0.5 
   \end{Bmatrix}$

So $p_t = p_{t-1} + I_t$

The expert forecaster fully realises the effect of each piece of information and is able to discern where the walk is walking to. That is, $p_t = q_t \; \forall \; t, I_t$

In game 1, assume the overconfident forecaster has the same interpretative abilities as the expert, but suffers from the same mental bias as before. i.e. although he can discern the movement of the random walk and its effects, he still doubles his confidence. 

***{That is, eqn (1) still holds. }

Let us show two scenarios. In the first, there is only one piece of information disseminated per time period, and hence the total number of time periods t such that time T = t is the first time that there $p_t = 0$ or $p_t = 1$ (i.e. the event finally happens or the opposite does). In the second, there are a fixed number of periods, and multiple pieces of information may be disseminated each time period. 

As an example of the first scenario, consider the following game: 

A coin is tossed multiple times. Tossing heads is worth +1 point, and tossing tails worth -1 point. Heads wins if 10 points are scored before the score hits 0, and tails wins if the score hits 0 before it gets to 10. A forecaster is predicting the probability of heads winning (NB: the true aleatory probability of heads winning this game is the current score / 10, which can be easily proved through Markov equations with end states at 1 and 0). 

The second game is analogous to many real world events that are to take place at a certain time or date, or for prediction problems where probabilistic forecasters are to predict whether an event will happen by a certain time/date.

With a step size $s$ of $0.05$, and a starting probability $p_0 = 0.5$, let's show one such random walk to an event that in this case, did not happen. The red line shows the forecast of an expert, and the blue line the forecast of the overconfident forecaster. We can see that as the event ultimately happen which was more likely than not to occur for most of the forecasting period, the average Brier Score of the overconfident forecast was lower than the fair forecast. This is okay, it is lower whenever an event is forecasted to happen with higher confidence than its actual probability of occurrence, and does happen, and happens many times under chance.     

![image.png](attachment:image.png)

We convert this random walk into an example of the second scenario, by fixing a certain number of time periods t, and for a total number of steps on the random walk z, allocating z/t steps to each time period. If we fix 10 time periods, then there are approximately 22 pieces of information disseminated per time period. $p_t$ for each t is then determined by where the walk has got to. Let's see the expert and overconfident forecasts in this scenario. Again the overconfident forecaster has a lower Brier Score, but again it's one data point. 

![image.png](attachment:image.png)

Now let's look at the forecasting performance for 10000 simulated scenarios, each starting at a starting probability of 0.5 (i.e. the first forecast was made before either forecaster knew anything about the problem that they were forecasting).

Brier Scores| Expert Forecaster | Charlatan
------------|-------------------|----------
Variable number of forecasts|0.3492|0.3494
Fixed number of forecasts|0.3311|0.3313

In this case we can see that there's no difference in the *average* Brier Score for the expert or the charlatan. However, a deeper dive into the Brier Scores for each simulation shows that the charlatan has a better Brier score about 58% of the time. Figure 3 shows how this percentage of charlatan being right increases as the initial probability of an event occurring increases. This intuitively means that when the charlatan is wrong, he's very wrong, as confirmed by Figure 4, which shows the difference between the Brier Score of the expert and the charlatan in each simulation (a positive difference means the charlatan had a lower, or better, Brier Score). We see that although the charlatan has a lower Brier Score more often, the difference is on average about 50% larger (when initial probability is 50%) when the expert has a lower score. 

**Figure 3. Performance of expert and charlatan forecasts for different initial probability of event's occurrence**

Event Initial Probability | Average Brier Score expert | Average Brier Score charlatan | Percentage of time expert has better Brier Score | Percentage of times charlatan has better brier score| Average Brier Score diff when expert better| Average Brier Score diff when charlatan better |
----|----|----|---|----|----|----|
0.1 |0.1136|0.1049|12.64%|87.36%|0.1165|0.0268|
0.2 |0.2115|0.1989|24.22%|75.78%|0.1182|0.0543|
0.3 |0.2873|0.2789|34.76%|65.24%|0.1189|0.0761|
0.4 |0.3315|0.3289|40.63%|59.37%|0.1119|0.0810|
0.5 |0.3492|0.3494|41.76%|58.24%|0.1064|0.0759|
0.6 |0.3320|0.3296|40.77%|59.23%|0.1115|0.0809|
0.7 |0.2853|0.2755|34.55%|65.45%|0.1147|0.0755|
0.8 |0.2123|0.2002|24.43%|75.57%|0.1179|0.0542|
0.9 |0.1116|0.1020|12.07%|87.93%|0.1161|0.0269|
Random|0.|0.|%|%|0.|0.|

![image.png](attachment:image.png)

We can also see in Figure 3 that when the starting probability of an event occurring is different to 0.5, the charlatan actually has a lower average Brier Score than the expert. This is true also when the starting probability comes from a $Uniform ~ (0,1)$ distribution. This is because the charlatan can effectively 'stack' highly confident forecasts of high-probability events, but has the luxury of adjusting the forecast the other way once the alternate becomes more likely. 

Now if we look at an example of a simulation where the charlatan has a higher (worse) Brier Score than the expert, we see the following. We see that both forecasters generally tended to predict that it was more likely that the event would happen, then it ultimately did not.  This is a classic case in which the charlatan would be able to make two claims: (1) that it was an event that couldn't be foreseen in advance, evidencing the fact that all parties forecasted incorrectly (even the expert claimed it was about 60% likely to occur for the majority of the simulation), and hence the charlatan's own bad forecast can be safely disregarded; and (2) there was a key piece of knowledge introduced two days before the event, subsequent to which the charlatan adjusted his forecasts more swiftly than the expert anyway. 

![image.png](attachment:image.png)

Imagine that this series represents a series of election forecasts by a charlatan and a fairer forecaster, with one event being one candidate winning and the alternative being that candidate losing, and see how easily this could be explained away by the charlatan: "Candidate A was ahead for most of the race, which we predicted, but so did everybody else; then at some point there was a public opinion swing fuelled by {insert contemporaneous event here} at which point once we realised the magnitude of the swing we reacted faster than everybody else and correctly projected that Candidate B would win".

Indeed see FiveThirtyEight's assessment of polling efforts for the 2020 democratic primary.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

Look familiar (from https://projects.fivethirtyeight.com/2020-primary-forecast/)?

This illustrates the key problem that we have here - we don't have a close relationship between a forecast and its response, so it is hard to validate forecasting correctness. While we can create an adjusted Brier Score to provide more weight to later forecasts, going some way to mitigate the fact that systemic uncertainty is slowly alleviated.

----

Further work to show in appendix: 

1) Show that even with an error term, it doesn't matter - overconfident forecast better. 
2) Truncate chain before final forecast. 
3) Randomise starting probability. 