# Chapter 5

`Original content created by Cam Davidson-Pilon`

`Ported to Julia and Turing by George Crowther`
____

### Would you rather lose an arm or a leg?

Statisticians can be a sour bunch. Instead of considering their winnings, they only measure how much they have lost. In fact, they consider their wins as negative losses. But what's interesting is *how they measure their losses*.

For example, consider the following example:

> A meteorologist is predicting the probability of a possible hurricane striking his city. He estimates, with 95% confidence, that the probability of it *not* striking is between 99% - 100%. He is very happy with his prevision and advises the city that a major evacuation is unnecessary. Unfortunately the hurricane does strike and the city is flooded.

This stylized example shows the flaw in using a pure accuracy metric to measure outcomes. Using a measure that emphasizes estimation accuracy, while an appealing and *objective* thing to do, misses the point of why you are even performing the statistical inference in the first place: results of inference. The author Nassim Taleb of *The Black Swan* and *Antifragility* stresses the importance of the *payoffs* of decisions, *not the accuracy*. Taleb distils this quite succinctly: "I would rather be vaguely right than very wrong."

## Loss Functions

We introduce what statisticians and decision theorists call *loss functions*. A loss function is a function of the true parameter, and an estimate of that parameter.

$$
L(\theta, \hat{\theta}) = f(\theta, \hat{\theta})
$$
The important point of the loss functions is that it measures how *bad* our current estimate is: the larger the loss, the worse the estimate is according to the loss function. A simple, and very common, example of a loss functoin is the *squared-error loss*:
$$
L(\theta, \hat{\theta}) = (\theta - \hat{\theta})^2
$$
The squared-error loss function is used in estimators like linear regression. UMVUEs and many areas of machine learning. We can also consider an asymmetric squared-error loss function, something like:
$$
L(\theta, \hat{\theta}) = 
\begin{cases}
\begin{align}
(\theta-\hat{\theta})^2 \quad &\hat{\theta}\le\theta \\
c(\theta - \hat{\theta})^2 \quad &\hat{\theta}\geq \theta, 0 < c < 1
\end{align}
\end{cases}
$$
which represents that estimating a value larger than the true estimate is preferable to estimating a value below. A situation where this might be useful is in estimating web traffic for the next month, where an over-estimated outlook is preferred so as to avoid an underallocation of server resources.

A negative property about the squared-error loss is that it puts a disproportionate emphasis on large outliers. This is because the loss increases quadratically, and not linearly, as the estimate moves away. That is, the penalty of being three units away is much less than being five units away, but the penalty is not much greater than being one unit away, though in both cases the magnitude of the difference is the same:
$$
\frac{1^2}{3^2}\leq\frac{3^2}{5^2}, \text{ although } 3-1 = 5-1
$$
This loss function imposes that large errors are *very* bad. A more *robust* loss function that increases linearly with the difference is the *absolute-loss*
$$
L(\theta, \hat{\theta}) = |\theta - \hat{\theta}|
$$
Other popular loss functions include:
- $L(\theta, \hat{\theta})=\mathbb{1}_{\hat{\theta}\neq\theta}$ is the zero-one loss often used in machine learning classification algorithms.
- $L(\theta, \hat{\theta})=-\theta \text{log}(\hat{\theta}) - (1-\theta)\text{log}(1-\hat{\theta}), \theta \in 0, 1, \; \hat{\theta}\in [0,1]$, called the *log-loss*, also used in machine learning.
Historically, loss functions have been motivated from 1) mathematical convinience, and 2) they are robust to application, i.e., they are objective measures of loss. The first reason has really held back the full breadth of loss functions. With computers being agnostic to mathematical convenience, we are free to design our own loss functions, which we take full advantage of later in this Chapter.

With respect to the second point, the above loss functions are indeed objective, in that they are most often a function of the difference between the estimate and true parameter, independent of signage or payoff of choosing that estimate. This last point, its independence of payoff, causes quite pathological results though. Consider our hurricane example above: the statistician equivalently predicted that the probability of the hurricane strking was between 0% and 1%. But if he had ignored being precise and instead focused on outcomes (99% chance of no flood, 1% chanec of flood), he might have advised differently.

By shifting our focus from tying to be incredibly precise about parameter estimation to focusing on the outcomes of our parameter estimation, we can customize our estimates to be optimized for our application. This requires us to design new loss functions that reflect out goals and outcomes. Some examples of more interesting loss functions:
- $L(\theta, \hat{\theta}) = \frac{|\theta-\hat{\theta}|}{\theta(1-\theta)}, \; \hat{\theta}, \theta \in [0, 1]$ emphasizes an estimate closer to 0 or 1 since if the true value of $\theta$ is near 0 or 1, the loss will be very large unless $\hat{\theta}$ is similarly close to 0 or 1. This function might be used by a political pundit whose job requires him or her to give confident "Yes/No" answers. This loss reflects that if the true parameter is close 1 (for example, if a political outcome is very likely to occur), he or she would want to strongly agree as to not look like a skeptic.
- $L(\theta, \hat{\theta})=1 - \text{exp}(1-(\theta-\hat{\theta})^2)$ is bounded between 0 and 1 and reflects that the user is indifferent to sufficiently far away estimates. It is similar to the zero-one loss above, but not quite as penalizing to estimates that are close to the true parameter.
- Complicated non-linear loss functions can be programmed:

      function loss(θ, θ_hat)
          if θ * θ_hat > 0
              return absolute(θ_hat - θ)
          else
              return absolute(θ_hat) * (θ - θ_hat)^2
          end
      end
- Another example is from the book *The Signal and The Noise*. Weather forecasters have an interesting loss function for their predictions.

>People notice one type of mistake - the failure to predict rain - more than other, false alarms. If it rains when it isn't supposed to, they curse the weatherman for ruining their picnic, wheras an unexpectedly sunny day is taken as a serendipitous bounus.

>[The Weather Channel's bias] is limited to slightly exaggerating the probability of rain when it is unlikely to occur - saying there is a 20 percent chance when they know it is really a 5 or 10 percent chance - covering their buttins in the case of an unexpected sprinkle.

As you can see, loss functions can be used for good and evil: with great power, comes great - well, you know.

# Loss functions in the real world
So far we have been under the unrealistic assumption that we know the true parameter. Of course if we knew the true parameter, bothering to guess an estimate is pointless. Hence a loss function is really only practical when the true parameter is unknown.

In Bayesian inference, we have a mindset that the unknown parameters are really random variables with prior and posterior distributions. Concerning the posterior distribution, a value drawn from it is a *possible* realization of what the true parameter could be. Given that realization, we can compute a loss associated with an estimate. As we have a whole distribution of what the unknown parameter could be (the posterior), we should be more interested in computing the *expected loss* given an estimate. This expected loss is a better estimate of the true loss than comparing the given loss from only a single sample from the posterior.

First it will be useful to explain a *Bayesian point estimate*. The systems and machinery present in the modern world are not built to accept posterior distributions as input. It is also rude to hand someone over a distribution when all they asked for was an estimate. In the course of an individual's day, when faced with uncertainty we still act by distilling our uncertainty down to a single action. Similarly, we need to distill our posterior distribution down to a single value (or vector in the multivariate case). If the value is chosen intelligently, we can avoid the flaw of frequentist methodologies that mask the uncertainty and provide a more informative result. The value chosen, if from a Bayesian posterior, is a Bayesian point estimate.

Suppose $P(\theta|X)$ is the posterior distribution of $\theta$ after observing data $X$, then the following function is understandable as the *expected loss of choosing estimate $\hat{\theta}$ to estimate $\theta$.*
$$
l(\hat{\theta}) = E_\theta\left[\;L(\theta, \hat{\theta})\;\right]
$$

This is also known as the *risk* of estimate $\hat{\theta}$

In [6]:
using Bokeh, Distributions

In [30]:
x₁ = LinRange(0, 60_000, 200)
y₁ = pdf(Normal(35_000, 7_500), x₁)
fig₁ = figure(width=500, height=150, toolbar_location=nothing, background_fill_color="#eff0f1")
Bokeh.plot!(fig₁, VArea, x=x₁, y1=0.0, y2=y₁, alpha=0.3)
Bokeh.plot!(fig₁, Line, x=x₁, y=y₁)

x₂ = LinRange(0, 10_000, 200)
y₂ = pdf(Normal(3_000, 500), x₂)
fig₂ = figure(width=500, height=150, toolbar_location=nothing, background_fill_color="#eff0f1")
Bokeh.plot!(fig₂, VArea, x=x₂, y1=0.0, y2=y₂, alpha=0.3, color="red")
Bokeh.plot!(fig₂, Line, x=x₂, y=y₂, color="red")

x₃ = LinRange(0, 25_000, 200)
y₃ = pdf(Normal(12_000, 3_000), x₃)
fig₃ = figure(width=500, height=150, toolbar_location=nothing, background_fill_color="#eff0f1")
Bokeh.plot!(fig₃, VArea, x=x₃, y1=0.0, y2=y₃, alpha=0.3)
Bokeh.plot!(fig₃, Line, x=x₃, y=y₃)


column([fig₁, fig₂, fig₃])

In [4]:
function loss(θ, θ_hat)
    if θ * θ_hat > 0
        return absolute(θ_hat - θ)
    else
        return absolute(θ_hat) * (θ - θ_hat)^2
    end
end

loss (generic function with 1 method)