# Utility or Risk - Back to Optimization

Link to original AM 207 course staff notes: https://am207.github.io/2018spring/wiki/utilityorrisk.html#the-logarithmic-utility-function-and-probabilistic-prediction

-----

## Decision Theory

* Basic idea behind decision theory is that **predictions** (or **actions absed on predictions**) are **described by a utility or loss function**, whose values can be **computed given the observed data**.
* Key distributions in the Bayesian scenario (our scenarios of interest)
    * **posterior**: $p(\theta | D)$
    * **posterior predictive**: $p(y^* | D) = \int p(y^* | \theta) \ p(\theta | D) \ d\theta $   
* Components of the decision scenario:
    1. $a \in A$, an **action** from a **set of available actions** for the decision problem
    2. $\omega \in \Omega$, a **state** in the **set of states in the world**
        * In our scenario, $\omega$ could be either $y^*$ or $\theta$, depending on the problem at hand
    3. $p(\omega | D)$, our **current belieft about the world**.
        * in practice, can be either the prosterior distribution or the posterior predictive distribution
    4. a **utility function** $u(a, \omega) : A \times \Omega \rightarrow \mathbb{R}$. Ths function **awards a utility to each action $a$, when the state of the world is $\omega$**
* the goal is to **maximize the distribution expected utility $\bar{u}(a)$ over all possible actions**
    * $\bar{u}(a) = \int u(a, \omega) \ p(\omega | D) \ d\omega$
    * so we want to find $\hat{a} = \underset{a}{\arg\max } \ \bar{u}(a)$ — the **bayes action**
* the maximum expected utility is then given by $\bar{u}(\hat{a}, p) = \bar{u}(\hat{a})$
* we define a **divergence** quanitity $d(a, p) = \bar{u}(p, p) - \bar{u}(a, p)$
    * so, **one can think of minimizing $d(a, p)$ with respect to $a$ as a way to get $\hat{a}$**

### Risk from the Posterior Predictive

* consider the case where $\omega = y^*$, and we have a model $M$ with respect to which we can define a posterior predictive distribution. Then we can condition on this model in the expressions for $\bar{u}(a)$ and $\bar{u}(\hat{a}, p)$ as follows:

$$\bar{u}(a) = \int u(a, y^*) \ p(y^* | D, M) \ dy^*,$$
$$\bar{u}(\hat{a}, p) = \bar{u}(\hat{a}) = \int u(\hat{a}, y^*) \ p(y^* | D, M) \ dy^*$$

* can make $a$ a function of $x$ as needed

-----

## Point Prediction

### The squared error loss/utility

* squared error loss is an examole of a risk **defined to make a point estimate**
* (I like this, quoted directly: "Given a posterior predictive distribution, **how do you commmunicate one number to your boss from it?**")
* define squared error loss on of an action, given a predicted value: $$l(a, y^*) = (a - y^*)^2$$
* Expression for **expected loss**: $$\bar{l}(a) = \int (a - y^*)^2 \ p(y^* \mid D, M) \ dy^*$$
* The point that minimizes this expected loss is $$\hat{a} = E_p[y^*]$$
    * Then, the **expected loss just becomes $Var_p[y^*]$** (just plog $\hat{a} = E_p[y^*]$ in to the equation for expected loss and follow the calculations through)
* (course staff lecture notes make the claim that "Using such a loss thus indicates that you only care about the first two moments about the distribution, and that there is no gain to considering things like skewness and kurtosis"... **think about just why this is the case**)

-----

## The Logarithmic Utility Function and Probabilistic Prediction

* **logarithmic utility function** is used for probabilistic prediction when the **unknown state is a future observation $y^*$**
    * goal is to use this utility function to find the distribution of the future observations
* logarithmic utility function is $$u(a, y^*) = log \ a(y^*)$$
* expected utility is as expected
* the $a$ that maximizes expected utility is just the posterior predictive distribuiton: $$\hat{a}(y^*) = p(y^* \mid D, M)$$
* the maximized utility is then just the **negative entropy of the posterior predictive distribution**: $$\bar{u}(\hat{a}) = \int \log(p(y^* \mid D, M)) \ p(y^* \mid D, M) \ dy^*$$
    * then the divergence in this case is just the **KL divergence**

### Single prediction vs. multiple prediction

* the context above is akin to trying to predict the marginal predictive distribution
* theoretical derivation is the same for the joint distribution: "consider the joint to be derived step by step from updated posterior predictives, as the new data points 'come in'"

-------

## Predictions with respect to which model/distribution?

* When our goal is to compare models, we can do so without knowing the **"true model"**: "This is the essential idea behind taking the difference in the KL divergences (or divergences of another kind) which allow us to **create a relative scale on which quantities like DIC or WAIC can be compared**"
* define a **generalization utility**: $$\bar{u}_t(\hat{a}) = \int u(\hat{a}, y^*) \ p_t(y^*) \ dy^*$$
    * $p_t(y^*)$ is the true predictive distribution.
    * use $\hat{a}$ since we're already considering the action as optimal w.r.t. a model's posterior predictive
* **True belief distribution**: not included int these notes, but see [Vehtari and Ojanen (2012)](https://projecteuclid.org/euclid.ssu/1356628931) for an exposition

### Bayesian model averaging

[TODO]

### Where are the models?

[TODO]

-----

## Model Comparison

* key thing here is that we will **sort our average utilities in some order"

[TODO]

-----

## Risk from the Posterior: Posterior Points

[TODO]