# Chapter 12 Notes

### Preliminary

Multiple estimators were presented in the previous chapters. <b>Decision Theory</b> is the formal theory used for comparing statistical procedures. 

An estimator to be chosen among a group can be called a <b>decision rule</b> and the possible value of the decision rules are called <b>action</b>. 

A loss function $L(\hat\theta, \theta)$ is function that measures the discrepancy between $\theta$ and $\hat\theta$. Loss function can be used to quantity the accuracy of an estimator. 

The <b>Risk</b> function $R(\hat\theta, \theta)$ is defined as the mean of evaluation of the loss function: $R(\hat\theta, \theta) = \mathbb{E}_\theta(L(\hat\theta, \theta)) = \int L(\theta, \hat\theta(x))f(x; \theta)dx$. i.e. Risk = average loss across possible sampling. 

### Comparing Risk Functions 

Directly comparing loss functions to pick a best estimator is not convenient, the result depends on the value of $\theta$ used. Instead, a one-number summary of the risk function can be used. Two of them are presented: the <b>maximum risk</b> and <b>Bayes risk</b>.

<b>Maximum risk</b>: highest value of the risk function. <br> 
$\bar R(\hat\theta) = \sup_\limits{\theta}R(\theta, \hat\theta)$

<b>Bayes's risk</b>: weighted average of the risk function based on prior knowledge of theta of $\theta$:  
$r(f,\hat\theta) = \int( R(\theta,\hat\theta)f(\theta)d\theta$


A decision rule that minimizes the Bayes risk is called a <b>Bayes rule</b>. $\hat\theta$ is a Bayes rule with respect to the prior $f$ if:  $$r(f,\hat\theta) = \inf_\limits{\tilde\theta} r(f, \tilde\theta)$$

where the infinimum is over all estimators $\tilde\theta$. An estimator that minimizes the maximum risk is called the <b>minimax rule</b>. Formally, $\hat\theta$ is minimax if: 
$$ \sup_\limits{\hat\theta} R(\theta, \hat\theta) = \inf_\limits{\tilde\theta}\sup_\limits{\theta}R(\theta,\tilde\theta) $$
where the infinimum is over all estimators $\tilde\theta$. 


### Bayes Estimators

<b>Posterior risk</b> definition: 
$ r(\hat\theta|x) = \int L(\theta,\hat\theta)f(\theta|x)d\theta$

<b>Theorem 12.7</b> <br> 
The Bayes risk satisfies: $r(f,\hat\theta) = \int r(\hat\theta|x)m(x)dx$ <br> 
where $m(x) = \int f(x|\theta)f(\theta)dx$ is the <b>marginal distribution</b> of X. <br> 
The estimator that minimizes the posterior risk $r(\hat\theta|x)$ is the Bayes estimator $\hat\theta$. 

This theorem is useful because directly minimizing Bayes risk would involve a term a with two integrals whereas minimizing the posterior risk only involves dealing with one integral. 


<b>Theorem 12.8</b> <br> 
If $L(\theta, \hat\theta) = (\theta-\hat\theta)^2$, then the Bayes estimator $\hat\theta(x) = \int \theta f(\theta|x)dx$

proof:

$ \dfrac{d}{d\hat\theta} r(\hat\theta|x) = \dfrac{d}{d\hat\theta} \int L(\theta,\hat\theta)f(\theta|x)d\theta
= \dfrac{d}{d\hat\theta} \int (\hat\theta-\theta)^2f(\theta|x)d\theta
= \int 2(\hat\theta-\theta)f(\theta|x)d\theta
= -2\int \theta f(\theta|x)d\theta + 2\hat\theta = 0$ <br> 
$\implies \hat\theta  = \int \theta f(\theta|x)d\theta$


### Minimax Rule 
<b>Theorem 12.10</b> <br>
Supposed $\hat\theta^f$ is the Bayes rule for some priof $f$: 
$$ r(f,\hat\theta^f) = \inf_\limits{\hat\theta} r(f,\hat\theta) $$
Also, supposed that: $$r(f,\hat\theta^f) \geq R(\theta,\hat\theta^f) \qquad\forall \theta$$
Then, $\hat\theta^f$ is minimax.

proof by contradiction: <br> 
Let's assume there is an estimator $\hat\theta^f$ which yields a smaller maximum risk: <br> 
$  \sup_\limits{\theta}R(\theta,\hat\theta_0) \leq \sup_\limits{\theta}R(\theta,\hat\theta^f) \leq r(f,\hat\theta^f) $,<br> 
Since the average of a function is always less than or equal to it maximum value, we have that

$r(f,\theta_0) = \int R(\theta, \hat\theta)f(\theta)d\theta \leq \sup_\theta R(\theta, \hat\theta_0)$ <br> 
Combining the previous results yields: <br> 
$r(f,\theta_0) \leq \sup_\theta R(\theta, \hat\theta_0) \leq \sup_\limits{\theta}R(\theta,\hat\theta^f) \leq r(f,\hat\theta^f)$ <br>
Which contradicts the initial assumption $r(f,\hat\theta_0) < r(f,\hat\theta^f)$


<br> 
<b>Theorem 12.11</b> <br> 
Let $\hat\theta$ be the Bayes rule for some prior $f$. Suppose further $\hat\theta$ has a constant risk; $R(\theta,\hat\theta) = c$. Then, $\hat\theta$ is minimax. <br>  
proof: 
$r(f,\hat\theta) = \int R(\theta,\hat\theta)f(\theta)d\theta = c$, consequently, $R(\hat\theta, \theta)\leq r(f,\theta)$ for all $\theta$. Apply the previous theorem to prove $\hat\theta$ is minimax. 


<b>Theorem 12.14</b> <br> 
Let $X_1, X_2, \dots X_n \sim N(\theta,1)$ and let $\hat\theta = \bar X$. Then $\hat\theta$ is minimax with respect to any well behaved (convex and symmetric) function. It is only estimator with this property. This property only holds provided the parameter set is not restricted. 


### Maximum likelihood, Minimax, and Bayes 

In short, in most parametric models, with large samples, the MLE is approximately minimax and Bayes. 

### Admissibility 

An estimator is <b>inadmissible</b> if there is another estimator with a risk function is always smaller, and strictly smaller for at least one value of $\theta$. In other words, $\hat\theta$ is inadmissible if there exits another estimator  $\theta'$ such that  

$$ R(\theta,\theta') \leq R(\theta,\hat\theta) \text{  for all }\theta\text{ and}$$ 
$$ R(\theta,\theta') < R(\theta,\hat\theta) \text{  for at least one value of }\theta$$ 

Otherwise, $\hat\theta$ is admissible. 



# Chapter 12 Exercises


### Exercise 12.1 

(a) 

<b> Part I - Bayes estimator:</b> <br> 
Posterior distribution: <br> 
$f(p|x) = \dfrac{f(x|p)f(p)}{\int f(x|p)f(p)}$ <br> 
The prior is Beta distributed while the likelihood is binomial. Therefore, the prior distribution is a conjugate prior of the likelihood function and the posterior distribution should be Beta with parameters:  <br>  
$\alpha' = \alpha+\sum x_i = \alpha+ x_i $  <br>
$\beta' = \beta + nN -\sum x_i = \beta + n - x_i$ <br> (Assuming a sample size of one)<br>

To find the Bayes estimator, we could directly apply theorem 12.8. Alternatively, we could take the longer route and minimize the posterior risk $r(p|x)$: <br> 
$r(p|x) = \int L(p, \hat p)f(p|x) dp = \int (p-\hat p)^2 \dfrac{p^{\alpha'-1}(1-p)^{\beta'-1}}{\text{Beta}(\alpha',\beta')}dp$ <br> 

Taking the derivative with respect to $\hat p$ and setting it to $0$: <br> 
$\dfrac{d}{d\hat p}r(p|x) = \int 2(p-\hat p) \dfrac{p^{\alpha'-1}(1-p)^{\beta'-1}}{\text{Beta}(\alpha',\beta')}dp = 
- 2\hat p + 2\int p \dfrac{p^{\alpha'-1}(1-p)^{\beta'-1}}{\text{Beta}(\alpha',\beta')}dp = 
- 2\hat p + 2 \dfrac{\alpha'}{\alpha' + \beta'}  =0 \implies \hat p = \dfrac{\alpha'}{\alpha' + \beta'}$ <br> 

$$ \implies \hat p = \dfrac{\alpha+ x_i}{\alpha + \beta + n}$$

<b> Part II - Bayes risk:</b> <br> 
Risk associated to the Bayes estimator: <br> 
$R(p,\hat p) = \mathbb{V}(\hat p) + \text{bias}^2(\hat p) = \dfrac{np(1-p)}{(\alpha + \beta + n)^2} + (\dfrac{\alpha+ np}{\alpha + \beta + n} - p)^2$ <br> 
$\dfrac{np(1-p)}{(\alpha + \beta + n)^2} + (\dfrac{\alpha  - \alpha p -\beta p}{\alpha + \beta + n})^2 = $ <br> 
$\dfrac{np(1-p) + (\alpha  - (\alpha+\beta)p)^2}{(\alpha + \beta + n)^2} = $ <br> 
$\dfrac{np - np^2 + \alpha^2 + (\alpha+\beta)^2p^2 - 2\alpha(\alpha+\beta)p}{(\alpha + \beta + n)^2} = $ <br> 
$\dfrac{p^2((\alpha+\beta)^2-n) + p(n - 2\alpha(\alpha+\beta) ) + \alpha^2 }{(\alpha + \beta + n)^2}$

Bayes risk associated to the Bayes estimator: <br> 

$r(f,\hat p) = \int R(p,\hat p)f(p)dp =$ <br>  
$(\alpha + \beta + n)^{-2}(((\alpha+\beta)^2-n)\int p^2f(p)dp + (n - 2\alpha(\alpha+\beta) )\int pf(p)dp + \alpha^2)$ <br> 
$(\alpha + \beta + n)^{-2}(((\alpha+\beta)^2-n)(\dfrac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)} - (\dfrac{\alpha}{\alpha+\beta})^2) + (n - 2\alpha(\alpha+\beta) )\dfrac{\alpha}{\alpha+\beta} + \alpha^2) = $ <br> 

$(\alpha + \beta + n)^{-2}(\quad \dfrac{\alpha\beta((\alpha+\beta)^2-n)}{(\alpha+\beta)^2(\alpha+\beta+1)} - \dfrac{((\alpha+\beta)^2-n)\alpha^2}{(\alpha+\beta)^2} + \dfrac{n\alpha}{\alpha+\beta}  - 2\alpha^2 + \alpha^2 \quad)$

$(\alpha + \beta + n)^{-2}(\quad \dfrac{\alpha\beta(\alpha+\beta)^2-\alpha\beta n}{(\alpha+\beta)^2(\alpha+\beta+1)} - \dfrac{\alpha^2(\alpha+\beta)^2-\alpha^2n}{(\alpha+\beta)^2} + \dfrac{n\alpha}{\alpha+\beta}  - \alpha^2 \quad)$

$(\alpha + \beta + n)^{-2}(\quad \dfrac{\alpha\beta}{(\alpha+\beta+1)} + \dfrac{-\alpha\beta n}{(\alpha+\beta)^2(\alpha+\beta+1)} - \alpha^2 + \dfrac{\alpha^2n}{(\alpha+\beta)^2} + \dfrac{n\alpha}{\alpha+\beta}  - \alpha^2 \quad)$

I fail to see a pattern here. 

(b) 

$f(\lambda|x) = \dfrac{f(x|\lambda)f(\lambda)}{\int f(x|\lambda)f(\lambda)}$


The Gamma distribution is a conjugate prior to the poisson distribution. Given the prior is gamma distributed with parameters $\alpha$ and $\beta$, the posterior should be gamma distributed with parameters: <br> 
$\alpha' = \alpha + \sum x_i = \alpha + x$ <br> 
$\beta' = \dfrac{\beta}{n\beta+1}=  \dfrac{\beta}{\beta+1}$

<b> Part I: Bayes estimator </b> <br> 
When the loss function is the least square error, the bayes estimator is simply the mean of the posterior distribution, hence: <br>
$\hat\lambda = \alpha'\beta' = \dfrac{\beta(\alpha + x)}{\beta+1}$


<b> Part II: Bayes risk </b> <br>
Assuming the chosen estimator is the Bayes estimator and the loss function is least square, the Bayes estimator should be the MSE of the Bayes estimator: <br> 
$\text{MSE}(\hat\lambda) 
= \mathbb{V}(\hat\lambda) + \text{bias}^2(\hat\lambda) 
= \dfrac{\beta^2}{(\beta+1)^2}\mathbb{V}(X) + (\dfrac{\beta\alpha + \beta\mathbb{E}(X)}{\beta+1} - \lambda)
= \dfrac{\beta^2\lambda}{(\beta+1)^2} + (\dfrac{\beta\alpha + \beta\lambda}{\beta+1} - \lambda)
= \dfrac{\beta^2\lambda}{(\beta+1)^2} + \dfrac{\beta\alpha - \lambda}{\beta+1}$