### Optimal Compression

"Statistics" is a function of data.

A **sufficient statistic** $t(d)$ has the following properties:
> $p(\theta \mid d) = p(\theta \mid t(d))$, example:<br><br>
  &nbsp;&nbsp;&nbsp;&nbsp;For $d\sim\mathcal{U}(a,b)$, $\theta = (a,b)$, then $p(\theta\mid d) = p(\theta \mid max(d), min(d))$<br>
  &nbsp;&nbsp;&nbsp;&nbsp;So for this case $t(d) = (max(d), min(d))$ is a sufficient statistic<br>
  &nbsp;&nbsp;&nbsp;&nbsp;$p(\theta \mid t(d)) \propto p(t(d) \mid \theta) p(\theta)$<br>

Exact sufficient statistics are rarely available. Let's look at "locally sufficient statistics".<br>
There are functions of the data that are "nearly" suffcient in the neighbirhood of a "fiducial" parameter $\theta_{\text{fid}}$.

Consider the log likelihood in a neighborhood around $\theta_{\text{fid}}$:<br>
$\ln p(d\mid \theta_{\text{fid}}+\Delta \theta) = \ln p(d\mid \theta_{\text{fid}}) + \partial_{\theta_i} \ln p(d\mid \theta_{\text{fid}}) \Delta \theta_i + \frac{1}{2} \partial_{\theta_i}\partial_{\theta_j} \ln p(d\mid \theta_{\text{fid}}) \Delta \theta_i \Delta \theta_j + \cdots \quad \Leftarrow$ "asymptotic" expansion.<br>
- $\partial_{\theta_i} \ln p(d\mid \theta_{\text{fid}}) \Delta \theta_i$ is the leading order term coupling $\Delta \theta$ and $d$, $s(d)_i = \partial_{\theta_i} \ln p(d\mid \theta_{\text{fid}})$ is the Fisher score function;<br>
- $\partial_{\theta_i}\partial_{\theta_j} \ln p(d\mid \theta_{\text{fid}}) = -K_{ij}$, K is the curvature / Hessian matrix.<br>

$$
\begin{align*}
<s(d)>_{d\sim p(d\mid \theta_{\text{fid}})} &= \int \partial_{\theta_i} \ln p |_{\theta_{\text{fid}}} p |_{\theta_{\text{fid}}} dd \\
&= \int \frac{\partial_{\theta_i} p}{p} p dd \\
&= \partial_{\theta_i} \int p(d\mid \theta_{\text{fid}}) dd \\
&= \partial_{\theta_i} {1} = 0
\end{align*}
$$

$\Rightarrow$ Average log likelihood is quadratic around $\theta_{\text{fid}}$.

Now let's find $\hat{\theta}$ that maximizes $\ln p(d\mid \hat{\theta})$.

$\partial_{\Delta \theta} p |_{\Delta \theta = \hat{\theta} - \theta_{\text{fid}}} = 0 = s_i - K_{ij} \Delta \theta_j$

$\hat{\theta} = \theta_{\text{fid}} + K_{ij}^{-1} s_j$, "Quadratic maximum likelihood".

Iterating this $\Rightarrow$ Maximum likelihood estimate, Newton-Raphson method.

Let's replace K with F (Fisher information matrix), where $F = <K>$

$\hat{\theta}_i = \theta_{\text{fid},i} + F_{ij}^{-1} s_j$

> Do I lose information due to this? We will show that no other unbiased estimator has lower covarince. 
- Unbiased if $\theta_{\text{true}} = \theta_{fid}$. $\langle \hat{\theta}\rangle = \theta_{fid}$ asymptotically, thus unbiased; 
- Covariance;

$\text{Cov} \hat{\theta} = <F^{-1}ss^{\top}F^{-1}> = F^{-1}<ss^{\top}>F^{-1}$, where:

$$
\begin{align*}
<ss^{\top}> &= \int \partial_{\theta_i} \ln p \partial_{\theta_j} \ln p dd \\
&= \int \partial_{\theta_i} \ln p \frac{\partial_{\theta_j} p}{p} p dd \\
&= -\int \partial_{\theta_i}\partial_{\theta_j} \ln p dd \\
&= F
\end{align*}
$$




Thus $\operatorname{Cor}\hat{\theta})= F^{-1}FF^{-1}=F^{-1}$. Can we do betters?

Assuming that $\theta_{fid}$ is known, let's take any estimator $f(d)$ such that 
$$\langle f(d)\rangle_{d\sim p(d\mid \theta_{fid})}=\theta.$$ 
Then, $f(d)$ is unbiased.


Consider 
$$\begin{pmatrix}
f \\ s
\end{pmatrix}$$
$$\operatorname{Cor}\left( \begin{pmatrix}
f \\ s
\end{pmatrix}\right) = \begin{pmatrix}
C_{ff} & C_{fs} \\
C_{sf} & C_{ss}=F^{-1}
\end{pmatrix}\succ 0$$
Then 
$$\begin{align*}
\left[C_{fs}\right]_{ij} &= \langle \rangle \\
&=\langle (f_{i} - \langle f_{i}\rangle)s_j \rangle \\
&=\langle f_{i}s_{j}\rangle \\
&= \int f_{i} \partial_{\theta_{j}}\ln p dd \\
&= \int f(d)_{i}\frac{\partial_{\theta_{j}}p}{p} dd \\
&= \partial_{\theta_{j}}\int f(d)_{i} p dd\\
&= \partial_{\theta_{j}}\langle f_{i}\rangle \\
&= \partial_{\theta_{j}}\theta_{i} \\
&= \delta_{ij}
\end{align*}$$
where $s(d)$ is the most informative statistic (near $\theta_{fid}$), thus locally sufficient. 

Remember that 
$$C_{f\mid s}=C_{ff}-C_{fs}C_{ss}^{-1}C_{sf}$$ 
is a covariance matrix, thus $C_{f\mid s}\succ 0$. 
$$C_{ff}-F^{-1}\succ 0 \Longrightarrow C_{ff}\succ F^{-1}$$
which is the multi-variate generalization of the Cramer-Rao bound. 

Near $\theta_{fid}$, $\hat{\theta}$ or $s$ are locally optimal compressed statistics, taking $\operatorname{dim}(d) \to \operatorname{dim}\theta$. 

**Quick example**:<br><br>
Gaussian likelihood with parameter-dependent mean. 
$$\begin{align*}
    \ln p(d\mid \theta) &= \text{const} -\frac{1}{2} (d-\mu(\theta)^{\top} N^{-1} (d-\mu(\theta))) \\
    \partial_{\theta_{i}} \ln p &= \left( \partial_{\theta_{i}\mu \|_{\theta_{\theta_{fid}}}} \right)^{\top} N^{-1} \left( d-\mu(\theta_{fid})\right) \\
    &=s
\end{align*}$$
So a lowest sufficient statistics is $(\partial_{\theta_{i}}\mu)N^{-1}d$. 
$$\begin{align*}
    F_{ij} &= \operatorname{Cov}(s)_{ij} \\
    &= \partial_{\theta}\mu^{\top} N^{-1} \langle \left(d-\mu(\theta_{fid})\right) \left(d-\mu(\theta_{fid}\right)^{\top}\rangle N^{-1}(\partial_{\theta_{j}}\mu) \\
    &=\partial_{\theta}\mu^{\top} N^{-1} N N^{-1}(\partial_{\theta_{j}}\mu) \\
    &=\partial_{\theta}\mu^{\top} N^{-1}(\partial_{\theta_{j}}\mu) 
\end{align*}$$