Add convenience function for calculating loss (log-likelihood) over data, also information criterion adjusted #7

Blunde1 · 2023-12-15T14:03:49Z

The loss should be using the triangular structure, and thus additive nature of the log-likelihood.
For each sub log-likelihood, we may add the information criterion component.
I.e., we seek to evaluate
$$l(u;\hat{\Lambda})=\sum_j l(u_j;\hat{C}_j)$$

and to evaluate
$$E[l(u_{test};\hat{\Lambda})]$$
as
$$E[l(u_{test};\hat{\Lambda})]\approx \sum_j l(u_{j,~ train};\hat{C}_j) + IC(\hat{C}_j)$$
where we may try for $IC(\hat{C}_j)$
the

AIC: $IC(\hat{C}_j)=ne(j)+1$, easy, standard, and may be computed locally for $\hat{C}_j$ or globally for $\hat{\Lambda}$. Negative: Asymptotics and assumes population precision is in the family of precisions being estimated over.
AICc: $IC(\hat{C}_j)=k + \frac{k^2+k}{n-k-1}$ for $k=ne(j)+1$, easy and relatively standard. May only be computed locally on $\hat{C}_j$. It is an adjustment for small sample sizes but requires $k < n$ still. It also has the same assumptions on the population precision as the AIC.
TIC: $IC(\hat{C}_j)=tr[\nabla^2 l(u_j;\hat{C}_j) (\nabla l(u_j;\hat{C}_j)^2)])$ can be computed locally or globally. Locally makes sense, as we have access to derivatives from the optimization. Avoids the assumptions on population precision.

All of the above employs asymptotic results. Is it possible to use e.g. the bootstrap (or the bootstrap in the frequentist domain) to alleviate these assumptions for when $n$ is small?

Blunde1 · 2023-12-15T14:23:38Z

To second order, quite generally we have
$$IC(\theta) = tr\left(E\left[\nabla_\theta^2 l(u;\hat{\theta})\right] Cov(\hat{\theta})\right)$$
The sample average to replace $Cov(\hat{\theta})$ is not the best estimator. In fact the "trace inner product" induces the Frobenius norm as a measure, and there exists results on adaptive inflation to improve the estimator under this norm.

Blunde1 · 2023-12-15T15:09:35Z

It is not exactly the sample estimator that is employed for $Cov(\hat{\theta})$ but rather the Delta method using the sample covariance for $Cov(\nabla_\theta l(u;\hat{\theta}))$. The argument on "best estimator" above still applies. This is particularly relevant for $p>>n$ but a global maximum exists (i.e. under L-2 regularization of the objective).

Note, this (implicitly) answers the question in https://www.tandfonline.com/doi/abs/10.1198/000313006X152207 on why practitioners should care about parameter variance while ignoring bias.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add convenience function for calculating loss (log-likelihood) over data, also information criterion adjusted #7

Add convenience function for calculating loss (log-likelihood) over data, also information criterion adjusted #7

Blunde1 commented Dec 15, 2023 •

edited

Loading

Blunde1 commented Dec 15, 2023

Blunde1 commented Dec 15, 2023

Add convenience function for calculating loss (log-likelihood) over data, also information criterion adjusted #7

Add convenience function for calculating loss (log-likelihood) over data, also information criterion adjusted #7

Comments

Blunde1 commented Dec 15, 2023 • edited Loading

Blunde1 commented Dec 15, 2023

Blunde1 commented Dec 15, 2023

Blunde1 commented Dec 15, 2023 •

edited

Loading