Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add convenience function for calculating loss (log-likelihood) over data, also information criterion adjusted #7

Open
Blunde1 opened this issue Dec 15, 2023 · 2 comments

Comments

@Blunde1
Copy link
Collaborator

Blunde1 commented Dec 15, 2023

The loss should be using the triangular structure, and thus additive nature of the log-likelihood.
For each sub log-likelihood, we may add the information criterion component.
I.e., we seek to evaluate
$$l(u;\hat{\Lambda})=\sum_j l(u_j;\hat{C}_j)$$

and to evaluate
$$E[l(u_{test};\hat{\Lambda})]$$
as
$$E[l(u_{test};\hat{\Lambda})]\approx \sum_j l(u_{j,~ train};\hat{C}_j) + IC(\hat{C}_j)$$
where we may try for $IC(\hat{C}_j)$
the

  • AIC: $IC(\hat{C}_j)=ne(j)+1$, easy, standard, and may be computed locally for $\hat{C}_j$ or globally for $\hat{\Lambda}$. Negative: Asymptotics and assumes population precision is in the family of precisions being estimated over.
  • AICc: $IC(\hat{C}_j)=k + \frac{k^2+k}{n-k-1}$ for $k=ne(j)+1$, easy and relatively standard. May only be computed locally on $\hat{C}_j$. It is an adjustment for small sample sizes but requires $k < n$ still. It also has the same assumptions on the population precision as the AIC.
  • TIC: $IC(\hat{C}_j)=tr[\nabla^2 l(u_j;\hat{C}_j) (\nabla l(u_j;\hat{C}_j)^2)])$ can be computed locally or globally. Locally makes sense, as we have access to derivatives from the optimization. Avoids the assumptions on population precision.

All of the above employs asymptotic results. Is it possible to use e.g. the bootstrap (or the bootstrap in the frequentist domain) to alleviate these assumptions for when $n$ is small?

@Blunde1
Copy link
Collaborator Author

Blunde1 commented Dec 15, 2023

To second order, quite generally we have
$$IC(\theta) = tr\left(E\left[\nabla_\theta^2 l(u;\hat{\theta})\right] Cov(\hat{\theta})\right)$$
The sample average to replace $Cov(\hat{\theta})$ is not the best estimator. In fact the "trace inner product" induces the Frobenius norm as a measure, and there exists results on adaptive inflation to improve the estimator under this norm.

@Blunde1
Copy link
Collaborator Author

Blunde1 commented Dec 15, 2023

It is not exactly the sample estimator that is employed for $Cov(\hat{\theta})$ but rather the Delta method using the sample covariance for $Cov(\nabla_\theta l(u;\hat{\theta}))$. The argument on "best estimator" above still applies. This is particularly relevant for $p>>n$ but a global maximum exists (i.e. under L-2 regularization of the objective).

Note, this (implicitly) answers the question in https://www.tandfonline.com/doi/abs/10.1198/000313006X152207 on why practitioners should care about parameter variance while ignoring bias.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant