Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional LOO functionality #2059

Open
4 tasks
sethaxen opened this issue Jun 23, 2022 · 6 comments
Open
4 tasks

Additional LOO functionality #2059

sethaxen opened this issue Jun 23, 2022 · 6 comments

Comments

@sethaxen
Copy link
Member

sethaxen commented Jun 23, 2022

There are several useful functions that can be used for model assessment and comparison that could be added here. In some cases these are already available in loo or soon will be:

  • loo_expectation(data, values, kwargs...): compute expectation of values wrt the leave-one-out posteriors using PSIS. e.g. if values are likelihoods, this computes pointwise ELPD, though less efficiently and stably than loo does. If values are posterior predictions, then the result is leave-one-out posterior predictive means. Exists in loo as E_loo and was previously proposed to be added to ArviZ in Is there a LOO R2? #1931 (comment).
  • logo(data, groups, var_name, kwargs...): compute leave-one-group-out crossvalidation. groups is an array of the same shape as the log-likelihood corresponding to the var specified by var_name, minus sample dimensions. The unique group values define new dimensions, and log-likelihoods that share the same group are summed together. The docstring should warn that this is more likely to produce Pareto shape errors than LOO, and it will probably fail if there are group-specific parameters.
  • loo_predictive_error(data, method, kwargs...): use loo_expectation to compute point estimates of LOO posterior-predictive means, and then use the name or function method to compute pointwise and mean predictive error across the data points. Coming to loo: Additional loo utilities stan-dev/loo#202
  • loo_crps(idata, scale=False, kwargs...): use loo_expectation to compute continuous ranked probability score (CRPS) or it's scaled variant SCRPS, which is another strictly proper scoring rule besides Log score (ELPD) to use for model comparison. Requires 2 posterior predictive draws for each posterior draw, so blocked by Supporting multiple posterior predictive draws for each posterior draw #2048. Coming to loo: CRPS and SCRPS stan-dev/loo#203
@OriolAbril
Copy link
Member

Generally agree. Not so sure about the logo one. It would basically be

  1. groupby using xarray
  2. call az.loo

I personally think it would be better to document this and have users do the groupby themselves instead of adding a logo function. If we add a function, I expect users to assume only cv schemes directly compatible with loo or logo are supported, but they can also do any combination themselves and I don't think we can support all possible cases, a draft example of already using logo with loo would be https://nbviewer.org/github/OriolAbril/Exploratory-Analysis-of-Bayesian-Models/blob/multi_obs_ic/content/Section_04/Multiple_likelihoods.ipynb#Predicting-team-level-performance.

@sethaxen
Copy link
Member Author

Good point about logo. I agree, if we can document it, and the code only takes a few lines, then it doesn't make sense to add a logo implementation for this. Thanks for the example; should be straightforward to adapt.

@aadya940
Copy link
Contributor

Hi @sethaxen , I was willing to implement the loo_expectation function in arviz.stats. From what I understood, we could implement it by calculating the weights using az.loo and then use the weights to calculate the weighted expectation, something like (values * weights).sum()/weights.sum(). I just wanted to know if I'm on the right track?

@sethaxen
Copy link
Member Author

That would be one approach. loo both computes the log-weights using psislw and then estimates the ELPD. Technically this function only needs the log-weights from loo, so it would be more efficient to just call psislw as loo does.

something like (values * weights).sum()/weights.sum().

Yep, this is the right approach, with a few notes:

  • The log-weights returned by psislw are already normalized, so no need for weights.sum().
  • You probably want to avoid numerical issues that would happen when exp(log_weights[i]) will underflow to 0 but exp(log(abs(values[i])) + log_weights[i]) might actually be large. Something like (np.sign(values) * np.exp(np.log(np.abs(values)) + log_weights)).sum() might be the right approach.
  • Third, the PSIS paper in Section E describes an approach for diagnosing an expectation computed using importance weights, described below.

Namely, in addition to using psis(-log_likelihood) to get the LOO log-weights, you would also compute something like

summand = np.sign(values) * np.exp(np.log(np.abs(values)) - log_likelihood))
log_abs_summand = np.abs(values) - log_likelihood
max_log_abs_summand = log_abs_summand.max()
summand_scaled = np.sign(values) * np.exp(max_log_abs_summand - max_log_abs_summand)
tail_right = ...  # select right tail of summand_scaled using approach of https://github.com/arviz-devs/arviz/blob/6b1b2be60cca804d4b0ec43a3303820bfa8785c6/arviz/stats/stats.py#L974-L980
tail_left = ...  # do the same but for `-summand_scaled`
k_hat_right, _ = _gpdfit(tail_right)
k_hat_left, _ = _gpdfit(tail_left) 
k_hat = max(k_hat_right, k_hat_left)

It might be better to refactor _psislw slightly to reduce code reuse. k_hat should at least be used to warn if an expectation will be poorly estimated, but a user may optionally want it returned. Some reworking is probably needed if you want to compute expectations with respect to multiple functions simultaneously. (IIRC, this is effectively what loo_pit does and needs)

@avehtari does this look right?

FWIW, I would support adding loo_expectation in one PR and then adding the diagnostics in a later PR.

@avehtari
Copy link
Contributor

Corresponding R functions are in https://github.com/stan-dev/loo/blob/master/R/E_loo.R, see specifically .wmean, .wvar, .wsd, .wquant. loo package is storing log weights, but in these computations the normalized weights are computed and used in .w* functions.

For the diagnostics, loo package has now also minimum sample size and sample size dependent khat-threshold. The new diagnostic summary messages are in a PR https://github.com/n-kall/posterior/blob/9a59fb57fbd7947a06eab908b3b2209fc60a2146/R/pareto_smooth.R#L611, and will be merged before the next release.

@aadya940
Copy link
Contributor

@sethaxen Thanks for the quick reply 👍 . I'll keep these points in mind and try to submit a draft PR implementing the loo_expectation function, you can review it whenever you have time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants