Memory issues with log likelihood in io_pymc3 #1077

OriolAbril · 2020-02-18T10:58:20Z

In the presence of a large number of observations, the log likelihood array cannot be created due to memory limitations. This is partly due to the fact that the code to store log likelihood values in io_pymc3 relies on growing lists inside loops which is not memory efficient and it could even reach the point of the log likelihood not fitting in memory.

I propose a 2 part solution, which afterwards could then be improved by storing in-file the results with or without dask...

The first part (which would also solve other issues related to io_pymc3) is to add an argument to exclude the log likelihood data. This argument could also be extended to pyro and numpyro, other converters already allow being explicit on this.

The second part is to preallocate the arrays which will reduce the memory usage and also slightly speed up conversion. Fix could be based on pymc-devs/pymc#3556 which fixed a similar issue with posterior predictive

nitishp25 · 2020-02-19T09:05:41Z

Will this argument contain a list of variables for which we calculate log_likelihood?

OriolAbril · 2020-02-19T15:03:46Z

I am not sure about this, I think errors when present will affect all variables altogether, but it could make sense as a memory saving measure.

nitishp25 · 2020-02-19T15:53:01Z

Yes I mean keeping it None as default will avoid generating log_likelihood but the second part of memory pre-allocation will have some effect in preventing errors right (if the user inputs the variables for log_likelihood)? If not then we can use dask arrays instead of numpy?

OriolAbril · 2020-02-19T20:28:01Z

I am not sure I follow, sorry, maybe I could have been more clear from the start. There are actually two issues in one here:

The first is with memory usage. From this side, the code does work, but it is not very efficient with memory usage. This could eventually end up raising a MemoryError, but in general what happens is that the computer freezes (one example here). Thinking about this however does pose a question: What if there are so many observations that the arrays do not fit in memory? Then no memory usage optimization can solve the problem, hence the proposal to allow users to avoid log likelihood storage (always defaulting to storing it though). On the long run, storing on file or using dask would allow us to work with data not fitting in memory, however, plots and algorithms do not work with dask yet, so it doesn't really make sense to look into this as of now. Here storing one instead of 2 variables could be relevant in terms of memory usage.

The second is that it is common to get errors when trying to retrieve the log_likelihood data in from_pymc, examples are #395 or #690. Avoiding log likelihood computation should solve many if not all such issues. Here storing one instead of 2 variables would probably NOT avoid the error (this is a suspicion, it has to be tried anyways).

nitishp25 · 2020-02-20T05:12:26Z

Sorry now I understand. Actually I was going through the log_likelihood argument of from_pystan and I thought something similar would work here, wasn't aware of the issues log_likelihood created in PyMC3.

So you're saying that a small fix for now can be to have a boolean argument which defaults to True (compute log_likelihood for all vars) and when False (do not compute log_likelihood at all) and also the memory pre-allocation (like PyMC3 posterior_predictive) for optimization?

OriolAbril · 2020-02-20T09:32:38Z

Yep, we could extend the boolean to accepting boolean or list of var_names, but always defaulting to True or a truish thing. It is tricky though for me on how to not confuse users, in data functions, using None means no data in that group whereas using None in plots means all variables. 🤔

nitishp25 · 2020-02-20T11:25:47Z

Maybe for plots set default of var_names to "all" and inside the code convert:

if var_names == "all":
    var_names = None    # for ease of use

so that the user thinks all vars are plotting and our purpose is also solved? Though this can confuse developers I think.

Also, can I work on the issue?

OriolAbril · 2020-02-20T11:39:08Z

The None is actually used for convenience (it is internally converted to the list of variable names with _var_names(var_names, data) most plots use this line at some point), I would not touch this it as things would get quite confusing, nobody seems to have any issue with var_names.

For this case it is probably better to use: None -> all vars; list -> vars in list, False -> skip log likelihood group. My guess is that users will generally either use one library (in which case they won't notice this) or will use multiple libraries because of their differences and won't have much problem with this difference in behaviour.

You can definitely work on this!

nitishp25 mentioned this issue Feb 20, 2020

[WIP] Reduce memory usage in log_likelihood io_pymc3 #1082

Merged

3 tasks

OriolAbril closed this as completed Feb 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory issues with log likelihood in io_pymc3 #1077

Memory issues with log likelihood in io_pymc3 #1077

OriolAbril commented Feb 18, 2020

nitishp25 commented Feb 19, 2020

OriolAbril commented Feb 19, 2020

nitishp25 commented Feb 19, 2020 •

edited

OriolAbril commented Feb 19, 2020

nitishp25 commented Feb 20, 2020

OriolAbril commented Feb 20, 2020

nitishp25 commented Feb 20, 2020 •

edited

OriolAbril commented Feb 20, 2020

Memory issues with log likelihood in io_pymc3 #1077

Memory issues with log likelihood in io_pymc3 #1077

Comments

OriolAbril commented Feb 18, 2020

nitishp25 commented Feb 19, 2020

OriolAbril commented Feb 19, 2020

nitishp25 commented Feb 19, 2020 • edited

OriolAbril commented Feb 19, 2020

nitishp25 commented Feb 20, 2020

OriolAbril commented Feb 20, 2020

nitishp25 commented Feb 20, 2020 • edited

OriolAbril commented Feb 20, 2020

nitishp25 commented Feb 19, 2020 •

edited

nitishp25 commented Feb 20, 2020 •

edited