Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory issues with log likelihood in io_pymc3 #1077

Closed
OriolAbril opened this issue Feb 18, 2020 · 8 comments
Closed

Memory issues with log likelihood in io_pymc3 #1077

OriolAbril opened this issue Feb 18, 2020 · 8 comments

Comments

@OriolAbril
Copy link
Member

In the presence of a large number of observations, the log likelihood array cannot be created due to memory limitations. This is partly due to the fact that the code to store log likelihood values in io_pymc3 relies on growing lists inside loops which is not memory efficient and it could even reach the point of the log likelihood not fitting in memory.

I propose a 2 part solution, which afterwards could then be improved by storing in-file the results with or without dask...

The first part (which would also solve other issues related to io_pymc3) is to add an argument to exclude the log likelihood data. This argument could also be extended to pyro and numpyro, other converters already allow being explicit on this.

The second part is to preallocate the arrays which will reduce the memory usage and also slightly speed up conversion. Fix could be based on pymc-devs/pymc#3556 which fixed a similar issue with posterior predictive

@nitishp25
Copy link
Contributor

Will this argument contain a list of variables for which we calculate log_likelihood?

@OriolAbril
Copy link
Member Author

I am not sure about this, I think errors when present will affect all variables altogether, but it could make sense as a memory saving measure.

@nitishp25
Copy link
Contributor

nitishp25 commented Feb 19, 2020

Yes I mean keeping it None as default will avoid generating log_likelihood but the second part of memory pre-allocation will have some effect in preventing errors right (if the user inputs the variables for log_likelihood)? If not then we can use dask arrays instead of numpy?

@OriolAbril
Copy link
Member Author

I am not sure I follow, sorry, maybe I could have been more clear from the start. There are actually two issues in one here:

The first is with memory usage. From this side, the code does work, but it is not very efficient with memory usage. This could eventually end up raising a MemoryError, but in general what happens is that the computer freezes (one example here). Thinking about this however does pose a question: What if there are so many observations that the arrays do not fit in memory? Then no memory usage optimization can solve the problem, hence the proposal to allow users to avoid log likelihood storage (always defaulting to storing it though). On the long run, storing on file or using dask would allow us to work with data not fitting in memory, however, plots and algorithms do not work with dask yet, so it doesn't really make sense to look into this as of now. Here storing one instead of 2 variables could be relevant in terms of memory usage.

The second is that it is common to get errors when trying to retrieve the log_likelihood data in from_pymc, examples are #395 or #690. Avoiding log likelihood computation should solve many if not all such issues. Here storing one instead of 2 variables would probably NOT avoid the error (this is a suspicion, it has to be tried anyways).

@nitishp25
Copy link
Contributor

Sorry now I understand. Actually I was going through the log_likelihood argument of from_pystan and I thought something similar would work here, wasn't aware of the issues log_likelihood created in PyMC3.

So you're saying that a small fix for now can be to have a boolean argument which defaults to True (compute log_likelihood for all vars) and when False (do not compute log_likelihood at all) and also the memory pre-allocation (like PyMC3 posterior_predictive) for optimization?

@OriolAbril
Copy link
Member Author

Yep, we could extend the boolean to accepting boolean or list of var_names, but always defaulting to True or a truish thing. It is tricky though for me on how to not confuse users, in data functions, using None means no data in that group whereas using None in plots means all variables. 🤔

@nitishp25
Copy link
Contributor

nitishp25 commented Feb 20, 2020

Maybe for plots set default of var_names to "all" and inside the code convert:

if var_names == "all":
    var_names = None    # for ease of use

so that the user thinks all vars are plotting and our purpose is also solved? Though this can confuse developers I think.

Also, can I work on the issue?

@OriolAbril
Copy link
Member Author

The None is actually used for convenience (it is internally converted to the list of variable names with _var_names(var_names, data) most plots use this line at some point), I would not touch this it as things would get quite confusing, nobody seems to have any issue with var_names.

For this case it is probably better to use: None -> all vars; list -> vars in list, False -> skip log likelihood group. My guess is that users will generally either use one library (in which case they won't notice this) or will use multiple libraries because of their differences and won't have much problem with this difference in behaviour.

You can definitely work on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants