New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory issues with log likelihood in io_pymc3 #1077
Comments
Will this argument contain a list of variables for which we calculate log_likelihood? |
I am not sure about this, I think errors when present will affect all variables altogether, but it could make sense as a memory saving measure. |
Yes I mean keeping it None as default will avoid generating |
I am not sure I follow, sorry, maybe I could have been more clear from the start. There are actually two issues in one here: The first is with memory usage. From this side, the code does work, but it is not very efficient with memory usage. This could eventually end up raising a MemoryError, but in general what happens is that the computer freezes (one example here). Thinking about this however does pose a question: What if there are so many observations that the arrays do not fit in memory? Then no memory usage optimization can solve the problem, hence the proposal to allow users to avoid log likelihood storage (always defaulting to storing it though). On the long run, storing on file or using dask would allow us to work with data not fitting in memory, however, plots and algorithms do not work with dask yet, so it doesn't really make sense to look into this as of now. Here storing one instead of 2 variables could be relevant in terms of memory usage. The second is that it is common to get errors when trying to retrieve the log_likelihood data in from_pymc, examples are #395 or #690. Avoiding log likelihood computation should solve many if not all such issues. Here storing one instead of 2 variables would probably NOT avoid the error (this is a suspicion, it has to be tried anyways). |
Sorry now I understand. Actually I was going through the So you're saying that a small fix for now can be to have a boolean argument which defaults to True (compute log_likelihood for all vars) and when False (do not compute log_likelihood at all) and also the memory pre-allocation (like PyMC3 posterior_predictive) for optimization? |
Yep, we could extend the boolean to accepting boolean or list of var_names, but always defaulting to True or a truish thing. It is tricky though for me on how to not confuse users, in data functions, using None means no data in that group whereas using None in plots means all variables. 🤔 |
Maybe for plots set default of
so that the user thinks all vars are plotting and our purpose is also solved? Though this can confuse developers I think. Also, can I work on the issue? |
The None is actually used for convenience (it is internally converted to the list of variable names with For this case it is probably better to use: You can definitely work on this! |
In the presence of a large number of observations, the log likelihood array cannot be created due to memory limitations. This is partly due to the fact that the code to store log likelihood values in io_pymc3 relies on growing lists inside loops which is not memory efficient and it could even reach the point of the log likelihood not fitting in memory.
I propose a 2 part solution, which afterwards could then be improved by storing in-file the results with or without dask...
The first part (which would also solve other issues related to io_pymc3) is to add an argument to exclude the log likelihood data. This argument could also be extended to pyro and numpyro, other converters already allow being explicit on this.
The second part is to preallocate the arrays which will reduce the memory usage and also slightly speed up conversion. Fix could be based on pymc-devs/pymc#3556 which fixed a similar issue with posterior predictive
The text was updated successfully, but these errors were encountered: