Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The datasets conditioned/gsim_XXX are using too much disk space #9387

Closed
micheles opened this issue Jan 26, 2024 · 5 comments · Fixed by #9540
Closed

The datasets conditioned/gsim_XXX are using too much disk space #9387

micheles opened this issue Jan 26, 2024 · 5 comments · Fixed by #9540
Assignees
Milestone

Comments

@micheles
Copy link
Contributor

micheles commented Jan 26, 2024

As reported by @CatalinaYepes . A solution could be to store them in the .tmp.hdf5 file. Otherwise, we could revert #9094.

@micheles micheles added this to the Engine 3.19.0 milestone Jan 26, 2024
@micheles micheles self-assigned this Jan 26, 2024
@micheles micheles modified the milestones: Engine 3.19.0, Engine 3.20.0 Mar 1, 2024
@micheles
Copy link
Contributor Author

Actually the only solution is to reduce the number of sites, since the memory/disk space occupation is quadratic with the number of sites.

@raoanirudh
Copy link
Member

Why doesn't storing them in the .tmp.hdf5 work? This is data needed only during the calculation and doesn't need to be stored in the final calc.hdf5

@micheles
Copy link
Contributor Author

Because you will soon run out of disk space, this is how Cata discovered the issue. Also, once you start storing 100+GB then reading the data will kill your calculation (out of memory or so slow to be impossible to run). No matter how big is your machine, a quadratic calculation will run out of resources pretty soon. You would need an algorithm not quadratic with the number of sites.

@raoanirudh
Copy link
Member

Opening this issue again as it still persists.

The issue is not related to having too many sites in the calculation. It was that the conditioned/mean_covs data that is now stored in the calc_xxx.hdf5 file is useful only while the calculation is running, and can safely be deleted from the datastore once the calculation is completed. Or the other option might be to store it in the calc_xxx.tmp.hdf5 file instead, which gets deleted at the end of the calculation, since this interim data is not useful to the user after the calculation is over. If the conditioned/mean_covs data is deleted from the datastore, the hdf5 file sizes in oqdata should go back to the regular sizes for scenario calculations that do not involve conditioning.

@raoanirudh raoanirudh reopened this May 15, 2024
@micheles
Copy link
Contributor Author

micheles commented May 16, 2024

You are partially right @raoanirudh , but my point still stand that calculations with too many points will be impossible. The only solution I see for Aristotle calculations is to use a large enough region_grid_spacing so that calculations can run. Then, to avoid wasting too much disk space we can store the temporary data in _tmp.hdf5 or even better only keep it in memory as it was originally, before #9094 (retrospectively, it was a bad idea, trading a decent but not impressive speedup for too much disk space).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants