-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue opening h5py arrays #7930
Comments
Thanks for you quick reply. With client calculating the mean works: but I run into: #7926 with saving. traceback (for full see #7926):
|
I've had the same problem using xarray's open_mfdataset to read large netcdf files. Downgrading Dask to 2.28 solved it for me. |
I'm wondering if your original approach would work better now that #7583 is resolved. Are you able to update to latest dask and try again? |
Thank you! not using distributed:
Using distributed cluster:
|
Ok thanks for the report! |
I used dask (and xarray) to combine a set of H5py files into a dataframe.
This worked great until I updated dask from 2.28 to 2021.07.1.
If I run the same script now, I always run out of memory, just after loading the files and doing any operation on the
dataframe.
I first thought it was an xarray issue (reported here: pydata/xarray#5585).
So I switched to a pure dask solution (based on: https://docs.dask.org/en/latest/array-creation.html)
However, I keep running into the same problem.
I tried 3 different methods:
All with the same results: Kernel died, due to memory
I tried different chunk sizes. Smaller chunks just makes everything slower (incl overloading the RAM). Bigger chunks just make it faster but stilll overload the memory. In any case the kernel still crashes early in the process, due to memory overload.
Also when trying to save the dataframe, nothing/just a few kbs are written.
Therefore I suspect that there is a problem with the creation of the dataframe, not saving it.
First I suspected the transformation form array to dataframe to be the issue.
But running calculation on the array itself or saving it also cause the same issue.
That's why I think it has to do with the loading of an array from h5py.
If am doing something principally wrong here, please let me know. But this code worked great for me on earlier versions of dask.
Thank you!
Here is code to reproduce. I have 16GB RAM, so to reproduce you might need to adjust the number of files loaded if you have more RAM:
Create Datafiles (note this takes a couple of GB disk space)
Setup
Method 1 (Pandas)
Method 2 (ddf per file)
Method 3: combine arrays
Testing ddf
All these methods produce a dask dataframe. However in the next steps Memory will fully load
and the kernel will crash:
output:
[ ] | 0% Completed | 6min 57.9s
output:
[# ] | 3% Completed | 13min 8.7s
output:
[## ] | 5% Completed | 4min 11.9s
output:
[# ] | 3% Completed | 5min 57.9s
Environment:
The text was updated successfully, but these errors were encountered: