-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MemoryError with dask.array.from_tiledb
#5915
Comments
Hi Peter, thanks for raising an issue. I think it would probably be helpful to pull in @ihnorton here since it looks like they did most of the TileDB implementation. |
Thanks for the ping @jsignell, I will take a look. |
Some quick notes here, will look further on Monday:
It looks like this happens because dask.array is trying to cover the shape of the tiledb array -- which is the entire requested domain:
Dask supports Alternatively, we may be able to add an option in TileDB-Py to take a view on a subset of a dimension (as suggested by @DPeterK), which would make the |
For the MemoryError relating to the chunk list, I believe the dimension size for non-delayed arrays is inherently limited by the memory available on the main node, because chunk lists are calculated eagerly (there are some open issues about task graph size which might also come in to play). Hopefully a dask dev will comment if I missed something. The large dimension scenario might be a good fit for a delayed array or delayed dataframe (given that you will be creating arrays with multiple attributes). The first issue w/ attribute name mismatch leading to crash is now fixed in the TileDB-Py dev tree and will be included in the next release. |
Thanks @jsignell and @ihnorton for engaging with this issue, and apologies for the slight mistake in the example code! I've updated the code example so that the array name is consistent - but I'm glad that mistake led to some improved error reporting. I've just tried using the indexing approach that I tentatively suggested above: tdb = tiledb.open(tdb_data_path, attr='surface_air_pressure')
points = da.from_array(tdb[0:60096, 0:219, 0:286], chunks) Given that my Jupyter kernal keeps dying, it seems this approach is trying to realise all of the indexed TileDB array contents in memory then write it to a dask array, rather than streaming or proxying the data into the TileDB - as I was concerned about above. And yes this is a large volume of data, but that's sort of the point here! |
I've realised that Iris already has a solution for the problem I described just above, in the form of the data proxy classes. Data proxy classes duck-type to being array-like, but only maintain a reference to the array on disk, rather than actually loading the array into memory. The only time the array is loaded into memory is on an indexing request, when only the subset defined by the index is loaded. I've a functional implementation of this for TileDB at https://github.com/informatics-lab/tiledb_netcdf/blob/master/nctotdb/readers.py#L38. |
Great! So are you all good? If so I'll close the issue. |
I personally am sorted, but others using dask with TileDB in cases like I've described here will still hit memory errors 😉 I feel this might still be worth a fix in dask itself. |
Fair enough. I'll leave it open in case it helps others. |
What's the summary of the current understanding? Is it the case that tdb = tiledb.open(tdb_data_path, attr='surface_air_pressure')
points = da.from_array(tdb[0:60096, 0:219, 0:286], chunks) involves materializing What's the next step for Dask to take? |
Hi @TomAugspurger,
--- output:
Some other notes below:
Running the above script gives a
-> also (note that I am using |
TileDB allows you to create arrays with an effectively unlimited domain. If you try to create a dask array from such a TileDB array, however, you unsurprisingly hit issues. For example, if you run the code below you get a Segmentation Fault, which I think is hiding a
MemoryError
(I've seen this error before with similar code to this, but unfortunately can't reproduce it here).Here's some code to demonstrate the problem:
And here's an example of running the code:
$ python dask_memerror_code.py Nonempty domain: ((0, 9), (0, 9)) Segmentation fault (core dumped)
I think dask is unable to create a Python array large enough for the size of the TileDB array, regardless of the TileDB array's nonempty domain.
A possible solution for this would be to add functionality to
da.from_tiledb
to allow the TileDB array to be subset when it's read into a dask array, such as:which could then be used when creating the dask array to subset the TileDB array:
I'm not sure, however, whether this would cause the data from the TileDB array to be loaded into local memory and potentially cause a different problem with memory being flooded if the indexed array were itself large.
The text was updated successfully, but these errors were encountered: