New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler fail case: centering data with dask.array #874
Comments
If This results in the desired scheduler behavior. In general, we can't assume that recomputing the initial chunks is fast, but for things like |
Currently |
In general, it's probably not a good idea to assume that loading data from disk is "fast", although it's certainly a preferable alternative to exhausting memory. It would be nice if we could setup dask to recompute chunks once they start to overflow some memory threshold, which might default to some fraction of the available system memory. The challenge then is figuring out which chunks to throw away. Cachey might have most of the appropriate logic for this. |
It would be an interesting intellectual exercise on how to do this generally. Any thoughts on how we could solve this problem if we tracked number of bytes of each output and computation times? |
One thought would be to pass in a |
Yes, I think dask.cache/cachey already uses a roughly appropriate metric. |
From http://xarray.pydata.org/en/stable/dask.html
@shoyer, does this imply that memory usage for computation of the mean is linearly dependent upon number of the unlimited dimension? We are having challenging memory issues with computing the mean for large spatial resolution datasets for short and long time record datasets. |
In xarray-space is there a way I can get the same behavior as a |
Computing the mean shouldn't be hard. The problem arises when you have a large array interact with the mean, like the following:
The current workaround is to explicitly compute the mean beforehand
This will require two passes over the data Note that this problem should only be an issue if you don't have enough memory. If you were on a distributed system this might not be an issue. |
@mrocklin, the problem is exactly as you imply but we have the limitation that we are trying really hard to keep compute cost down so For |
Storing You should always have significantly more memory than your chunk size. How much RAM do you have? What are your chunk sizes? |
For reference. My notebook has 16GB of memory and I probably shoot for chunk sizes in the 100 MB range. There is about a 100x gap there. |
Thanks @mrocklin for the quick reply. We are storing We have had both problems but in its simplest form we are having challenges computing just the mean and not having it exceed memory. Fundamentally, I think I'm confused about the rules related to chunk size and total memory available. Does the computation performed indicate the ratio of chunk size to total memory required? It seems like this is the case but this is not obvious how to estimate this ratio. Ultimately, it would be good to have a rule of thumb that we can add to http://xarray.pydata.org/en/stable/dask.html. To clarify, how many chunks get loaded into memory for computation of (mixed xarray and dask-esq pseudo code):
Does this mean that 10 chunks are loaded into memory to compute the mean because chunk size on I suspect I'm wrong many please here and please set me on the right conceptual path. |
@mrocklin, this is just a good general heuristic, right? Or is this specific to the |
Pessimistically dask will use something like |
What I'm getting at is that there are two ways to compute the mean that I can forsee:
Does dask do 1 or 2 or some combination or alternative strategy? |
Strategy 1. However it tries to walk that tree in a depth-first way, rather than producing all of the leaves first. |
That is just for performance because it is the optimal way to compute on the tree, right? That way some leaves are effectively never needed and we can average a leaf into some averaged-average, correct? Ok, this is starting to make sense now. nco probably does 2, which is why it sometimes seems to work better because if file access is really expensive to access at a particular point, e.g., get the data chunk from a file, approach 2 will actually make more sense because there will be fewer cache misses overall and the primary cost is getting data from disk into memory. Do we have freedom or recourse to force dask to minimize the data accesses somehow (optimally force strategy 2 so that adjacent times are computed first)? The issue is that we never get close to using all the cores because we are so disk / memory limited. Maybe this is a set up problem on our end. I fully recognize I'm asking a lot that is probably not in the dask design here because there is probably a latent assumption that most of the data is already in memory because dask is supposed to be more of a threaded/distributed numpy, where it would be unwise to do all the loading and unloading of data in and out of the data structures anyway. |
Is a performance solution to artificially limit |
If you were to set |
If you're also computing the mean across Nx and Ny then I wouldn't anticipate a problem. The intermediates would be very small. I don't think that this is what you're asking though. |
No, the problem is we just need a time mean. If data access were free we could do this with some type of SIMD kernel, e.g. at a point just average the time dimension. We just have slow drives on HPC. If we had more compute time dask.distributed would make sense and we could effectively get this to scale like on a laptop, but it doesn't we can't afford the cycles for the analysis so this obvious solution is inapplicable for our case. |
Does this work for you?
|
I haven't tried it yet. I'm not sure I understand how to set this option in xarray for the computation to be 100% honest. Advice on this would greatly be appreciated. Is this one of those times I need to convert to dask from xarray first? Also, is the split going to be over the non-reduced direction first? If so, this is the best solution but I can hack it by setting chunk size to be over the entire spatial dimension. The problem is that we could have more performance hurdles at larger scales if we don't compute over Nx and Ny first, and then over adjacent time steps. So, this fixes today's problem but we could still have a latent issue. |
Yes, I think this is safe, though we should certainly document it (and test it, if it isn't tested already). See pydata/xarray#1360 |
I would use |
get is a keyword to compute. If compute is buried then you'll probably
have to use this as a context manager or global.
…On Fri, Apr 7, 2017 at 1:44 PM, Phillip Wolfram ***@***.***> wrote:
I would use ds.mean(split_every=2, get=dask.async.get_sync), right? Even
with the default multiprocessing scheduler it looks like memory is much
better constrained with ds.mean(split_every=2).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#874 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszH5QIWKMHmsnz8hHRA5Ukl7JCbq2ks5rtnYAgaJpZM4GyLD2>
.
|
.compute() in xarray currently doesn't allow for any arguments, but
arguably it should, just passing them on to dask.
On Fri, Apr 7, 2017 at 10:55 AM, Matthew Rocklin <notifications@github.com>
wrote:
… get is a keyword to compute. If compute is buried then you'll probably
have to use this as a context manager or global.
On Fri, Apr 7, 2017 at 1:44 PM, Phillip Wolfram ***@***.***>
wrote:
> I would use ds.mean(split_every=2, get=dask.async.get_sync), right? Even
> with the default multiprocessing scheduler it looks like memory is much
> better constrained with ds.mean(split_every=2).
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#874 (comment)>, or
mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/
AASszH5QIWKMHmsnz8hHRA5Ukl7JCbq2ks5rtnYAgaJpZM4GyLD2>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#874 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1iw-fQ63P_WnY0-ir6qmVGi6U0A3ks5rtniQgaJpZM4GyLD2>
.
|
It looks like we might be able to get by without these more complex approaches. We had some non-cached code (https://github.com/MPAS-Dev/MPAS-Analysis/blob/develop/mpas_analysis/shared/climatology/climatology.py#L595) that appears to be resulting in a non-optimal xarray/dask lazy computation that produces a memory overrun. I'm clearly not fully versed as I should be in dask because it wasn't obvious to me that this code would cause problems when I merged it, but following today it seems reasonable to suspect that combinations of lazy reductions are dangerous and should be "barriered" via a The computation with a simple mean, in contrast, appears to be readily solved by chunking, which is consistent with previous explanations and usage of xarray. So the issue appears to thankfully be on our side. Many thanks to @mrocklin and @shoyer for the special help with commands to keep the computations in check. |
The main reason is to parallelise execution of stefcal. It could still be a bit of a memory hog: see dask/dask#874 The phase normalisation is changed a bit to avoid using a median, which is not available in dask. Instead, the angles are wrapped such that the branch cut is on the side opposite an arbitrary data point, and then the mean is taken. Provided all the phases are clustered within pi of each other this should be just as good, but it is less robust to a bad outlier (particularly since the "arbitrary" point chosen is the minimum initial angle).
The main reason is to parallelise execution of stefcal. It could still be a bit of a memory hog: see dask/dask#874 The phase normalisation is changed a bit to avoid using a median, which is not available in dask. Instead, the angles are wrapped such that the branch cut is on the side opposite an arbitrary data point, and then the mean is taken. Provided all the phases are clustered within pi of each other this should be just as good, but it is less robust to a bad outlier (particularly since the "arbitrary" point chosen is the minimum initial angle).
Looks like I'm late to the party on this important issue (I discovered it via pydata/xarray#1832). Is this considered resolved with the |
Not sure if this is still interesting to you, @pwolfram or other, but figured I'd leave this note in case it was. It's definitely possible to check-in a result on disk. Dask Array's Below is some sample code demonstrating it's sample usage with Zarr. HTH In [1]: import dask.array as da
In [2]: import zarr
In [3]: a = da.random.random((10, 10), chunks=(4, 4))
In [4]: z = zarr.open_array("data.zarr", shape=a.shape, dtype=a.dtype, chunks=(4
...: , 4))
In [5]: a2 = a.store(z, return_stored=True)
In [6]: a2
Out[6]: dask.array<load-store-50dda186-8bd9-11e8-970a, shape=(10, 10), dtype=float64, chunksize=(4, 4)> Edit: This is also now documented in this doc section. |
The main reason is to parallelise execution of stefcal. It could still be a bit of a memory hog: see dask/dask#874 The phase normalisation is changed a bit to avoid using a median, which is not available in dask. Instead, the angles are wrapped such that the branch cut is on the side opposite an arbitrary data point, and then the mean is taken. Provided all the phases are clustered within pi of each other this should be just as good, but it is less robust to a bad outlier (particularly since the "arbitrary" point chosen is the minimum initial angle).
I know that persisting data on disk (or in memory) is the currently recommended solution here, but it occurred to me that another good tool to have for fixing these sorts of issues would be a way to explicitly "duplicate" all of the nodes in a the graph underlying a dask collection. That way, there couldn't be any undesirable caching, because they would be totally separate tasks. I opened a new issue to discuss this proposal: #6674 |
Something else people here might want to play with is graphchain. Admittedly this is still storing results in memory, but it does so without one needing to specify what to persist. |
People following this issue may be interested in advanced graph manipulation functionality that has been added somewhat recently |
A common use case for many modeling problems (e.g., in machine learning or climate science) is to center data by subtracting an average of some kind over a given axis. The dask scheduler currently falls flat on its face when attempting to schedule these types of problems.
Here's a simple example of such a fail case:
The scheduler will keep each of the initial chunks in memory that it uses to compute the mean, because they will be used later to as an argument to
sub
. In contrast, the appropriate way to handle this graph to avoid blowing up memory would be to compute the initial chunks twice.I know that in principle this could be avoided by using an on-disk cache. But this seems like a waste, because the initial values are often sitting in a file on disk, anyways.
This is a pretty typical use case for dask.array (one of the first things people try with xray), so it's worth seeing if we can come up with a solution that works by default.
The text was updated successfully, but these errors were encountered: