-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT graph building #10518
Comments
Thanks for your report. Before I get into this, what you call "compilation", i.e. the process of building the "low level graph" / initializing all tasks, is what we typically refer to as "materialization". With this out of the way, let's look at your You'll notice the attribute There are a couple of layer types in here that are already materialized. Looking through the code base it appears that for those operations (primarily concatenate and overlap) the "non-materialized" representation of the graph is simply not implemented yet. If this is implemented properly, we would indeed just compute whatever is necessary to calculate the desired output key as you are describing. |
Apart from the materialization issues, your example is generating a graph of about 4MM tasks. This is larger than what we are typically dealing with and especially if you are adding more layers to this graph, this could quickly grow out of hand. If there is any opportunity to rechunk / use larger chunks, this could help significantly. IIUC your input dataset has chunks of about 50MiB. You may be able to go a little higher here zero = da.zeros((10000,5000,5000,5), dtype='float32', chunks=(1000, 500, 25, 1))
# If the input data is the way it is, you can still rechunk later
# Mind that the new chunks are a multiple of the input, this way the operation is trivial
zero = zero.rechunk((2000, 1000, 25, 1))
# Adjust the overlap as necessary
data = da.overlap.overlap(zero, depth={0:5, 1:2, 2:2}, boundary=-1) |
Thanks for the reply! So to reiterate what you're saying, the materialization does in fact already happen just-in-time, it's only because the non-materialization representation isn't implemented for concatenate/overlap that the current behavior requires those layers to be fully materialized in order to compute any output keys. If/when a non-materialization representation is available for those layers, it should be the case that in the example above, the first block can be computed without actually materializing the other (unnecessary for that block) parts of the low level graph. Did I understand that correctly? For example, this specific area is where that non-materializing representation would be implemented? And this pull request is actually addressing that, except that it hasn't been merged yet. I can bump that pull request, since it seems like it may have slipped through the cracks. So I guess that answers my main question, thanks! One thing from your reply that I'm unsure of, though: having something like 4MM (or even 400MM) tasks in a graph should only be a problem for the scheduler if you're actually trying to compute that entire graph, right? If instead, we're only doing a small segment of the graph at a time (like in the example above, computing one block at a time) - and assuming the graph has a full non-materialization representation - there really shouldn't be any bottleneck on how large the overall graph can be. The bottleneck is purely at the number of tasks specifically required for the requested output key. Is that correct? |
Yes, we call this culling. We'll basically take the output key you requested and walk our way backwards building the graph. This way, you'll only build and compute what you actually require for your result.
I'm not familiar with this yet but from a brief glance, yes. For For the sake of transparency, this |
Unfortunately my application needs to work with arrays rather than DataFrames, but thanks for mentioning it - I'll definitely keep my eye on the progress. With that said, I am somewhat invested in getting a faster subgraph computation like in the example above, and so I'm trying to dig into exactly why this example needs to take so long to compute the first block. In the reference you provided above to culling, one of the first lines of that function is to call The problem with that is, when there are millions of external keys for a layer (which seems to commonly be the case for an overlap layer), this takes significantly more time and memory to create and work with than any single key dependency path actually requires (regardless of if the layer is actually materialized, if I understand things correctly). I'm having some difficulty understanding exactly why an individual layer needs the entire graph's external keys in order to cull its own dependencies, however. When I trace back the usage of the full external key set, it appears as though each layer is just checking whether each of its dependencies is actually contained in the full graph key set (here)...which to me, seems tautologically true. I've made some modifications to dask that allow the above example to completely run in < 1 second (essentially removing the time dependency on the size of the graph/data):
This requires removing the calculation of all external keys that I mentioned above, however, which does work for this example - and I think theoretically should never be required for any other situation - but I'm not familiar enough with dask internals to say for sure. Do you (or anyone else) know why the calculation of all external keys is necessary? In what situation(s) would a layer need the external keys of the entire graph in order to cull its own dependencies? |
@rjzamora maybe you got some insights here? |
You have a good eye @BrandonSmithJ - The call to You are absolutely correct that there is no fundamental reason that we need Both myself and others have proposed changes to this in the past (e.g. #9216). However, slightly more-visible problems, like |
Thanks @rjzamora! The historical usage of it makes sense, and I think that does solidify my understanding on things. I cleaned up my patch a bit and created a pull request (#10534) that addresses this issue, and seems to pass all tests (locally, at least). So it appears this might offer a generalized significant improvement in speed when a user is only computing parts of the overall graph. Happy to discuss this further - whether it's any other small modifications that should be made, or if the change can't be incorporated for whatever reason. |
Here's a motivating example for what I'm referring to:
In the actual data I'm working with, the graph takes a significant amount of time to create before the first block can even start work.
In this case it's due to the rather large amount of operations that need to take place in
dask.layers.ArrayOverlapLayer
, but the general idea I'm discussing is: how feasible would it be to allow just-in-time graph compilation, in order to remove startup time dependency on the size of the data / overall number of tasks? Is it possible to allow compiling only the exact task dependency graph for a given object when it's computed, rather than having to build all tasks at the outset?The text was updated successfully, but these errors were encountered: