Add another optimization step to accommodate ops that require expensive meta information (e.g. `divisions`) #181

phofl · 2023-06-23T14:08:30Z

Matt and I chatted offline about #166 and potential traps when implementing set_index.

Generally, we shouldn't use divisions while simplifying the Expression tree and pushing stuff up and down (I'll open another issue about how we can avoid this partially with align-like stuff). Computing divisions might be expensive, accessing them multiple times for a set_index/sort_values/... call will trigger multiple computations. So it's better to avoid them whenever possible.

Some methods will require information about the divisions to select the best algorithm (merge as an example). We might pass a _simplify_down step multiple times depending on the structure of our Expression tree, which would potentially require re-computing the divisions each time that merge is hit by this.

We can avoid this conflict through adding another optimization step that comes after simplify, lower for example. This step can accommodate optimizations that require information that are relatively expensive to obtain (like divisions). Ideally we avoid multiple passes in this case.

We should keep in mind that we need a more complex structure in the future, which would require us to generalize the different optimization steps.

cc @rjzamora

The text was updated successfully, but these errors were encountered:

rjzamora · 2023-06-23T17:04:04Z

I started digesting the discussion in #166, and fell down a bit of a rabbit hole myself :)

This problem is reminding me of the fact that I was originally hoping to free Expr objects from needing to track both npartitions and divisions information. I eventually gave up on this idea as “too extreme”, but it may not be too late to reconsider.

For example, if we are splitting the optimization pass into different kinds of Expr changes, then it may make sense to roughly split into the following separate passes:

abstract optimizations that don't require divisions/npartitions information but may affect the final divisions/npartitions (predicate pushdown, projection, etc)
divisions/npartitions "lowering" (i.e. IO layers are forced to extract "known" divisions/npartitions)
anther abstract-optimization pass for things that cannot affect divisions/npartitions

(can stop here if you don't need to generate a graph)

algorithm "lowering" (ensuring all expressions can generate a graph)
another abstact-optimization pass for things that cannot affect divisions/npartitions
fusion

I will probably change my mind about this, just sharing my current thoughts :)

phofl mentioned this issue Jun 23, 2023

Caching __dask_graph__ #183

Open

rjzamora mentioned this issue Jun 27, 2023

[Proposal] Add "phases" argument to Expr.simplify #194

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add another optimization step to accommodate ops that require expensive meta information (e.g. `divisions`) #181

Add another optimization step to accommodate ops that require expensive meta information (e.g. `divisions`) #181

phofl commented Jun 23, 2023

rjzamora commented Jun 23, 2023 •

edited

Loading

Add another optimization step to accommodate ops that require expensive meta information (e.g. divisions) #181

Add another optimization step to accommodate ops that require expensive meta information (e.g. divisions) #181

Comments

phofl commented Jun 23, 2023

rjzamora commented Jun 23, 2023 • edited Loading

Add another optimization step to accommodate ops that require expensive meta information (e.g. `divisions`) #181

Add another optimization step to accommodate ops that require expensive meta information (e.g. `divisions`) #181

rjzamora commented Jun 23, 2023 •

edited

Loading