Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add another optimization step to accommodate ops that require expensive meta information (e.g. divisions) #181

Open
phofl opened this issue Jun 23, 2023 · 1 comment

Comments

@phofl
Copy link
Collaborator

phofl commented Jun 23, 2023

Matt and I chatted offline about #166 and potential traps when implementing set_index.

Generally, we shouldn't use divisions while simplifying the Expression tree and pushing stuff up and down (I'll open another issue about how we can avoid this partially with align-like stuff). Computing divisions might be expensive, accessing them multiple times for a set_index/sort_values/... call will trigger multiple computations. So it's better to avoid them whenever possible.

Some methods will require information about the divisions to select the best algorithm (merge as an example). We might pass a _simplify_down step multiple times depending on the structure of our Expression tree, which would potentially require re-computing the divisions each time that merge is hit by this.

We can avoid this conflict through adding another optimization step that comes after simplify, lower for example. This step can accommodate optimizations that require information that are relatively expensive to obtain (like divisions). Ideally we avoid multiple passes in this case.

We should keep in mind that we need a more complex structure in the future, which would require us to generalize the different optimization steps.

cc @rjzamora

@rjzamora
Copy link
Member

rjzamora commented Jun 23, 2023

I started digesting the discussion in #166, and fell down a bit of a rabbit hole myself :)

This problem is reminding me of the fact that I was originally hoping to free Expr objects from needing to track both npartitions and divisions information. I eventually gave up on this idea as “too extreme”, but it may not be too late to reconsider.

For example, if we are splitting the optimization pass into different kinds of Expr changes, then it may make sense to roughly split into the following separate passes:

  • abstract optimizations that don't require divisions/npartitions information but may affect the final divisions/npartitions (predicate pushdown, projection, etc)
  • divisions/npartitions "lowering" (i.e. IO layers are forced to extract "known" divisions/npartitions)
  • anther abstract-optimization pass for things that cannot affect divisions/npartitions

(can stop here if you don't need to generate a graph)

  • algorithm "lowering" (ensuring all expressions can generate a graph)
  • another abstact-optimization pass for things that cannot affect divisions/npartitions
  • fusion

I will probably change my mind about this, just sharing my current thoughts :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants