You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Matt and I chatted offline about #166 and potential traps when implementing set_index.
Generally, we shouldn't use divisions while simplifying the Expression tree and pushing stuff up and down (I'll open another issue about how we can avoid this partially with align-like stuff). Computing divisions might be expensive, accessing them multiple times for a set_index/sort_values/... call will trigger multiple computations. So it's better to avoid them whenever possible.
Some methods will require information about the divisions to select the best algorithm (merge as an example). We might pass a _simplify_down step multiple times depending on the structure of our Expression tree, which would potentially require re-computing the divisions each time that merge is hit by this.
We can avoid this conflict through adding another optimization step that comes after simplify, lower for example. This step can accommodate optimizations that require information that are relatively expensive to obtain (like divisions). Ideally we avoid multiple passes in this case.
We should keep in mind that we need a more complex structure in the future, which would require us to generalize the different optimization steps.
I started digesting the discussion in #166, and fell down a bit of a rabbit hole myself :)
This problem is reminding me of the fact that I was originally hoping to free Expr objects from needing to track both npartitions and divisions information. I eventually gave up on this idea as “too extreme”, but it may not be too late to reconsider.
For example, if we are splitting the optimization pass into different kinds of Expr changes, then it may make sense to roughly split into the following separate passes:
abstract optimizations that don't require divisions/npartitions information but may affect the final divisions/npartitions (predicate pushdown, projection, etc)
divisions/npartitions "lowering" (i.e. IO layers are forced to extract "known" divisions/npartitions)
anther abstract-optimization pass for things that cannot affect divisions/npartitions
(can stop here if you don't need to generate a graph)
algorithm "lowering" (ensuring all expressions can generate a graph)
another abstact-optimization pass for things that cannot affect divisions/npartitions
fusion
I will probably change my mind about this, just sharing my current thoughts :)
Matt and I chatted offline about #166 and potential traps when implementing
set_index
.Generally, we shouldn't use
divisions
while simplifying the Expression tree and pushing stuff up and down (I'll open another issue about how we can avoid this partially with align-like stuff). Computing divisions might be expensive, accessing them multiple times for aset_index
/sort_values
/... call will trigger multiple computations. So it's better to avoid them whenever possible.Some methods will require information about the divisions to select the best algorithm (
merge
as an example). We might pass a_simplify_down
step multiple times depending on the structure of our Expression tree, which would potentially require re-computing thedivisions
each time thatmerge
is hit by this.We can avoid this conflict through adding another optimization step that comes after
simplify
,lower
for example. This step can accommodate optimizations that require information that are relatively expensive to obtain (likedivisions
). Ideally we avoid multiple passes in this case.We should keep in mind that we need a more complex structure in the future, which would require us to generalize the different optimization steps.
cc @rjzamora
The text was updated successfully, but these errors were encountered: