Consider reactivating low-level DataFrame optimization when not all layers are Blockwise #8447

gjoseph92 · 2021-12-03T00:10:05Z

Since #7620, we've seen a few instances where users have gotten burned by root-task overproduction (see dask/distributed#5555, dask/distributed#5223 for background) because certain DataFrame optimizations still use low-level graphs, and therefore aren't getting fused anymore. Examples:

We do want to get everything to Blockwise eventually, but our bandwith to track these down and fix them is limited. In the interim, I propose that by default, we still do low-level fusion when any of the layers in the graph are materialized.

cc @rjzamora @ian-r-rose @jrbourbeau

The text was updated successfully, but these errors were encountered:

Change the DataFrame optimization default so that DataFrames that may need low-level fusion still get it. Closes dask#8447

rjzamora · 2021-12-03T02:57:14Z

We do want to get everything to Blockwise eventually, but our bandwith to track these down and fix them is limited. In the interim, I propose that by default, we still do low-level fusion when any of the layers in the graph are materialized.

My motivation to finally refocus my efforts on HLG cleanup is certainly reaching a tipping point! Sorry for being somewhat MIA in these topics lately.

I agree that we probably want the middle ground you are suggesting. As I type this, I see that you submitted a PR - So, perhaps we can discuss the specific details there :)

This was referenced Dec 3, 2021

Suboptimal graph structure when read-writing a parquet #8445

Closed

Add fusion optimization for Delayed #8448

Open

gjoseph92 added the dataframe label Dec 3, 2021

gjoseph92 added a commit to gjoseph92/dask that referenced this issue Dec 3, 2021

DataFrame fusion when some layers are materialized

bcde4b8

Change the DataFrame optimization default so that DataFrames that may need low-level fusion still get it. Closes dask#8447

gjoseph92 linked a pull request Dec 3, 2021 that will close this issue

DataFrame fusion when some layers are materialized #8451

Open

3 tasks

rjzamora mentioned this issue Dec 8, 2021

Move DataFrame ACA aggregations to HLG #8468

Merged

gjoseph92 mentioned this issue Dec 8, 2021

[DNM] P2P shuffle skeleton - scheduler plugin dask/distributed#5524

Closed

2 tasks

gjoseph92 mentioned this issue Dec 16, 2021

from_dask_array materializes the graph unnecessarily #8496

Closed

gjoseph92 mentioned this issue Jan 19, 2022

Spatial join benchmarks (from Scipy 2020 talk) geopandas/dask-geopandas#114

Open

gjoseph92 mentioned this issue Mar 10, 2022

Graph optimization loses annotations #7036

Open

gjoseph92 mentioned this issue Oct 25, 2022

Automatic retries beyond read_parquet / to_parquet #9594

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider reactivating low-level DataFrame optimization when not all layers are Blockwise #8447

Consider reactivating low-level DataFrame optimization when not all layers are Blockwise #8447

gjoseph92 commented Dec 3, 2021

rjzamora commented Dec 3, 2021 •

edited

Consider reactivating low-level DataFrame optimization when not all layers are Blockwise #8447

Consider reactivating low-level DataFrame optimization when not all layers are Blockwise #8447

Comments

gjoseph92 commented Dec 3, 2021

rjzamora commented Dec 3, 2021 • edited

rjzamora commented Dec 3, 2021 •

edited