Use atop fusion in dask dataframe #4229

mrocklin · 2018-11-20T17:53:31Z

The high level atop fusion layers can be used in dask dataframe as well as dask array. Today this makes it cheaper to perform task fusion on dataframes with many partitions. In the future this should make it easier to build more sophisticated high level optimizations.

This currently builds on #4092

This currently breaks tests. There is a fair amount that we can clean up in both dask.dataframe's broadcast operations (map_partitions, elemwise) and in the dask.array.atop code.

Tests added / passed
Passes flake8 dask

mrocklin · 2018-11-30T16:25:16Z

OK, most dataframe tests pass here now

…ame-atop

mrocklin · 2018-12-10T23:58:19Z

OK, this is ready for review. There is some future work though:

Regression: I've turned off dataframe-parquet column projection for now. (The previous implementation was very brittle). I think that we'll want to redo it anyway in short order.
I think that we should rename atop to blockwise, or something else (see Better name for atop? #4035)
There are issues around fusing diamond-like atop graphs. We might need to repeat some of the work of @jcrist and @eriknw at a higher level.

mrocklin · 2018-12-12T23:13:29Z

@jcrist @TomAugspurger if either of you have time it would be helpful if you could take a look at this

TomAugspurger · 2018-12-13T14:17:12Z

Will probably have time next week.

…

On Wed, Dec 12, 2018 at 5:13 PM Matthew Rocklin ***@***.***> wrote: @jcrist <https://github.com/jcrist> @TomAugspurger <https://github.com/TomAugspurger> if either of you have time it would be helpful if you could take a look at this — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4229 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIh_ilQHPW7ecby3zsAWyGHgDRl91ks5u4Y2agaJpZM4Yrofg> .

dask/dataframe/core.py

mrocklin · 2018-12-18T21:53:46Z

dask/dataframe/tests/test_optimize_dataframe.py

+    assert len(b) <= 15
+
+
+@pytest.mark.xfail(reason="need better high level fusion")


This is future work. We need better fusion on the high level graph level. cc @jcrist ?

Sure, that'd be fun to work on. Is this blocking for this PR to merge, or just future work?

Not blocking. Things are better with this PR (df + 1 + 2 + 3 + 4 happens in one task) but could be much better still with better fusion. In particular I've seen the use case in this failing test happen frequently. It seems that common use of Pandas includes dozens of lines of modifying columns in place, all of which generate diamond-like graphs.

The PR itself could also use review though.

TomAugspurger

This PR looks good, but I haven't really kept up with the new high-level graph stuff.

Only request would be a docstring for the new broadcast method. And could you explain the reasoning behind that name? IIUC, adds a task to each partition / block of the input args? I worry a bit about broadcast because it clashes with NumPy's concept (and broadcasting data in distributed), but I haven't come up with a better name.

mrocklin · 2018-12-20T13:35:27Z

A fun motivating stack overflow question: https://stackoverflow.com/questions/53844188/how-do-i-use-dask-to-efficiently-calculate-many-simple-statistics/53869772#53869772

mrocklin · 2018-12-20T15:18:21Z

Only request would be a docstring for the new broadcast method. And could you explain the reasoning behind that name?

Renamed to partitionwise_graph and added a docstring. I also moved it into core.py

bluecoconut · 2018-12-20T20:19:36Z

Just did a test locally and this branch had a dramatic impact (~2x improvement) in performance for the code below, and even has a lower memory footprint for the larger stat tests.

import dask
import dask.datasets
import numpy as np
import time
from distributed import Client

client = Client()
client

df = dask.datasets.timeseries()
df = df.repartition(npartitions=300)
df = client.persist(df)

def random_indexer(df):
    indexer = ~df.index.isnull()
    for i in range(np.random.randint(15)+1):
        col = np.random.choice(['x','y'])
        value = np.random.uniform(-1,1)
        op = np.random.choice([lambda x, y: x < y, lambda x, y: x > y])
        indexer = np.logical_and(indexer, op(df[col], value))
    return indexer

def random_statistic(indexer, df):
    col = np.random.choice(['x', 'y', 'name'])
    if col == 'name':
        op = np.random.choice([lambda x: x.unique().size, np.min, np.max])
    else:
        op = np.random.choice([lambda x: x.unique().size, np.min, np.max, np.sum, np.mean])
    return op(df[col][indexer])

np.random.seed(137)
stats = []
for i in range(10):
    ind = random_indexer(df)
    for k in range(20):
        stats.append(random_statistic(ind, df))

st = time.time()
stat_computed = client.compute(stats)
ft = time.time()
print(ft-st)

st = time.time()
stat_results = client.gather(stat_computed)
ft = time.time()
print(ft-st)

For calculating 200 statistics

(10 unique filtering, 20 different statistics from that filtered subset)

npartitions	graph create (master)	graph create (high-level)	execution (master)	execution (high-level)
100	1.7s	2.5s	10.2s	5.6s
300	5.4s	8.1s	31s	15.3s
600	11.6s	16s	59 s	29.35s

For calculating 2000 statistics

I also tried to increase the scale of stats (for i in range(100), 2000 stats total) and I see the same garbage collection warnings in both branches, and memory still seems to grow a lot more than I anticipated.

However, on this branch, the calculation with 2000 stats actually completes! Also, the Bytes stored on the dashboard seems to be accurate (though quite large, from 1.5 GB after persist(df) to 7 GB stored after calculating stats. (will need to dig into why there is so much data left in the cluster next).

The biggest difference from current 1.0.0 master is that master actually doesn't complete this same calculation on my machine. It has the memory issue (?) that even prevents this from completing, heading into swap and eventually killing workers.

mrocklin · 2018-12-20T21:57:32Z

Thank you for trying things out @bluecoconut and for the benchmark. It's always nice to see things work well on unseen problems :)

I look forward to finding out why graph construction was more expensive (I would have expected it to decrease).

I suspect that you'll see additional boosts as we get a bit better at fusing things at the high level (see the comments on the xfailed test).

mrocklin · 2018-12-24T04:57:59Z

Merging this tomorrow if there are no further comments.

mrocklin added 30 commits October 12, 2018 17:40

wip - add high level graph

f2048de

handle collections in atop, use atop in map_blocks

aca0028

Use HighGraph in Dask DataFrame

06a6ff2

fix more tests

de3f426

use highgraph in bag

663f8e7

cleanup

855a0e8

clean up array creation, gufunc, and linalg

4abd015

fix overlap

8dfdce2

add missing dependency in tsqr

1e784f7

add highgraph visualize

c1a4d76

Merge branch 'master' into high-level-graphs

e6e9efb

replace sharedict in array optimization

6059f6a

rewrite drop/new axis map_blocks behavior for atop

86da50a

array cleanup

4c30f2c

unpack subgraph callable

9ce8522

fix bag reductions

a0015d9

avoid deprecation warning

1cd7919

support collections with non-highgraphs

5155238

Merge branch 'master' of github.com:dask/dask into high-level-graphs

d83302d

handle_graphviz -> graphviz_to_file

af53afe

use cls rather than HighGraph

695750a

add docstring to HighGraph.from_collections

659ef10

return a tuple rather than set in unpack_collections

365b3ff

rename HighGraph to HighLevelGraph

b21178a

collections.Mapping -> compatibility.Mapping

744cb48

Add high level graphs doc

71afea5

add API docs [skip ci]

e22ef00

fix map_overlap

68a9d95

fix random test

f16d347

Merge branch 'master' of github.com:dask/dask into high-level-graphs

8ca5c29

mrocklin mentioned this pull request Dec 9, 2018

Regression: map_blocks converts large np arrays to strings #4284

Closed

mrocklin added 2 commits December 9, 2018 22:20

clean up existing tests

8c63b37

flake8

d5d82f4

djhoese mentioned this pull request Dec 10, 2018

Regression: map_blocks provides array as tuple #4285

Closed

mrocklin added 2 commits December 10, 2018 18:19

Merge branch 'dataframe-atop' of github.com:mrocklin/dask into datafr…

80ed0ae

…ame-atop

Merge branch 'master' of github.com:dask/dask into dataframe-atop

48fa481

mrocklin changed the title ~~[WIP] Use atop fusion in dask dataframe~~ Use atop fusion in dask dataframe Dec 10, 2018

xfail aspirational optimize test

c6fe6f8

use apply_and_enforce again

aa1e9eb

TomAugspurger reviewed Dec 18, 2018

View reviewed changes

dask/dataframe/core.py Outdated Show resolved Hide resolved

dask/dataframe/core.py Outdated Show resolved Hide resolved

mrocklin added 2 commits December 18, 2018 13:48

only call normalize_arg once in map_partitions

ed291d6

normalize __dask_layers__ to tuple

b7947d6

mrocklin commented Dec 18, 2018

View reviewed changes

TomAugspurger reviewed Dec 19, 2018

View reviewed changes

rename broadcast to partitionwise_graph

c272799

bluecoconut mentioned this pull request Dec 20, 2018

Memory Leak? Very big graphs clogging scheduler? dask/distributed#2433

Closed

minor cleanup

5d3a4e3

mrocklin merged commit d63ad2a into dask:master Dec 27, 2018

mrocklin deleted the dataframe-atop branch December 27, 2018 21:14

mrocklin mentioned this pull request Jan 9, 2019

RecursionError when adding many computed columns to a DataFrame #4360

Open

stsievert mentioned this pull request Jul 19, 2019

Networkx-Dask DAG API? nipype/pydra#72

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use atop fusion in dask dataframe #4229

Use atop fusion in dask dataframe #4229

mrocklin commented Nov 20, 2018

mrocklin commented Nov 30, 2018

mrocklin commented Dec 10, 2018

mrocklin commented Dec 12, 2018

TomAugspurger commented Dec 13, 2018 via email

mrocklin Dec 18, 2018

jcrist Dec 18, 2018

mrocklin Dec 18, 2018

mrocklin Dec 18, 2018

TomAugspurger left a comment

mrocklin commented Dec 20, 2018

mrocklin commented Dec 20, 2018

bluecoconut commented Dec 20, 2018

mrocklin commented Dec 20, 2018

mrocklin commented Dec 24, 2018

		assert len(b) <= 15


		@pytest.mark.xfail(reason="need better high level fusion")

Use atop fusion in dask dataframe #4229

Use atop fusion in dask dataframe #4229

Conversation

mrocklin commented Nov 20, 2018

mrocklin commented Nov 30, 2018

mrocklin commented Dec 10, 2018

mrocklin commented Dec 12, 2018

TomAugspurger commented Dec 13, 2018 via email

mrocklin Dec 18, 2018

Choose a reason for hiding this comment

jcrist Dec 18, 2018

Choose a reason for hiding this comment

mrocklin Dec 18, 2018

Choose a reason for hiding this comment

mrocklin Dec 18, 2018

Choose a reason for hiding this comment

TomAugspurger left a comment

Choose a reason for hiding this comment

mrocklin commented Dec 20, 2018

mrocklin commented Dec 20, 2018

bluecoconut commented Dec 20, 2018

For calculating 200 statistics

For calculating 2000 statistics

mrocklin commented Dec 20, 2018

mrocklin commented Dec 24, 2018