Make map_blocks with block_info produce a Blockwise by bmerry · Pull Request #5896 · dask/dask

bmerry · 2020-02-14T12:59:45Z

Previously, using map_blocks with block_info or block_id would denature
a Blockwise and do a substitution on every instantiation of the subgraph
callable to inject the block info.

Instead, create a dask array whose elements are the block infos, and
pass this as an extra parameter into dask.array.blockwise. This makes
the layer a Blockwise, and hence a candidate for rewrite_blockwise.

Tests added / passed
Passes black dask / flake8 dask

bmerry · 2020-02-14T13:06:04Z

I've marked this a draft PR because it's built on top of #5895. In the meantime if you want to look at what it's doing, ignore the first two commits.

In this (rather contrived) example, map_blocks is about 75x faster, and the computation is slightly faster too:

import dask.array as da
import numpy as np

import timeit

def combine(a, b, block_id=None):
    return a + b if block_id[0] == block_id[1] else np.zeros_like(a)

a = da.ones((1000, 100), chunks=(1, 1))
b = da.zeros((1000, 100), chunks=(1, 1))
print(timeit.timeit('da.map_blocks(combine, a, b, dtype=a.dtype)', number=10, globals=globals()))
c = da.map_blocks(combine, a, b, dtype=a.dtype)
print(timeit.timeit('c.compute()', number=1, globals=globals()))

Gains with block_info are be less because the data for block_info is more expensive to compute than block_id. On the other hand, I'm hoping this change will make some more complex graphs cheaper to optimise by allowing fusion to occur at the blockwise level instead of on individual blocks.

bmerry · 2020-02-14T13:09:18Z

By the way, are there any issues around serializability of the closures returned by _pass_block_info / _pass_block_id?

mrocklin · 2020-02-14T20:15:00Z

By the way, are there any issues around serializability of the closures returned by _pass_block_info / _pass_block_id?

Everything will work, but serialization is a bit slower. In general we try to use raw top-level defined functions and pass through additional data as keywords when possible. The core pickle module is able to handle these really efficiently, and it opens up fewer opportunities for odd serializations issues to arise.

bmerry · 2020-02-15T09:32:48Z

In general we try to use raw top-level defined functions and pass through additional data as keywords when possible.

Ok, I'll have a look next week to see if there is some way to achieve that - possibly by passing the original function as a keyword arg.

bmerry · 2020-02-16T16:37:55Z

I figured out how to avoid using closures. Once #5895 is merged (it's currently approved) I'll rebase against master. I'll also need to check the test failures, but I think it's just a formatting issue from black.

Previously, using map_blocks with block_info or block_id would denature a Blockwise and do a substitution on every instantiation of the subgraph callable to inject the block info. Instead, create a dask array whose elements are the block infos, and pass this as an extra parameter into `dask.array.blockwise`. This makes the layer a Blockwise, and hence a candidate for `rewrite_blockwise`.

bmerry · 2020-02-16T17:42:59Z

I've rebased against master, so this is ready for review now.

bmerry · 2020-02-16T18:28:01Z

By the way, I haven't added any tests because this is just an optimisation, not a new feature, and I think the existing map_blocks tests are reasonably comprehensive.

mrocklin · 2020-02-20T01:26:29Z

I apologize for the delay in reviewing this. There are relatively few people familiar with this part of the codebase, and it's a bit more involved to review.

This makes the layer a Blockwise, and hence a candidate for rewrite_blockwise.

It still depends on a non-Blockwise though, but I guess that's ok? rewrite_blockwise is comfortable treating that as a small input to the entire resulting output?

mrocklin · 2020-02-20T01:29:17Z

We might test that rewrite_blockwise on such a graph produces a HighLevelGraph with a small number of layers. This isn't strictly necessary for this PR, but if this is functionality that you value then it might be wise to add a small test for it, if it's easy for you.

bmerry · 2020-02-20T05:42:40Z

It still depends on a non-Blockwise though, but I guess that's ok? rewrite_blockwise is comfortable treating that as a small input to the entire resulting output?

I'm now also less certain that there is any benefit to the resulting graph structure - I should definitely write a test to check that rewrite_blockwise is actually able to do the optimisation I'm claiming.

My initial plan had been to follow this up by supporting block_id directly in Blockwise, by injecting the block coordinates as part of make_blockwise_graph, which would have removed the need for an extra layer. But thinking about how to support that in rewrite_blockwise is making my head hurt even more than this PR did, so I probably won't get to it.

If there is no concatenation, set `concatenate` to None. map_blocks always passes True, which was preventing map_blocks layers from being fused with elementwise layers by `optimize_blockwise`.

bmerry · 2020-02-20T12:35:38Z

So it turned out that map_blocks wouldn't fuse with elementwise operations, even without block_id/block_info, because map_blocks always passed concatenate=True and blockwise ops with different concatenation settings don't merge.

Once that's fixed, this example becomes much faster, both to build the structures and to compute (the chunks are tiny to emphasize the overheads):

#!/usr/bin/env python3
import time
import dask.array as da
from dask.blockwise import optimize_blockwise
import numpy as np

CHUNK_SIZE = 10
NCHUNKS = 10000
SIZE = CHUNK_SIZE * NCHUNKS

def combine(x, y, block_id):
    return x + y

t0 = time.monotonic()
base = [da.full((SIZE,), i, dtype=np.int8, chunks=CHUNK_SIZE) for i in range(4)]
a = base[0] + base[1]
b = da.map_blocks(combine, a, base[2], dtype=np.int8)
c = b + base[3]
t1 = time.monotonic()
c.compute()
t2 = time.monotonic()

print(t1 - t0)
print(t2 - t1)

On master:

0.23567571800049336
8.729706922999867

On this branch:

0.08010524800010899
4.941571717000443

TomAugspurger · 2020-03-03T17:24:44Z

and blockwise ops with different concatenation settings don't merge.

Once that's fixed, this example becomes much faster

Just to clarify, that's fixed on this branch (likely in bad97f0)?

dask/blockwise.py

bmerry · 2020-03-04T06:38:50Z

Just to clarify, that's fixed on this branch (likely in bad97f0)?

Yes.

TomAugspurger · 2020-03-06T14:14:06Z

Thanks @bmerry!

bmerry requested a review from TomAugspurger February 14, 2020 13:18

bmerry force-pushed the block-info-in-blockwise branch from bf0a5c1 to c3f97e2 Compare February 16, 2020 17:35

bmerry marked this pull request as ready for review February 16, 2020 17:42

bmerry added 2 commits February 20, 2020 14:15

Normalize concatenate in blockwise

bad97f0

If there is no concatenation, set `concatenate` to None. map_blocks always passes True, which was preventing map_blocks layers from being fused with elementwise layers by `optimize_blockwise`.

Add a test that map_blocks fuses with elemwise ops

f679c90

TomAugspurger reviewed Mar 3, 2020

View reviewed changes

dask/blockwise.py Show resolved Hide resolved

TomAugspurger added a commit to TomAugspurger/dask-benchmarks that referenced this pull request Mar 4, 2020

Added dask/dask#5896

6731426

TomAugspurger merged commit 1190067 into dask:master Mar 6, 2020

bmerry deleted the block-info-in-blockwise branch February 25, 2021 12:54

Uh oh!

Conversation

bmerry commented Feb 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bmerry commented Feb 14, 2020

Uh oh!

bmerry commented Feb 14, 2020

Uh oh!

mrocklin commented Feb 14, 2020

Uh oh!

bmerry commented Feb 15, 2020

Uh oh!

bmerry commented Feb 16, 2020

Uh oh!

bmerry commented Feb 16, 2020

Uh oh!

bmerry commented Feb 16, 2020

Uh oh!

mrocklin commented Feb 20, 2020

Uh oh!

mrocklin commented Feb 20, 2020

Uh oh!

bmerry commented Feb 20, 2020

Uh oh!

bmerry commented Feb 20, 2020

Uh oh!

TomAugspurger commented Mar 3, 2020

Uh oh!

Uh oh!

bmerry commented Mar 4, 2020

Uh oh!

TomAugspurger commented Mar 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bmerry commented Feb 14, 2020 •

edited

Loading