Add colors to represent high level layer types #7974

freyam · 2021-08-02T10:08:04Z

Closes Categorise high level layers for display #7919
Tests passed
Passes black dask / flake8 dask / isort dask

In this PR, I have added an extra attribute to the graphviz output of High-Level graphs - node fill color. Nodes are colored on the basis of their layer_type.

This was achieved by using DOT's fillcolor attribute.

Currently, this option is made optional to the users. They need to provide an additional kwarg while calling dask.visualize() with color="layer_type".

Demo

import dask
import dask.dataframe as dd

df = dask.datasets.timeseries()
df2 = df[df.y > 0]
df3 = df2.groupby('name').x.std()

`c.dask.visualize(color="layer_type")`

Color Scheme

layer_colors = {
    "DataFrameIOLayer": "purple",
    "ShuffleLayer": "rose",
    "SimpleShuffleLayer": "rose",
    "ArrayOverlayLayer": "pink",
    "BroadcastJoinLayer": "blue",
    "Blockwise": "green",
    "BlockwiseLayer": "green",
    "BlockwiseCreateArray": "green",
    "MaterializedLayer": "gray",
}

Explanation

DataFrameIOLayer: inefficient;
ShuffleLayer, SimpleShuffleLayer: inefficient;
ArrayOverlayLayer: inefficient;
BroadcastJoinLayer: (?);
Blockwise, BlockwiseLayer, BlockwiseCreateArray: efficient; easy to parallize;
MaterializedLayer: inefficient; better to materialize as late as possible;

Key Points

Blockwise Layers are more efficient than others and can be readily parallelized. As a result, they are green (a color used to signify something that is right, something that is correct). When users see a green layer, they may be certain that it is the most efficient method to accomplish things and that no optimization is required.
The gray color (which signifies neutrality and balance) is used for Materialized layers to indicate that the layer should be materialized as late as feasible. Since we are not sure of the optimal way to optimize , we use the gray color to indicate that we are not sure.
All of the other layers are inefficient in some way. As a result, they are colored brighter and stronger to attract the user's attention. Colors such as purple, pink, and blue are used to indicate to the user that something needs to be optimized. Users should utilize as little of these levels as feasible.

GPUtester · 2021-08-02T10:08:06Z

Can one of the admins verify this patch?

GenevieveBuckley · 2021-08-10T08:23:34Z

Discussed in the meeting today, we want to:

Make this color-coding feature optional, and turn it on using a color keyword argument (similar to https://docs.dask.org/en/latest/api.html#dask.visualize)
Write a short explanation targeted at new Dask users, telling them what the colours mean and why they should care about it.

GenevieveBuckley · 2021-08-10T08:28:22Z

For that second part, why should someone care / what does it mean:

There's some discussion of this in the thread here Categorise high level layers for display #7919
Very roughly, blockwise layers are pretty good/efficient, shuffles are not very efficient so you don't want to have too many of them (probably similar for array overlap layers), and you want to avoid materializing layers until the very end when the computation runs

freyam · 2021-08-10T10:16:58Z

Demo

Made the color-coding optional.

Users need to supply color="layer_type" to see the colors and the Legend.

Now, I will try to gather info about different layer_types.

…ayers-repr

freyam · 2021-08-11T12:45:27Z

Update

Color Scheme

layer_colors = {
    "DataFrameIOLayer": "purple",
    "ShuffleLayer": "rose",
    "SimpleShuffleLayer": "rose",
    "ArrayOverlayLayer": "pink",
    "BroadcastJoinLayer": "blue",
    "Blockwise": "green",
    "BlockwiseLayer": "green",
    "BlockwiseCreateArray": "green",
    "MaterializedLayer": "gray",
}

Explanation

DataFrameIOLayer: inefficient;
ShuffleLayer, SimpleShuffleLayer: inefficient;
ArrayOverlayLayer: inefficient;
BroadcastJoinLayer: (?);
Blockwise, BlockwiseLayer, BlockwiseCreateArray: efficient; easy to parallize;
MaterializedLayer: inefficient; better to materialize as late as possible;

Key Points

Blockwise Layers are more efficient than others and can be readily parallelized. As a result, they are green (a color used to signify something that is right, something that is correct). When users see a green layer, they may be certain that it is the most efficient method to accomplish things and that no optimization is required.
The gray color (which signifies neutrality and balance) is used for Materialized layers to indicate that the layer should be materialized as late as feasible. Since we are not sure of the optimal way to optimize , we use the gray color to indicate that we are not sure.
All of the other layers are inefficient in some way. As a result, they are colored brighter and stronger to attract the user's attention. Colors such as purple, pink, and blue are used to indicate to the user that something needs to be optimized. Users should utilize as little of these levels as feasible.

freyam · 2021-08-11T13:16:58Z

Apart from colors, we could also tinker around with the outline or the node shape as well.

Some examples

More peripheries to show bottlenecks.

Rounded node shape to show simplicity and easiness

@martindurant

freyam · 2021-08-11T13:25:13Z

and, where do we stand on adding these colors to the HTML Reprs as well (as shown #7919 (comment))

(outdated screenshot, but concept remains)

GenevieveBuckley · 2021-08-13T01:28:34Z

Can we put "Legend: Layer types" as the legend heading? I think adding that it's about layer types will help make it clearer.

The color keyword argument is nice. We'll need to add it to the visualize docstring, including what the allowed values are, so that people know it exists. Take a look at how it's phrased in the low level graph visualize docstring and use that as guide.

I like the double layer outline, that does draw my attention to it. The rounded edges I would be less likely to notice, for me that's less effective.

I personally am not a big fan of adding more colours to the HTML reprs. I think it makes it more confusing because it's not explained what they mean, and the layer type information is included in the details dropdown for each of them already.

freyam · 2021-08-13T08:49:48Z

I like the double layer outline, that does draw my attention to it.

Which all layers would you like to see following this?

My plan:

DataFrameIOLayer
ShuffleLayer
SimpleShuffleLayer
ArrayOverlayLayer

GenevieveBuckley · 2021-08-16T06:46:10Z

I'm unsure about whether we should highlight some layer types (eg: with double outlines) given that we don't really have a clear-cut, definitive list for what is or isn't efficient. I'll let @martindurant weigh in though, since I think you may have had some previous conversations around that.

…ayers-repr

GenevieveBuckley · 2021-08-16T23:53:55Z

@martindurant when you get a chance can you look over this for merging. It is Freyam's last few days of GSOC, so we'd like to get these open PRs in.

martindurant · 2021-08-17T02:30:41Z

Can we have a final demo image here, perhaps also add to the initial PR description, so people coming here get to understand what was done quickly?

I'm not sure I answered regarding double outlining some nodes; I would not do it yet, and see how the colouring goes down with the community first. If you did the rounded corners, that's fine, since it's rather subtle. I expect in the long run we will come up with some shapes too, but let's let this PR do what the current title says.

freyam · 2021-08-17T02:42:20Z

@martindurant Updated ✔️

martindurant

Sorry, I went through this again and I have some thoughts.

dask/highlevelgraph.py

martindurant · 2021-08-17T13:06:02Z

dask/highlevelgraph.py

@@ -1271,13 +1320,52 @@ def to_graphviz(
        attrs.setdefault("fontsize", str(node_size))
        attrs.setdefault("tooltip", str(node_tooltips))

+        if color == "layer_type":
+            layer_colors = {


Possible future improvement: it feels like these should be (class) attributes of the layers themselves. Since colouring is not going to be default yet, it's ok to leave them here for now.

dask/highlevelgraph.py

martindurant

Some more comments (sorry) on the docs

dask/base.py

dask/highlevelgraph.py

freyam · 2021-08-18T20:02:37Z

💛

base implementation

1f507b2

freyam mentioned this pull request Aug 2, 2021

Categorise high level layers for display #7919

Closed

added legend

b10c84d

freyam mentioned this pull request Aug 5, 2021

Add tooltips to graphviz #7973

Merged

3 tasks

made color coding optional

700e958

Merge branch 'main' of https://github.com/dask/dask into high-level-l…

e3d7009

…ayers-repr

GenevieveBuckley mentioned this pull request Aug 10, 2021

Google Summer of Code 2021 Project dask/dask-blog#107

Merged

Merge branch 'main' of https://github.com/dask/dask into high-level-l…

3cabc7d

…ayers-repr

new colors

409bb26

added doctstrings

c073aa5

freyam marked this pull request as ready for review August 13, 2021 10:45

freyam added 2 commits August 16, 2021 12:20

cleaner code

f82e55e

Merge branch 'main' of https://github.com/dask/dask into high-level-l…

486961c

…ayers-repr

martindurant reviewed Aug 17, 2021

View reviewed changes

freyam added 2 commits August 17, 2021 18:00

only show present colors in Legend

42c85a4

update doc

88a13f7

martindurant reviewed Aug 18, 2021

View reviewed changes

freyam added 3 commits August 18, 2021 22:28

better docs

56698f9

added low level

666db96

fixed default filenames

cf39759

martindurant merged commit a5ff631 into dask:main Aug 18, 2021

freyam deleted the high-level-layers-repr branch August 18, 2021 20:02

ian-r-rose mentioned this pull request Jul 19, 2022

Change repr methods to avoid Layer materialization #9289

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add colors to represent high level layer types #7974

Add colors to represent high level layer types #7974

freyam commented Aug 2, 2021 •

edited

GPUtester commented Aug 2, 2021

GenevieveBuckley commented Aug 10, 2021

GenevieveBuckley commented Aug 10, 2021

freyam commented Aug 10, 2021

freyam commented Aug 11, 2021 •

edited

freyam commented Aug 11, 2021

freyam commented Aug 11, 2021

GenevieveBuckley commented Aug 13, 2021 •

edited

freyam commented Aug 13, 2021

GenevieveBuckley commented Aug 16, 2021

GenevieveBuckley commented Aug 16, 2021

martindurant commented Aug 17, 2021

freyam commented Aug 17, 2021

martindurant left a comment

martindurant Aug 17, 2021

martindurant left a comment

freyam commented Aug 18, 2021

Add colors to represent high level layer types #7974

Add colors to represent high level layer types #7974

Conversation

freyam commented Aug 2, 2021 • edited

Demo

c.dask.visualize(color="layer_type")

c.dask.visualize(color="layer_type")

Color Scheme

Explanation

Key Points

GPUtester commented Aug 2, 2021

GenevieveBuckley commented Aug 10, 2021

GenevieveBuckley commented Aug 10, 2021

freyam commented Aug 10, 2021

Demo

freyam commented Aug 11, 2021 • edited

Update

Color Scheme

Explanation

Key Points

freyam commented Aug 11, 2021

Some examples

freyam commented Aug 11, 2021

GenevieveBuckley commented Aug 13, 2021 • edited

freyam commented Aug 13, 2021

GenevieveBuckley commented Aug 16, 2021

GenevieveBuckley commented Aug 16, 2021

martindurant commented Aug 17, 2021

freyam commented Aug 17, 2021

martindurant left a comment

Choose a reason for hiding this comment

martindurant Aug 17, 2021

Choose a reason for hiding this comment

martindurant left a comment

Choose a reason for hiding this comment

freyam commented Aug 18, 2021

freyam commented Aug 2, 2021 •

edited

`c.dask.visualize(color="layer_type")`

`c.dask.visualize(color="layer_type")`

freyam commented Aug 11, 2021 •

edited

GenevieveBuckley commented Aug 13, 2021 •

edited