Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add colors to represent high level layer types #7974

Merged
merged 14 commits into from Aug 18, 2021

Conversation

freyam
Copy link
Contributor

@freyam freyam commented Aug 2, 2021

In this PR, I have added an extra attribute to the graphviz output of High-Level graphs - node fill color. Nodes are colored on the basis of their layer_type.

This was achieved by using DOT's fillcolor attribute.

Currently, this option is made optional to the users. They need to provide an additional kwarg while calling dask.visualize() with color="layer_type".

Demo

import dask
import dask.dataframe as dd

df = dask.datasets.timeseries()
df2 = df[df.y > 0]
df3 = df2.groupby('name').x.std()

c.dask.visualize(color="layer_type")

7974-AA

c.dask.visualize(color="layer_type")

image

Color Scheme

layer_colors = {
    "DataFrameIOLayer": "purple",
    "ShuffleLayer": "rose",
    "SimpleShuffleLayer": "rose",
    "ArrayOverlayLayer": "pink",
    "BroadcastJoinLayer": "blue",
    "Blockwise": "green",
    "BlockwiseLayer": "green",
    "BlockwiseCreateArray": "green",
    "MaterializedLayer": "gray",
}

final

Explanation

  • DataFrameIOLayer: inefficient;
  • ShuffleLayer, SimpleShuffleLayer: inefficient;
  • ArrayOverlayLayer: inefficient;
  • BroadcastJoinLayer: (?);
  • Blockwise, BlockwiseLayer, BlockwiseCreateArray: efficient; easy to parallize;
  • MaterializedLayer: inefficient; better to materialize as late as possible;

Key Points

  • Blockwise Layers are more efficient than others and can be readily parallelized. As a result, they are green (a color used to signify something that is right, something that is correct). When users see a green layer, they may be certain that it is the most efficient method to accomplish things and that no optimization is required.
  • The gray color (which signifies neutrality and balance) is used for Materialized layers to indicate that the layer should be materialized as late as feasible. Since we are not sure of the optimal way to optimize , we use the gray color to indicate that we are not sure.
  • All of the other layers are inefficient in some way. As a result, they are colored brighter and stronger to attract the user's attention. Colors such as purple, pink, and blue are used to indicate to the user that something needs to be optimized. Users should utilize as little of these levels as feasible.

@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

@freyam freyam mentioned this pull request Aug 5, 2021
3 tasks
@GenevieveBuckley
Copy link
Contributor

Discussed in the meeting today, we want to:

  1. Make this color-coding feature optional, and turn it on using a color keyword argument (similar to https://docs.dask.org/en/latest/api.html#dask.visualize)
  2. Write a short explanation targeted at new Dask users, telling them what the colours mean and why they should care about it.

@GenevieveBuckley
Copy link
Contributor

For that second part, why should someone care / what does it mean:

  • There's some discussion of this in the thread here Categorise high level layers for display #7919
  • Very roughly, blockwise layers are pretty good/efficient, shuffles are not very efficient so you don't want to have too many of them (probably similar for array overlap layers), and you want to avoid materializing layers until the very end when the computation runs

@freyam
Copy link
Contributor Author

freyam commented Aug 10, 2021

Demo

Made the color-coding optional.

Users need to supply color="layer_type" to see the colors and the Legend.

image

Now, I will try to gather info about different layer_types.

@freyam
Copy link
Contributor Author

freyam commented Aug 11, 2021

Update

Color Scheme

layer_colors = {
    "DataFrameIOLayer": "purple",
    "ShuffleLayer": "rose",
    "SimpleShuffleLayer": "rose",
    "ArrayOverlayLayer": "pink",
    "BroadcastJoinLayer": "blue",
    "Blockwise": "green",
    "BlockwiseLayer": "green",
    "BlockwiseCreateArray": "green",
    "MaterializedLayer": "gray",
}

final

Explanation

  • DataFrameIOLayer: inefficient;
  • ShuffleLayer, SimpleShuffleLayer: inefficient;
  • ArrayOverlayLayer: inefficient;
  • BroadcastJoinLayer: (?);
  • Blockwise, BlockwiseLayer, BlockwiseCreateArray: efficient; easy to parallize;
  • MaterializedLayer: inefficient; better to materialize as late as possible;

Key Points

  • Blockwise Layers are more efficient than others and can be readily parallelized. As a result, they are green (a color used to signify something that is right, something that is correct). When users see a green layer, they may be certain that it is the most efficient method to accomplish things and that no optimization is required.
  • The gray color (which signifies neutrality and balance) is used for Materialized layers to indicate that the layer should be materialized as late as feasible. Since we are not sure of the optimal way to optimize , we use the gray color to indicate that we are not sure.
  • All of the other layers are inefficient in some way. As a result, they are colored brighter and stronger to attract the user's attention. Colors such as purple, pink, and blue are used to indicate to the user that something needs to be optimized. Users should utilize as little of these levels as feasible.

@freyam
Copy link
Contributor Author

freyam commented Aug 11, 2021

Apart from colors, we could also tinker around with the outline or the node shape as well.

Some examples

  1. More peripheries to show bottlenecks.

image

  1. Rounded node shape to show simplicity and easiness

image

@martindurant

@freyam
Copy link
Contributor Author

freyam commented Aug 11, 2021

and, where do we stand on adding these colors to the HTML Reprs as well (as shown #7919 (comment))

image
(outdated screenshot, but concept remains)

@GenevieveBuckley
Copy link
Contributor

GenevieveBuckley commented Aug 13, 2021

Can we put "Legend: Layer types" as the legend heading? I think adding that it's about layer types will help make it clearer.

The color keyword argument is nice. We'll need to add it to the visualize docstring, including what the allowed values are, so that people know it exists. Take a look at how it's phrased in the low level graph visualize docstring and use that as guide.

I like the double layer outline, that does draw my attention to it. The rounded edges I would be less likely to notice, for me that's less effective.

I personally am not a big fan of adding more colours to the HTML reprs. I think it makes it more confusing because it's not explained what they mean, and the layer type information is included in the details dropdown for each of them already.

@freyam
Copy link
Contributor Author

freyam commented Aug 13, 2021

I like the double layer outline, that does draw my attention to it.

Which all layers would you like to see following this?

My plan:

  • DataFrameIOLayer
  • ShuffleLayer
  • SimpleShuffleLayer
  • ArrayOverlayLayer

@freyam freyam marked this pull request as ready for review August 13, 2021 10:45
@GenevieveBuckley
Copy link
Contributor

I'm unsure about whether we should highlight some layer types (eg: with double outlines) given that we don't really have a clear-cut, definitive list for what is or isn't efficient. I'll let @martindurant weigh in though, since I think you may have had some previous conversations around that.

@GenevieveBuckley
Copy link
Contributor

@martindurant when you get a chance can you look over this for merging. It is Freyam's last few days of GSOC, so we'd like to get these open PRs in.

@martindurant
Copy link
Member

Can we have a final demo image here, perhaps also add to the initial PR description, so people coming here get to understand what was done quickly?

I'm not sure I answered regarding double outlining some nodes; I would not do it yet, and see how the colouring goes down with the community first. If you did the rounded corners, that's fine, since it's rather subtle. I expect in the long run we will come up with some shapes too, but let's let this PR do what the current title says.

@freyam
Copy link
Contributor Author

freyam commented Aug 17, 2021

@martindurant Updated ✔️

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I went through this again and I have some thoughts.

dask/highlevelgraph.py Outdated Show resolved Hide resolved
dask/highlevelgraph.py Outdated Show resolved Hide resolved
@@ -1271,13 +1320,52 @@ def to_graphviz(
attrs.setdefault("fontsize", str(node_size))
attrs.setdefault("tooltip", str(node_tooltips))

if color == "layer_type":
layer_colors = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible future improvement: it feels like these should be (class) attributes of the layers themselves. Since colouring is not going to be default yet, it's ok to leave them here for now.

dask/highlevelgraph.py Outdated Show resolved Hide resolved
Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments (sorry) on the docs

dask/base.py Outdated Show resolved Hide resolved
dask/base.py Outdated Show resolved Hide resolved
dask/base.py Outdated Show resolved Hide resolved
dask/highlevelgraph.py Outdated Show resolved Hide resolved
dask/highlevelgraph.py Outdated Show resolved Hide resolved
dask/highlevelgraph.py Outdated Show resolved Hide resolved
dask/highlevelgraph.py Outdated Show resolved Hide resolved
dask/highlevelgraph.py Show resolved Hide resolved
dask/highlevelgraph.py Outdated Show resolved Hide resolved
@martindurant martindurant merged commit a5ff631 into dask:main Aug 18, 2021
@freyam
Copy link
Contributor Author

freyam commented Aug 18, 2021

💛

@freyam freyam deleted the high-level-layers-repr branch August 18, 2021 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Categorise high level layers for display
4 participants