Categorise high level layers for display #7919

martindurant · 2021-07-20T18:27:08Z

There are currently a concrete number of subclasses of the base highlevelgraph.Layer. Some of these have specific contexts or collection linkage (array Vs dataframe), others do not. For the sake of the work being done by @freyam , it would be nice to create some categories for the purpose of being shown in .visualize().

Layers allow for attaching attributes at instantiation. I suggests there might also be class attributes giving information about the layer type, which will be true for all instances.

Example: DataFrameIOLayer is IO by operation type and dataframe by collection. It would be reasonable for these to be among the default annotations of all instances.

The text was updated successfully, but these errors were encountered:

martindurant · 2021-07-20T18:32:49Z

cc @GenevieveBuckley

freyam · 2021-07-20T18:45:18Z

This is amazing! I will get started on finding more about this and share my findings here! 🚀

martindurant · 2021-07-20T18:57:45Z

I suggest starting with a concrete list of the current known implementations of Layer.

freyam · 2021-07-21T07:29:48Z

List of Layers

General

# High level graph layer
class Layer(collections.abc.Mapping)

# Fully materialized layer of `Layer`
class MaterializedLayer(Layer)

# Tensor Operation
class Blockwise(Layer)

Array Layers

# Specialized Blockwise Layer for array creation routines
class BlockwiseCreateArray(Blockwise)

# Simple HighLevelGraph array overlap layer
class ArrayOverlapLayer(Layer)

Dataframe Layers

# DataFrame-based HighLevelGraph Layer
class DataFrameLayer(Layer)

# High-level graph layer for a simple shuffle operation in which each output partition depends on all input partitions
class SimpleShuffleLayer(DataFrameLayer)

# High-level graph layer corresponding to a single stage of a multi-stage inter-partition shuffle operation.
class ShuffleLayer(SimpleShuffleLayer)

# High-level graph layer for a join operation requiring the smaller collection to be broadcasted to every partition of the larger collection.
class BroadcastJoinLayer(DataFrameLayer)

# DataFrame-based Blockwise Layer with IO
class DataFrameIOLayer(Blockwise, DataFrameLayer)

GenevieveBuckley · 2021-07-21T08:41:08Z

Copying over my comment from the Slack thread earlier today:

I think figuring out what categories should be included will be the tricky thing, and trying to make those categories be something users will care about. It seems like it'd be more useful if we can use them to highlight bottlenecks. I assume that's why you suggest IO and shuffle. We seem a bit vague on things beyond that though, I don't know that "all other CPU computations" is likely to be very useful to display (and will probably be most cases)

martindurant · 2021-07-21T13:38:50Z

So I think I'm more advocating that the Layer classes have some default annotations - and that's all. That way, we only need to look at these annotations, we don't need to do any isinstance checks. Of course, we still might want to look into the layer contents, to make the difference between GPU/CPU.

Totally agree that "CPU" should not have any associated decoration, as it will be the default, most common box.

freyam · 2021-07-23T16:30:39Z

@martindurant
What would the default annotations be for all the layers? I can think of layer_id.

Meanwhile, I have come up with the visualization for the HLGs!

Design

Group the listed layer_types into 4 categories

IO (DataFrame)

DataFrameIOLayer

Shuffle (DataFrame)

SimpleShuffleLayer
ShuffleLayer

Blockwise (Array mostly + DataFrame)

Blockwise
BlockwiseCreateArray

Materialized (Array + DataFrame)

MaterializedLayer

The remaining layers:

ArrayOverlapLayer
DataFrameLayer
BroadcastJoinLayer
can be just left as it is.

Visualization

Graphviz (`Layer.dask.visualize()`)

We can show the different groups with different colors.

EDIT: Updated Graph Reprentation

HTML Repr (`Layer.dask`)

Using the same colors used in the Graphviz, we change the tiny circles next to each layer to reflect its layer_type.

EDIT: Updated HTML Reprentation

(I have also modified the MaterializedLayer's existing color to be a little lighter. I feel it looks nicer and easier on the eyes)
It might seem too colorful, but that's because I am showing all the different types of layers at the same time. This is highly unlikely to been observed in real Dask Graphs.

/cc @jacobtomlinson I would love to hear your opinion as well on the colors. You have done an amazing job adding all the existing colors ✨

martindurant · 2021-07-23T16:32:00Z

What would the default annotations be for all the layers

Empty? If we can't think of anything useful, it's better not to complicate the visuals.

freyam · 2021-07-23T16:39:06Z

/cc @mrocklin What do you think about the use of colors?
(colors are open for discussion)

GenevieveBuckley · 2021-07-26T02:45:27Z

I hadn't thought of changing the colours in the dots next to the layer titles in the HTML representation, that's a nice touch.

Personally, I'm not sure we're gaining much by having multiple shades of green/blue/etc. for different types of layers in the same larger category. I think that adds more confusion.

We'll need some effort spent on:

Explaining what information the user should gain from this (eg: shuffles are inefficient so you don't want to see too many of them, blockwise layers are likely to be nicely parallelized, etc.). You'll need to be able to explain this very clearly for all of them.
How we are going to communicate that information to the user (a legend for graphviz? added information in the docs? something in the HTML repr?)

freyam · 2021-07-26T09:42:39Z

Personally, I'm not sure we're gaining much by having multiple shades of green/blue/etc. for different types of layers in the same larger category. I think that adds more confusion.

You are right! This looks even cleaner ✨

I have also added a Legend below so users have a good idea of what the colors mean.

and

Note: The colors used in Layer 2, 5, and 6 are supposed to be Green (Blockwise) as well, but I have used different ones to show all of the colors

Explaining what information the user should gain from this (eg: shuffles are inefficient so you don't want to see too many of them, blockwise layers are likely to be nicely parallelized, etc.). You'll need to be able to explain this very clearly for all of them.

I'm unfamiliar with this section of the code. So I'm not sure which layers are more difficult for users and which are simple (quick?).

If I have more context about the different layers and what makes them slower than the rest, I can perhaps do something more with those layers' nodes.

For example, I can add a little node form note to all of the Shuffle and DataFrameIO layers. This could imply that this is an area worth investigating.

How does this sound?

GenevieveBuckley · 2021-07-27T02:27:53Z

For example, I can add a little node form note to all of the Shuffle and DataFrameIO layers.

I'm sorry, what does this mean?

martindurant · 2021-07-27T02:44:26Z

I think it means that the outline box is not just a rectangle, but has a little fold in the corner. Pretty subtle.

…

On July 26, 2021 10:28:04 PM EDT, Genevieve Buckley ***@***.***> wrote: > For example, I can add a little node form **note** to all of the Shuffle and DataFrameIO layers. I'm sorry, what does this mean? -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #7919 (comment)

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

freyam · 2021-07-27T03:14:19Z

@GenevieveBuckley Yes, Martin is correct.

So, in essence, the current node shape is "box." Another node shape is "note," which resembles a paper fold from the corner.

Normally, people fold a page to bookmark it so that it can be easily found later. In this case, the fold would indicate that this area warrants further investigation.

jacobtomlinson · 2021-07-27T09:45:49Z

/cc @jacobtomlinson I would love to hear your opinion as well on the colors. You have done an amazing job adding all the existing colors ✨

Thanks @freyam but all the colour design was done by an external designer. See dask/community#135. I've added another comment to that issue to see if we can get the designer involved in choosing the more colours.

freyam · 2021-07-27T09:48:44Z

I've added another comment to that issue to see if we can get the designer involved in choosing the more colours.

That's amazing! 🚀

martindurant · 2021-07-28T15:32:00Z

So, in essence, the current node shape is "box." Another node shape is "note," which resembles a paper fold from the corner.

I think that the difference is too subtle to see in your examples, and the meaning of the different shapes will not be obvious to users.

freyam · 2021-07-28T22:41:13Z

Just wanna confirm:
Shuffle and DataFrameIO layers are the 2 specific layers you want to highlight differently to point at the potential source of inefficiency?

freyam · 2021-08-02T10:08:43Z

I have opened a Draft PR where I will be working along with this discussion.
#7974

This was referenced Jul 31, 2021

Adding tooltips to graphviz representation #7970

Closed

Add colors to represent high level layer types #7974

Merged

martindurant closed this as completed in #7974 Aug 18, 2021

DahnJ mentioned this issue Oct 13, 2021

Update HighLevelGraph documentation #7709

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Categorise high level layers for display #7919

Categorise high level layers for display #7919

martindurant commented Jul 20, 2021

martindurant commented Jul 20, 2021

freyam commented Jul 20, 2021

martindurant commented Jul 20, 2021

freyam commented Jul 21, 2021 •

edited

GenevieveBuckley commented Jul 21, 2021

martindurant commented Jul 21, 2021

freyam commented Jul 23, 2021 •

edited

martindurant commented Jul 23, 2021

freyam commented Jul 23, 2021

GenevieveBuckley commented Jul 26, 2021

freyam commented Jul 26, 2021

GenevieveBuckley commented Jul 27, 2021

martindurant commented Jul 27, 2021 via email

freyam commented Jul 27, 2021

jacobtomlinson commented Jul 27, 2021

freyam commented Jul 27, 2021

martindurant commented Jul 28, 2021

freyam commented Jul 28, 2021

freyam commented Aug 2, 2021

Categorise high level layers for display #7919

Categorise high level layers for display #7919

Comments

martindurant commented Jul 20, 2021

martindurant commented Jul 20, 2021

freyam commented Jul 20, 2021

martindurant commented Jul 20, 2021

freyam commented Jul 21, 2021 • edited

List of Layers

General

Array Layers

Dataframe Layers

GenevieveBuckley commented Jul 21, 2021

martindurant commented Jul 21, 2021

freyam commented Jul 23, 2021 • edited

Design

IO (DataFrame)

Shuffle (DataFrame)

Blockwise (Array mostly + DataFrame)

Materialized (Array + DataFrame)

Visualization

Graphviz (Layer.dask.visualize())

HTML Repr (Layer.dask)

martindurant commented Jul 23, 2021

freyam commented Jul 23, 2021

GenevieveBuckley commented Jul 26, 2021

freyam commented Jul 26, 2021

GenevieveBuckley commented Jul 27, 2021

martindurant commented Jul 27, 2021 via email

freyam commented Jul 27, 2021

jacobtomlinson commented Jul 27, 2021

freyam commented Jul 27, 2021

martindurant commented Jul 28, 2021

freyam commented Jul 28, 2021

freyam commented Aug 2, 2021

freyam commented Jul 21, 2021 •

edited

freyam commented Jul 23, 2021 •

edited

Graphviz (`Layer.dask.visualize()`)

HTML Repr (`Layer.dask`)