Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorise high level layers for display #7919

Closed
martindurant opened this issue Jul 20, 2021 · 19 comments · Fixed by #7974
Closed

Categorise high level layers for display #7919

martindurant opened this issue Jul 20, 2021 · 19 comments · Fixed by #7974

Comments

@martindurant
Copy link
Member

There are currently a concrete number of subclasses of the base highlevelgraph.Layer. Some of these have specific contexts or collection linkage (array Vs dataframe), others do not. For the sake of the work being done by @freyam , it would be nice to create some categories for the purpose of being shown in .visualize().

Layers allow for attaching attributes at instantiation. I suggests there might also be class attributes giving information about the layer type, which will be true for all instances.

Example: DataFrameIOLayer is IO by operation type and dataframe by collection. It would be reasonable for these to be among the default annotations of all instances.

@martindurant
Copy link
Member Author

cc @GenevieveBuckley

@freyam
Copy link
Contributor

freyam commented Jul 20, 2021

This is amazing! I will get started on finding more about this and share my findings here! 🚀

@martindurant
Copy link
Member Author

I suggest starting with a concrete list of the current known implementations of Layer.

@freyam
Copy link
Contributor

freyam commented Jul 21, 2021

List of Layers

General

# High level graph layer
class Layer(collections.abc.Mapping)

# Fully materialized layer of `Layer`
class MaterializedLayer(Layer)

# Tensor Operation
class Blockwise(Layer)

Array Layers

# Specialized Blockwise Layer for array creation routines
class BlockwiseCreateArray(Blockwise)

# Simple HighLevelGraph array overlap layer
class ArrayOverlapLayer(Layer)

Dataframe Layers

# DataFrame-based HighLevelGraph Layer
class DataFrameLayer(Layer)

# High-level graph layer for a simple shuffle operation in which each output partition depends on all input partitions
class SimpleShuffleLayer(DataFrameLayer)

# High-level graph layer corresponding to a single stage of a multi-stage inter-partition shuffle operation.
class ShuffleLayer(SimpleShuffleLayer)

# High-level graph layer for a join operation requiring the smaller collection to be broadcasted to every partition of the larger collection.
class BroadcastJoinLayer(DataFrameLayer)

# DataFrame-based Blockwise Layer with IO
class DataFrameIOLayer(Blockwise, DataFrameLayer)

@GenevieveBuckley
Copy link
Contributor

Copying over my comment from the Slack thread earlier today:

I think figuring out what categories should be included will be the tricky thing, and trying to make those categories be something users will care about. It seems like it'd be more useful if we can use them to highlight bottlenecks. I assume that's why you suggest IO and shuffle. We seem a bit vague on things beyond that though, I don't know that "all other CPU computations" is likely to be very useful to display (and will probably be most cases)

@martindurant
Copy link
Member Author

So I think I'm more advocating that the Layer classes have some default annotations - and that's all. That way, we only need to look at these annotations, we don't need to do any isinstance checks. Of course, we still might want to look into the layer contents, to make the difference between GPU/CPU.

Totally agree that "CPU" should not have any associated decoration, as it will be the default, most common box.

@freyam
Copy link
Contributor

freyam commented Jul 23, 2021

@martindurant
What would the default annotations be for all the layers? I can think of layer_id.

Meanwhile, I have come up with the visualization for the HLGs!

Design

Group the listed layer_types into 4 categories

IO (DataFrame)

  1. DataFrameIOLayer

Shuffle (DataFrame)

  1. SimpleShuffleLayer
  2. ShuffleLayer

Blockwise (Array mostly + DataFrame)

  1. Blockwise
  2. BlockwiseCreateArray

Materialized (Array + DataFrame)

  1. MaterializedLayer

The remaining layers:

  1. ArrayOverlapLayer
  2. DataFrameLayer
  3. BroadcastJoinLayer
    can be just left as it is.

Visualization

Graphviz (Layer.dask.visualize())

We can show the different groups with different colors.

EDIT: Updated Graph Reprentation
image

HTML Repr (Layer.dask)

Using the same colors used in the Graphviz, we change the tiny circles next to each layer to reflect its layer_type.

EDIT: Updated HTML Reprentation
image

(I have also modified the MaterializedLayer's existing color to be a little lighter. I feel it looks nicer and easier on the eyes)
It might seem too colorful, but that's because I am showing all the different types of layers at the same time. This is highly unlikely to been observed in real Dask Graphs.

/cc @jacobtomlinson I would love to hear your opinion as well on the colors. You have done an amazing job adding all the existing colors ✨

@martindurant
Copy link
Member Author

What would the default annotations be for all the layers

Empty? If we can't think of anything useful, it's better not to complicate the visuals.

@freyam
Copy link
Contributor

freyam commented Jul 23, 2021

/cc @mrocklin What do you think about the use of colors?
(colors are open for discussion)

@GenevieveBuckley
Copy link
Contributor

I hadn't thought of changing the colours in the dots next to the layer titles in the HTML representation, that's a nice touch.

Personally, I'm not sure we're gaining much by having multiple shades of green/blue/etc. for different types of layers in the same larger category. I think that adds more confusion.

We'll need some effort spent on:

  1. Explaining what information the user should gain from this (eg: shuffles are inefficient so you don't want to see too many of them, blockwise layers are likely to be nicely parallelized, etc.). You'll need to be able to explain this very clearly for all of them.
  2. How we are going to communicate that information to the user (a legend for graphviz? added information in the docs? something in the HTML repr?)

@freyam
Copy link
Contributor

freyam commented Jul 26, 2021

Personally, I'm not sure we're gaining much by having multiple shades of green/blue/etc. for different types of layers in the same larger category. I think that adds more confusion.

You are right! This looks even cleaner ✨

I have also added a Legend below so users have a good idea of what the colors mean.
image

and

image

Note: The colors used in Layer 2, 5, and 6 are supposed to be Green (Blockwise) as well, but I have used different ones to show all of the colors


Explaining what information the user should gain from this (eg: shuffles are inefficient so you don't want to see too many of them, blockwise layers are likely to be nicely parallelized, etc.). You'll need to be able to explain this very clearly for all of them.

I'm unfamiliar with this section of the code. So I'm not sure which layers are more difficult for users and which are simple (quick?).

If I have more context about the different layers and what makes them slower than the rest, I can perhaps do something more with those layers' nodes.

For example, I can add a little node form note to all of the Shuffle and DataFrameIO layers. This could imply that this is an area worth investigating.

image

How does this sound?

@GenevieveBuckley
Copy link
Contributor

For example, I can add a little node form note to all of the Shuffle and DataFrameIO layers.

I'm sorry, what does this mean?

@martindurant
Copy link
Member Author

martindurant commented Jul 27, 2021 via email

@freyam
Copy link
Contributor

freyam commented Jul 27, 2021

@GenevieveBuckley Yes, Martin is correct.

So, in essence, the current node shape is "box." Another node shape is "note," which resembles a paper fold from the corner.

Normally, people fold a page to bookmark it so that it can be easily found later. In this case, the fold would indicate that this area warrants further investigation.

@jacobtomlinson
Copy link
Member

/cc @jacobtomlinson I would love to hear your opinion as well on the colors. You have done an amazing job adding all the existing colors ✨

Thanks @freyam but all the colour design was done by an external designer. See dask/community#135. I've added another comment to that issue to see if we can get the designer involved in choosing the more colours.

@freyam
Copy link
Contributor

freyam commented Jul 27, 2021

I've added another comment to that issue to see if we can get the designer involved in choosing the more colours.

That's amazing! 🚀

@martindurant
Copy link
Member Author

So, in essence, the current node shape is "box." Another node shape is "note," which resembles a paper fold from the corner.

I think that the difference is too subtle to see in your examples, and the meaning of the different shapes will not be obvious to users.

@freyam
Copy link
Contributor

freyam commented Jul 28, 2021

Just wanna confirm:
Shuffle and DataFrameIO layers are the 2 specific layers you want to highlight differently to point at the potential source of inefficiency?

@freyam
Copy link
Contributor

freyam commented Aug 2, 2021

I have opened a Draft PR where I will be working along with this discussion.
#7974

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants