Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Seeking Feedback] Flux Resource Model Strawman #247

Closed
dongahn opened this issue Jun 18, 2017 · 41 comments
Closed

[Seeking Feedback] Flux Resource Model Strawman #247

dongahn opened this issue Jun 18, 2017 · 41 comments

Comments

@dongahn
Copy link
Member

dongahn commented Jun 18, 2017

I implemented a strawman to help me further design the upcoming flux resource comms. module -- a new service that selects the best-matching resources for each job. At this point, it will be extremely helpful if I can get some feedback from advanced scheduling researchers to make sure I will have a future-scheduling-proof design. The strawman code you can play with is at https://github.com/dongahn/resource_module_strawman

Using your feedback, I would like to determine:

  • What are the resource representations suited well for various advanced scheduling you are working on? Answer, if we can use our primitives to model, traverse, and match on such representations.

  • Programmability -- can scheduler writers easily express their algorithms using our graph match visitor patterns.

This is a strawman and so doesn't have a whole lot in it. So, do assume some of other core technologies as you evaluate this. In particular, please assume you can use planner on any level in each subsystem
and keep aggregate information about the vertex’s subtree in a scalable fashion. This is descried in here and here

@SteVwonder: Please look at this from our IO-Aware scheduling algorithms angle.

  • Use PFS1 IO bandwidth subsystem as the dominant subsystem; and see if and how you would replicate your algorithm in our HPDC 16 paper. I configured a simple model as a starting point under
    --matcher=PFS1BA option (See below for the graph rendered for a mini scale system)

  • Use the containment subsystem as your dominant subsystem and PFS1 IO bandwidth subsystem as an auxiliary, and see if you can implement a similar IO Aware algorithm. I configured a simple model as a starting point under --matcher=C+PFS1BA option (See below for the graph rendered for mini scale)

  • In both cases, I suspect that you might need a bit different graph representations. Please let me know if you want to / need to have the graph model refined for your algorithms

@tpatki: Please look at this from a Power-Aware scheduling algorithm angle

  • Use the power distribution subsystem as the dominant subsystem; and see if and how you would implement power-aware scheduling algorithms. I configured a simple model as a starting point under
    --matcher=PA option (See below for the graph rendered for mini scale)

  • Use the containment subsystem as the dominant subsystem and power distribution subsystem as an auxiliary subsystem; and see if and how you can implement effective algorithms. I configured a simple model as a starting point under --matcher=C+PA option (See below for the graph rendered for mini scale)

  • In both cases, I suspect you might need a bit different graph models. Please let me know what are the models that are most suitable for your algorithms. I happened to talk with an IBM power-scheduling researcher at my CORAL face to face meeting, and he believes that future compute nodes may need to be grouped together based on their power efficiency levels. I do believe this can be easily done by refining power hierarchy model. If you can suggest some models based on your expertise, this would be very helpful at this point.

Nikil (I don't have his github id): Please look at this from the standpoint of your IB-network-connection-aware scheduling algorithm:

  • Use the IB connection subsystem as the dominant subsystem; and see if an dhow you would implement pod-based scheduling algorithms. I configured a simple model as a starting point under --matcher=IBA option (See below for the graph rendered for mini scale)

  • Use the containment subsystem as the dominant subsystem and the IB connection subsystem as an auxiliary; see if you can implement these algorithms as well. I configured a simple model as a starting point under --matcher=C+IBA option (See below for the graph rendered for mini scale)

  • In your case, I highly suspect we would want to refine the network subsystem graph model. This opens up a good test for me because I want to see how flexible our new flux model strawman implementation allows a scheduler plugin writer like you can do this.

@lipari and @morrone: any early feedback from you two should also be high useful, of course!

@dongahn
Copy link
Member Author

dongahn commented Jun 18, 2017

For @SteVwonder:

pfs1ba mini

--matcher=PFS1BA option

c pfs1ba mini

--matcher=C+PFS1BA

@dongahn
Copy link
Member Author

dongahn commented Jun 18, 2017

For @tpatki:

pa mini

--matcher=PA

c pa mini

--matcher=C+PA

@dongahn
Copy link
Member Author

dongahn commented Jun 18, 2017

For Nikil:

iba mini

--matcher=IBA

c iba mini

--matcher=C+IBA

@dongahn
Copy link
Member Author

dongahn commented Jun 18, 2017

Oh, forgot to mension. The strawman currently doesn't build on TOSS2 because their compilers are so ancient. Please use a TOSS3 system or your laptop.

@springme
Copy link

springme commented Jun 18, 2017 via email

@tpatki
Copy link
Member

tpatki commented Jun 18, 2017 via email

@SteVwonder
Copy link
Member

SteVwonder commented Jun 19, 2017

@dongahn nice work!

Poking around a bit, it seems that the C+PFS1BA works exactly as I expected: it performs a depth-first search through the dominant hierarchy (containment) and then when it hits the auxiliary hierarchy (IO) it performs an upwalk. The PFS1BA matcher isn't working as I expected. I expected it to perform a depth first search through the dominant hierarchy (IO) (e.g. pfsbw->lnetbw->coreswitchbw->leafswitchbw->nicbw->node), but it seems to only visit the root vertex:

→ ./resource --matcher=PFS1BA
[INFO] Load the matcher ...
----dom_discover_vtx: pfs1bw
----dom_finish_vtx: pfs1bw
*********************************************************
* Elapse time 0.000062
*   Start Time: 1497908308.232591
*   End Time: 1497908308.232653
*********************************************************

I am not sure what is going on here. Any thoughts?

In the meantime, I am working on a matcher that works for the C+PFS1BA case.

EDIT: does it maybe have to do with the directionality of the links between the resources? Maybe when IO is dominant, it should no longer be a "flowsup" relationship but a relationship with the reverse directionality.
EDIT2: nevermind, I think I figured it out. The call to add_subsystem should use flows_down not flows_up.

@dongahn
Copy link
Member Author

dongahn commented Jun 19, 2017

Ah. Thank you for looking at this so quickly. I will take a look shortly.

@dongahn
Copy link
Member Author

dongahn commented Jun 19, 2017

EDIT2: nevermind, I think I figured it out. The call to add_subsystem should use flows_down not flows_up.

Great.

Your schedulers can also choose all relationships for both cases with:

        if ( !rc && (rc = subsystem_exist (ctx, "PFS1BA")) == 0)
            matcher.add_subsystem ("pfs1bw", "*");

For the depth first search on the dominant subsystem, I color the vertex so that I won't traverse further whenever detecting a back edge.

@dongahn
Copy link
Member Author

dongahn commented Jun 19, 2017

Also FYI, future work will include providing richer idioms for scheduler to use to select edge types within each target subsystem. Probably, string pattern matching fucntions for inclusion and exclusion.

@dongahn
Copy link
Member Author

dongahn commented Jun 20, 2017

@SteVwonder: I also created coloring_for_upwalk dev-branch in the strawman and pushed my changes. As I was adding support for graphml, I found that I was actually adding parallel edges! So I fixed that and also added coloring support for up-walks -- with this the DFU traverser can detect loops during the up-walk as well. This led to slight changes to matcher callback name.

@dongahn
Copy link
Member Author

dongahn commented Jun 21, 2017

@nikhil-jain: So that this this tracking ticket will make into your email. @SteVwonder has been looking at this strawman as well as been involved in our recent scheduling discussions. @SteVwonder should be a good resource for you.

@dongahn
Copy link
Member Author

dongahn commented Jun 23, 2017

As @SteVwonder, @morrone and I are discussing and debating the capabilities and limitations of the resource strawman, it seems the classes of matching problems that the current strawman+planner can and can’t solve start to emerge. So, I thought I should document them here.

First, there are at least two main classes of advanced matching problems our current scheme can solve on a single dominant hierarchy beyond simple flat matching.

  • Hierarchical Exact matching, for example,

    1. If the scheduler choses the containment hierarchy as dominant: e.g., cluster[1]->rack[2]->node[4]
    2. If the scheduler choses the IB network connection hierarchy as dominant: e.g., edge_switch[2]->node[4]
  • Hierarchical Aware matching: for example, choose node resources in a way to maximize or minimize certain hierarchical quantities

    1. Choose node[4] such a way that we minimize the number of edge switches to use given the current resource state (using the IB network connection hierarchy as dominant along with planners embedded at various levels)
    2. Choose node[4] such a way that we spread them across as many PDUs as possible given the current state (using the power hierarchy as the dominant hierarchy and its planners)

There are also some classes of multi-hierarchy matching problems that the DFU+planner can solve exactly. That is, when we arrange certain hierarchies such that one becomes dominant and the rest become an ordered set of auxiliary.

  • Hierarchical Exact on dom hierarchy and Flow Exact matching on aux hierarchies: for example,

    1. When the per-node flow resource quantifies are given along with exact hierarchical request on containment hierarchy, e.g., cluster[1]->rack[2]->node[4], each node of 8 nodes requires at least 20MB/s or 100 Watts. The Flow Exact constraint can be checked during each up-walk.
  • Hierarchical Exact on dom hierarchy and Hierarchical Aware matching on aux: for example,

    1. Within the hard hierarchical constraint on the dominant hierarchy, you want to minimize or maximize certain quantities on one or more auxiliary hierarchies: "within rack[3]->nodes[18] hierarchical constraint - we can select those nodes such a way to use the edges switches that have highest or lowest number of available nodes underneath them first.”

Now, we also realized that there are certain classes of matching problems that DFU+planner alone cannot solve exactly. Most notably: Hierarchical Exact matching over multiple hierarchies: e.g., cluster[1]->rack[2]->node[18]<-edge_switch[2].

So, If I summarize this into a capability table of DFU+planner:

Dom\Aux None Hierarchical Exact Hierarchical Aware Flow Exact
Hierarchical Exact Yes No Yes Yes
Hierarchical Aware Yes NA Yes Yes
Flow Exact Yes NA Yes Yes

For the Multi Hierarchical Exact column, I have to believe that this can be solved by introducing a bit more advanced traversal type (which I call Loop-enabled DFU) and a variant of planner (which I call up-aggregator). The idea is to perform DFU on some of the subtrees "twice" to find the exact matches, the first iteration to generate some high-level information on aux hierarchies and the second to find the exact match. E.g., at a rack level, the first DFU deposits the swich info of the current loop iteration to the root of the IB connection aux hierarchy and the second iteration, we use that as high level switch locality info for the actual selection.

The loop traversal concept would be similar to: https://github.com/tinkerpop/gremlin/wiki/Loop-Pattern.

Of course, it will be less likely we will do anything for more than simple DFU as our short-term directions but I thought it would be good to flesh out details on what can and can't.

@dongahn
Copy link
Member Author

dongahn commented Jun 23, 2017

In case Depth First on dominant hierarchy and Up on auxiliary hierarchies (DFU) and envisioned match scoring schemes are not clear enough:
dfu-and-score-scheme

Also, for Planner
planner

@lipari
Copy link
Contributor

lipari commented Jun 23, 2017

Thank you @dongahn for documenting your discussions. It provides a concise synopsis of the frontiers of the resource selection thinking. At this point, I don't have much to add. I know you are a proponent for a one-pass traversal solution to maximize performance. So I will review the loop traversal doc you provided to understand what it has to offer.

@dongahn
Copy link
Member Author

dongahn commented Jun 23, 2017

Thanks @lipari.

I know you are a proponent for a one-pass traversal solution to maximize performance.

What we realized is that for Multi Hierarchical Exact matching problems, there is no viable solution even with two-pass algorithms we proposed in the past like:

  • find all hierarchically matching nodes in the containment hierarchy: e.g., rack[2]->node[4]
  • find all hierarchically matching nodes in the containment network connection hierarchy: e.g., edge_switch[2]->node[4]
  • find all intersected nodes. (We could not come up with a good solution at this step for this matching problem.)

It feels like, if you want to solve this problem exactly without some high-level info aggregation techniques, this may become a computational intractable problem (feels like knapsack / linear LP) although I haven't thought too deeply about this.

You can review the Loop traversal in Gremlin but this will be a bit misleading. Because what you really want is a custom semantics combining the back trek and loop patterns.

E.g., if you want rack[2]->node[4] and each 4 nodes under exactly edge_switch[2], when you perform DFU at each rack level for the first iteration, you will aggregate edge_switch info at the ib network connection hierarchy root. (that is during each up walk). Then, once the first iteration is done, you perform the second iteration of DFU at that rack vertex level. In terms of traversal, this is exactly the same walk. But matcher will do things differently this time: it will use the up aggregate information to find the exact match.

For example, at the end of the first DFU iteration, you may have

edge_switch1: 3
edge_switch2: 2
edge_switch3: 1

Then, in the second iteration, you can choose those nodes by round-robining edge_switch1 and 2. (or a variant according to your matching policy). Once the second DFU iteration at the rack vertex is done. You clear up the up_aggregator to prepare for the Looped DFU at the next rack level.

A beauty of the current division between traverser and matcher is that we can separate the walking concern and matching concern. So you can introduce increasingly complex traversal types without affecting other concerns.

In any case, I think there is enough of patterns, I think we should be able to generalize this and incorporate this into our scheduling infrastructure.

@dongahn
Copy link
Member Author

dongahn commented Jun 23, 2017

The more I think about this, the more I'm unsure if Hierarchical Exact matching across multiple hierarchies worth extra complexity. The other thing we can do is to look at this problem at the opposite angle: Giving a scheduler a way to use a "virtual hierarchy" and limit the Hierarchical Exact matching to be single.

For example, nothing prevents me from introducing a virtual hierarchy that combines containment and ib-network connection hierarchies in a way that makes sense, as part of resource specification. The composed virtual hierarchy can look like:

cluster->rack->edge_switch->nodes->sockets->cores

If the scheduler needs to express Hierarchical Exact matching for this kind , it just subscribes to this virtual hierarchy as its dominant hierarchy and do matching on this. This will have impact to the job spec though as the spec should be valid with respect to this virtual hierarchy. Instead of rack[2]->node[18]<-edge_switch[1], rack[2]->edge_switch[1]->node[18]

Maybe I should play w/ strawman to see what it takes to create a virtual hierarchy and the effect of it.

@morrone
Copy link
Contributor

morrone commented Jun 23, 2017

So does this mean we should perhaps eliminate the edges section from the jobspec for now? Is there any reasonable use case we can think of now that we are making this limitation?

@dongahn
Copy link
Member Author

dongahn commented Jun 23, 2017

Good question. I think making virtual hierarchies for some cases are not really possible or desirable so I have to believe we will need edges for general cases. Lack of compelling use cases, maybe

rack[2]->edge_switch[1]->node[5]<-compiler_license[5]

Or at the time of power work where we want to schedule jobs only through a specific named pdu,

rack[2]->edge_switch[1]->node[5]<-pdu5

although you can argue, in the latter case, these scheduler can use power hierarchy as its dominant.

In my mind, some of the bigger benefits of having edges would be that this gets us a step closer to use it as our resource specification language.

Given all the complexity in Flux resource model, I have to think a big challenge in front us is how to ensure a job spec is valid without even doing expensive graph matching.

I thought, if we use the same language for both the resource section of the job spec and resource specification, a good level of validation of a job spec can be done against the schema used to generate resources, like using XML Schema Definition (XSD) or DTD.

@morrone
Copy link
Contributor

morrone commented Jun 23, 2017

Its not clear that a compiler license would need that sort of mapping. We already have a "forest of trees", so you could simply have this in the resources section:

rack[2]->edge_switch[1]->node[5]
compiler_license[5]

And the switch/pdu example you gave looks pretty much exactly what we just decided we aren't going to allow. :)

In my mind, some of the bigger benefits of having edges would be that this gets us a step closer to use it as our resource specification language.

I don't think that is a benefit at all. The jobspec is not the resource spec. We can put additional things in the resource spec that don't appear in the resources section of the jobspec.

Given all the complexity in Flux resource model, I have to think a big challenge in front us is how to ensure a job spec is valid without even doing expensive graph matching.

Right, so if the jobspec doesn't allow completely arbitrary graphs at this stage, it only makes our lives easier.

@lipari
Copy link
Contributor

lipari commented Jun 23, 2017

Regarding terms, I prefer using the term "resource request" to refer to the resources the job specification is requesting. And I agree with @morrone that the simpler we design the resource request specification, the easier will be our lives and the lives of our users. Namely, it would be nice if by default our users don't have to construct their resource request in graph terms. As an example, a user could ask for 5 nodes, each capable of receiving at least 50 watts.

@dongahn
Copy link
Member Author

dongahn commented Jun 23, 2017

And the switch/pdu example you gave looks pretty much exactly what we just decided we aren't going to allow. :)

Well, there is a difference between abstract hierarchical matching vs. concrete hierarchical matching where concrete resource names are given. Abstract matching is computationally difficult, but concrete matching is tractable (this is the same class as Flow Exact problem, I think). Notice pdu5 isn't a quantity but a name. But this wasn't a compelling usage case anyway so I hear you :)

Right, so if the jobspec doesn't allow completely arbitrary graphs at this stage, it only makes our lives easier.

Actually, I think you are right. Others, any near- to mid-term use cases with this simplification?

Wait. How about things like shared node-local resource in relation to cores: GPU, burst buffer?

@morrone
Copy link
Contributor

morrone commented Jun 24, 2017

Wait. How about things like shared node-local resource in relation to cores: GPU, burst buffer?

Can't that be represented by the existing "with" method?

@dongahn
Copy link
Member Author

dongahn commented Jun 24, 2017

I haven't looked at the with construct close enough. If it can express a tree, that'd be fine.

@morrone
Copy link
Contributor

morrone commented Jun 24, 2017

Yes, we can express a forest of trees without the addition of the "edges" directive.

@dongahn
Copy link
Member Author

dongahn commented Jun 26, 2017

OK. It seems a virtual hierarchy can be easily added -- pushed my changes to resource strawman in the virtual_hier branch. This hierarchy is just another set of edges that span multiple real hierarchies. Essentially, adding a new annotation to the existing edges. As far as resource specification language is flexible enough to allow for overlaying a virtual hierarchy like this we should be able to approximate multi hierarchical exact matching with a single hierarchical matching.

va mini

--matcher=VA

v pfs1ba mini

--matcher=V+PFS1BA

@SteVwonder
Copy link
Member

Right, so if the jobspec doesn't allow completely arbitrary graphs at this stage, it only makes our lives easier.

Others, any near- to mid-term use cases with this simplification?

I agree that we shouldn't immediately support arbitrary graphs in the job spec. Once we have some more concrete use-cases and a prototype scheduler/matcher, then I think we issue a new version of the job spec. Related quesiton: how are we planning on versioning the job spec? Currently it seems to just be an int. Do we want to semantically version the spec?

This hierarchy is just another set of edges that span multiple real hierarchies. Essentially, adding a new annotation to the existing edges.

@dongahn, very cool! My only question: what is the 'vhierarchy' resource in the figures that you included? Is that necessary to make the virtual hierarchy work, or can it be omitted if the user desires?

@lipari
Copy link
Contributor

lipari commented Jun 26, 2017

@dongahn, is that edgeswitch0 intermediary between rack0 and node0 vestigial? I don't think it is necessary with edgeswitchbw0 now included in the bandwidth portion of the virtual hierarchy.

@SteVwonder
Copy link
Member

@lipari: I believe it is required if the scheduler/user wants to do Hierarchical Exact matching w.r.t the edge switches (i.e. cluster[1]->rack[2]->node[18]<-edge_switch[2] which in this scenario, i believe would be written as cluster[1]->rack[2]->edge_switch[2]->node[18]) in addition to BW-aware scheduling. It's existence in the dominant virtual hierarchy allows for the exact matching, and it's existence in the auxiliary hierarchy allows for the BW constraint checking/allocation.

@lipari
Copy link
Contributor

lipari commented Jun 26, 2017

@SteVwonder, while I can see the performance advantage of including the switch in the composite hierarchy, I wonder what we would do if a switch served all the nodes in racks 1 and 2 and only half of the nodes in rack 3.

@dongahn
Copy link
Member Author

dongahn commented Jun 26, 2017

@dongahn, very cool! My only question: what is the 'vhierarchy' resource in the figures that you included? Is that necessary to make the virtual hierarchy work, or can it be omitted if the user desires?

You shouldn't have to. You should be able to use an existing vertex as a root of a new virtual hierarchy. This of course will heavily depend on the expressibility of our future resource specification language. But I don't think this is a difficult idiom to support at all.

@dongahn
Copy link
Member Author

dongahn commented Jun 26, 2017

@SteVwonder, while I can see the performance advantage of including the switch in the composite hierarchy, I wonder what we would do if a switch served all the nodes in racks 1 and 2 and only half of the nodes in rack 3.

In this case, you may just use real ib network hierarchy for hierarchical exact matching or augment the infrastructure that I was talking about.

The virtual hierarchy support is proposed so that we can get the most mileage out of the current DFU+planner scheduling infrastructure. Nothing we do now will limit more complex hierarchical exact matching for long-term future.

@lipari
Copy link
Contributor

lipari commented Jun 26, 2017

Nothing we do now will limit more complex hierarchical exact matching for long-term future.

I like that answer.

@dongahn
Copy link
Member Author

dongahn commented Jul 13, 2017

@SteVwonder: I am going to add support to allow for specifying the resource graph generator in graphml (now it's C++ code) to get some more mileage out of my strawman. I saw you forked and reviewed the strawman. Any reason to believe the proposed DFU+planner scheme won't work to implement your IO BW aware policy? Programmability comment?

@SteVwonder
Copy link
Member

I know we discussed in today's meeting, but I wanted to document the discussion here.
I have a working traversal, but I want to verify it on resources that are partially allocated. Besides that verification, I believe the current API should work.

@dongahn
Copy link
Member Author

dongahn commented Jul 17, 2017

@SteVwonder: great! Once this graphml-based resource generator work is done, I will work on 2 more items: introducing self-sorting accumulators and integrating @morrone's job spec prototype. At that point, we should be able to allocate and deallocate, which can make ease your verification effort, I think.

@dongahn
Copy link
Member Author

dongahn commented Jul 21, 2017

Ok. I pushed the GraphML-based resource generator work into the strawman, and updated the master with that branch as well. I plan to document how to model your resources using this resource generator so that folks can more easily create their own models and walk the resources.

@dongahn
Copy link
Member Author

dongahn commented Jul 22, 2017

FYI -- I used the BGL's generic depth_first_search itself to walk the generator recipe and emit and store the actual resource vertices and edges into the resource graph database.

The way this setup algorithm works, I checks whether there is an existing edge between two vertices (when the generation edge type is associative). The only mechanism I found, which allows for this is to iterate over all edges of a vertex and see if one of the edge has the same target. My performance profiling shows that this is where the most of time is going! I suspect ultimately edge iteration performance may become the bottleneck for many operations. Definitely, something to pay attention to.

intruments

@dongahn
Copy link
Member Author

dongahn commented Jul 22, 2017

OK. I drafted a section into README.md on Generating Resources Using GraphML (GRUG) and pushed my changes into master.

@lipari
Copy link
Contributor

lipari commented Jul 24, 2017

Looks great, @dongahn! I hope don't mind but I will insert comments as I review them instead of bundling them all up for one composite comment.

@dongahn
Copy link
Member Author

dongahn commented Sep 11, 2017

I believe I have received all the feedback I needed for my resource strawman. Now I'm moving to grow this strawman code further into resource-query utility, I'm closing this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants