-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Seeking Feedback] Flux Resource Model Strawman #247
Comments
For @SteVwonder: --matcher=PFS1BA option--matcher=C+PFS1BA |
For @tpatki: --matcher=PA--matcher=C+PA |
Oh, forgot to mension. The strawman currently doesn't build on TOSS2 because their compilers are so ancient. Please use a TOSS3 system or your laptop. |
Thanks for working on flux on the week end, Dong. I trust that you will all spend tomorrow with your kids if you have them. Happy Father's Day to all of the dads!
Becky
…Sent from my iPhone
On Jun 17, 2017, at 6:22 PM, Dong H. Ahn <notifications@github.com<mailto:notifications@github.com>> wrote:
For @SteVwonder<https://github.com/stevwonder>:
[pfs1ba mini]<https://user-images.githubusercontent.com/862123/27257387-77d821dc-5389-11e7-931a-ccd844353fd7.png>
--matcher=PFS1BA option
[c pfs1ba mini]<https://user-images.githubusercontent.com/862123/27257392-9390e95e-5389-11e7-9758-e05dc19b2f5f.png>
--matcher=C+PFS1BA
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#247 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AEdU0EYB0nmLn9ZFc39fWth6LPBIKHD9ks5sFHuxgaJpZM4N9X8v>.
|
Thanks, Dong, I’ll take a look at this shortly. Thanks for putting this together over the weekend!
Sincerely,
T
From: Becky Springmeyer <notifications@github.com>
Reply-To: flux-framework/flux-sched <reply@reply.github.com>
Date: Saturday, June 17, 2017 at 8:22 PM
To: flux-framework/flux-sched <flux-sched@noreply.github.com>
Cc: "Patki, Tapasya" <patki1@llnl.gov>, Mention <mention@noreply.github.com>
Subject: Re: [flux-framework/flux-sched] [Seeking Feedback] Flux Resource Model Strawman (#247)
Thanks for working on flux on the week end, Dong. I trust that you will all spend tomorrow with your kids if you have them. Happy Father's Day to all of the dads!
Becky
…Sent from my iPhone
On Jun 17, 2017, at 6:22 PM, Dong H. Ahn <notifications@github.com<mailto:notifications@github.com>> wrote:
For @SteVwonder<https://github.com/stevwonder>:
[pfs1ba mini]<https://user-images.githubusercontent.com/862123/27257387-77d821dc-5389-11e7-931a-ccd844353fd7.png>
--matcher=PFS1BA option
[c pfs1ba mini]<https://user-images.githubusercontent.com/862123/27257392-9390e95e-5389-11e7-9758-e05dc19b2f5f.png>
--matcher=C+PFS1BA
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#247 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AEdU0EYB0nmLn9ZFc39fWth6LPBIKHD9ks5sFHuxgaJpZM4N9X8v>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#247 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABhgTlir3jydTZL7478ayMulqxVyA9a6ks5sFJgEgaJpZM4N9X8v>.
|
@dongahn nice work! Poking around a bit, it seems that the
I am not sure what is going on here. Any thoughts? In the meantime, I am working on a matcher that works for the EDIT: does it maybe have to do with the directionality of the links between the resources? Maybe when IO is dominant, it should no longer be a "flowsup" relationship but a relationship with the reverse directionality. |
Ah. Thank you for looking at this so quickly. I will take a look shortly. |
Great. Your schedulers can also choose all relationships for both cases with: if ( !rc && (rc = subsystem_exist (ctx, "PFS1BA")) == 0)
matcher.add_subsystem ("pfs1bw", "*"); For the depth first search on the dominant subsystem, I color the vertex so that I won't traverse further whenever detecting a back edge. |
Also FYI, future work will include providing richer idioms for scheduler to use to select edge types within each target subsystem. Probably, string pattern matching fucntions for inclusion and exclusion. |
@SteVwonder: I also created |
@nikhil-jain: So that this this tracking ticket will make into your email. @SteVwonder has been looking at this strawman as well as been involved in our recent scheduling discussions. @SteVwonder should be a good resource for you. |
As @SteVwonder, @morrone and I are discussing and debating the capabilities and limitations of the resource strawman, it seems the classes of matching problems that the current strawman+planner can and can’t solve start to emerge. So, I thought I should document them here. First, there are at least two main classes of advanced matching problems our current scheme can solve on a single dominant hierarchy beyond simple flat matching.
There are also some classes of multi-hierarchy matching problems that the DFU+planner can solve exactly. That is, when we arrange certain hierarchies such that one becomes dominant and the rest become an ordered set of auxiliary.
Now, we also realized that there are certain classes of matching problems that DFU+planner alone cannot solve exactly. Most notably: Hierarchical Exact matching over multiple hierarchies: e.g., So, If I summarize this into a capability table of DFU+planner:
For the Multi Hierarchical Exact column, I have to believe that this can be solved by introducing a bit more advanced traversal type (which I call Loop-enabled DFU) and a variant of planner (which I call up-aggregator). The idea is to perform DFU on some of the subtrees "twice" to find the exact matches, the first iteration to generate some high-level information on aux hierarchies and the second to find the exact match. E.g., at a rack level, the first DFU deposits the swich info of the current loop iteration to the root of the IB connection aux hierarchy and the second iteration, we use that as high level switch locality info for the actual selection. The loop traversal concept would be similar to: https://github.com/tinkerpop/gremlin/wiki/Loop-Pattern. Of course, it will be less likely we will do anything for more than simple DFU as our short-term directions but I thought it would be good to flesh out details on what can and can't. |
Thank you @dongahn for documenting your discussions. It provides a concise synopsis of the frontiers of the resource selection thinking. At this point, I don't have much to add. I know you are a proponent for a one-pass traversal solution to maximize performance. So I will review the loop traversal doc you provided to understand what it has to offer. |
Thanks @lipari.
What we realized is that for Multi Hierarchical Exact matching problems, there is no viable solution even with two-pass algorithms we proposed in the past like:
It feels like, if you want to solve this problem exactly without some high-level info aggregation techniques, this may become a computational intractable problem (feels like knapsack / linear LP) although I haven't thought too deeply about this. You can review the Loop traversal in Gremlin but this will be a bit misleading. Because what you really want is a custom semantics combining the back trek and loop patterns. E.g., if you want For example, at the end of the first DFU iteration, you may have
Then, in the second iteration, you can choose those nodes by round-robining edge_switch1 and 2. (or a variant according to your matching policy). Once the second DFU iteration at the rack vertex is done. You clear up the up_aggregator to prepare for the Looped DFU at the next rack level. A beauty of the current division between traverser and matcher is that we can separate the walking concern and matching concern. So you can introduce increasingly complex traversal types without affecting other concerns. In any case, I think there is enough of patterns, I think we should be able to generalize this and incorporate this into our scheduling infrastructure. |
The more I think about this, the more I'm unsure if Hierarchical Exact matching across multiple hierarchies worth extra complexity. The other thing we can do is to look at this problem at the opposite angle: Giving a scheduler a way to use a "virtual hierarchy" and limit the Hierarchical Exact matching to be single. For example, nothing prevents me from introducing a virtual hierarchy that combines containment and ib-network connection hierarchies in a way that makes sense, as part of resource specification. The composed virtual hierarchy can look like:
If the scheduler needs to express Hierarchical Exact matching for this kind , it just subscribes to this virtual hierarchy as its dominant hierarchy and do matching on this. This will have impact to the job spec though as the spec should be valid with respect to this virtual hierarchy. Instead of Maybe I should play w/ strawman to see what it takes to create a virtual hierarchy and the effect of it. |
So does this mean we should perhaps eliminate the edges section from the jobspec for now? Is there any reasonable use case we can think of now that we are making this limitation? |
Good question. I think making virtual hierarchies for some cases are not really possible or desirable so I have to believe we will need edges for general cases. Lack of compelling use cases, maybe
Or at the time of power work where we want to schedule jobs only through a specific named pdu,
although you can argue, in the latter case, these scheduler can use power hierarchy as its dominant. In my mind, some of the bigger benefits of having edges would be that this gets us a step closer to use it as our resource specification language. Given all the complexity in Flux resource model, I have to think a big challenge in front us is how to ensure a job spec is valid without even doing expensive graph matching. I thought, if we use the same language for both the resource section of the job spec and resource specification, a good level of validation of a job spec can be done against the schema used to generate resources, like using XML Schema Definition (XSD) or DTD. |
Its not clear that a compiler license would need that sort of mapping. We already have a "forest of trees", so you could simply have this in the resources section:
And the switch/pdu example you gave looks pretty much exactly what we just decided we aren't going to allow. :)
I don't think that is a benefit at all. The jobspec is not the resource spec. We can put additional things in the resource spec that don't appear in the resources section of the jobspec.
Right, so if the jobspec doesn't allow completely arbitrary graphs at this stage, it only makes our lives easier. |
Regarding terms, I prefer using the term "resource request" to refer to the resources the job specification is requesting. And I agree with @morrone that the simpler we design the resource request specification, the easier will be our lives and the lives of our users. Namely, it would be nice if by default our users don't have to construct their resource request in graph terms. As an example, a user could ask for 5 nodes, each capable of receiving at least 50 watts. |
Well, there is a difference between abstract hierarchical matching vs. concrete hierarchical matching where concrete resource names are given. Abstract matching is computationally difficult, but concrete matching is tractable (this is the same class as Flow Exact problem, I think). Notice pdu5 isn't a quantity but a name. But this wasn't a compelling usage case anyway so I hear you :)
Actually, I think you are right. Others, any near- to mid-term use cases with this simplification? Wait. How about things like shared node-local resource in relation to cores: GPU, burst buffer? |
Can't that be represented by the existing "with" method? |
I haven't looked at the with construct close enough. If it can express a tree, that'd be fine. |
Yes, we can express a forest of trees without the addition of the "edges" directive. |
OK. It seems a virtual hierarchy can be easily added -- pushed my changes to resource strawman in the --matcher=VA--matcher=V+PFS1BA |
I agree that we shouldn't immediately support arbitrary graphs in the job spec. Once we have some more concrete use-cases and a prototype scheduler/matcher, then I think we issue a new version of the job spec. Related quesiton: how are we planning on versioning the job spec? Currently it seems to just be an int. Do we want to semantically version the spec?
@dongahn, very cool! My only question: what is the 'vhierarchy' resource in the figures that you included? Is that necessary to make the virtual hierarchy work, or can it be omitted if the user desires? |
@dongahn, is that edgeswitch0 intermediary between rack0 and node0 vestigial? I don't think it is necessary with edgeswitchbw0 now included in the bandwidth portion of the virtual hierarchy. |
@lipari: I believe it is required if the scheduler/user wants to do Hierarchical Exact matching w.r.t the edge switches (i.e. cluster[1]->rack[2]->node[18]<-edge_switch[2] which in this scenario, i believe would be written as cluster[1]->rack[2]->edge_switch[2]->node[18]) in addition to BW-aware scheduling. It's existence in the dominant virtual hierarchy allows for the exact matching, and it's existence in the auxiliary hierarchy allows for the BW constraint checking/allocation. |
@SteVwonder, while I can see the performance advantage of including the switch in the composite hierarchy, I wonder what we would do if a switch served all the nodes in racks 1 and 2 and only half of the nodes in rack 3. |
You shouldn't have to. You should be able to use an existing vertex as a root of a new virtual hierarchy. This of course will heavily depend on the expressibility of our future resource specification language. But I don't think this is a difficult idiom to support at all. |
In this case, you may just use real ib network hierarchy for hierarchical exact matching or augment the infrastructure that I was talking about. The virtual hierarchy support is proposed so that we can get the most mileage out of the current DFU+planner scheduling infrastructure. Nothing we do now will limit more complex hierarchical exact matching for long-term future. |
I like that answer. |
@SteVwonder: I am going to add support to allow for specifying the resource graph generator in graphml (now it's C++ code) to get some more mileage out of my strawman. I saw you forked and reviewed the strawman. Any reason to believe the proposed DFU+planner scheme won't work to implement your IO BW aware policy? Programmability comment? |
I know we discussed in today's meeting, but I wanted to document the discussion here. |
@SteVwonder: great! Once this graphml-based resource generator work is done, I will work on 2 more items: introducing self-sorting accumulators and integrating @morrone's job spec prototype. At that point, we should be able to allocate and deallocate, which can make ease your verification effort, I think. |
Ok. I pushed the GraphML-based resource generator work into the strawman, and updated the master with that branch as well. I plan to document how to model your resources using this resource generator so that folks can more easily create their own models and walk the resources. |
FYI -- I used the BGL's generic depth_first_search itself to walk the generator recipe and emit and store the actual resource vertices and edges into the resource graph database. The way this setup algorithm works, I checks whether there is an existing edge between two vertices (when the generation edge type is associative). The only mechanism I found, which allows for this is to iterate over all edges of a vertex and see if one of the edge has the same target. My performance profiling shows that this is where the most of time is going! I suspect ultimately edge iteration performance may become the bottleneck for many operations. Definitely, something to pay attention to. |
OK. I drafted a section into README.md on Generating Resources Using GraphML (GRUG) and pushed my changes into master. |
Looks great, @dongahn! I hope don't mind but I will insert comments as I review them instead of bundling them all up for one composite comment. |
I believe I have received all the feedback I needed for my resource strawman. Now I'm moving to grow this strawman code further into |
I implemented a strawman to help me further design the upcoming flux resource comms. module -- a new service that selects the best-matching resources for each job. At this point, it will be extremely helpful if I can get some feedback from advanced scheduling researchers to make sure I will have a future-scheduling-proof design. The strawman code you can play with is at https://github.com/dongahn/resource_module_strawman
Using your feedback, I would like to determine:
What are the resource representations suited well for various advanced scheduling you are working on? Answer, if we can use our primitives to model, traverse, and match on such representations.
Programmability -- can scheduler writers easily express their algorithms using our graph match visitor patterns.
This is a strawman and so doesn't have a whole lot in it. So, do assume some of other core technologies as you evaluate this. In particular, please assume you can use planner on any level in each subsystem
and keep aggregate information about the vertex’s subtree in a scalable fashion. This is descried in here and here
@SteVwonder: Please look at this from our IO-Aware scheduling algorithms angle.
Use PFS1 IO bandwidth subsystem as the dominant subsystem; and see if and how you would replicate your algorithm in our HPDC 16 paper. I configured a simple model as a starting point under
--matcher=PFS1BA option
(See below for the graph rendered for a mini scale system)Use the containment subsystem as your dominant subsystem and PFS1 IO bandwidth subsystem as an auxiliary, and see if you can implement a similar IO Aware algorithm. I configured a simple model as a starting point under
--matcher=C+PFS1BA option
(See below for the graph rendered for mini scale)In both cases, I suspect that you might need a bit different graph representations. Please let me know if you want to / need to have the graph model refined for your algorithms
@tpatki: Please look at this from a Power-Aware scheduling algorithm angle
Use the power distribution subsystem as the dominant subsystem; and see if and how you would implement power-aware scheduling algorithms. I configured a simple model as a starting point under
--matcher=PA option
(See below for the graph rendered for mini scale)Use the containment subsystem as the dominant subsystem and power distribution subsystem as an auxiliary subsystem; and see if and how you can implement effective algorithms. I configured a simple model as a starting point under
--matcher=C+PA option
(See below for the graph rendered for mini scale)In both cases, I suspect you might need a bit different graph models. Please let me know what are the models that are most suitable for your algorithms. I happened to talk with an IBM power-scheduling researcher at my CORAL face to face meeting, and he believes that future compute nodes may need to be grouped together based on their power efficiency levels. I do believe this can be easily done by refining power hierarchy model. If you can suggest some models based on your expertise, this would be very helpful at this point.
Nikil (I don't have his github id): Please look at this from the standpoint of your IB-network-connection-aware scheduling algorithm:
Use the IB connection subsystem as the dominant subsystem; and see if an dhow you would implement pod-based scheduling algorithms. I configured a simple model as a starting point under
--matcher=IBA option
(See below for the graph rendered for mini scale)Use the containment subsystem as the dominant subsystem and the IB connection subsystem as an auxiliary; see if you can implement these algorithms as well. I configured a simple model as a starting point under
--matcher=C+IBA option
(See below for the graph rendered for mini scale)In your case, I highly suspect we would want to refine the network subsystem graph model. This opens up a good test for me because I want to see how flexible our new flux model strawman implementation allows a scheduler plugin writer like you can do this.
@lipari and @morrone: any early feedback from you two should also be high useful, of course!
The text was updated successfully, but these errors were encountered: