-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need specification for "resource set", R #109
Comments
Nice start, @grondo. Just wanted to add some distinctions... The R will contain node-resident resources that will become Rlocal as well as ancillary resources like switches, racks, licenses, bandwidth and power. For resources like a rack or a switch, the scheduler may schedule jobs that are distributed to specific racks or switches. However, one could imagine that the allocated rack or switch in this case remains in the scheduler's domain and does not get included in the R that is passed to the system instance (as embodied by a collection of brokers). On the other hand, for resources for which we will have controls (e.g., license managers), these resources would become part of R but would only be relevant to specific agents like license managers, and not the brokers (unless we have a broker devoted to controlling a license manager). Power and network bandwidth could fall into either of these two cases depending on whether a throttle was available to limit power or bandwidth serving the allocated resources. |
Thanks, @lipari! You bring up some good points.
These generic "global" job resources like licenses, burst buffer storage space, bandwidth, etc will have to be passed in to some sort of containment management, in case there is some action required to give access to the licenses, or reserve space, etc. My thought is that these would be included in Rlocal, and then the container plugin specific to that resource type would be able to decide how to contain or make available that specific resource. For example, a simple approach would be to have the plugins on the first node of a job operate on these resources. What remains to be decided is how these resources do get included in Rn for each IMP n. We may have to put some tag on these kind of global resources so they are automatically included in any Rn I would also argue that Rlocal for any IMP should include not only the resource vertex(es) on which the IMP will be run, but also all the parents of the vertex up to the root (in the hierarchical resource tree). This will give the IMP containment plugins a bit more information about where they are running in the global hierarchy, which could be useful, and also allows us to keep global resources discussed above in their proper place in any hierarchy. E.g. instead of a simple Rlocal like |
I'm worried that these new parts of the plan are undermining the original need/goal for Rlocal . If the Rlocal contains resource types that the IMP doesn't directly control, then we are back to the original situation we had with a single large R. The IMP now needs to parse out and identify only its locally relevant resource types from a larger tree of information. Rather than going that route, I wonder if it wouldn't make more sense to just going back to having a single complete R document that is sent everywhere including the IMP (although this time around we are deciding that the complete R no longer needs to be signed). In other words, if the IMP needs to parse out potentially extraneous information from Rlocal and find the point where the information begins to align with its local resources, then it could do that just as easily from a large R. What is the value then of Rlocal? |
Rlocal allows the instance that is starting the IMP to control the shape of the container under which the IMP will execute the job shell, instead of relying on the IMP to make that decision, when it doesn't have or need the necessary data to make the correct decision about what goes in the local "container". I guess where you might differ in opinion is whether the parents of a resource are part of that conceptual container. I tend to think a container that is just "cpu0" doesn't make any sense, you need Another benefit of Rlocal is potentially eliminating dependence on flux-sched or flux-core provided resource query language that might be required to perform the intersection between local resources and global R (though even if you had this support for the IMP, I'm not convinced the IMP alone could make the right decision here). To realize this particular goal, Rlocal will need to be simple enough that the IMP or its plugins could parse it easily themselves. |
A specific case where Rlocal might be required is if an instance, for testing or other good reason, would like to start more than one IMP per broker. To do this, the instance would break up local resources into multiple Rlocal and pass to each IMP. I don't see how it would be possible if the instance passed the global R to each IMP. BTW, I was mainly taking a long-term view on inclusion of parent resources in Rlocal, and as long as it is possible to add that support in at a future date, I'm ok with leaving it out for now. I think off-node resources like burst-buffer space and licenses could be handled by including enough metadata in those resources included in Rlocal such that a plugin could know exactly which licenses it was operating on, or which burst buffers it was reserving space in... etc. |
Ah. I thought we agreed that we weren't doing that, that instead the IMP always controls all resources on the node and leaves further resource masking to the jobshell. But if we are reversing that I suppose that is fine.
If we think that the IMP is going to need to know about global resources, then again I think I'm back to thinking we should just send the full R. We can always add annotations to that if we ever want to run multiple IMPs per node. Adding global resources to Rlocal makes the name a bit of a misnomer. :) |
The IMP reading a global R does not necessarily imply that the IMP must use a "resource query language". The IMP just needs a parser. The parser for R and Rlocal, if Rlocal is allowed to contain global resources, will be very nearly identical I think. We can choose to implement the parser twice or cut-and-paste it into the IMP if we want to keep it separate. |
Yeah, I completely understand your sentiment. I'm fine with leaving Rlocal with only "local" resources (whatever "local" may mean), but then we have no proposed method to handle off-node resources (since IMP can only run at most within a node). |
I thought we had talked about that. I think that the "execution management" module, or whatever we are calling it now, would handle off-compute-node resource setup before launching remote execution and the IMPs. There would be plugins into that module that can instantiate the various resources that people come up with. |
The IMP will need to parse R, but then how does it complete the intersection between local available resources and R. It would need to generate an R' from hwloc or some other local HW query code, then take the intersection of R' and R. If only Rlocal is sent to the IMP it doesn't need to read local HW configuration, it doesn't need to generate a second R' from from that information, and it doesn't need to do the work of the intersection. So that feels like quite a bit of code saved from a security significant piece of software. |
That could work but the instance doesn't have any privilege except through the IMP. Is it a requirement that all off node resources don't require privilege to access (this is possible, I just didn't think of that way before)? |
I think it is actually a lot simpler than that if there is just a single IMP per node. It just walks the tree of data in R, and looks up each resource it sees in its internal table "oh! that belongs to me, I'll note that", "nope that doesn't belong to me, skip it". There is no complicated intersection needed, really. |
Ok, I guess I couldn't visualize how to make it quite that simple. Whereas with Rlocal The IMP would walk each type of resource for which it has a containment plugin and hand the list of those resources in Rlocal to the plugin (or alternately each plugin could generate the list itself(. No comparisons needed at all. Since there won't be containment plugins for "node" "switch" "datacenter", and other resources, those would be safely ignored if they were there at all. |
It would be preferable when possible. But it can be handled on a case-by-case basis. Doing this through the IMP could potentially introduce a fair bit of complexity. We might be back to needing a away to track authority back through multiple levels of flux instances. We were able to avoid that when the IMP was constrained to dealing with resources inside of its local node's container. |
Yeah, actually it is slightly more complicated than I stated, but not much. Actually, in each case where it finds a resource it owns, it needs remember that AND all of the resources under it in the tree. But that is still pretty straight forward I think. I don't think the IMP can only look at types even for an Rlocal. It needs to look at either names or counts too. Flux allows two approaches to something like a "socket". We can either represent all of the sockets on a node as a single resource vertex and use the count within that vertex to represent all of the sockets, or we can have a resource vertex with a name/id/uuid/whatever for_each_ of the sockets. In the former case, with individual resource vertices, the scheduler picks the exact resources, and the IMP just needs to carry out the instructions. In the latter, with numbered resources, the IMP needs to be more aware of what is happing with allocations on the node (for instance, if nodes are shared). But actually, I'm not sure that the IMP can read the scheduler's mind enough to always make the same selection pattern...and that could lead to resources being shared (sockets/cores) that the scheduler could have avoided. So actually, for things like sockets and cores I suspect that we will always use separate resource vertices.
I think is reasonable for the IMP to know what node it is on. And once it knows that, it will be fairly easy to pick out its own resources from the global R. |
I'm not sure that is required, but either way you need to verify ownership of the resource whether it is through an IMP plugin or a plugin in the execution system. One example I can think of is if you restricted access to a license server through some sort of iptables rules. For jobs granted access to the license server you would have to allow iptables rules to be modified in the network namespace of the job. I can envision how this could be done with an IMP plugin, but not at all what kind of system you'd need if you tried to do it through a plugin in the unprivileged execution modules. Besides, the containers can't exist until the IMP runs so there is an ordering problem there. For this case would you consider licenses a "local" resource, or perhaps rename them network access tokens or something? (If that is the case then I could see that this scheme could work) |
Ok, but why? I guess it is immaterial if we pass R or Rlocal since the IMP always filters R to Rlocal anyway (it will just be a noop in the second case). The question I keep struggling with is why you'd want to do that work in your privileged process if you don't have to?
Yes, that is what I meant. Each containment plugin will need to know the list and count of types (especially for RAM) of each of the resource types it knows how to deal with. |
Here's my proposal for a simplified Rlocal to satisfy near-term milestones, if it is acceptable that R and Rlocal have different specifications. The simplified Rlocal as input to the IMP will be a JSON document with a list of resource types for which the IMP should create a container. The IMP will support a list of plugins which operate on one or more of each type, and access the Rlocal directly to determine the parameters of the containers they can create. e.g. a memory and cpu "cgroups" container would read the "sockets" "cpus" and "memory" fields of the Rlocal dictionary and add cpus and mems to a cpuset cgroup, and constrain memory with a memory cgroup. The format of Rlocal might look (very roughly), something like: {
"cpu": { "list": [ 0, 1, 2, 3] }, "count": 4 },
"socket": { "list": [ 0 ], "count": 1 },
"memory": { "count": 1024, "units": "MB" },
} This is just off the cuff so there may be missing fields, but is meant to give a general idea. If the IMP must take the full R as input, then I'd suggest a plugin to the IMP, provided by the instance, would generate this format as input to the IMP containment plugin infrastructure. That would further require that the IMP operate in privilege separation mode so that the plugin operating on R runs with permissions of the instance owner. This would avoid copy-and-paste parsing code between flux-framework projects, and allow a single, system-installed version of flux-security to work with multiple versions of other Flux projects which may generate R with different formats or capabilities. |
Me too! Which is why adding non-local things to Rlocal that need to be skipped seems like exactly the same kind of processing in a privileged process that we were trying to avoid by having Rlocal in the first place. I don't see might difference between skipping 3 things or skipping 1000. The parsing code needs to be rock solid in either case, so the implementation work seems the same. But if they are the same, then why put the extra effort into implementing Rlocal in the first place.
That is interesting...is that how it usually works? I was thinking more that some number of floating licenses would be allocated to an entire job, and they could choose to use them where they like. But I guess it depends on the sophistication of the particular license server. |
I actually don't know, but it was a real proposal from somewhere (but not sure if it was ever implemented).
You do make good points. Either way, I don't think we're closing the door on including either more or less resources in the R in future implementations, so I'm thinking we can move on for now... |
The devil is in the details there. At some point the IMP needs to ingest resource data in some form from an untrusted source. It could either be by parsing a resource document itself, or from parsing the output of something that parses the resource document. But at some point it always needs to be defensive and validate its input. I would need to know more to evaluate it, I think. I'm not really clear why parsing R seems more scary than parsing Rlocal, or why it would necessarily need to be handled separately through plugins and/or resource separation. I think R could be handled by an internal parser in the IMP exactly the same way that Rlocal is being proposed to be handled. |
Ok, again I understand your point. I guess I'm arguing that Rlocal as used internally by the IMP is a different, much simpler, format than R (sorry, maybe it shouldn't be called Rlocal anymore?). The amount of code being used would therefore be less, and therefore, provably less bugs. The Rlocal format could evolve much more slowly than R, though I admit it hasn't been proven that R will change at any kind of pace that would require frequent updates to flux-security project, so perhaps a weaker argument here. Also it just kind of seems to make sense to send less data to the IMP, even though this doesn't have a security argument. For a job with 1000 cores on 1000 nodes, R is potentially 1000x the size of Rlocal.... |
Totally agree. |
To summarize, I think consensus here is that Rlocal should contain only node local resources, but that it is still useful to send only a subset of R, perhaps in a simpler format, as input to the IMP. As far as the topic of this issue, which is the specification of R, I don't think that changes much. We still need to be able to generate some Rlocal from R, and therefore we'll need some kind of format of R that allows this within in an instance. |
IMHO this approach is sound. As I already discussed this w/ @grondo, extracting Rlocal should be fully distributed so that a centralized component doesn't become a scalability bottleneck. For needing to fetch R by many execution service modules to extract Rlocal , I believe this should be scalable as this would essentially have performance complexity of a broadcast... At some point, we may want to measure this though.. Sorry if this was too obvious. |
I would like to have a bit more discussions on the main topic of this issue: R format. As @grondo nicely captured this in the beginning, R will serve as the input and output of a range of components. For example, it will be the input to the
It seems an important decision we can all benefit from at this point would be whether we want to spec out a common R format or going with an opaque approach with just common abstractions on it agreed upon. W/o looking at this too closely, if we go with a graph format with an optional ability to annotate extra information on resource vertex and edge (i.e., the concepts described in RFC4), we should be able to describe the format captured in all of the above use cases, and I can contribute to that effort based on my current But a part of me is also saying whether this kind of rigor on the format is necessary at this point. Another approach can be for each of format variants, we require a library to expose a set of common operations including "reader" and "writer." The former would a bit more rigorous but at the same time it can be a bit more time consuming. But maybe something that we ought to do anyway. There is also the third possibility which is to start to go with the opaque approach above but as we reach agreements on the common abstractions, we will know the requirement on the format better. And at that point, we can formalize the format. It seems we will have to write libraries that expose those abstractions around the R anyway, this also is not a bad idea IMHO either. Thoughts? |
Thanks @dongahn. Some very good thoughts above. Another element we need to keep in mind is how various components of flux would manage dependencies of interpreting and managing R. Ideally perhaps, R format would be supported directly by flux-core, so that the execution system, which depends on it, can be tested stand-alone. However, this approach might lead to a lowest-common denominator format, which may not support the needs of advanced resource services and/or schedulers. An argument might also be made the the R format is solely the domain of a resource service, and the R interpreter and generator should therefore by offered by that service, though that would leave the execution service dependent on resource services being installed, which might not be what we want. Another approach would be to have the R spec opaque as you said, with each type supplying a corresponding API that satisfies the requirements of all use cases outside of resource service internals. Somehow the required implementation would be encoded in R itself and the correct implementation loaded at runtime. One more idea would be to have a very basic R specification, but allow a section for "extensions" which might be ignored by most components, but used for any extra information needed by the resource service itself. The base R spec might not even need an API if it was simple enough, thereby removing the pain of deciding where the dependent libraries might live. |
Of four possibilities, the last two seem most attractive from my perspective. A bit on the 4th option:
This is a very interesting idea from my perspective @grondo. I consider the resource representation needed by Why don't I put a few examples of the graph representations I plan to use by |
sounds good @dongahn! Thanks! |
Yup! I was thinking along the same line. |
defaults:
edge:
attrs: { subsystem: containment, out: contains, in: in }
resource:
- type: rack
count: 1
id: 0
edges:
- out:
- type: node
name: node7
id: 7
count: 1
edges:
- out:
- type: socket
count: 1
edges:
- out:
- type: core
name: core0
count: 1
id: 0
- type: memory
name: memory0
count: 4
unit: GB
# how to annotate different out-edge type
- out:
type: foo
name: foo1
count: 1 I like this direction. But I am not clear what is the best way to annotate an out edge when it has a different attribute then the default. Now I remember I used the singular |
edges:
- out:
? attrs: {}
vtx: $resource_vertex We can also do it this way at the expense of being verbose...of course. |
Note that, at least early on in here, that was meant to be dealt with by range expansion on names and or IDs such that you could have something like: type: node
name: n[1-50]
count: 50
- type: core
... I'm not sure we still want to do that, but it's an option. Otherwise, for machine generated R, it could just be explicitly laid out with counts of only 1. |
If I'm following correctly we have an However, there may be multiple types of Maybe new edge types should be specified in an outer dictionary, along with each type's attributes? this is probably better and less verbose than using properties:
edge:
with: { "attrs": { "subsystem": "containment", "out": "contains", "in": "in" }}
foo: { "attrs": ... } # set attributes for edge type "foo" Or something similar. Would this work? The drawback is that properties can't be overridden within the spec, like if we had separate |
In another thread we had other ways to specify edges, this is a version I had for a relatively dense way to hand-write edges for example: type: node
<power:
- type: pdu
>with: core
<with: rack Where the prefix characters on the key represented either out. That's rather... opaque I would say, but it's OK as shorthand. If the goal is to do multiple edge types under a single edges key, what we were talking about in 2015 has an example here with description, copied below, called links back then: type: Core
count: 1
tasks: 1 # defaults to one, meaning one of these rspecs per task, to get one total, use *
sharing: exclusive
contains: []
links:
- type: uses
direction: out
target: 17 We discussed this concept and expressing it at length in that issue. |
Thanks for commenting @trws! I freely admit I've completely lost context on what we've discussed before. |
Happy to, sorry it took so long actually, OpenMP F2F meeting this week. I like the idea of doing "edges: type : ..." by the way, that's a nice way to handle multiple types without all the extra verbosity of having a "type: ..." key on every one. |
Throwing one crazy idea out there: it seems a generic graph format "standard" for JSON is also emerging (like graphml being a small subset of xml to specify a graph), and one possibility is to latch onto that format instead of reinventing our own... http://jsongraphformat.info |
I actually used one of those as the serialization format for the test implementation of the original python version of jobspec. As long as the graph format supports directed multigraphs it would be perfectly reasonable to use it. Asking users to write it is another issue, but using it for passing graphs around would be fine, and could even store “canonicalized” jobspec for that matter.
On 29 Jan 2019, at 13:38, Dong H. Ahn wrote:
Also https://github.com/jsongraph/json-graph-specification
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#109 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AAoStfJ_sf8jqkuRxC7NoqWuHAZ26Pfkks5vIL9bgaJpZM4PsfPr>.
|
I actually used one of those as the serialization format for the test implementation of the original python version of jobspec. As long as the graph format supports directed multigraphs it would be perfectly reasonable to use it.
Oh. Good. I am looking at this, and it seems to have all the constructs I need. I will play with it some more then.
Asking users to write it is another issue
Right. This isn’t so human friendly. The current resource section of jobspec would be far better for that purpose. I don’t know if our requirements include R to be generated directly by human users.
but using it for passing graphs around would be fine, and could even store “canonicalized” jobspec for that matter.
Yes for the future when the jobspec support a full graph, this could be useful as well.
|
OK. It looks like JSON Graph Format (JGF) gives me everything that I need to emit the resource graph. See below an example in JGF to encode the graph in #109 (comment) Pros:
Cons:
{
"graph":{
"nodes":[
{
"id":"0",
"metadata":{
"type":"rack",
"name":"rack0",
"id":0
}
},
{
"id":"1",
"metadata":{
"type":"node",
"name":"node7",
"id":7
}
},
{
"id":"2",
"metadata":{
"type":"socket",
"name":"socket0",
"id":0
}
},
{
"id":"3",
"metadata":{
"type":"core",
"name":"core0",
"id":0
}
},
{
"id":"4",
"metadata":{
"type":"memory",
"name":"memory0",
"count":2,
"unit":1073741824,
"id":0
}
},
{
"id":"5",
"metadata":{
"type":"foo",
"name":"foo0",
"id":0
}
}
],
"edges":[
{
"source":"0",
"target":"1",
"metadata":{
"subsystem":"containment",
"relationship":"contains"
}
},
{
"source":"1",
"target":"2",
"metadata":{
"subsystem":"containment",
"relationship":"contains"
}
},
{
"source":"2",
"target":"3",
"metadata":{
"subsystem":"containment",
"relationship":"contains"
}
},
{
"source":"2",
"target":"4",
"metadata":{
"subsystem":"containment",
"relationship":"contains"
}
},
{
"source":"2",
"target":"5",
"metadata":{
"subsystem":"foo",
"relationship":"bars"
}
}
]
}
} |
From the current spec, it wasn't clear if JGF has support for adding default node/edge properties. But since this is JSON, we can always add those properties as an extra data (building on @grondo's idea at #109 (comment)). {
"properties": {
"edges": [
{
"id": "default",
"subsystem": "containment",
"relationship": "contains"
},
{
"id": "foo",
"subsystem": "foo",
"relationship": "bars"
}
]
},
"graph": {
"nodes": [
{
"id": "0",
"metadata": {
"type": "rack",
"name": "rack0",
"id": 0
}
},
{
"id": "1",
"metadata": {
"type": "node",
"name": "node7",
"id": 7
}
},
{
"id": "2",
"metadata": {
"type": "socket",
"name": "socket0",
"id": 0
}
},
{
"id": "3",
"metadata": {
"type": "core",
"name": "core0",
"id": 0
}
},
{
"id": "4",
"metadata": {
"type": "memory",
"name": "memory0",
"count": 2,
"unit": 1073741824,
"id": 0
}
},
{
"id": "5",
"metadata": {
"type": "foo",
"name": "foo0",
"id": 0
}
}
],
"edges": [
{
"source": "0",
"target": "1"
},
{
"source": "1",
"target": "2"
},
{
"source": "2",
"target": "3"
},
{
"source": "2",
"target": "4"
},
{
"source": "2",
"target": "5",
"metadata": {
"property": "foo"
}
}
]
}
} |
Also I found that JGF is much more condense and legible than GraphML (#109 (comment)), though! |
👍 From their website, it looks like they are also using json-schema for validation. So +1 for no added dependencies to read and another +1 for no added dependencies to validate.
I agree with you that I don't expect users to have to write R, but they will most likely have to read it. I imagine myself frequently dumping R from the KVS to see what resources my job ran on. That being said, the examples you posted are quite legible IMO.
Just a thought, but this format probably compresses really well. Lots of repeated tags and patterns. It wouldn't help human-readability, but if we pass the compressed version around in the messages, it can potentially help performance. We could also store it compressed in the KVS, but if we do that, we'll want to make it simple for users to decompress and view R from the KVS.
👍 👍 I agree. Much nicer to look at than GraphML. |
Can we propose that the base Really, what is required from flux-core is the basic containment hierarchy, and an ability to map Not that I am fully opposed to specifying the format of R as JGF, but at this point it seems like overkill for flux-core components. But I'd hate to add extra complexity elsewhere for only a modicum of simplicity in flux-core. |
I think a two section approach has several advantages. For example, this way, we don't have to include rank and slot info into the "graph" section needed for the nested schedulers. One disadvantage is, though, R will be on the order of twice as big with the two section approach. That may be okay... Generally, the two section approach would be a bit faster in exchange for more needed space. One alternative appoarch would be to develop a converter layer that converts the graph into the containment as what you require for the execution service. With either approach, a question remains what should be the exact format for the |
How about: #109 (comment) or similar under the For our current sprint, |
BTW, does the execution system need info on the higher level resources? Like rack or cluster? Or just node and down? |
@grondo and @SteVwonder: I can start to draft an RFC on a two section proposal as a way to push forward this dialogue further, if you like. |
That would be a great start! Thanks
…On Mon, Feb 4, 2019, 4:37 PM Dong H. Ahn ***@***.*** wrote:
@grondo <https://github.com/grondo> and @SteVwonder
<https://github.com/SteVwonder>: I can start to draft an RFC on a two
section proposal as a way to push forward this dialogue further, if you
like.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#109 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAtSUrGNCGtnSQkQkbEamgJEaaiZcReXks5vKNI6gaJpZM4PsfPr>
.
|
I realize I'm coming to this a bit late, but I'd put in my 2c for keeping it one piece. The hardware topology should always be walkable by just limiting the graph representation to that kind of edge, and if it's in two formats in two parts at least some of the components will have to work with both. The job shell has to take R emitted from sched for example, so either sched needs to work with both formats or there has to be one that works for both. Perhaps it would be good to come up with an API or similar for what the core side wants to talk to, that could understand whatever the format is underneath and provide the appropriate information rather than expecting it to walk the resource spec directly? |
Thank you for your thoughts. I think what @grondo wants is to put what's required by the execution service in one section in an easy-to-use format and the full information pertaining to the scheduler in the second section. This isn't too difficult to do by the scheduler. I hope that we don't find a situation where any component needs to read both sections. Clearly this isn't optimal in terms of storage and R producer performance. But it has a consumer performance advantage (the execution system only needs to read one section while the nested scheduler instance doesn't needs to read the data like "rank" and "slots") and the lower complexity in the execution system software. The API approach (or converter approach as I suggested above) would be another excellent way to overcome this issue. But what @grondo seems to concern about is the complexity of designing API at this point, as it will have to be a graph code. |
Fair enough. I'm all for using the format that makes sense, just pointing out that just because it's stored as a graph (or as a tree) doesn't mean we have to access it that way in terms of the API. The resource spec format is a graph jammed into a tree format, so is JGF, so there are other options if for some reason it turns out to be complicated to get a simple serialization format.
…________________________________
From: Dong H. Ahn <notifications@github.com>
Sent: Wednesday, February 6, 2019 2:26 PM
To: flux-framework/rfc
Cc: Scogland, Tom; Mention
Subject: Re: [flux-framework/rfc] Need specification for "resource set", R (#109)
@trws<https://github.com/trws>:
Thank you for your thoughts.
I think what @grondo<https://github.com/grondo> wants is to put what's required by the execution service in one section in an easy-to-use format and the full information pertaining to the scheduler in the second section. This isn't too difficult to do by the scheduler. I hope that we don't find a situation where any component needs to read both sections. Clearly this isn't optimal in terms of storage and R producer performance. But it has a consumer performance advantage (the execution system only needs to read one section while the nested scheduler instance doesn't needs to read the data like "rank" and "slots") and the lower complexity in the execution system software. The API approach (or converter approach as I suggested above) would be another excellent way to overcome this issue. But what @grondo<https://github.com/grondo> seems to concern about is the complexity of designing API at this point, as it will have to be a graph code.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#109 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AAoStS6u9gHiRs2T1zBTgeDkTkZLxG4lks5vK1aGgaJpZM4PsfPr>.
|
Completely agreed! Good thoughts @trws. |
Thank you for the good discussions. We will probably want to refer back to this when we evolve But for now PR #155 resolved the ticket. |
This issue is being opened to start a discussion on the use cases, API, and/or specification for the R as in RFC 15. R is the serialized version of any resource set, and is presumably produced by the serializer described in RFC4, consumed by the resource service in an instance as configuration, and used by the IMP and job shell to determine shape of containment and local resource slots.
In essence, the R format will be the way composite resource and resource configuration information will be transmitted to and from instances of Flux.
Ideally, the purpose of this issue is to determine the format of R such that a new RFC could be drafted.
To get the discussion started, here are some high level requirements and use cases for R:
R should act as resource configuration input to an instance,
therefore it may be that configuration of even the system instance
is written in R spec, or the configuration language (RDL?) generates R.
(in fact, one use case might be to directly generate R from hwloc data)
Execution service in an instance needs to be able to generate
Rlocal from R for each rank. So given a rank or even generic
"resource vertex", there should be a function to generate an
Rn from R, where Rn is a hierarchical subset of R.
The containment plugins in the IMP will need to query Rlocal
for the list of local resources of given type or types on which
the containment plugins operate. For instance, a memory plugin
will need to determine the amount and location of RAM contained
in Rlocal in order to set up memcg limits. Similarly a Socket/CPU
plugin would need to iterate over or query the list of local
sockets/cores in Rlocal to add these to the cgroup.
The job shell will use jobspec+R to determine the local
'task slots' that map to commands in the 'tasks' section.
Dependency management here might get challenging. The IMP is a user of Rn, but we want to ideally eliminate dependencies in the flux-security project on other flux-framework projects. Possible approaches here might include:
The text was updated successfully, but these errors were encountered: