Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need specification for "resource set", R #109

Closed
grondo opened this issue Oct 3, 2017 · 95 comments
Closed

Need specification for "resource set", R #109

grondo opened this issue Oct 3, 2017 · 95 comments

Comments

@grondo
Copy link
Contributor

grondo commented Oct 3, 2017

This issue is being opened to start a discussion on the use cases, API, and/or specification for the R as in RFC 15. R is the serialized version of any resource set, and is presumably produced by the serializer described in RFC4, consumed by the resource service in an instance as configuration, and used by the IMP and job shell to determine shape of containment and local resource slots.

In essence, the R format will be the way composite resource and resource configuration information will be transmitted to and from instances of Flux.

Ideally, the purpose of this issue is to determine the format of R such that a new RFC could be drafted.

To get the discussion started, here are some high level requirements and use cases for R:

  • R should act as resource configuration input to an instance,
    therefore it may be that configuration of even the system instance
    is written in R spec, or the configuration language (RDL?) generates R.
    (in fact, one use case might be to directly generate R from hwloc data)

  • Execution service in an instance needs to be able to generate
    Rlocal from R for each rank. So given a rank or even generic
    "resource vertex", there should be a function to generate an
    Rn from R, where Rn is a hierarchical subset of R.

  • The containment plugins in the IMP will need to query Rlocal
    for the list of local resources of given type or types on which
    the containment plugins operate. For instance, a memory plugin
    will need to determine the amount and location of RAM contained
    in Rlocal in order to set up memcg limits. Similarly a Socket/CPU
    plugin would need to iterate over or query the list of local
    sockets/cores in Rlocal to add these to the cgroup.

  • The job shell will use jobspec+R to determine the local
    'task slots' that map to commands in the 'tasks' section.

Dependency management here might get challenging. The IMP is a user of Rn, but we want to ideally eliminate dependencies in the flux-security project on other flux-framework projects. Possible approaches here might include:

  1. The IMP could take a subset of the R specification, simple enough to parse with its own parser and offer some simplified interface to IMP plugins that do containment. Containment plugins probably only really need to get a list of local resources from R, as long as the logical IDs match the actual system logical IDs. The IMP's interface to R could later be expanded to offer higher level functionality for advanced containment (though I don't have any use cases that I can think of here)
  2. Alternately, the IMP itself could treat R as opaque data, passed to plugins. The plugins would then have a dependency on some library from a system-installed flux core or sched project.
@lipari
Copy link
Contributor

lipari commented Oct 4, 2017

Nice start, @grondo. Just wanted to add some distinctions... The R will contain node-resident resources that will become Rlocal as well as ancillary resources like switches, racks, licenses, bandwidth and power. For resources like a rack or a switch, the scheduler may schedule jobs that are distributed to specific racks or switches. However, one could imagine that the allocated rack or switch in this case remains in the scheduler's domain and does not get included in the R that is passed to the system instance (as embodied by a collection of brokers).

On the other hand, for resources for which we will have controls (e.g., license managers), these resources would become part of R but would only be relevant to specific agents like license managers, and not the brokers (unless we have a broker devoted to controlling a license manager).

Power and network bandwidth could fall into either of these two cases depending on whether a throttle was available to limit power or bandwidth serving the allocated resources.

@grondo
Copy link
Contributor Author

grondo commented Oct 4, 2017

Thanks, @lipari! You bring up some good points.

On the other hand, for resources for which we will have controls (e.g., license managers), these resources would become part of R but would only be relevant to specific agents like license managers, and not the brokers (unless we have a broker devoted to controlling a license manager).

These generic "global" job resources like licenses, burst buffer storage space, bandwidth, etc will have to be passed in to some sort of containment management, in case there is some action required to give access to the licenses, or reserve space, etc. My thought is that these would be included in Rlocal, and then the container plugin specific to that resource type would be able to decide how to contain or make available that specific resource. For example, a simple approach would be to have the plugins on the first node of a job operate on these resources.

What remains to be decided is how these resources do get included in Rn for each IMP n. We may have to put some tag on these kind of global resources so they are automatically included in any Rn

I would also argue that Rlocal for any IMP should include not only the resource vertex(es) on which the IMP will be run, but also all the parents of the vertex up to the root (in the hierarchical resource tree). This will give the IMP containment plugins a bit more information about where they are running in the global hierarchy, which could be useful, and also allows us to keep global resources discussed above in their proper place in any hierarchy.

E.g. instead of a simple Rlocal like socket[0]->core[0-3] for an IMP managing a single socket, you might instead have llnl->cluster[5](name=hype)->node[113](name=hype113)->socket[0]->core[0-3]

@morrone
Copy link
Contributor

morrone commented Oct 4, 2017

My thought is that these would be included in Rlocal, and then the container plugin specific to that resource type would be able to decide how to contain or make available that specific resource.

I would also argue that Rlocal for any IMP should include not only the resource vertex(es) on which the IMP will be run, but also all the parents of the vertex up to the root (in the hierarchical resource tree).

I'm worried that these new parts of the plan are undermining the original need/goal for Rlocal . If the Rlocal contains resource types that the IMP doesn't directly control, then we are back to the original situation we had with a single large R. The IMP now needs to parse out and identify only its locally relevant resource types from a larger tree of information. Rather than going that route, I wonder if it wouldn't make more sense to just going back to having a single complete R document that is sent everywhere including the IMP (although this time around we are deciding that the complete R no longer needs to be signed).

In other words, if the IMP needs to parse out potentially extraneous information from Rlocal and find the point where the information begins to align with its local resources, then it could do that just as easily from a large R. What is the value then of Rlocal?

@grondo
Copy link
Contributor Author

grondo commented Oct 4, 2017

In other words, if the IMP needs to parse out potentially extraneous information from Rlocal and find the point where the information begins to align with its local resources, then it could do that just as easily from a large R. What is the value then of Rlocal?

Rlocal allows the instance that is starting the IMP to control the shape of the container under which the IMP will execute the job shell, instead of relying on the IMP to make that decision, when it doesn't have or need the necessary data to make the correct decision about what goes in the local "container".

I guess where you might differ in opinion is whether the parents of a resource are part of that conceptual container. I tend to think a container that is just "cpu0" doesn't make any sense, you need node->socket[0]->cpu[0] at least to resolve the container. Taking that idea a bit further, in our resource model, node0 is not a valid container either, you need llnl->hype->node[0]. The IMP won't have any containment plugins that try to operate on resource type "datacenter", "node", or "switch", etc., so the extra resources will be safely ignored. However, if a containment plugin happens to need this information, it may be at least able to get it. (location of off-node resources is the main use case I'm considering now)

Another benefit of Rlocal is potentially eliminating dependence on flux-sched or flux-core provided resource query language that might be required to perform the intersection between local resources and global R (though even if you had this support for the IMP, I'm not convinced the IMP alone could make the right decision here). To realize this particular goal, Rlocal will need to be simple enough that the IMP or its plugins could parse it easily themselves.

@grondo
Copy link
Contributor Author

grondo commented Oct 4, 2017

A specific case where Rlocal might be required is if an instance, for testing or other good reason, would like to start more than one IMP per broker. To do this, the instance would break up local resources into multiple Rlocal and pass to each IMP. I don't see how it would be possible if the instance passed the global R to each IMP.

BTW, I was mainly taking a long-term view on inclusion of parent resources in Rlocal, and as long as it is possible to add that support in at a future date, I'm ok with leaving it out for now. I think off-node resources like burst-buffer space and licenses could be handled by including enough metadata in those resources included in Rlocal such that a plugin could know exactly which licenses it was operating on, or which burst buffers it was reserving space in... etc.

@morrone
Copy link
Contributor

morrone commented Oct 4, 2017

A specific case where Rlocal might be required is if an instance, for testing or other good reason, would like to start more than one IMP per broker.

Ah. I thought we agreed that we weren't doing that, that instead the IMP always controls all resources on the node and leaves further resource masking to the jobshell. But if we are reversing that I suppose that is fine.

I think off-node resources like burst-buffer space and licenses could be handled by including enough metadata in those resources included in Rlocal such that a plugin could know exactly which licenses it was operating on, or which burst buffers it was reserving space in... etc.

If we think that the IMP is going to need to know about global resources, then again I think I'm back to thinking we should just send the full R. We can always add annotations to that if we ever want to run multiple IMPs per node. Adding global resources to Rlocal makes the name a bit of a misnomer. :)

@morrone
Copy link
Contributor

morrone commented Oct 4, 2017

Another benefit of Rlocal is potentially eliminating dependence on flux-sched or flux-core provided resource query language that might be required to perform the intersection between local resources and global R

The IMP reading a global R does not necessarily imply that the IMP must use a "resource query language". The IMP just needs a parser. The parser for R and Rlocal, if Rlocal is allowed to contain global resources, will be very nearly identical I think. We can choose to implement the parser twice or cut-and-paste it into the IMP if we want to keep it separate.

@grondo
Copy link
Contributor Author

grondo commented Oct 4, 2017

If we think that the IMP is going to need to know about global resources, then again I think I'm back to thinking we should just send the full R. We can always add annotations to that if we ever want to run multiple IMPs per node. Adding global resources to Rlocal makes the name a bit of a misnomer. :)

Yeah, I completely understand your sentiment. I'm fine with leaving Rlocal with only "local" resources (whatever "local" may mean), but then we have no proposed method to handle off-node resources (since IMP can only run at most within a node).

@morrone
Copy link
Contributor

morrone commented Oct 4, 2017

I thought we had talked about that. I think that the "execution management" module, or whatever we are calling it now, would handle off-compute-node resource setup before launching remote execution and the IMPs. There would be plugins into that module that can instantiate the various resources that people come up with.

@grondo
Copy link
Contributor Author

grondo commented Oct 4, 2017

The IMP reading a global R does not necessarily imply that the IMP must use a "resource query language". The IMP just needs a parser. The parser for R and Rlocal, if Rlocal is allowed to contain global resources, will be very nearly identical I think. We can choose to implement the parser twice or cut-and-paste it into the IMP if we want to keep it separate.

The IMP will need to parse R, but then how does it complete the intersection between local available resources and R. It would need to generate an R' from hwloc or some other local HW query code, then take the intersection of R' and R.

If only Rlocal is sent to the IMP it doesn't need to read local HW configuration, it doesn't need to generate a second R' from from that information, and it doesn't need to do the work of the intersection. So that feels like quite a bit of code saved from a security significant piece of software.

@grondo
Copy link
Contributor Author

grondo commented Oct 4, 2017

I thought we had talked about that. I think that the "execution management" module, or whatever we are calling it now, would handle off-compute-node resource setup before launching remote execution and the IMPs. There would be plugins into that module that can instantiate the various resources that people come up with.

That could work but the instance doesn't have any privilege except through the IMP. Is it a requirement that all off node resources don't require privilege to access (this is possible, I just didn't think of that way before)?

@morrone
Copy link
Contributor

morrone commented Oct 4, 2017

The IMP will need to parse R, but then how does it complete the intersection between local available resources and R. It would need to generate an R' from hwloc or some other local HW query code, then take the intersection of R' and R.

I think it is actually a lot simpler than that if there is just a single IMP per node. It just walks the tree of data in R, and looks up each resource it sees in its internal table "oh! that belongs to me, I'll note that", "nope that doesn't belong to me, skip it". There is no complicated intersection needed, really.

@grondo
Copy link
Contributor Author

grondo commented Oct 4, 2017

There is no complicated intersection needed, really.

Ok, I guess I couldn't visualize how to make it quite that simple.

Whereas with Rlocal The IMP would walk each type of resource for which it has a containment plugin and hand the list of those resources in Rlocal to the plugin (or alternately each plugin could generate the list itself(. No comparisons needed at all. Since there won't be containment plugins for "node" "switch" "datacenter", and other resources, those would be safely ignored if they were there at all.

@morrone
Copy link
Contributor

morrone commented Oct 4, 2017

That could work but the instance doesn't have any privilege except through the IMP. Is it a requirement that all off node resources don't require privilege to access (this is possible, I just didn't think of that way before)?

It would be preferable when possible. But it can be handled on a case-by-case basis.

Doing this through the IMP could potentially introduce a fair bit of complexity. We might be back to needing a away to track authority back through multiple levels of flux instances. We were able to avoid that when the IMP was constrained to dealing with resources inside of its local node's container.

@morrone
Copy link
Contributor

morrone commented Oct 4, 2017

Whereas with Rlocal The IMP would walk each type of resource for which it has a containment plugin and hand the list of those resources in Rlocal to the plugin

Yeah, actually it is slightly more complicated than I stated, but not much. Actually, in each case where it finds a resource it owns, it needs remember that AND all of the resources under it in the tree. But that is still pretty straight forward I think.

I don't think the IMP can only look at types even for an Rlocal. It needs to look at either names or counts too. Flux allows two approaches to something like a "socket". We can either represent all of the sockets on a node as a single resource vertex and use the count within that vertex to represent all of the sockets, or we can have a resource vertex with a name/id/uuid/whatever for_each_ of the sockets.

In the former case, with individual resource vertices, the scheduler picks the exact resources, and the IMP just needs to carry out the instructions. In the latter, with numbered resources, the IMP needs to be more aware of what is happing with allocations on the node (for instance, if nodes are shared). But actually, I'm not sure that the IMP can read the scheduler's mind enough to always make the same selection pattern...and that could lead to resources being shared (sockets/cores) that the scheduler could have avoided. So actually, for things like sockets and cores I suspect that we will always use separate resource vertices.

Since there won't be containment plugins for "node" "switch" "datacenter", and other resources, those would be safely ignored if they were there at all.

I think is reasonable for the IMP to know what node it is on. And once it knows that, it will be fairly easy to pick out its own resources from the global R.

@grondo
Copy link
Contributor Author

grondo commented Oct 4, 2017

Doing this through the IMP could potentially introduce a fair bit of complexity. We might be back to needing a away to track authority back through multiple levels of flux instances. We were able to avoid that when the IMP was constrained to dealing with resources inside of its local node's container.

I'm not sure that is required, but either way you need to verify ownership of the resource whether it is through an IMP plugin or a plugin in the execution system.

One example I can think of is if you restricted access to a license server through some sort of iptables rules. For jobs granted access to the license server you would have to allow iptables rules to be modified in the network namespace of the job. I can envision how this could be done with an IMP plugin, but not at all what kind of system you'd need if you tried to do it through a plugin in the unprivileged execution modules. Besides, the containers can't exist until the IMP runs so there is an ordering problem there.

For this case would you consider licenses a "local" resource, or perhaps rename them network access tokens or something? (If that is the case then I could see that this scheme could work)

@grondo
Copy link
Contributor Author

grondo commented Oct 4, 2017

I think is reasonable for the IMP to know what node it is on. And once it knows that, it will be fairly easy to pick out its own resources from the global R.

Ok, but why? I guess it is immaterial if we pass R or Rlocal since the IMP always filters R to Rlocal anyway (it will just be a noop in the second case). The question I keep struggling with is why you'd want to do that work in your privileged process if you don't have to?

I don't think the IMP can only look at types even for an Rlocal. It needs to look at either names or counts too.

Yes, that is what I meant. Each containment plugin will need to know the list and count of types (especially for RAM) of each of the resource types it knows how to deal with.

@grondo
Copy link
Contributor Author

grondo commented Oct 4, 2017

Here's my proposal for a simplified Rlocal to satisfy near-term milestones, if it is acceptable that R and Rlocal have different specifications.

The simplified Rlocal as input to the IMP will be a JSON document with a list of resource types for which the IMP should create a container. The IMP will support a list of plugins which operate on one or more of each type, and access the Rlocal directly to determine the parameters of the containers they can create. e.g. a memory and cpu "cgroups" container would read the "sockets" "cpus" and "memory" fields of the Rlocal dictionary and add cpus and mems to a cpuset cgroup, and constrain memory with a memory cgroup.

The format of Rlocal might look (very roughly), something like:

{
 "cpu": { "list": [ 0, 1, 2, 3] }, "count": 4 },
 "socket": { "list": [ 0 ],  "count": 1 },
 "memory": { "count": 1024, "units": "MB" },
}

This is just off the cuff so there may be missing fields, but is meant to give a general idea.

If the IMP must take the full R as input, then I'd suggest a plugin to the IMP, provided by the instance, would generate this format as input to the IMP containment plugin infrastructure. That would further require that the IMP operate in privilege separation mode so that the plugin operating on R runs with permissions of the instance owner. This would avoid copy-and-paste parsing code between flux-framework projects, and allow a single, system-installed version of flux-security to work with multiple versions of other Flux projects which may generate R with different formats or capabilities.

@morrone
Copy link
Contributor

morrone commented Oct 4, 2017

The question I keep struggling with is why you'd want to do that work in your privileged process if you don't have to?

Me too! Which is why adding non-local things to Rlocal that need to be skipped seems like exactly the same kind of processing in a privileged process that we were trying to avoid by having Rlocal in the first place. I don't see might difference between skipping 3 things or skipping 1000. The parsing code needs to be rock solid in either case, so the implementation work seems the same. But if they are the same, then why put the extra effort into implementing Rlocal in the first place.

One example I can think of is if you restricted access to a license server through some sort of iptables rules.

That is interesting...is that how it usually works? I was thinking more that some number of floating licenses would be allocated to an entire job, and they could choose to use them where they like. But I guess it depends on the sophistication of the particular license server.

@grondo
Copy link
Contributor Author

grondo commented Oct 4, 2017

That is interesting...is that how it usually works? I was thinking more that some number of floating licenses would be allocated to an entire job, and they could choose to use them where they like. But I guess it depends on the sophistication of the particular license server.

I actually don't know, but it was a real proposal from somewhere (but not sure if it was ever implemented).

Me too! Which is why adding non-local things to Rlocal that need to be skipped seems like exactly the same kind of processing in a privileged process that we were trying to avoid by having Rlocal in the first place. I don't see might difference between skipping 3 things or skipping 1000. The parsing code needs to be rock solid in either case, so the implementation work seems the same. But if they are the same, then why put the extra effort into implementing Rlocal in the first place.

You do make good points.
I guess you are implementing Rlocal in either case, it is just a case of where it is generated. I say you have my consensus that non-local things should not go into Rlocal, and lets leave it at that for now.
However, also take a look at my proposal above at possibly using instance provided code to parse R from within the IMP by running it in an unprivileged child.

Either way, I don't think we're closing the door on including either more or less resources in the R in future implementations, so I'm thinking we can move on for now...

@morrone
Copy link
Contributor

morrone commented Oct 4, 2017

However, also take a look at my proposal above at possibly using instance provided code to parse R from within the IMP by running it in an unprivileged child.

The devil is in the details there. At some point the IMP needs to ingest resource data in some form from an untrusted source. It could either be by parsing a resource document itself, or from parsing the output of something that parses the resource document. But at some point it always needs to be defensive and validate its input. I would need to know more to evaluate it, I think.

I'm not really clear why parsing R seems more scary than parsing Rlocal, or why it would necessarily need to be handled separately through plugins and/or resource separation. I think R could be handled by an internal parser in the IMP exactly the same way that Rlocal is being proposed to be handled.

@grondo
Copy link
Contributor Author

grondo commented Oct 4, 2017

I'm not really clear why parsing R is more seems more scary than parsing Rlocal, or why it would necessarily need to be handled separately through plugins and/or resource separation. I think R would be handled by an internal parser in the IMP exactly the same way that Rlocal is being proposed to be handled.

Ok, again I understand your point.

I guess I'm arguing that Rlocal as used internally by the IMP is a different, much simpler, format than R (sorry, maybe it shouldn't be called Rlocal anymore?). The amount of code being used would therefore be less, and therefore, provably less bugs.

The Rlocal format could evolve much more slowly than R, though I admit it hasn't been proven that R will change at any kind of pace that would require frequent updates to flux-security project, so perhaps a weaker argument here.

Also it just kind of seems to make sense to send less data to the IMP, even though this doesn't have a security argument. For a job with 1000 cores on 1000 nodes, R is potentially 1000x the size of Rlocal....

@lipari
Copy link
Contributor

lipari commented Oct 4, 2017

Also it just kind of seems to make sense to send less data to the IMP, even though this doesn't have a security argument. For a job with 1000 cores on 1000 nodes, R is potentially 1000x the size of Rlocal....

Totally agree.

@grondo
Copy link
Contributor Author

grondo commented Oct 6, 2017

To summarize, I think consensus here is that Rlocal should contain only node local resources, but that it is still useful to send only a subset of R, perhaps in a simpler format, as input to the IMP.

As far as the topic of this issue, which is the specification of R, I don't think that changes much. We still need to be able to generate some Rlocal from R, and therefore we'll need some kind of format of R that allows this within in an instance.

@dongahn
Copy link
Member

dongahn commented Oct 6, 2017

To summarize, I think consensus here is that Rlocal should contain only node local resources, but that it is still useful to send only a subset of R, perhaps in a simpler format, as input to the IMP.

IMHO this approach is sound.

As I already discussed this w/ @grondo, extracting Rlocal should be fully distributed so that a centralized component doesn't become a scalability bottleneck.

For needing to fetch R by many execution service modules to extract Rlocal , I believe this should be scalable as this would essentially have performance complexity of a broadcast... At some point, we may want to measure this though..

Sorry if this was too obvious.

@dongahn
Copy link
Member

dongahn commented Oct 9, 2017

I would like to have a bit more discussions on the main topic of this issue: R format. As @grondo nicely captured this in the beginning, R will serve as the input and output of a range of components. For example, it will be the input to the resource service (resource selection part of scheduling) in a nested instance, as well as remote execution service, and job shell. Similarly, it will be the output of the resource service, and also of other related services, utilities, and even manual effort, e.g., :

  • hwloc (probably via other services that help with formatting);
  • GRUG through resource service;
  • RDL through resrc;

It seems an important decision we can all benefit from at this point would be whether we want to spec out a common R format or going with an opaque approach with just common abstractions on it agreed upon.

W/o looking at this too closely, if we go with a graph format with an optional ability to annotate extra information on resource vertex and edge (i.e., the concepts described in RFC4), we should be able to describe the format captured in all of the above use cases, and I can contribute to that effort based on my currentresource-query experience.

But a part of me is also saying whether this kind of rigor on the format is necessary at this point.

Another approach can be for each of format variants, we require a library to expose a set of common operations including "reader" and "writer."

The former would a bit more rigorous but at the same time it can be a bit more time consuming. But maybe something that we ought to do anyway.

There is also the third possibility which is to start to go with the opaque approach above but as we reach agreements on the common abstractions, we will know the requirement on the format better. And at that point, we can formalize the format. It seems we will have to write libraries that expose those abstractions around the R anyway, this also is not a bad idea IMHO either.

Thoughts?

@grondo
Copy link
Contributor Author

grondo commented Oct 9, 2017

Thanks @dongahn. Some very good thoughts above.

Another element we need to keep in mind is how various components of flux would manage dependencies of interpreting and managing R. Ideally perhaps, R format would be supported directly by flux-core, so that the execution system, which depends on it, can be tested stand-alone. However, this approach might lead to a lowest-common denominator format, which may not support the needs of advanced resource services and/or schedulers.

An argument might also be made the the R format is solely the domain of a resource service, and the R interpreter and generator should therefore by offered by that service, though that would leave the execution service dependent on resource services being installed, which might not be what we want.

Another approach would be to have the R spec opaque as you said, with each type supplying a corresponding API that satisfies the requirements of all use cases outside of resource service internals. Somehow the required implementation would be encoded in R itself and the correct implementation loaded at runtime.

One more idea would be to have a very basic R specification, but allow a section for "extensions" which might be ignored by most components, but used for any extra information needed by the resource service itself. The base R spec might not even need an API if it was simple enough, thereby removing the pain of deciding where the dependent libraries might live.

@dongahn
Copy link
Member

dongahn commented Oct 9, 2017

Of four possibilities, the last two seem most attractive from my perspective. A bit on the 4th option:

One more idea would be to have a very basic R specification, but allow a section for "extensions" which might be ignored by most components, but used for any extra information needed by the resource service itself. The base R spec might not even need an API if it was simple enough, thereby removing the pain of deciding where the dependent libraries might live.

This is a very interesting idea from my perspective @grondo. I consider the resource representation needed by flux-core elements as a "proper" subset of the representation needed by resource so this can work out nicely if we can reasonably separate out the baseline from the extension. I don't know if "section" is the right construct but I got the idea.

Why don't I put a few examples of the graph representations I plan to use by resource and see what belongs to the baseline and what belongs to the core elements, and see if these are easily separable. I will use GraphML but other markup language capable of graph can do as well.

@grondo
Copy link
Contributor Author

grondo commented Oct 9, 2017

sounds good @dongahn! Thanks!

@dongahn
Copy link
Member

dongahn commented Jan 29, 2019

I suggest we don't add top-level keys in R like this. For readabillity, extensibility (and a bit of sanity), I'd suggest something namespaced, like

Yup! I was thinking along the same line.

@dongahn
Copy link
Member

dongahn commented Jan 29, 2019

Great idea. Let me play with it. Love the idea of edges plural.

defaults:
    edge:
        attrs: { subsystem: containment, out: contains, in: in }
    
resource:
  - type: rack
    count: 1
    id: 0
    edges:
      - out:
        - type: node
          name: node7
          id: 7
          count: 1
          edges:
            - out:
              - type: socket
                count: 1
                edges:
                  - out:
                      - type: core
                        name: core0
                        count: 1
                        id: 0
                      - type: memory
                        name: memory0
                        count: 4
                        unit: GB
                  # how to annotate different out-edge type
                  - out:
                      type: foo
                      name: foo1
                      count: 1

I like this direction. But I am not clear what is the best way to annotate an out edge when it has a different attribute then the default. Now I remember I used the singular edge key because of this. @grondo: any idea?

@dongahn
Copy link
Member

dongahn commented Jan 29, 2019

edges:
    - out:
        ? attrs: {}
        vtx: $resource_vertex

We can also do it this way at the expense of being verbose...of course.

@trws
Copy link
Member

trws commented Jan 29, 2019

  • with implies a multiplicative edge, and I found that in a fully concretized resource set this semantics can be confusing. What happens if an intermediate vertex's count is multiple? This can lead to an incorrect interpretation of the size of a target vertex. I think introducing the edge key whose semantic simply is associative would be much cleaner.

Note that, at least early on in here, that was meant to be dealt with by range expansion on names and or IDs such that you could have something like:

type: node
name: n[1-50]
count: 50
  - type: core
...

I'm not sure we still want to do that, but it's an option. Otherwise, for machine generated R, it could just be explicitly laid out with counts of only 1.

@grondo
Copy link
Contributor Author

grondo commented Jan 29, 2019

If I'm following correctly we have an edges: key which specifies a list of edges that are currently either in: or out:, and each of these keys are in turn lists of vertexes connected by these types of edges.

However, there may be multiple types of in or out edges.

Maybe new edge types should be specified in an outer dictionary, along with each type's attributes? this is probably better and less verbose than using defaults:

properties:
   edge:
      with: { "attrs": { "subsystem": "containment", "out": "contains", "in": "in" }}
      foo: { "attrs": ... } # set attributes for edge type "foo"

Or something similar.

Would this work? The drawback is that properties can't be overridden within the spec, like if we had separate attr and vertices keys...

@trws
Copy link
Member

trws commented Jan 29, 2019

In another thread we had other ways to specify edges, this is a version I had for a relatively dense way to hand-write edges for example:

type: node
<power:
  - type: pdu
>with: core
<with: rack

Where the prefix characters on the key represented either out. That's rather... opaque I would say, but it's OK as shorthand. If the goal is to do multiple edge types under a single edges key, what we were talking about in 2015 has an example here with description, copied below, called links back then:

    type: Core
    count: 1
    tasks: 1 # defaults to one, meaning one of these rspecs per task, to get one total, use *
    sharing: exclusive
    contains: []
    links:
      - type: uses
        direction: out
        target: 17

We discussed this concept and expressing it at length in that issue.

@grondo
Copy link
Contributor Author

grondo commented Jan 29, 2019

Thanks for commenting @trws! I freely admit I've completely lost context on what we've discussed before.

@trws
Copy link
Member

trws commented Jan 29, 2019

Happy to, sorry it took so long actually, OpenMP F2F meeting this week. I like the idea of doing "edges: type : ..." by the way, that's a nice way to handle multiple types without all the extra verbosity of having a "type: ..." key on every one.

@dongahn
Copy link
Member

dongahn commented Jan 29, 2019

We shouldn't bend over backwards to make the resources section from jobspec fit our R though. If it would be better to first emit vertices, then edges in a separate object, perhaps we should allow for that.

Yes. We almost have to allow this. It just that I also wanted to make the similar structure of jobspec's resource section also a valid R format.

Throwing one crazy idea out there: it seems a generic graph format "standard" for JSON is also emerging (like graphml being a small subset of xml to specify a graph), and one possibility is to latch onto that format instead of reinventing our own... http://jsongraphformat.info

@dongahn
Copy link
Member

dongahn commented Jan 29, 2019

@trws
Copy link
Member

trws commented Jan 29, 2019 via email

@dongahn
Copy link
Member

dongahn commented Jan 29, 2019 via email

@dongahn
Copy link
Member

dongahn commented Jan 30, 2019

OK. It looks like JSON Graph Format (JGF) gives me everything that I need to emit the resource graph. See below an example in JGF to encode the graph in #109 (comment)

Pros:

  • If JGF can become a standard we should have a plenty of tools we can use -- e.g., validators, reader/writers, visualizers (already things like jsongraph.py exists) and editors for debugging and human interactivity);
  • Similarly, building on a standard may allow us to leverage future techniques such as compression which the community may work on (or we can work on and contribute)

Cons:

  • A bit verbose (I will look at the spec to see if I can create a default attributes so that we can optimize. )
  • Lose the human-friendly structure as jobspec's resource section has
{  
  "graph":{  
    "nodes":[  
      {  
        "id":"0",
        "metadata":{  
          "type":"rack",
          "name":"rack0",
          "id":0
        }
      },
      {  
        "id":"1",
        "metadata":{  
          "type":"node",
          "name":"node7",
          "id":7
        }
      },
      {  
        "id":"2",
        "metadata":{  
          "type":"socket",
          "name":"socket0",
          "id":0
        }
      },
      {  
        "id":"3",
        "metadata":{  
          "type":"core",
          "name":"core0",
          "id":0
        }
      },
      {  
        "id":"4",
        "metadata":{  
          "type":"memory",
          "name":"memory0",
          "count":2,
          "unit":1073741824,
          "id":0
        }
      },
      {  
        "id":"5",
        "metadata":{  
          "type":"foo",
          "name":"foo0",
          "id":0
        }
      }
    ],
    "edges":[  
      {  
        "source":"0",
        "target":"1",
        "metadata":{  
          "subsystem":"containment",
          "relationship":"contains"
        }
      },
      {  
        "source":"1",
        "target":"2",
        "metadata":{  
          "subsystem":"containment",
          "relationship":"contains"
        }
      },
      {  
        "source":"2",
        "target":"3",
        "metadata":{  
          "subsystem":"containment",
          "relationship":"contains"
        }
      },
      {  
        "source":"2",
        "target":"4",
        "metadata":{  
          "subsystem":"containment",
          "relationship":"contains"
        }
      },
      {  
        "source":"2",
        "target":"5",
        "metadata":{  
          "subsystem":"foo",
          "relationship":"bars"
        }
      }
    ]
  }
}

@dongahn
Copy link
Member

dongahn commented Jan 30, 2019

A bit verbose (I will look at the spec to see if I can create a default attributes so that we can optimize. )

From the current spec, it wasn't clear if JGF has support for adding default node/edge properties. But since this is JSON, we can always add those properties as an extra data (building on @grondo's idea at #109 (comment)).

{
    "properties": {
       "edges": [
           {
               "id": "default",
               "subsystem": "containment",
               "relationship": "contains"
           },
           {
               "id": "foo",
               "subsystem": "foo",
               "relationship": "bars"
           }
       ]
    },
   "graph": {
       "nodes": [
            {
                "id": "0",
                 "metadata": {
                     "type": "rack",
                     "name": "rack0",
                     "id": 0 
                  }
            },
            {
                "id": "1",
                "metadata": {
                     "type": "node",
                     "name": "node7",
                     "id": 7
                  } 
            },
            {
                "id": "2",
                "metadata": {
                     "type": "socket",
                     "name": "socket0",
                     "id": 0
                  }
            },
            {
                "id": "3",
                "metadata": {
                     "type": "core",
                     "name": "core0",
                     "id": 0
                  }
            },
            {
                "id": "4",
                "metadata": {
                     "type": "memory",
                     "name": "memory0",
                     "count": 2,
                     "unit": 1073741824,
                     "id": 0
                  }
            },
            {
                "id": "5", 
                "metadata": {
                     "type": "foo",
                     "name": "foo0",
                     "id": 0
                  }
            }
        ],
        "edges": [
            {
                "source": "0",
                "target": "1"
            },
            {
                "source": "1",
                "target": "2"
            },
            {
                "source": "2",
                "target": "3"
            },
            {
                "source": "2",
                "target": "4"
            },
            {
                "source": "2",
                "target": "5",
                "metadata": {
                    "property": "foo"
                 }
            }
        ]
    }
}

@grondo and @trws: thoughts?

@dongahn
Copy link
Member

dongahn commented Jan 30, 2019

A bit verbose

Also I found that JGF is much more condense and legible than GraphML (#109 (comment)), though!

@SteVwonder
Copy link
Member

It looks like JSON Graph Format (JGF) gives me everything that I need to emit the resource graph.

If JGF can become a standard we should have a plenty of tools we can use -- e.g., validators, reader/writers, visualizers (already things like jsongraph.py exists) and editors for debugging and human interactivity);

👍 From their website, it looks like they are also using json-schema for validation. So +1 for no added dependencies to read and another +1 for no added dependencies to validate.

Right. This isn’t so human friendly. The current resource section of jobspec would be far better for that purpose. I don’t know if our requirements include R to be generated directly by human users.

I agree with you that I don't expect users to have to write R, but they will most likely have to read it. I imagine myself frequently dumping R from the KVS to see what resources my job ran on. That being said, the examples you posted are quite legible IMO.

A bit verbose

Just a thought, but this format probably compresses really well. Lots of repeated tags and patterns. It wouldn't help human-readability, but if we pass the compressed version around in the messages, it can potentially help performance. We could also store it compressed in the KVS, but if we do that, we'll want to make it simple for users to decompress and view R from the KVS.

Also I found that JGF is much more condense and legible than GraphML (#109 (comment)), though!

👍 👍 I agree. Much nicer to look at than GraphML.

@grondo
Copy link
Contributor Author

grondo commented Feb 4, 2019

Cons:
A bit verbose (I will look at the spec to see if I can create a default attributes so that we can optimize. )
Lose the human-friendly structure as jobspec's resource section has

Can we propose that the base R type still contains a hardware topology ("containment" hierarchy) as resource: or some other key, with a simple, more human readable hierarchy-only representation, while flux-sched and other advanced schedulers can extend the base R with a "graph:" (and other) sections which can be considered opaque to flux-core components?

Really, what is required from flux-core is the basic containment hierarchy, and an ability to map R to ranks and construct an R_local (exec system), and map task slots to local resources (job shell). We could also offer simple tools that parse and display the R for a job (e.g. in queue listing we might just need to pull out a host list, or in a more detailed listing a user-friendly representation of R), or users could dump just the resource: section of R directly.

Not that I am fully opposed to specifying the format of R as JGF, but at this point it seems like overkill for flux-core components. But I'd hate to add extra complexity elsewhere for only a modicum of simplicity in flux-core.

@dongahn
Copy link
Member

dongahn commented Feb 5, 2019

Can we propose that the base R type still contains a hardware topology ("containment" hierarchy) as resource: or some other key, with a simple, more human readable hierarchy-only representation, while flux-sched and other advanced schedulers can extend the base R with a "graph:" (and other) sections which can be considered opaque to flux-core components?

I think a two section approach has several advantages. For example, this way, we don't have to include rank and slot info into the "graph" section needed for the nested schedulers.

One disadvantage is, though, R will be on the order of twice as big with the two section approach. That may be okay... Generally, the two section approach would be a bit faster in exchange for more needed space.

One alternative appoarch would be to develop a converter layer that converts the graph into the containment as what you require for the execution service.

With either approach, a question remains what should be the exact format for the resource section. Like we discuss above, at least O would avoid use of with as the first cut.

@dongahn
Copy link
Member

dongahn commented Feb 5, 2019

How about:

#109 (comment) or similar under the resource key and JGF under the sched or graph key?

For our current sprint, resource can have R_lite++ of course?

@dongahn
Copy link
Member

dongahn commented Feb 5, 2019

BTW, does the execution system need info on the higher level resources? Like rack or cluster? Or just node and down?

@dongahn
Copy link
Member

dongahn commented Feb 5, 2019

@grondo and @SteVwonder: I can start to draft an RFC on a two section proposal as a way to push forward this dialogue further, if you like.

@grondo
Copy link
Contributor Author

grondo commented Feb 5, 2019 via email

@trws
Copy link
Member

trws commented Feb 6, 2019

I realize I'm coming to this a bit late, but I'd put in my 2c for keeping it one piece. The hardware topology should always be walkable by just limiting the graph representation to that kind of edge, and if it's in two formats in two parts at least some of the components will have to work with both. The job shell has to take R emitted from sched for example, so either sched needs to work with both formats or there has to be one that works for both. Perhaps it would be good to come up with an API or similar for what the core side wants to talk to, that could understand whatever the format is underneath and provide the appropriate information rather than expecting it to walk the resource spec directly?

@dongahn
Copy link
Member

dongahn commented Feb 6, 2019

@trws:

Thank you for your thoughts.

I think what @grondo wants is to put what's required by the execution service in one section in an easy-to-use format and the full information pertaining to the scheduler in the second section. This isn't too difficult to do by the scheduler. I hope that we don't find a situation where any component needs to read both sections. Clearly this isn't optimal in terms of storage and R producer performance. But it has a consumer performance advantage (the execution system only needs to read one section while the nested scheduler instance doesn't needs to read the data like "rank" and "slots") and the lower complexity in the execution system software. The API approach (or converter approach as I suggested above) would be another excellent way to overcome this issue. But what @grondo seems to concern about is the complexity of designing API at this point, as it will have to be a graph code.

@trws
Copy link
Member

trws commented Feb 6, 2019 via email

@dongahn
Copy link
Member

dongahn commented Feb 6, 2019

Completely agreed! Good thoughts @trws.

@dongahn
Copy link
Member

dongahn commented Feb 13, 2019

Thank you for the good discussions. We will probably want to refer back to this when we evolve R later.

But for now PR #155 resolved the ticket.

@dongahn dongahn closed this as completed Feb 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants