embed SELinux security context into credential metadata #70

dun · 2019-06-15T17:11:12Z

Embed the SELinux security context into the credential metadata. This will benefit resource managers (e.g., Flux and Slurm) that want to confine jobs to a specific SELinux context.

The goal is to allow a user to change into a project context where they can only write to specific directories, have that user launch a job with that security context, and ensure the job runs in that security context in order to prevent cross-contamination of data.

~~munged can retrieve the context of a peer socket via getpeercon(3).~~ Getting the appropriate security context of the peer is going to be tricky.

The text was updated successfully, but these errors were encountered:

dun · 2019-07-23T00:45:07Z

Email conversations with Steve Lawrence (@stevedlawrence) and Tim Wickberg (@wickberg):

From: Chris Dunlap
Date: Wed, 26 Jun 2019 17:36:39 -0700

The MUNGE daemon communicates with a client over a Unix domain socket. Under Linux, it uses the SO_PEERCRED socket option to obtain the UID/GID of the client (e.g., srun) connected to this socket. It could be modified to call getpeercon() there to obtain the security context of the client as well. That information is encapsulated into a credential in a manner that prevents tampering. When slurmd receives the credential, it can extract the UID/GID and security context without having to trust the client. It can then perform a combination of setuid() and setexeccon() before exec'ing the job.

Is that basically how you see this working? What other attributes are you needing to pass, how are these obtained, etc.?

From: Steve Lawrence
Date: Thu, 27 Jun 2019 08:19:08 -0400

That all sounds good to me, except for one potential concern, which is the use of getpeercon() to get the context of the submitting job. The socket you mention is created by srun, which means the context returned by getpeercon() will be the same as the srun process. That context could be different from the process that executed srun and what we want to use for setexeccon(). This is very common in SELinux and happens by way of an SELinux domain transition.

A domain transition is somewhat analogous to the setuid/setgid bits in standard Linux DAC. When these bits are set, when a user executes a file, the effective uid and gid of the new process are changed to that of the executable. A domain transition works similarly -- when a user executes a file, there are ways to change the SELinux context of the new process. This is often used to restrict or elevate privileges for the new process and is very common.

As an example, say we have a very restricted user that has no network access enforced by SELinux, running under the SELinux context "restricted_user_t". When the user executes srun, we could then perform a domain transition to a more privileged context "srun_t", which does have network access so slurm can submit the job to the master. The issue here is that if srun runs under "srun_t", getpeercon() will return that context, when what we really want is "restricted_user_t". Imagine someone set the setuid/setgid bits on the srun executable -- srun would run under a different uid/gid and munge would get the wrong uid/gid. Same kind of issue with SELinux.

One option to resolve this potential issue is to configure the SELinux policy to just not do a domain transition when a user executes srun. This way, srun will run as the user's context "restricted_user_t", and getpeercon() will get the expected context. The downside here is that a user will require whatever permissions srun requires, which might violate security requirements. For example, now our "restricted_user_t" context needs network access since srun needs network access.

In the case of this project, I don't think that would violate any security requirements, but I haven't looked in detail at what srun does to submit a job. Though, if effort is being put into making this change, it might be worth the effort to make it general purpose enough so that if we or someone else does want to separate srun into a different context, then they can do so. If this weren't an issue, I would very likely recommend doing the domain transition, so this isn't a totally made up use case.

Another option, munge could determine the process id of the submitting process (not srun, but whatever executed srun) and then use getpidcon() to get the context of that process. SO_PEERCRED will get the pid of srun, and finding its parent could be done by querying /proc, so this probably isn't too much extra work.

As an example, the process tree with contexts might look something like this:

xterm (restricted_user_t)
-- bash (restricted_user_t)
---- srun (srun_t)

In this case, munge would want to find the pid of the bash process and use getpidcon() to get "restricted_user_t".

Another related issue is the tools that wrap srun (e.g., salloc, sbatch). With these wrapper tools, the process tree might look something like this if we transitioned on the tools in addition to srun:

xterm (restricted_user_t)
-- bash (restricted_user_t)
---- salloc (srun_t)
------ srun (srun_t)

Note that in SELinux, salloc and srun are likely considered security equivalent and would run under the same context. In this case, munge would need to find the parent of salloc and get its context, rather than the parent of srun. An alternative here would be to just not transition when salloc is executed, but instead only when srun is executed. In which case the process tree looks like this:

xterm (restricted_user_t)
-- bash (restricted_user_t)
---- salloc (restricted_user_t)
------ srun (srun_t)

In this case, it would be reasonable to get the context of srun's parent, which is salloc with the same context as bash (i.e., we can just always get the context of the parent of srun). I'm not familiar enough with salloc and other wrappers to know if they have any privileges that a restricted user shouldn't have, or if they really are just fairly thin wrappers of srun. My guess is they don't really have excessive privilege requirements, and so this option would be viable and munge can just always get the context of the parent of srun. I'm also not familiar with what privileges srun needs, so it may be entirely reasonable to run it as the user context, and SO_PEERCRED + getpeercon() would be sufficient.

From: Chris Dunlap
Date: Thu, 27 Jun 2019 12:17:40 -0700

Thanks, Steve! This really helps. I was worried getpeercon() might return the wrong context, but wasn't sure what to do about it. I still have a couple of questions.

1. From a security perspective, is there a difference between having srun obtain the security context vs. having MUNGE obtain it? In particular, srun could call getpidcon() of its "pppid", and embed the resulting security context into the payload of the MUNGE credential. (MUNGE already supports embedding an arbitrary payload into the credential.) I'm wondering if there is any benefit to having MUNGE directly obtain the security context instead of having it passed from the resource manager when the credential is created.

2. Is there a way to algorithmically determine how far back in the process tree you need to go to get the appropriate security context? For a general purpose solution, I'm worried there could be multiple (or no) wrappers. Should you keep traversing the parent PIDs until the process cmdline matches a shell?

From: Steve Lawrence
Date: Fri, 28 Jun 2019 08:06:33 -0400

1. From an SELinux perspective, I don't think there's really a difference between MUNGE or srun getting the security context. But from a general security perspective, it might make sense to have MUNGE get it. I don't really know the details of how MUNGE and srun interact, but if srun sends the context to MUNGE or just appends something to MUNGE's payload, that means there's some inherent trust in srun to do the right thing. And there might be a potential attack where a malicious user could download and execute a modified srun that appends an arbitrary context that has elevated privileges. So I'm not sure we can really trust having srun get the context.

And I believe this is why MUNGE exists in the first place. I believe it has private access to keys used to sign the uid/gid credentials. Those keys wouldn't be accessible by a modified malicious binary so it would be difficult to forge the SELinux context without the master noticing. I presumed this is why MUNGE is trusted to get the uid/gid right now, because of this signing.

Side-note, SELinux could easily prevent the attack I mentioned of a user running a modified srun, but security is all about layers, so if we can add an extra layer of whatever security MUNGE provides, that would be preferred.

2. This is where things get tricky. It's probably best not to look at parents to find a command shell, because there could be cases where a slurm job isn't submitted from a shell. E.g., maybe it's submitted from a GUI or from cron or something that isn't necessarily a shell. It would be fine for our use case, but I can imagine cases where it might not work.

The best solution I've thought of so far is to have the ability to configure Slurm/MUNGE with a list of slurm-related SELinux contexts, and MUNGE would just keep looking at parent processes until it found one that doesn't have one of these slurm contexts. The benefit to this is that if we wanted to have different SELinux contexts for srun and the wrappers we could do so. I imagine the pseudo-code would look something like this:

slurm_contexts = [srun_t, salloc_t, sbatch_t]
pid = get_pid_of_srun()
context = getpidcon(pid)
while (context is in slurm_contexts) {
  pid = get_parent_pid(pid)
  context = getpidcon(pid)
}
return context

I'm not a huge fan of this, seems a bit over-engineered. But it is generic and should support all potential use cases. And it does use SELinux contexts to determine which processes to ignore, which I like since a user has no control over their SELinux context. Instead, the SELinux policy author has full control over how slurm determines which SELinux context to use without needing to mess with slurm at all.

And this type of list of contexts isn't unprecedented. There's a handful of files in /etc/selinux/<policy>/contexts/ that contains lists of contexts specific to certain applications. I don't think any processes use those lists in a manner like how I'm suggesting for slurm, but how those lists are used are application-specific and vary. Adding a slurm_contexts that contains the list mentioned above doesn't seem completely unreasonable.

All that said, if that all sounds overly complex, it would probably be reasonable to just always get the context of the parent of srun, it's just not the most flexible. Doing that would limit us to not writing a policy for wrappers, which is probably sufficient, but without an in-depth analysis of these wrappers it's hard to say for sure. If we want to play it safe and leave that door open, something like the above suggestion might be needed.

From: Chris Dunlap
Date: Fri, 05 Jul 2019 00:27:29 -0700

MUNGE and srun interact via the C library libmunge. This library provides the client which communicates with the local munged daemon over a Unix domain socket. As far as MUNGE is concerned, srun is just another client. MUNGE has no special knowledge of (or hooks for) Slurm.

MUNGE was originally developed to address several problems we were having with user authentication in the context of a cluster resource manager. We needed a portable API that worked across several flavors of Unix. We couldn't rely on root privileges or reserved ports. The remote daemons on the compute nodes within the cluster needed to reliably determine the user's identity. We couldn't trust the client for the user's identity. The credential had to be tamper-proof since it was being forwarded by untrusted intermediaries. And encoding/decoding the credential had to be fast at scale.

I think having MUNGE obtain the SELinux security context makes sense. It falls within the bailiwick of its user authentication. The issue has more to do with the time frame.

I'm currently envisioning adding a MUNGE_OPT_SELINUX context option to libmunge that takes an exclusion list of SELinux contexts in the form of a NULL-terminated array of strings. The client (srun) could create its MUNGE credential as follows:

char *slurm_contexts[] = { "srun_t", "salloc_t", "sbatch_t", NULL };
mungectx = munge_ctx_create ();
munge_ctx_set (MUNGE_OPT_SELINUX, slurm_contexts);
munge_encode (&mungecred, mungectx, payload_buf, payload_len);
munge_ctx_destroy (mungectx);
return mungecred;

The exclusion list is sent to munged in the request (via munge_encode()). munged obtains the uid, gid, and pid of the client via the SO_PEERCRED socket option. It looks up the process corresponding to that pid to obtain its uid & ppid. And it keeps traversing the process tree until either the uid transitions or the SELinux context fails to match the exclusion list:

getsockopt (sd, SOL_SOCKET, SO_PEERCRED, &cred, &credlen);
lookup-process-by-pid (cred.pid, &uid, &ppid);
for (pid=ppid; lookup-process-by-pid (pid, &uid, &ppid); pid=ppid) {
  if (uid != cred.uid) break;
  getpidcon (pid, &context);
  if (context-not-in-exclusion-list) break;
}
return (uid == cred.uid) ? context : NULL;

If the exclusion list is simply NULL, the for-loop will exit with the SELinux context of the parent of the peer process (e.g., the parent of srun). The uid check ensures the uid of the selected process matches the uid of the peer process that called munge_encode(). I don't want a bad exclusion list to result in returning the context of init or some other system process.

A few more questions...

Do you foresee any issues with the uid check above? Am I correct in assuming the uid of the matching process must match the uid of the process that requested the credential?

Would the SELinux contexts in the exclusion list always be specified as user:role:domain, or could they be abbreviated with just the domain (e.g., "system_u:system_r:srun_t" vs. "srun_t")? And what about security levels?

From: Steve Lawrence
Date: Fri, 5 Jul 2019 08:23:35 -0400

Sounds like the right approach to me.

The client (srun) could create its MUNGE credential as follows:
char *slurm_contexts[] = { "srun_t", "salloc_t", "sbatch_t", NULL };
...

Any particular reason to get the context list from srun rather than MUNGE having the list? srun really doesn't need to know anything about SELinux, so if all the logic, including what a srun context is, could be kept in MUNGE that might simplify things a bit.

I'd also probably prefer that the slurm_contexts not be hard coded. An SELinux policy is free to run srun and wrappers in whatever context they want, so hard-coding types could be limiting. For example, some policies are starting to drop the "_t" suffix, some policies are starting to have types like "srun.process", etc., so it isn't unrealistic that srun could run as something different than the list you have.

A configuration file like /etc/selinux/<policy>/contexts/slurm_contexts or in a slurm/munge-specific config file would be ideal. I think /etc/selinux/<policy> probably makes the most sense since slurm types are really specific to each policy. For example, you could actually have two different policies on a system that an admin could switch between (e.g., /etc/selinux/steve_policy/ and /etc/selinux/chris_policy/) and each could define different contexts for srun. Putting the config file in /etc/selinux/<policy>/contexts/ allows each policy to define what their slurm contexts are specific to their policy.

To get the contexts directory of the currently loaded policy you can use the libselinux function selinux_contexts_path() defined in <selinux/selinux.h>. You could then append "slurm_contexts" to this path to get the config file and read the list of slurm-related contexts. It's a bit more work, but allows extra functionality.

As to the questions:

Do you foresee any issues with the uid check above? Am I correct in assuming the uid of the matching process must match the uid of the process that requested the credential?

That check doesn't seem unreasonable to me. The only time I think it could cause issues is if someone had the setuid bit set on srun or a wrapper, but I can't imagine a system that would do that, and it would probably break other things if they did. That said, I can't think of a way where the loop would get up to init if the uid check were removed, but it can't hurt and is added protection from something dangerous.

Another option would be instead of having a list of slurm contexts, have whitelist of contexts that are allowed to submit slurm jobs. Then you traverse up the process tree until you find one of the whitelisted contexts. If it never finds one, the job isn't submitted. This is probably better from a security perspective, but is more difficult to maintain since it requires updates anytime something new is allowed to execute slurm. I prefer the existing solution, but both have their pros and cons.

Would the SELinux contexts in the exclusion list always be specified as user:role:domain, or could they be abbreviated with just the domain (e.g., "system_u:system_r:srun_t" vs. "srun_t")? And what about security levels?

I think all we really care about is the type/domain. The user, role, and (optional) security levels are all based on the context of the process that executed srun, so that could vary. For example, if a "staff" user ran srun, the full context might look like:

staff_u:staff_r:srun_t

Whereas a guests context might be:

guest_u:guest_r:srun_t

In this case, all we really care about when determining which process submitted the job is the type, srun_t, which remains constant regardless of who executes it. That is what lets us know if a process is a slurm-related process or not. So the exclusion list should really just contain types. Note that getpidcon() returns a full context, so you'll need to extract the type out of that for comparison. Fortunately, libselinux provides functions in <selinux/context.h> to extract the type, so it'll look something like this (with no error checking for brevity):

char *pidcon = NULL;
getpidcon(pid, &pidcon);
context_t con = context_new(pidcon);
const char *type = context_type_get(con);
// do comparisons with type
...
context_free(con);

Note that you'll still want to send pidcon to the slurm master so that the full context is used when the job is executed.

From: Chris Dunlap
Date: Sat, 06 Jul 2019 13:06:50 -0700

I don't see how MUNGE could manage the context list. It doesn't have any knowledge of Slurm or the client that's calling into libmunge. It can't apply a "slurm_contexts" policy to all clients. Are you suggesting that MUNGE loads an applicable policy somehow based on the security context of the process calling munge_encode()? And that this policy would need to be loaded and parsed again for each request encapsulating an SELinux context?

From: Steve Lawrence
Date: Mon, 8 Jul 2019 09:18:30 -0400

I didn't realize MUNGE could be used for things other than Slurm. I now see why you suggested that slurm send contexts to MUNGE, and that changes things a little bit.

However, one concern with this is that it requires some inherent trust in srun to send appropriate contexts. If srun were compromised, it could send contexts in addition to the slurm contexts and result in the wrong context being used for the job. And in our specific use case, the uid check wouldn't even save us, because our processes calling srun run under the same uid but with different contexts.

I personally would rather put more trust in MUNGE than in srun, since it seems to be smaller and more security-focused, so I'd prefer it handles getting the list of client types. The added benefit here is clients don't need to make code changes to get SELinux support. Only MUNGE and how it gets the list of client contexts need to be updated, and a config file makes that very easy.

We could go complex and make the config file a list of types depending on the client type. Calling the confg file something like /etc/selinux/<policy>/contexts/munge_client_contexts and it could look something like:

srun_t = srun_t salloc_t sbatch_t
client_a_t = client_a_t

The first type is the type of the client, and the types after the equal sign are parent contexts that should be ignored. So it should be similar logic to what you have now, but you first need to pick the appropriate list based on the client context. This config file wouldn't need to be updated often, so it really only needs to be parsed and loaded once when munge starts.

Though, I'm still not entirely convinced we even need the srun/salloc/sbatch context distinctions. If anyone knows if these wrappers really are different than srun from a security perspective, that would be good to know. But I suspect that salloc/sbatch/srun can all run as the same SELinux context, which simplifies both the config file format and, I think, the code. The config would then just contain:

srun_t
client_a_t

and the code might look something like this:

// only do this once at process start time
munge_client_contexts = read_munge_client_contexts()

client_context = get_context_of_client_socket()
if (munge_client_contexts.contains(client_context)) {
  pid = get_pid_of_client_socket()
  do {
    pid = get_ppid(pid)
    context = getpidcon(pid)
  } while (context == client_context);
  return context;
} else {
  return client_context;
}

If the context of the client is one of those listed in the munge_client_contexts file, then we find the first parent process that has a context other than the client context.

If the context of the client is not in the list, then we just use the client context.

This way is maybe not as generic and wouldn't support wrappers with different contexts, but I'm not convinced that's necessary. And it makes the config file quite a bit simpler -- just a list of types rather than mapping between type and list. I'm not sure if the code is any simpler, but it does avoid having to load and lookup one of multiple lists.

From: Chris Dunlap
Date: Wed, 17 Jul 2019 19:26:55 -0700

OK, this makes sense.

What's the rationale for placing the configuration in /etc/selinux/<policy>/contexts/munge_client_contexts vs. /etc/munge/munged.conf? Is there a need to switch between SELinux policies where the security contexts/types would differ? Is this just the standard location used by SELinux-enabled apps? It seems like it would be more straightforward to configure and maintain in /etc/munge/munged.conf, but perhaps that approach offers less flexibility.

From: Steve Lawrence
Date: Thu, 18 Jul 2019 07:22:47 -0400

Yes, that's the idea. The types that are assigned to srun are specific to each policy. For example, one policy might using the type "srun_t", where another policy might use the type "srun.process", and another might be something generic like "munge_client_t". Since these types are tied specifically to the policy, the convention is for SELinux-aware applications to get their SELinux-aware configurations from the policy directory. Although not super common, this allows users to easily switch between policies without having to modify the specific configurations of SELinux-aware applications. Most SELinux-aware applications have a config file in /etc/selinux/<policy>/contexts/.

The format of the file is completely up to the application, so you could make it very similar to the munge config file and reuse that config parsing logic if you wanted.

SELinux has an API function to get the contexts directory:

selinux_contexts_path()

So you don't need to use any complex logic to figure out which policy directory to use. You just need to call that and append "munge_client_contexts".

From: Tim Wickberg
Date: Wed, 31 Jul 2019 16:32:36 -0600

Hi Steve -

I spent some time chatting with Chris today trying to come up to speed on how best to handle securely transmitting the SELinux context, and it does sound like there are a few subtleties that I had not been aware of that need to be resolved.

Specifically, if the salloc/sbatch/srun commands themselves have been placed into different contexts, then MUNGE would not see the context that would ideally be the one restored when launching the users' processes on the compute nodes.

Is this case -- where salloc/sbatch/srun have different labels -- an expected one for the LLNL deployment under discussion, or more of a hypothetical issue? I think this is critical to deciding what path we take, and does have significant ramifications for when parts of this may be able to be developed and by whom.

At the moment, my working expectation has been that salloc/sbatch/srun are all within whatever the users' "normal" context is, and have not had any different context applied that would lead to a domain transition. In which case, MUNGE could get that context through SO_PEERSEC on the socket, and embed that into a new revision of their RPC message type. Slurm would need to extract that new field, and apply it correctly on the compute nodes, but otherwise this path is relatively straightforward.

If they do have a different context, that is when things get tricky. I know Chris outlined a scenario where MUNGE looks up the pid of, e.g., srun, and walks up the process tree to find a usable label, but I'd really rather not see MUNGE tackle that complexity. There are all sorts of subtle race conditions that would expose MUNGE to with how the /proc data is maintained -- Slurm itself has to deal with some of these and they're a huge hassle -- and I'd rather not see MUNGE adopt that complexity both from a security and performance standpoint.

There is a potential alternative here if the salloc/sbatch/srun processes do not map directly to the required context: simply rely on some other mechanism to inform Slurm in a secure fashion what the desired context is. This can be accomplished through the use of a "job_submit" plugin, which is a piece of code running inside the slurmctld process, and which could provide the appropriate result based on the uid/gid of the submitting user through some other mechanism.

Alternatively, it may be simpler to configure pam_selinux alongside Slurm's UsePAM=1 configuration option to have PAM itself decide what context to establish. (And I believe that could be accomplished today without any modification to Slurm or MUNGE.)

From: Steve Lawrence
Date: Mon, 05 Aug 2019 09:30:44 -0400

Is this case -- where salloc/sbatch/srun have different labels -- an expected one for the LLNL deployment under discussion, or more of a hypothetical issue?

srun will almost certainly be run under a different label. It's still a hypothetical that salloc/sbatch will run under a different label or if they can just run as the label as whatever executed them. It's possible for security reasons that salloc and sbatch should run as something different (likely the same as srun), but I don't know enough about what capabilities salloc/sbatch need compared to a normal user vs srun. I'm sure srun requires extra privileges (e.g. open network connections to the master). But what I don't know is what salloc and sbatch need or what they really do under the hood. No one has clarified this to me.

At the moment, my working expectation has been that salloc/sbatch/srun are all within whatever the users' "normal" context is, and have not had any different context applied that would lead to a domain transition. In which case, MUNGE could get that context through SO_PEERSEC on the socket, and embed that into a new revision of their RPC message type. Slurm would need to extract that new field, and apply it correctly on the compute nodes, but otherwise this path is relatively straightforward.

I think we need to change this expectation. I believe it will violate security requires if at a minimum srun does not transition to something different, so getting the context via SO_PEERSEC will likely not work.

If they do have a different context, that is when things get tricky. I know Chris outlined a scenario where MUNGE looks up the pid of, e.g., srun, and walks up the process tree to find a usable label, but I'd really rather not see MUNGE tackle that complexity. There are all sorts of subtle race conditions that would expose MUNGE to with how the /proc data is maintained -- Slurm itself has to deal with some of these and they're a huge hassle -- and I'd rather not see MUNGE adopt that complexity both from a security and performance standpoint.

Understandable. That said, it sounds to me like MUNGE is responsible and trusted to perform the security logic. Hence the reason why Slurm does not do it. It seems logical that it also handle this. But I'm open to alternatives. I'm not all that familiar with slurm/MUNGE.

There is a potential alternative here if the salloc/sbatch/srun processes do not map directly to the required context: simply rely on some other mechanism to inform Slurm in a secure fashion what the desired context is. This can be accomplished through the use of a "job_submit" plugin, which is a piece of code running inside the slurmctld process, and which could provide the appropriate result based on the uid/gid of the submitting user through some other mechanism.

Unfortunately, we cannot rely on uid/gid. In our particular use case, a user will switch between different security labels, but stay under the same uid/gid. We need slurm/MUNGE/something to determine what security label was used to submit a job and maintain that when executing the job.

Alternatively, it may be simpler to configure pam_selinux alongside Slurm's UsePAM=1 configuration option to have PAM itself decide what context to establish. (And I believe that could be accomplished today without any modification to Slurm or MUNGE.)

For similar reasons above, I'm not sure PAM will work since what the user logs in as isn't necessarily the same as the security label that a job could be executed as. A user can switch to one of many security labels after logging in and submit jobs from different labels.

From: Tim Wickberg
Date: Mon, 05 Aug 2019 09:35:48 -0600

srun will almost certainly be run under a different label. It's still a hypothetical that salloc/sbatch will run under a different label or if they can just run as the label as whatever executed them. It's possible for security reasons that salloc and sbatch should run as some thing different (likely the same as srun), but I don't know enough about what capabilities salloc/sbatch need compared to a normal user vs srun. I'm sure srun requires extra privileges (e.g. open network connections to the master). But what I don't know is what salloc and sbatch need or what they really do under the hood. No one has clarified this to me.

salloc/sbatch have very similar requirements to srun -- being able to connect out to the controller and compute nodes, as well as establish ephemeral TCP ports to listen on -- and would need equivalent privileges.

I think we need to change this expectation. I believe it will violate security requires if at a minimum srun does not transition to something different, so getting the context via SO_PEERSEC will likely not work.

That's unfortunate, and at this point I believe precludes having MUNGE provide this info for us.

Unfortunately, we cannot rely on uid/gid. In our particular use case, a user will switch between different security labels, but stay under the same uid/gid. We need slurm/MUNGE/something to determine what security label was used to submit a job and maintain that when executing the job.

This is a problem as well. What you've described here implies the only way to find the "correct" label that the job should be running under is to walk up the process tree, and stop once an appropriate label as been identified.

However, building such a process tree representation in a secure manner is very difficult on Linux system, and there are many subtle race conditions that -- in some cases -- cannot be overcome 100%. So modifying MUNGE to provide this is a non-starter as well from my perspective, as I certainly would not want to develop or submit insecure extensions to MUNGE, and I'm sure Chris would not accept such work either.

The salloc/sbatch/srun client commands themselves cannot provide this info either. They're inherently untrusted, and there is nothing that stops a user from installing modified versions bypassing any logic we add to handle this.

I think that we've thus reached an impasse, and I don't have a workable solution given your current set of requirements.

From my perspective, either salloc/sbatch/srun need to run under the context you expect the job to run with on the node, or you need to provide the slurmctld process with some way to securely look up the correct mapping for a given user/group (possibly alongside some additional job annotations such as QOS, account, or WCKey).

From: Steve Lawrence
Date: Mon, 05 Aug 2019 14:33:08 -0400

srun will almost certainly be run under a different label. It's still a hypothetical that salloc/sbatch will run under a different label or if they can just run as the label as whatever executed them. It's possible for security reasons that salloc and sbatch should run as some thing different (likely the same as srun), but I don't know enough about what capabilities salloc/sbatch need compared to a normal user vs srun. I'm sure srun requires extra privileges (e.g. open network connections to the master). But what I don't know is what salloc and sbatch need or what they really do under the hood. No one has clarified this to me.

salloc/sbatch have very similar requirements to srun -- being able to connect out to the controller and compute nodes, as well as establish ephemeral TCP ports to listen on -- and would need equivalent privileges.

Ok, then is sounds to me like ideally salloc/sbatch would have the same label as srun. This only adds extra complication, since it requires looking up a process tree rather than at just a single parent.

Unfortunately, we cannot rely on uid/gid. In our particular use case, a user will switch between different security labels, but stay under the same uid/gid. We need slurm/MUNGE/something to determine what security label was used to submit a job and maintain that when executing the job.

This is a problem as well. What you've described here implies the only way to find the "correct" label that the job should be running under is to walk up the process tree, and stop once an appropriate label as been identified.

However, building such a process tree representation in a secure manner is very difficult on Linux system, and there are many subtle race conditions that -- in some cases -- cannot be overcome 100%. So modifying MUNGE to provide this is a non-starter as well from my perspective, as I certainly would not want to develop or submit insecure extensions to MUNGE, and I'm sure Chris would not accept such work either.

Ok, I'll have to see if there are any alternatives. I have one idea, but it's extremely ugly -- we essentially need a duplicate srun/salloc/sbatch executable file for every different security label that one could transition from, and the user needs to execute the right one. SELinux could enforce that only the right one could be executed, but the user much know which one to execute, or we provide a wrapper script that figures it out. This allows srun to have additional privileges to submit jobs while still have each execution of srun to have a unique label.

Out of curiosity, what are the race conditions with calculating parent process ids? I can certainly imagine complexities (e.g. parent processes terminating, various process states to deal with), but I'm not aware of any race conditions with security implications. Maybe something like srun and the parent exiting before munge reads /proc and another processing starting with the same pid?

The salloc/sbatch/srun client commands themselves cannot provide this info either. They're inherently untrusted, and there is nothing that stops a user from installing modified versions bypassing any logic we add to handle this.

Agreed. That's why I was hoping MUNGE would be able to handle it.

From my perspective, either salloc/sbatch/srun need to run under the context you expect the job to run with on the node, or you need to provide the slurmctld process with some way to securely look up the correct mapping for a given user/group (possibly alongside some additional job annotations such as QOS, account, or WCKey).

Are there any annotations that are specific to a process? How are such annotations usually applied?

From: Tim Wickberg
Date: Mon, 05 Aug 2019 13:29:43 -0600

Ok, I'll have to see if there are any alternatives. I have one idea, but it's extremely ugly -- we essentially need a duplicate srun/salloc/sbatch executable file for every different security label that one could transition from, and the user needs to execute the right one. SELinux could enforce that only the right one could be executed, but the user much know which one to execute, or we provide a wrapper script that figures it out. This allows srun to have additional privileges to submit jobs while still have each execution of srun to have a unique label.

That sounds like a complete nightmare to maintain, and I wouldn't expect to do anything in Slurm in support of such a model.

Out of curiosity, what are the race conditions with calculating parent process ids? I can certainly imagine complexities (e.g. parent processes terminating, various process states to deal with), but I'm not aware of any race conditions with security implications. Maybe something like srun and the parent exiting before munge reads /proc and another processing starting with the same pid?

You've hit on the main issue, yes. There is no atomic method to interact with /proc; when you fetch the ppid of a process, there is always a potential race in between reading that value and opening /proc/$ppid/stat to fetch further info on that parent in which the parent has exited and the pid reassigned to a new process.

Are there any annotations that are specific to a process? How are such annotations usually applied?

At job submission and/or step launch time through various arguments. The way in which a given site uses these varies widely, but the --qos, --partition, and --wckey ones are the most frequently used to implement some sort of policy. There are plugin APIs available -- cli_filter or SPANK in particular -- that can help set up some of this automatically.

But I don't think that helps here. Any of these can be set to arbitrary values by the user -- enforcement happens within our slurmctld process on the control node -- and thus cannot be used for security.

To rephrase some of my earlier remarks, if the design for the SELinux contexts on these systems has salloc/sbatch/srun running in some other context and not the desired context for the users' processes to launch in on the compute nodes, I do not know there is anything we can do here.

Please understand that anything I build in Slurm needs to be usable on other systems. I cannot build a one-off set of commands, installation mechanisms, or other work that is only of use in this very specific LLNL deployment. That unfortunately seems to be the path we're headed down at this point.

From: Steve Lawrence
Date: Mon, 05 Aug 2019 16:48:10 -0400

Ok, I'll have to see if there are any alternatives. I have one idea, but it's extremely ugly -- we essentially need a duplicate srun/salloc/sbatch executable file for every different security label that one could transition from, and the user needs to execute the right one. SELinux could enforce that only the right one could be executed, but the user much know which one to execute, or we provide a wrapper script that figures it out. This allows srun to have additional privileges to submit jobs while still have each execution of srun to have a unique label.

That sounds like a complete nightmare to maintain, and I wouldn't expect to do anything in Slurm in support of such a model.

Agreed. Nothing in slurm would need to change with this. This would all need to be managed by LLNL if we went this route.

Out of curiosity, what are the race conditions with calculating parent process ids? I can certainly imagine complexities (e.g. parent processes terminating, various process states to deal with), but I'm not aware of any race conditions with security implications. Maybe something like srun and the parent exiting before munge reads /proc and another processing starting with the same pid?

You've hit on the main issue, yes. There is no atomic method to interact with /proc; when you fetch the ppid of a process, there is always a potential race in between reading that value and opening /proc/$ppid/stat to fetch further info on that parent in which the parent has exited and the pid reassigned to a new process.

I'm surprised no one has written up a reasonable solution to this problem, but I did find mentions of using openat to do it. If we were able to resolve the potential race issue, although with some complexity, would you be open to adding the capability to MUNGE? I'm wondering if something like this C-ish code would work to safely (ignoring error checking/closing descriptors/etc.) loop up parent pids, via the use of open/openat combination:

char* get_srun_parent_context(int srun_pid) {

  int pid_dir = open("/proc/$(srun_pid)>/", ...);

  while (true) {
    // open the /proc/pid/stat file and extract the ppid
    int pid_stat = openat(pid_dir, "stat", ...);
    int ppid = get_ppid_from_stat(pid_stat);

    // error if ppid is 1 or a subpreaper process, this pid was
    // orphaned

    // get the pid context
    getpidcon(ppid, &context);

    // get what we assume is a file descriptor to the ppid /proc dir
    int ppid_dir = open("/proc/$(ppid)/", ...);

    // re-open the /proc/pid/stat file and extract the ppid
    int check_pid_stat = openat(pid_dir, "stat");
    int check_ppid = get_ppid_from_stat(check_pid_stat);

    if (ppid != check_ppid) {
      // The stat ppid changed (should be 1 or a subpreaper pid),
      // meaning the parent pid terminated at some point after reading
      // stat the first time. We don't know when, so whatever context
      // we got or we opened for ppid_dir may have been some other
      // process that reused the process id. So we cannot trust any
      // information we got. Must error.
      return null;
    }

    // check if this is still an srun client process or not
    int is_client = is_srun_client_context(context);

    if (!is_client) {
      return context;
    } else {
      // loop around using the new ppid as pid
      pid_dir = ppid_dir;
    }
  }
}

I think the only race-y thing here is the original open of pid_dir (i.e. opening of the srun /proc stat file), but I presume something similar can be done to validate that we get a file descriptor to the correct srun process (e.g. checking ppid).

To rephrase some of my earlier remarks, if the design for the SELinux contexts on these systems has salloc/sbatch/srun running in some other context and not the desired context for the users' processes to launch in on the compute nodes, I do not know there is anything we can do here.

Please understand that anything I build in Slurm needs to be usable on other systems. I cannot build a one-off set of commands, installation mechanisms, or other work that is only of use in this very specific LLNL deployment. That unfortunately seems to be the path we're headed down at this point.

Completely understood. I don't believe the need to get the parent process context is specific to LLNL's use case. IMHO, any system that cares enough about security to want Slurm SELinux support should also want to have separate labels for srun/salloc/sbatch, and should also want to use the label of the submitting process instead of one of these labels. This follows the principal of least privilege and ensures that processes that can execute srun do not need to have all the privileges that srun requires.

From: Chris Dunlap
Date: Thu, 08 Aug 2019 17:56:20 -0700

Sorry for the delayed response.

I'm open to adding SELinux support to MUNGE -- assuming all potential race conditions and issues of correctness are addressed, obviously. The main issue is a question of when. There are additional complexities that will take time.

This code would be running in a munged thread that cannot block. I don't expect these calls to block since they're accessing kernel data structures, but they all have to be checked. If some could potentially block, that adds more complexity.

There's the issue of the credential format which needs to be updated to an extensible format in order to support new features. And there's the compatibility code to handle both the new and old formats.

There's portability concerns. The SELinux bits have to be properly autoconf'd and tested on multiple platforms, some of them dating back years. MUNGE still has to build on systems that don't support SELinux. Builds that don't support SELinux need to handle credentials from builds that do, and vice-versa. Automated tests need to be added to the test suite.

There's the performance impact this will have which becomes a concern at scale. The cost of these additional system calls adds up which will reduce credential throughput. I might be able to mitigate this some by checking if the process start time is monotonically decreasing as you traverse up the process tree, but I'd need to research that more. (There was a recent issue where an attacker could delay the process start time since fork() is not atomic.)

There's the requisite plumbing to extend the libmunge API and client/server protocol. There should ideally be compatibility code to handle the new and old protocols.

There's the munge/unmunge/remunge clients that need to be updated accordingly. This, at least, is relatively straightforward.

These are just the issues off the top of my head. I haven't had time to do a detailed design. There could be other surprises. In short, this feature adds a lot of complexity that will take time to properly address.

Initially, I was told the anticipated time frame for this would be around the end of August. That's not a target I can meet. My plan for the short-term was to have Slurm do the security context lookup and pass the context string to munge_encode() for embedding into the credential payload; the SELinux policy could be updated to protect against a modified/malicious srun client. But Tim says any changes being made to Slurm cannot be just for a very specific LLNL deployment.

So far, I've largely been gathering requirements. I'd want to allow for at least 6 months development time to address the concerns noted above. That time would have to be negotiated with my management.

From: Steve Lawrence
Date: Fri, 09 Aug 2019 07:51:12 -0400

Ok. Thanks for the response. I'll talk to Carlos and see if any of the options that require less intrusive MUNGE changes (i.e. just get the context of the socket) can be made to work. The options are pretty ugly and does feel like the "right" way to do it, but it's certainly less complex and is more likely to fit in their time frame.

From: Steve Lawrence
Date: Fri, 09 Aug 2019 09:03:41 -0400

Thinking more on this, I think I may have a solution that's fairly clean from an SELinux perspective (more complicated and a bit awkward, but nothing too bad), but requires a reliable and specific process tree from slurm.

The required changes to MUNGE follow the original approach, in which MUNGE gets the context of the srun socket (no process tree craziness) and a slurm process will use setexeccon(3) to restore that context at the appropriate time.

But I need some clarification on how a job is executed to be sure this will work. Can you verify the process tree that leads to the execution of the job once slurmd receives it? As I recall the slurmd daemon spawns a new instance of slurmstepd for every job, and then slurmstepd executes the actual job binary? So a process tree of multiple jobs running might look something like this:

slurmd
|-slurmstepd
| `-job1.sh
`-slurmstepd
  `-job2.sh

Are the slurmstepd processes always unique for each job, and do not share any resources with any other jobs?

If so, we can define the SELinux policy so srun/salloc get a unique type based on what executes it without too much hassle. So every srun instance gets a unique and limited type. Then slurmd can execute slurmstepd with that type. And finally slurmstepd will transition back to the original type when it executes a job.

So the whole thing looks something like this:

On the job submission node:

gdm
|-bash (projectA_t)
| `-salloc (projetcA_srun_t)
|   `-srun job1 (projectA_srun_t)
`-bash (projectB_t)
  `-salloc (projectB_srun_t)
    `-srun job2 (projectB_srun_t)

MUNGE will then get the context of the srun socket (i.e. projectX_srun_t) and securely pass that to the master slurm node.

On the master node:

When slurmd executes slurmstepd it will transition (via setexeccon(3)) to the appropriate srun type that MUNGE gathered. And then slurmstepd will transition (via normal SELinux policy) to the original type, e.g.:

slurmd (slurmd_t)
|-slurmstepd (projectA_srun_t)
| `-job1.sh (projectA_t)
`-slurmstepd (projectB_srun_t)
  `-job2.sh (projectB_t)

This is a bit awkward (and potentially a security violation) because the special *_srun_t type needs the combined privileges of srun/salloc and slurmstepd. Really slurmstepd should be someting like *_slurmstepd_t but we can't do that). There are ways we can split things up in policy so projectX_srun_t has different privileges depending if it's on a master node or not, but it's bit harder to manage.

Some questions:

1. Is my understanding of the process tree correct for executing a job?

2. Are any resources shared/reused between different slurmstepd processes? Are they truly one-off and solitary processes per job?

3. Do srun/salloc and slurmstepd require essentially the same permissions, or are they really very different? I assume they're pretty different, but maybe not? Is slurmstepd what actually communicates back to salloc/srun to provide the job output, or is that routed through slurmd?

4. Are there any concerns with these changes to MUNGE and slurmd? I think this is back to the original approach, so hopefully not.

From: Chris Dunlap
Date: Fri, 09 Aug 2019 16:26:19 -0700

Slurm specifics are outside my bailiwick, so I'll leave those for Tim.

To be honest, I don't fully understand the proposed SELinux policy. It seems convoluted to me, and that makes me think this approach is likely to result in a site-specific and workload-specific feature.

Also, relying on specific Slurm behavior makes this a resource manager specific solution, not a general-purpose one. Any features I add to MUNGE have to provide a general-purpose solution.

While spelunking /proc to locate an appropriate process adds substantial complexity, it also seems to offer a general-purpose solution. Removing that part only removes ~20% of the work involved. The rationale for placing that code in MUNGE is to remove the need for applications to roll their own (potentially incompatible or insecure) solution. MUNGE strives to provide basic user authentication information in a simple, secure, fast, and portable manner.

Unfortunately, properly addressing this complexity will take significant time. I don't see a quick path forward for adding this feature to MUNGE. Given the time requirements, you might want to pursue other solutions.

From: Steve Lawrence
Date: Tue, 13 Aug 2019 09:21:13 -0400

As someone that's done alot of SELinux work, this really isn't anything that convoluted from an SELinux perspective. Ensuring that domain transitions happen with the right domains in the right spot is a good chunk of what gives SELinux is strength, and also what can make it seem complicated. So while this might seem convoluted, it really isn't anything too crazy to any familiar with SELinux.

The only thing here the complicates things is that slurm/MUNGE need to be SELinux aware. Which does require understanding how the processes all interact to know the correct place to insert the awareness. There's really no way around this, regardless of which approach is taken.

And there are some good security reasons to have srun/alloc transition to something like projecA_srun_t, so regardless if MUNGE just uses the context of srun/salloc or if it looks up the process tree, we'll probably still do that. And the security reasons also apply to slurmstepd too (though, it's still not clear to me how slurmstepd gets information back to salloc/srun). So restoring the process tree context might not actually be the best thing, since it doesn't allow separating slurmstepd.

I'm not sure I understand what makes this not general purpose. I believe the way MUNGE works right now is it captures UID/GID information from srun, provides that info to slurmd, and then slurmd restores that at some point. I think we should be able to basically do the same thing but with SELinux. And SELinux policy can can be made as complicated or as simple to take advantage of this.

For example of a simple usage, assume we didn't have any of our LLNL project stuff going on, and there was just a single user and we don't care about transition srun. In this case, the SELinux policy would be configured do to this:

gdm
`-bash (user_t)
  `-salloc (user_t)
    `-srun job1.sh (user_t)

So salloc and srun do not have a transitions and they all run in user_t. MUNGE would capture user_t from the srun socket (no proecess tree) and send that to slurmd. Which looks like this:

slurmd (slurmd_t)
`-slurmstepd (user_t)
  `-job1.sh (user_t)

So slurmd would be SELinux aware to execute slurmstepd as user_t, which would then execute the job. This way, the job is executed exactly as if the user had executed it.

So in the non-complicated LLNL case, this is all pretty straightforward. And the LLNL will just add some extra transitions (that MUNGE/slurm do not need to be aware of) to make sure things are a bit more restricted to meet its security requirements.

Another question I thought of:

1. Where does the UID/GID get restored by slurm? Is slurmstepd executed as the original UID/GID, or just the job that slurmstepd executes?

The other questions for reference:

2. Is my understanding of the process tree correct for executing a job?

3. Are any resources shared/reused between different slurmstepd processes? Are they truly one-off and solitary processes per job?

4. Do srun/salloc and slurmstepd require essentially the same permissions, or are they really very different? I assume they're pretty different, but maybe not? Is slurmstepd what actually communicates back to salloc/srun to provide the job output, or is that routed through slurmd?

5. Are there any concerns with these changes to MUNGE and slurmd? I think this is back to the original approach, so hopefully not.

From: Steve Lawrence
Date: Tue, 13 Aug 2019 11:05:45 -0400

Just want to add that after talking with some colleagues, they agree that the process tree thing isn't as good as it sounds once we really dug into the implications. It sounds right, but doesn't allow for as much confinement of slurmstepd as one would really like. And as you mentioned, the process tree lookup is complex and potentially racey if not done correctly.

I think the best way forward, for both time and capabilities, is for MUNGE to just get the context of the peer socket via getpeercon(3) and set that at the appropriate point in slurm via setexeccon(3). It's relatively simple for MUNGE, isn't specific for LLNL, and an SELinux policy can accomplish a good deal of lock down with it, including reach LLNL's requirements.

A colleague also mentioned that it might make things cleaner if MUNGE also had the ability to do some sort of translation on the context. For example, if getpeercon(3) returned "projectA_srun_t", ideally it would call setexeccon(3) with "projectA_slurmstepd_t". Such a translation is common in in things like pam where the context of a user is different based on how the user logs in (e.g. sshd vs locallogin). A simple search/replace (e.g. s/srun/slurmstepd/g) would probably suffice, but something like a lookup table could work. There are also options like making it part of SELinux policy and MUNGE be extra SELinux aware and become an SELinux userspace object manager. So lots of options. But this isn't stritly needed -- just would make the SELinux a parts a bit cleaner.

I do still have some unknowns about exactly how slurmd/slurmstepd/jobs interact, so those questions need to be answered until I can be 100% confident this solution works.

dun · 2022-12-22T21:57:19Z

Cautionary note since munged will presumably need to spelunk /proc to find the appropriate security context to embed.

[Linux] /proc/pid/stat parsing bugs
https://www.openwall.com/lists/oss-security/2022/12/21/6

Most of the code splits it by space and takes an N-th field.
The problem is that the process name "(cat)" can contain spaces (and
brackets). Potentially some important software (containers/sandboxes)
can be tricked into getting wrong data, and I've seen cases close to
stack overflows (buffer for a fixed number of fields is allocated on
stack).

The only way to parse it is to do strrchr(')') first (fortunately it
contains just one unescaped string).

wickberg · 2022-12-22T22:16:15Z

The SELinux context can be fetched using the SO_PEERSEC option to getsockopt()... I wouldn't expect MUNGE to do anything unusual with /proc there.

dun · 2022-12-22T22:20:22Z

There's been some discussion in this thread on traversing /proc to find the appropriate process to pass to getpidcon().

dun added the feature label Jun 15, 2019

garlick mentioned this issue Jun 26, 2019

Support for SELinux flux-framework/flux-security#95

Open

dun mentioned this issue Nov 5, 2020

extensible credential format #87

Open

garlick mentioned this issue Feb 14, 2023

Plan to support GEOAxIS, OpenID, OAuth2, Kerberos authentication, SELinux Multi Level Security (MLS) and Multi Category Security (MCS) flux-framework/flux-security#160

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

embed SELinux security context into credential metadata #70

embed SELinux security context into credential metadata #70

dun commented Jun 15, 2019 •

edited

Loading

dun commented Jul 23, 2019 •

edited

Loading

dun commented Dec 22, 2022 •

edited

Loading

wickberg commented Dec 22, 2022

dun commented Dec 22, 2022

embed SELinux security context into credential metadata #70

embed SELinux security context into credential metadata #70

Comments

dun commented Jun 15, 2019 • edited Loading

dun commented Jul 23, 2019 • edited Loading

dun commented Dec 22, 2022 • edited Loading

wickberg commented Dec 22, 2022

dun commented Dec 22, 2022

dun commented Jun 15, 2019 •

edited

Loading

dun commented Jul 23, 2019 •

edited

Loading

dun commented Dec 22, 2022 •

edited

Loading