Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

munged crashes when processing supplementary group info under glibc #2

Closed
GoogleCodeExporter opened this issue May 15, 2015 · 12 comments
Labels
Milestone

Comments

@GoogleCodeExporter
Copy link

What steps will reproduce the problem?

Add the following to /etc/passwd (this would never exist in production, but is used to emulate the fact that ldap groups can have case insensitive users defined in them)

dummy1:x:9999:9999::/tmp:/bin/false
Dummy1:x:9999:9999::/tmp/bin/false

to /etc/group

dummy:x:9999:dumm1,Dummy1

What is the expected output? What do you see instead?

munge should run and not crash. Munge will crash.

What version of the software are you using? On what operating system?

I have tried EPEL 0.5.8 (x86_64) and self built 0.5.9 (x86_64,i386) with identical results under Centos 5.5.

Please provide any additional information below.

Munge was quitting without any warning after 1 hour of running. The only clue in the logs was when it started it spit out an ...

2010-10-27 20:21:21 Error:     Unable to query group info: Permission denied

Tracing the source it became obvious that munge can not handle the possibility that the same uid is mapped to two different strings, including case insensitivity. The fix for the most obvious issues should be as simple as altering the user comparison to a stricmp, but that does not fix the silent crash of munge during some edge cases and a group info reload.

Original issue reported on code.google.com by ericew on 28 Oct 2010 at 12:35

@GoogleCodeExporter
Copy link
Author

The more I dig the more confused I get. munge mixes pthreads and getgrent calls, why was this not migrated to the reentrant versions of the calls?

Original comment by ericew on 28 Oct 2010 at 2:34

@GoogleCodeExporter
Copy link
Author

Without patch:

-bash-3.2$ id
uid=18070(ew2193) gid=972(faculty) 
groups=972(faculty),1000(shared),1015(strnmr),1016(mullin),1024(lcg),1067(ritsta
ff),1106(evalcon),1131(bsmlab),1223(pvc)
-bash-3.2$ munge -n -g mullin > /tmp/bug_test
-bash-3.2$ unmunge < /tmp/bug_test
unmunge: Error: Unauthorized credential for client uid=18070 gid=972
-bash-3.2$ newgrp mullin
bash-3.2$ unmunge < /tmp/bug_test
STATUS:           Success (0)
ENCODE_HOST:      hermes.rit.albany.edu (169.226.65.36)
ENCODE_TIME:      2010-10-28 15:04:44 (1288292684)
DECODE_TIME:      2010-10-28 15:05:16 (1288292716)
TTL:              300
CIPHER:           aes128 (4)
MAC:              sha1 (3)
ZIP:              none (0)
UID:              ew2193 (18070)
GID:              faculty (972)
GID_RESTRICTION:  mullin (1016)
LENGTH:           0

With patch:

-bash-3.2$ id
uid=18070(ew2193) gid=972(faculty) 
groups=972(faculty),1015(strnmr),1016(mullin),1024(lcg),1067(ritstaff),1106(eval
con),1131(bsmlab),1223(pvc)
-bash-3.2$ munge -n -g mullin > /tmp/bug_test
-bash-3.2$ unmunge < /tmp/bug_test
STATUS:           Success (0)
ENCODE_HOST:      hermes-04.rit.albany.edu (169.226.65.69)
ENCODE_TIME:      2010-10-28 15:06:16 (1288292776)
DECODE_TIME:      2010-10-28 15:06:23 (1288292783)
TTL:              300
CIPHER:           aes128 (4)
MAC:              sha1 (3)
ZIP:              none (0)
UID:              ew2193 (18070)
GID:              faculty (972)
GID_RESTRICTION:  mullin (1016)
LENGTH:           0

Right now it's a big patch to gids.c, but really it should be cleaned up some since there are essentially two different ways to fix the problem. The easier one removes all the hashing code and uses getgrouplist which is nscd/ldap friendly and works. The other is a fix to the existing group scanning code to use reentrant (yes, it makes a difference on some systems). Really if getgrouplist is on the system then we should jettison a lot of code and drop some command line arguments.

Original comment by ericew on 28 Oct 2010 at 7:11

@GoogleCodeExporter
Copy link
Author

GoogleCodeExporter commented May 15, 2015

As a work-around, the gids code path can be disabled by running munged with --group-update-time=-1. The only functionality you'll lose is restricting credentials by supplementary GIDs.

So far, I've been unable to reproduce this behavior here under RHEL 5.5. One difference between our systems is that we're not running LDAP.

The same UID can be mapped to different user names. User names and group names are treated as case-sensitive strings like they are with the standard password & group routines. I don't understand what the problem is with this behavior. I'm not very familiar with LDAP.

Re: comment 1, getgrent_r() and getpwnam_r() were not used since they are only called in/from _gids_hash_create(), and only one instance of that routine can be running at a time within munged. As such, the static buffers in getgrent() and getpwnam() didn't seem to be an issue.

Re: comment 2, getgrouplist() is non-standard. It's not available on AIX or Solaris, so an alternate method would still be required.

Also, the comment 2 "without patch" excerpt is curious. I don't see why newgrp was needed there unless you had disabled the gids hash (e.g., via --group-update-time=-1). If you build or configure with CFLAGS='-DGIDS_DEBUG', the gids hash will be dumped to stdout every time it is rebuilt. Of course, this is only useful if you're running munged in the foreground (-F).

I had a similar bug report a while back with 0.5.8 under RHEL 5.2 (glibc-2.5-25) where munged would segfault in getpwnam(). That user was also running LDAP. I created a small reproducer (gids-test.c attached) that mostly mimicked the calls in gids.c. He confirmed the reproducer worked (i.e., segfaulted). Note that it is not threaded. Try it and let me know whether it works for you.

I'm still tracing through gids.c, but nothing major has jumped out at me yet.

gids-test.c.gz

Original comment by chris.m.dunlap on 29 Oct 2010 at 3:55

@GoogleCodeExporter
Copy link
Author

Results from gids-test was identical to munge. The test end in a

*ERROR* cannot query group info: Permission denied

Which mirrors the error I was getting

2010-10-27 20:21:21 Error:     Unable to query group info: Permission denied

As I have worked on this there are a number of factors that have come up.

  1. getgrent and getgrent_r are two different code paths in glibc. Altering gids-test to be re-entrant causes the error to go away. There also appears to be some hostility in the glibc community against using the getgrent for anything more than legacy code. (below is diff to gids-test.c).

    --- gids-test.c 2010-10-29 07:53:54.000000000 -0400
    +++ gids-test-r.c       2010-10-29 08:13:16.000000000 -0400
    @@ -6,6 +6,8 @@
    #include <string.h>
    #include <sys/types.h>
    
    +#define BUFLEN 4096
    +
    int
    main (int argc, char *argv[])
    {
    @@ -17,21 +19,21 @@
     uid_t           uid;
     gid_t           gid_tmp;
    
    +    struct group grp;
    +    char buf[BUFLEN];
    +    int i;
    +
     setgrent ();
    
     while (1) {
    -        errno = 0;
    -        gr_ptr = getgrent ();
    +       i = getgrent_r(&grp, buf, BUFLEN, &gr_ptr);
    +
    +       if (i == ERANGE) {
    +               fprintf(stderr, "*ERROR* ERANGE Insufficient buffer space supplied. Try again with larger buffer.\n");
    +       }
    +       if (i)
    +               break;
    
    -        if (gr_ptr == NULL) {
    -            if ((errno == 0) || (errno == ENOENT))
    -                break;
    -            if (errno == EINTR)
    -                continue;
    -            fprintf (stderr, "*ERROR* cannot query group info: %s",
    -                    strerror (errno));
    -            exit (EXIT_FAILURE);
    -        }
         gid = gr_ptr->gr_gid;
         printf ("  group [%s] (gid=%u)\n",
                 (gr_ptr->gr_name ? gr_ptr->gr_name : "*NULL*"), gid);

    http://cygwin.ru/ml/libc-alpha/2006-08/msg00029.html

  2. The error does not crash munge right away, but causes some lingering problem that crashes the daemon on the next refresh of the group information. If the first run fails we should either be bailing out completely with some sane error message or we should be turning off refreshes with a note to the logs. Silently failing on the next refresh was infuriating to track down. I see at least one other case in the slurm mailing lists where this appears to be the case.

  3. Even if this bug is fixed the supplemental gid information will still be incomplete and inaccurate for some systems. On many enterprise systems including some LDAP, NIS, and AD style authentication a non-root user does not have access to a dump of the passwd or group databases. The way around this on linux and bsd systems with nscd is to use the glibc getgrouplist. I understand that this is not a standard call which is why my patch also modifies the autoconf to test for its presence before enabling that feature. When found it jettisons 100% of the code for gid hashing and replaces the supplemental group test function with a call to getgrouplist.

I plan on cleaning up and submitting 3 patches. I hope to have them by the end of the day barring any interruptions.

  1. Changes gids.c to re-entrant code. This should be pretty small and benign.
  2. A patch for getgrouplist support.
  3. A patch against 0.5.9 configure.in and configure so that people don't have to rebuild it themselves.

Original comment by ericew on 29 Oct 2010 at 1:04

@GoogleCodeExporter
Copy link
Author

Thanks for running this down, especially since I've been unable to reproduce it here. I'm somewhat surprised getgrent() was the culprit under the circumstances, but gids-test.c & gids-test-r.c seem to confirm it. Fortunately, it's a fairly straightforward correction to make.

Re: 2), this matches the comment 2 "without patch" curious behavior and explains why newgrp was needed there. In gids.c, return values are checked and errors are logged. If an error is encountered while building the next map, the previous map is kept and updates continue as scheduled since the error is likely transient. I'm guessing the silent crash-on-next-update behavior was memory corruption from getgrent() that will go away with the switch to getgrent_r() & getpwnam_r(). Time will tell.

Re: 3), munged can be run as root if needed; clients still won't need root privileges. I have concerns with getgrouplist() in the credential decode path. Response time and availability of a network service are two issues that immediately come to mind. Then there's the issue of performance since you also need getpwuid() to convert the uid within the credential into a username for getgrouplist(). I'm not comfortable with adding support for getgrouplist() at this time.

Original comment by chris.m.dunlap on 31 Oct 2010 at 8:53

  • Changed title: munged crashes when processing supplementary group info under glibc
  • Changed state: Accepted

@GoogleCodeExporter
Copy link
Author

  1. What about a lazy caching scheme? getgrouplist and getpwuid results saved for a time using the hashes. Timecode would determine if getgrouplist needs to be refreshed. Possibly merging the two ways to get group info where available and offering it as a command line option or even having a stub, like the gids-test, that can provide some feedback to the user to see which method might be more appropriate.

The lazy caching scheme would add a field to the gid hash and otherwise be low impact. The getgrouplist implementation can be in parallel code not enabled by default while it "bakes". Maybe I can also add a one-off check after the decode path to confirm that getgrent_r and getgrouplist agree and throw a warning into the logs if not.

I'll rework my patch to separate the code paths and not jettison the original code. That will allow the patch back in without any compromises at this point. Users can enable it with a flag if it's appropriate for their environment.

Original comment by ericew on 1 Nov 2010 at 1:03

@GoogleCodeExporter
Copy link
Author

This issue was updated by svn:r854 / 842ec31.

Created issue-2 branch.

Original comment by chris.m.dunlap on 3 Nov 2010 at 8:19

  • Changed state: Started

@GoogleCodeExporter
Copy link
Author

GoogleCodeExporter commented May 15, 2015

getgrent -> getgrent_r patch. It's pretty minimal. I have been swamped, but I do plan on submitting a getgrouplist branch that can be enabled via command line switch.

munge-issue2.patch.gz

Original comment by ericew on 3 Nov 2010 at 9:48

@GoogleCodeExporter
Copy link
Author

Thanks for confirming that getgrent_r() here fixes the problem under glibc.

I can't simply replace getgrent() with getgrent_r() -- some platforms don't have it, and others use different prototypes. It's somewhat of a pain to deal with, which is one reason why the non-reentrant routines were used in the first place. Ditto for getpwnam(). But I'll deal with it.

As for getgrouplist(), I won't allow that to be called from a worker thread. The reason the gids map is built in a separate thread is to prevent it from adversely impacting credential processing time.

Original comment by chris.m.dunlap on 4 Nov 2010 at 3:16

@GoogleCodeExporter
Copy link
Author

Out of curiosity, does running gids-test under valgrind provide any more information as to where the underlying problem lies?

Original comment by chris.m.dunlap on 4 Nov 2010 at 3:19

@GoogleCodeExporter
Copy link
Author

This issue was closed by svn:r889 / dfe5ea5.

Original comment by chris.m.dunlap on 20 Jan 2011 at 1:42

  • Changed state: Fixed

@GoogleCodeExporter
Copy link
Author

Original comment by chris.m.dunlap on 4 Feb 2011 at 3:31

  • Added labels: Milestone-0.5.10

@dun dun added bug and removed auto-migrated labels Jun 4, 2015
@dun dun added this to the 0.5.10 milestone Jun 4, 2015
dun added a commit that referenced this issue Jun 17, 2015
dun added a commit that referenced this issue Oct 7, 2020
dun added a commit that referenced this issue Oct 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants