Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intel MPI (versions in 2019a and 2019b) segfaults on CentOS 8 #11762

Open
branfosj opened this issue Nov 25, 2020 · 11 comments
Open

Intel MPI (versions in 2019a and 2019b) segfaults on CentOS 8 #11762

branfosj opened this issue Nov 25, 2020 · 11 comments
Milestone

Comments

@branfosj
Copy link
Member

branfosj commented Nov 25, 2020

  • Intel 2019a toolchain: impi-2018.4.274-iccifort-2019.1.144-GCC-8.2.0-2.31.1.eb
  • Intel 2019b toolchain: impi-2018.5.288-iccifort-2019.5.281.eb

These Intel MPI versions segfault on CentOS 8.

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 892328 RUNNING AT bear-pg0206u40a
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 892328 RUNNING AT bear-pg0206u40a
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
   Intel(R) MPI Library troubleshooting guide:
      https://software.intel.com/node/561764
===================================================================================

From https://github.com/easybuilders/easybuild/wiki/Conference-call-notes-20201125#qa

at HPC-UGent the impi in intel/2019b was bumped to the one used in intel/2020a

We've decided to follow the same approach and have bumped the impi in intel/2019a and intel/2019b.

@boegel boegel added this to the 4.x milestone Nov 26, 2020
@boegel
Copy link
Member

boegel commented Nov 26, 2020

@hajgato Care to pitch in here?
We've seen the same on RHEL 8.2, right?

Have we looked at possible workarounds?

@hajgato
Copy link
Collaborator

hajgato commented Nov 26, 2020

As far as I remember those IMPI libs clashes with glibc. So not many things to do. If it is the known CHAR_WIDTH clash (never checked), than we might be able to change that symbol in the mpi.so to for example CHAR_VIDTH.

@branfosj
Copy link
Member Author

In our testing of intel/2019a we experienced problems with impi using the UCX 1.5.1 that is already in the repo (UCX-1.5.1-GCCcore-8.2.0.eb). We bumped this to 1.6.1 and have not had any issues when testing using OSU Micro Benchmarks.

@hajgato
Copy link
Collaborator

hajgato commented Dec 22, 2020

I have found that works on RHEL8.2 and AMD Rome with impi/2018.4.274-iccifort-2019.1.144-GCC-8.2.0-2.31.1 (its not my job to judge things...)

https://software.intel.com/content/www/us/en/develop/articles/resolving-segfaults-in-legacy-intel-mpi-library-on-newer-linux-distributions.html

just for safety, I copy the relevant stuff here
If you must remain on an older version of the Intel® MPI Library, there is a workaround for this issue. Create a file strtok_proxy.c with the following code:

#define _GNU_SOURCE 1

#include <string.h>
#include <dlfcn.h>

typedef char *(*strtok_t) (char *str, const char *delimiters);
static strtok_t strtok_orig_p = NULL;

char *strtok(char *str, const char *delimiters)
{

    if (strtok_orig_p == NULL) {
        strtok_orig_p = (strtok_t) dlsym(RTLD_NEXT, "strtok");
        if (strtok_orig_p == NULL) {
            /* Error handling */
            return NULL;
        }
        if (str == NULL) {
            str = "";
        }
    }

    return strtok_orig_p(str, delimiters);
}

Compile this file using the following commands:

gcc -c -Wall -Werror -fpic ./strtok_proxy.c
gcc -ldl -shared -o ./strtok_proxy.so ./strtok_proxy.o

And apply the generated library at runtime using the following:

export LD_PRELOAD=./strtok_proxy.so

@boegel
Copy link
Member

boegel commented Dec 22, 2020

So I guess we should update the impi easyblock to do this automagically for the versions that require it...

I'm wondering how to detect that this dirty hack is needed though... Just using nm on /usr/lib/libc.so.6 doesn't seem to be sufficient.

It would be nice if they provided a bit more context, but I guess we should be happy that there's a workaround. :)

@branfosj
Copy link
Member Author

Should we apply the dirty hack if glibc > some version and impi version < 2019. The linked document says 'Intel® MPI Library 2018 and earlier'. We should be able to determine a suitable glibc version to make the split on - somewhere between what is in CentOS 7 and 8. Do we know if this is seen in Ubuntu 18.04, as that would allow us to further narrow this down.

@hajgato
Copy link
Collaborator

hajgato commented Dec 22, 2020

I strongly suspect that the same issue is with Ubuntu 18.04. See: https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/Intel-MPI-segmentation-fault-bug/td-p/1154073

@hajgato
Copy link
Collaborator

hajgato commented Dec 22, 2020

Anyway, I would be more happy if we make a mpirun/mpiexec whatsoever wrapper that preloads the strtok_proxy. (And the reason that I do not like LD_PRELOAD, and if we have to use it, maybe we have to minimize when it is used)

@boegel
Copy link
Member

boegel commented Dec 23, 2020

We briefly discussed this during the EB conf call today.

There's probably several options here, and $LD_PRELOAD certainly raises some concerns, but it may be difficult to avoid.

Why create wrappers for mpirun rather than just letting the impi module add the strok_proxy.so to $LD_PRELOAD?
To avoid that it's always there, not just when running MPI stuff?

Maybe we should reach out to Intel support to get more info on this, and make it clear we're not happy with the $LD_PRELOAD workaround...

It was also mentioned that the latest impi 2019 version (2019 update 5) may no longer have this issue, so only possible workaround could be having a tweaked intel/2019b (& co) on RHEL8 that has the latest impi 2019.x ...

@akesandgren
Copy link
Contributor

Note that this seem to be a problem only when using mpirun to start things. Using srun inside a batch job doesn't show the problem for me on Ubuntu Focal.

@akesandgren
Copy link
Contributor

akesandgren commented Apr 30, 2021

Just for the record, 2018.5 also have the same problem.
2019.7 does work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants