Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dlopen of libgomp 13.1.0 and 13.2.0 with RTLD_DEEPBIND on Python fail with segmentation fault on Ubuntu 22.04 #114

Closed
1 task done
traversaro opened this issue Sep 13, 2023 · 15 comments · Fixed by #117
Closed
1 task done
Labels
bug Something isn't working

Comments

@traversaro
Copy link
Contributor

Solution to issue cannot be found in the documentation.

  • I checked the documentation.

Issue

If I try to dlopen with RTLD_DEEPBIND from a Python environment libgomp 13.*, I obtain a segfault. A simple reproducer is just the command python -c "import ctypes; import os; ctypes._dlopen(os.environ['CONDA_PREFIX']+'/lib/libgomp.so.1', os.RTLD_DEEPBIND)" :

(testsegfault) traversaro@IITICUBLAP257:~$ python -c "import ctypes; import os; ctypes._dlopen(os.environ['CONDA_PREFIX']+'/lib/libgomp.so.1', os.RTLD_DEEPBIND)"
Segmentation fault

The issue does not appear if:

  • A C/C++ program is used for dlopen, without passing by the python interpreter
  • libgomp <= 12 is used

The backtrace is the following:

(gdb) bt
#0  initialize_env () at ../../../libgomp/env.c:2062
#1  0x00007ffff7fc947e in call_init (l=<optimized out>, argc=argc@entry=3, argv=argv@entry=0x7fffffffc1f8, env=env@entry=0x7fffffffc218)
    at ./elf/dl-init.c:70
#2  0x00007ffff7fc9568 in call_init (env=0x7fffffffc218, argv=0x7fffffffc1f8, argc=3, l=<optimized out>) at ./elf/dl-init.c:33
#3  _dl_init (main_map=0x555555b8e620, argc=3, argv=0x7fffffffc1f8, env=0x7fffffffc218) at ./elf/dl-init.c:117
#4  0x00007ffff7e09c85 in __GI__dl_catch_exception (exception=<optimized out>, operate=<optimized out>, args=<optimized out>)
    at ./elf/dl-error-skeleton.c:182
#5  0x00007ffff7fd0ff6 in dl_open_worker (a=0x7fffffffb910) at ./elf/dl-open.c:808

and seems to indicate that something is going wrong around https://github.com/gcc-mirror/gcc/blob/releases/gcc-13.2.0/libgomp/env.c#L2062 . I have a few ideas to investigate this further, like debugging the value of the environ global variable, but I am not sure when I will have time for this, so in the meanwhile I opened this issue.

Downstream issue: conda-forge/casadi-feedstock#91 .

Installed packages

(testsegfault) traversaro@IITICUBLAP257:~$ conda list
# packages in environment at /home/traversaro/miniforge3/envs/testsegfault:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
ca-certificates           2023.7.22            hbcca054_0    conda-forge
ld_impl_linux-64          2.40                 h41732ed_0    conda-forge
libexpat                  2.5.0                hcb278e6_1    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 13.2.0               h807b86a_0    conda-forge
libgomp                   13.2.0               h807b86a_0    conda-forge
libnsl                    2.0.0                h7f98852_0    conda-forge
libsqlite                 3.43.0               h2797004_0    conda-forge
libuuid                   2.38.1               h0b41bf4_0    conda-forge
libzlib                   1.2.13               hd590300_5    conda-forge
ncurses                   6.4                  hcb278e6_0    conda-forge
openssl                   3.1.2                hd590300_0    conda-forge
pip                       23.2.1             pyhd8ed1ab_0    conda-forge
python                    3.11.5          hab00c5b_0_cpython    conda-forge
readline                  8.2                  h8228510_1    conda-forge
setuptools                68.2.2             pyhd8ed1ab_0    conda-forge
tk                        8.6.12               h27826a3_0    conda-forge
tzdata                    2023c                h71feb2d_0    conda-forge
wheel                     0.41.2             pyhd8ed1ab_0    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge

Environment info

(testsegfault) traversaro@IITICUBLAP257:~$ conda info

     active environment : testsegfault
    active env location : /home/traversaro/miniforge3/envs/testsegfault
            shell level : 1
       user config file : /home/traversaro/.condarc
 populated config files : /home/traversaro/miniforge3/.condarc
                          /home/traversaro/.condarc
          conda version : 23.3.1
    conda-build version : not installed
         python version : 3.10.12.final.0
       virtual packages : __archspec=1=x86_64
                          __cuda=12.2=0
                          __glibc=2.35=0
                          __linux=5.15.90.1=0
                          __unix=0=0
       base environment : /home/traversaro/miniforge3  (writable)
      conda av data dir : /home/traversaro/miniforge3/etc/conda
  conda av metadata url : None
           channel URLs : https://conda.anaconda.org/conda-forge/linux-64
                          https://conda.anaconda.org/conda-forge/noarch
          package cache : /home/traversaro/miniforge3/pkgs
                          /home/traversaro/.conda/pkgs
       envs directories : /home/traversaro/miniforge3/envs
                          /home/traversaro/.conda/envs
               platform : linux-64
             user-agent : conda/23.3.1 requests/2.31.0 CPython/3.10.12 Linux/5.15.90.1-microsoft-standard-WSL2 ubuntu/22.04.2 glibc/2.35
                UID:GID : 1000:1000
             netrc file : None
           offline mode : False
@traversaro traversaro added the bug Something isn't working label Sep 13, 2023
@traversaro
Copy link
Contributor Author

libgomp <= 12 is used

Indeed, it seems that the problematic piece of code was only introduced in libgomp 13 : gcc-mirror/gcc@9f2fca5 .

@h-vetinari
Copy link
Member

Based on the patch you found, it seems to have something to do with parsing OMP_* environment variables?

I saw that the ipopt-feedstock sets

  # Environment variables needed by spral
  # See https://github.com/ralna/spral#usage-at-a-glance
  export OMP_CANCELLATION=TRUE
  export OMP_PROC_BIND=TRUE

In particular, from the commit you linked that introduced the new facility for host vs. device, it seems to me that:

  • The parsing for OMP_PROC_BIND changed substantially (as opposed to OMP_CANCELLATION)
  • While things clearly shouldn't break, the code does warn for invalid values, so trying to rebuild the affected stack against libgomp 13.x would probably be a good idea.
  • Out of all the test cases in that commit, none of them has a value of TRUE for OMP_PROC_BIND, but rather things like: "spread", "close", "spread,spread", "spread,close"

@traversaro
Copy link
Contributor Author

I am not sure this is related to ipopt/spral. The environment in which this happens reported in #114 (comment) is created with mamba create -n testsegfault libgomp python, and in that environment no OMP_* variable are defined.

While things clearly shouldn't break, the code does warn for invalid values, so trying to rebuild the affected stack against libgomp 13.x would probably be a good idea.

Just to understand, which stack? The problem occurs just by combining libgomp and python, and I do not think that python depends on libgomp .

@h-vetinari
Copy link
Member

OK, sorry about that. I followed your "downstream issue" a bit, that's why I got to ipopt. If this happens purely with python+libgomp, then I'm more stumped (I thought it was something about setting/parsing the OMP_* options). I fail to imagine how the commit you referenced would touch the ABI, but perhaps that's the case. Might be interesting to rebuild python with gcc 13 to see if that changes anything?

@traversaro
Copy link
Contributor Author

I found another issue that contains a segfault in libgomp's initialize_env() weechat/weechat#2009 , if I got it correctly it happens again with libgomp 13.2.0 , but with Fedora 39.

@traversaro
Copy link
Contributor Author

traversaro commented Sep 13, 2023

I reproduced the issue in Debian and Ubuntu distro with apt-packages that contain gomp 13, while earlier distros with gomp 12 all pass fine: https://github.com/traversaro/reproduce-python-gomp-deepbind-issue/actions/runs/6172933871. On the other hand, Fedora 38 has gomp 13.2.0, but does not reproduce the error, similarly also latest arch does not reproduce the problem.

@S-Dafarra
Copy link

I found another issue that contains a segfault in libgomp's initialize_env() weechat/weechat#2009 , if I got it correctly it happens again with libgomp 13.2.0 , but with Fedora 39.

The issue here seems to happen even with PHP. I wonder if it happens in general when using dlopen

@traversaro
Copy link
Contributor Author

I found another issue that contains a segfault in libgomp's initialize_env() weechat/weechat#2009 , if I got it correctly it happens again with libgomp 13.2.0 , but with Fedora 39.

The issue here seems to happen even with PHP. I wonder if it happens in general when using dlopen

I tested with casadi, and the issue did not happened when using a simple C++ example (I tested https://github.com/casadi/casadi/blob/main/docs/examples/cplusplus/ipopt_nl.cpp).

@traversaro
Copy link
Contributor Author

traversaro commented Sep 13, 2023

I found another issue that contains a segfault in libgomp's initialize_env() weechat/weechat#2009 , if I got it correctly it happens again with libgomp 13.2.0 , but with Fedora 39.

The issue here seems to happen even with PHP. I wonder if it happens in general when using dlopen

I tested with casadi, and the issue did not happened when using a simple C++ example (I tested https://github.com/casadi/casadi/blob/main/docs/examples/cplusplus/ipopt_nl.cpp).

Just to be sure I created a minimal C-based test, and indeed the issue does not appear to happen with that, see https://github.com/traversaro/reproduce-python-gomp-deepbind-issue/actions/runs/6174144211 and https://github.com/traversaro/reproduce-python-gomp-deepbind-issue/blob/main/test.c .

@traversaro
Copy link
Contributor Author

I was able to reproduce the problem without libgomp, just with a manually coded shared lib, i.e. testso.c :

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h> // Include this header for environ

extern char **environ; // Declare extern environ

static void __attribute__((constructor))
initialize_env (void)
{
    char **env;
    fprintf(stderr, "Print debug\n", *env);
    env = environ;
    fprintf(stderr, "environ %p env %p\n", env, environ);
    for (env = environ; *env != 0; env++)
    {
        fprintf(stderr, "%s\n", *env);
    }
    return;
}
gcc -shared -fPIC testso.c -o testso.so
(testsegfault) traversaro@IITICUBLAP257:~/test_ipopt_dir$ python -c "import ctypes; import os; ctypes._dlopen('./testso.so', os.RTLD_DEEPBIND)"
Print debug
environ (nil) env (nil)
Segmentation fault

While in normal use:

Trying to load with RTLD_LAZY|RTLD_DEEPBIND ./testso.so
Print debug
environ 0x7ffcf93cefc0 env 0x7ffcf93cefc0

For some reason the environ global variable is set to 0/NULL.

So perhaps we should move the issue to Python feedstock?

@traversaro
Copy link
Contributor Author

Ok, I think this is the combination of two different behaviour/problems:

  • P1: constructors of shared library opened by dlopen with RTLD_DEEPBIND on Python on conda-forge/Debian have environ==NULL
    • I am not sure why this happens, and if it is expected behaviour or a bug
  • P2: libgomp >= 13 segfaults if environ==NULL
    • This behaviour I think it is a bug of libgomp, as environ==NULL is a valid state, for example caused by calling clearenv() on Linux.

P2 can be reproduced easily on libgomp >= 13 with this MWE:

#include <dlfcn.h>
#include <stdio.h>
#include <stdlib.h>

int main () {
    clearenv();
    void * handle = dlopen("libgomp.so.1", RTLD_NOW);
   
    if (handle) {
        fprintf(stderr, "dlopen of libgomp.so.1 done correctly.\n");
        return EXIT_SUCCESS;
    } else {
        fprintf(stderr, "dlopen of libgomp.so.1 failed with error: %s.\n", dlerror());
        return EXIT_SUCCESS;
    }
    return EXIT_SUCCESS;
}

to run:

gcc -ldl test_gomp_segfault.c -o test_gomp_segfault
./test_gomp_segfault

I will open a bug upstream in GCC for P2.

@traversaro
Copy link
Contributor Author

I will open a bug upstream in GCC for P2.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111413

@traversaro traversaro changed the title dlopen of libgomp 13.1.0 and 13.2.0 with RTLD_DEEPBIND fail with segmentation fault on Ubuntu 22.04 dlopen of libgomp 13.1.0 and 13.2.0 with RTLD_DEEPBIND on Python fail with segmentation fault on Ubuntu 22.04 Sep 14, 2023
@h-vetinari h-vetinari mentioned this issue Sep 16, 2023
5 tasks
@traversaro
Copy link
Contributor Author

I will open a bug upstream in GCC for P2.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111413

The issue was fixed upstream for GCC14, see:

The patch is huge, but avoiding to to indentation changes it can be summarized to single line change, that for backport can be more adapt to reduce the risk of patch conflicts.

@h-vetinari
Copy link
Member

Great job!

The patch is huge, but avoiding to to indentation changes it can be summarized to single line change

Proof of that statement, using Github's UI.

@traversaro
Copy link
Contributor Author

P1: constructors of shared library opened by dlopen with RTLD_DEEPBIND on Python on conda-forge/Debian have environ==NULL

* I am not sure why this happens, and if it is expected behaviour or a bug

It turns that also this was working fine in gomp <= 12 and it does not work in gomp 13, so I opened an issue also for that: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111556 . However, to be honest I am not sure if this is a problem in libgomp, in glibc or simply a problem of how ELF and the POSIX spec interact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants