Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault when shared library used from petsc: mac + openmpi only #110

Closed
1 task done
minrk opened this issue Jan 30, 2024 · 9 comments · Fixed by #114
Closed
1 task done

segfault when shared library used from petsc: mac + openmpi only #110

minrk opened this issue Jan 30, 2024 · 9 comments · Fixed by #114
Labels

Comments

@minrk
Copy link
Member

minrk commented Jan 30, 2024

Solution to issue cannot be found in the documentation.

  • I checked the documentation.

Issue

Originally reported in dolfinx-mpc, I've now reproduced the issue with just petsc4py with help from @jorgensd. Running a simple solve segfaults. Amazingly, this somehow only affects mac openmpi builds. All other combinations I've tested work fine.

And after some local rebuilds of petsc and mumps, I have narrowed this down to the shared-library builds of mumps that are the issue. When I rebuild mumps-mpi with static libs only and then rebuild petsc with that, there are no errors.

The segfault:

* thread #1, name = 'main', queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x6001550217f0)
  * frame #0: 0x00000001024c98a0 libdmumps.dylib`dmumps_scatter_dist_rhs_ + 736
    frame #1: 0x00000001024c4d94 libdmumps.dylib`dmumps_solve_driver_ + 56724
    frame #2: 0x0000000102516a1c libdmumps.dylib`dmumps_ + 1708
    frame #3: 0x000000010251b114 libdmumps.dylib`dmumps_f77_ + 4228
    frame #4: 0x0000000102514ba4 libdmumps.dylib`dmumps_c + 3312
    frame #5: 0x00000001030c7a90 libpetsc.3.20.3.dylib`MatSolve_MUMPS + 688
    frame #6: 0x0000000102fd61d0 libpetsc.3.20.3.dylib`MatSolve + 308
    frame #7: 0x000000010374bd7c libpetsc.3.20.3.dylib`PCApply_LU + 84
    frame #8: 0x0000000103880710 libpetsc.3.20.3.dylib`PCApply + 204
    frame #9: 0x00000001036556c4 libpetsc.3.20.3.dylib`KSPSolve_PREONLY + 248
    frame #10: 0x00000001036fa518 libpetsc.3.20.3.dylib`KSPSolve_Private + 1056
    frame #11: 0x00000001036fa0b4 libpetsc.3.20.3.dylib`KSPSolve + 16
    frame #12: 0x00000001019aaf38 PETSc.cpython-310-darwin.so`__pyx_pw_8petsc4py_5PETSc_3KSP_101solve + 176

Still working out what to do about that, but the shared-library builds on mac rely on patches I made, so something could be wrong there.

Installed packages

# packages in environment at /Users/minrk/conda/envs/ompi-petsc-latest:
#
# Name                    Version                   Build  Channel
bzip2                     1.0.8                h93a5062_5    conda-forge
c-ares                    1.26.0               h93a5062_0    conda-forge
ca-certificates           2023.11.17           hf0a4a13_0    conda-forge
fftw                      3.3.10          mpi_openmpi_haef8dc3_8    conda-forge
gmp                       6.3.0                h965bd2d_0    conda-forge
hdf5                      1.14.3          mpi_openmpi_h20f603a_0    conda-forge
hypre                     2.28.0          mpi_openmpi_haba3941_0    conda-forge
icu                       73.2                 hc8870d7_0    conda-forge
krb5                      1.21.2               h92f50d5_0    conda-forge
libaec                    1.1.2                h13dd4ca_1    conda-forge
libblas                   3.9.0           21_osxarm64_openblas    conda-forge
libcblas                  3.9.0           21_osxarm64_openblas    conda-forge
libcurl                   8.5.0                h2d989ff_0    conda-forge
libcxx                    16.0.6               h4653b0c_0    conda-forge
libedit                   3.1.20191231         hc8eb9b7_2    conda-forge
libev                     4.33                 h93a5062_2    conda-forge
libffi                    3.4.2                h3422bc3_5    conda-forge
libgfortran               5.0.0           13_2_0_hd922786_2    conda-forge
libgfortran5              13.2.0               hf226fd6_2    conda-forge
libhwloc                  2.9.3           default_h4394839_1009    conda-forge
libiconv                  1.17                 h0d3ecfb_2    conda-forge
liblapack                 3.9.0           21_osxarm64_openblas    conda-forge
libnghttp2                1.58.0               ha4dd798_1    conda-forge
libopenblas               0.3.26          openmp_h6c19121_0    conda-forge
libptscotch               7.0.4                h820b06d_1    conda-forge
libscotch                 7.0.4                hf7fe8bf_1    conda-forge
libsqlite                 3.44.2               h091b4b1_0    conda-forge
libssh2                   1.11.0               h7a5bd25_0    conda-forge
libxml2                   2.12.4               h0d0cfa8_1    conda-forge
libzlib                   1.2.13               h53f4e23_5    conda-forge
llvm-openmp               17.0.6               hcd81f8e_0    conda-forge
metis                     5.1.0             h13dd4ca_1007    conda-forge
mpfr                      4.2.1                h9546428_0    conda-forge
mpi                       1.0                     openmpi    conda-forge
mpi4py                    3.1.5           py310hd3bd7df_0    conda-forge
mumps-include             5.6.2                hce30654_4    conda-forge
mumps-mpi                 5.6.2                hc6b315c_4    conda-forge
ncurses                   6.4                  h463b476_2    conda-forge
numpy                     1.26.3          py310hd45542a_0    conda-forge
openmpi                   4.1.6              h526c993_101    conda-forge
openssl                   3.2.0                h0d3ecfb_1    conda-forge
parmetis                  4.0.3             h6eb5794_1005    conda-forge
petsc                     3.20.4          real_hdd9ae42_100    conda-forge
petsc4py                  3.20.3          real_heb9844d_100    conda-forge
pip                       23.3.2             pyhd8ed1ab_0    conda-forge
ptscotch                  7.0.4                hc1c4572_1    conda-forge
python                    3.10.13         h2469fbe_1_cpython    conda-forge
python_abi                3.10                    4_cp310    conda-forge
readline                  8.2                  h92ec313_1    conda-forge
scalapack                 2.2.0                h515df86_1    conda-forge
scotch                    7.0.4                hc1c4572_1    conda-forge
setuptools                69.0.3             pyhd8ed1ab_0    conda-forge
suitesparse               5.10.1               h79486c6_3    conda-forge
superlu                   5.2.2                hc615359_0    conda-forge
superlu_dist              8.2.1                h3dacc9e_1    conda-forge
tbb                       2021.11.0            h2ffa867_1    conda-forge
tk                        8.6.13               h5083fa2_1    conda-forge
tzdata                    2023d                h0c530f3_0    conda-forge
wheel                     0.42.0             pyhd8ed1ab_0    conda-forge
xz                        5.2.6                h57fd34a_0    conda-forge
yaml                      0.2.5                h3422bc3_2    conda-forge
zlib                      1.2.13               h53f4e23_5    conda-forge
zstd                      1.5.5                h4f39d0f_0    conda-forge

Environment info

active environment : ompi-mumps-561
    active env location : /Users/minrk/conda/envs/ompi-mumps-561
            shell level : 2
       user config file : /Users/minrk/.condarc
 populated config files : /Users/minrk/conda/.condarc
                          /Users/minrk/.condarc
          conda version : 23.11.0
    conda-build version : 3.28.4
         python version : 3.10.13.final.0
                 solver : libmamba (default)
       virtual packages : __archspec=1=m1
                          __conda=23.11.0=0
                          __osx=14.3=0
                          __unix=0=0
       base environment : /Users/minrk/conda  (writable)
      conda av data dir : /Users/minrk/conda/etc/conda
  conda av metadata url : None
           channel URLs : https://conda.anaconda.org/conda-forge/osx-arm64
                          https://conda.anaconda.org/conda-forge/noarch
          package cache : /Users/minrk/conda/pkgs
                          /Users/minrk/.conda/pkgs
       envs directories : /Users/minrk/conda/envs
                          /Users/minrk/.conda/envs
               platform : osx-arm64
             user-agent : conda/23.11.0 requests/2.31.0 CPython/3.10.13 Darwin/23.3.0 OSX/14.3 solver/libmamba conda-libmamba-solver/23.11.1 libmambapy/1.5.6
                UID:GID : 501:20
             netrc file : /Users/minrk/.netrc
           offline mode : False
@minrk
Copy link
Member Author

minrk commented Apr 29, 2024

This appears to be affecting mpich on mac now, too. I don't know enough to debug it or move forward anywhere.

It is very strange, but still building with static mumps on mac fixes the problem. So it suggests to me that some compile flags are wrong. I don't know what kind of flag could cause a segfault in dmumps_scatter_dist_rhs_, or how to isolate this to what should be a failing test either in the mumps or petsc package.

The petsc KSP tests pass with mumps when I try, but this example still segfaults.

I'm somewhat inclined to switch the mac builds to static while working this out, since it fixes serious problems. But I can't tell the scope of what is affected.

@dalcinl
Copy link
Contributor

dalcinl commented Apr 29, 2024

@minrk Could this be related to mpifort not passing -Wl,commons,use_dylibs anymore? Apple really screw Fortran users with its new linker version.

@minrk
Copy link
Member Author

minrk commented Apr 29, 2024

Thanks for the pointer. I have no idea, but I can try it.

I'd love a test that relies only on mumps, but so far I can't do it with less than petsc4py.

@dalcinl
Copy link
Contributor

dalcinl commented Apr 29, 2024

From MUMP's INSTALL documentation:

-DAVOID_MPI_IN_PLACE:
MUMPS uses MPI_IN_PLACE in some collective MPI operations. In case of
MPI environments where MPI_IN_PLACE is failing, it is possible to
avoid the use of MPI_IN_PLACE at the cost of more temporary memory
allocation and possibly less efficient code.

Can you try to build (I have no idea how) passing that define? If the issue is related to -Wl,commons,use_dylibs, then avoiding the use of MPI_IN_PLACE may be all what you need.

@minrk
Copy link
Member Author

minrk commented Apr 30, 2024

Thank you!

I tried with -DAVOID_MPI_IN_PLACE, but it doesn't compile:

zsol_c.F:2517:54:

 2517 |       ALLOCATE(TMP_INT_ARRAY(KEEP(28)), STAT = allocok)
      |                                                      1
Error: Symbol ‘allocok’ at (1) has no IMPLICIT type

suggesting it perhaps isn't used that often. I'll try the use_dylibs next.

@minrk
Copy link
Member Author

minrk commented Apr 30, 2024

@dalcinl you're a hero! I've confirmed that AVOID_MPI_IN_PLACE fixes the problem (after a patch to define a missing variable). I have now written my first line of fortran code, I think. #114

@minrk
Copy link
Member Author

minrk commented Apr 30, 2024

I'd still love to have a regression test in this repo if anyone knows how to translate this to pure mumps calls, but I'm thrilled to have working mumps again on mac.

@dalcinl
Copy link
Contributor

dalcinl commented Apr 30, 2024

I have now written my first line of fortran code, I think. #114

I'm not sure whether to congratulate you or say I am sorry 🤣.

@dalcinl
Copy link
Contributor

dalcinl commented Apr 30, 2024

I've confirmed that AVOID_MPI_IN_PLACE fixes the problem (after a patch to define a missing variable)

Are you planning to submit the patch upstream to MUMPS developers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants