Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible missing MPI_Type_free in ESMCI_VMKernel? #209

Open
mathomp4 opened this issue Jan 12, 2024 · 16 comments
Open

Possible missing MPI_Type_free in ESMCI_VMKernel? #209

mathomp4 opened this issue Jan 12, 2024 · 16 comments
Assignees

Comments

@mathomp4
Copy link
Contributor

This is a big longshot in the dark. @climbfuji and I are trying to get GEOS to work with Spack, namely the JCSDA spack-stack. In the tests by @climbfuji with spack-stack, he kept getting crashes at the end of execution of GEOSgcm (and even smaller more boring programs, but ones that did link to MAPL and thus ESMF).

So I started with mothership spack, and my first test showed all was well. But he reminded me that spack-stack builds ESMF as static-only, no shared. So I build GEOS against a static-only ESMF and, yup, crashes on program exit. Turning on all the debugging flags in GEOS and MAPL didn't help too much but I did get out:

double free or corruption (fasttop)

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x14d40515cdbf in ???
#1  0x14d40515cd2b in ???
#2  0x14d40515e3e4 in ???
#3  0x14d4051a2c26 in ???
#4  0x14d4051aacc9 in ???
#5  0x14d4051ac8a3 in ???
#6  0x14d415265e9b in _ZNSt15__new_allocatorIP15ompi_datatype_tE10deallocateEPS1_m
        at /gpfsm/dulocal15/sles15/other/gcc/12.3.0/include/c++/12.3.0/bits/new_allocator.h:158
#7  0x14d415263683 in _ZNSt16allocator_traitsISaIP15ompi_datatype_tEE10deallocateERS2_PS1_m
        at /gpfsm/dulocal15/sles15/other/gcc/12.3.0/include/c++/12.3.0/bits/alloc_traits.h:496
#8  0x14d415260139 in _ZNSt12_Vector_baseIP15ompi_datatype_tSaIS1_EE13_M_deallocateEPS1_m
        at /gpfsm/dulocal15/sles15/other/gcc/12.3.0/include/c++/12.3.0/bits/stl_vector.h:387
#9  0x14d41525cd69 in _ZNSt12_Vector_baseIP15ompi_datatype_tSaIS1_EED2Ev
        at /gpfsm/dulocal15/sles15/other/gcc/12.3.0/include/c++/12.3.0/bits/stl_vector.h:366
#10  0x14d41526bd28 in ???
#11  0x14d4051601bd in ???
#12  0x14d411a28a76 in ???
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node borgm001 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

Now, not much traceback, but it does seem to point to MPI type-ish stuff? Maybe? Honestly, I'm reaching here.

So I grepped both ESMF and MAPL and many types around but one thing I saw was in ESMCI_VMKernel.C you have:

MPI_Type_contiguous(byteCount, MPI_BYTE, &(customType[i]));
MPI_Type_commit(&(customType[i]));

and I don't see a corresponding MPI_Type_free for customType.

Of course, ESMF is complex and this is also C++ code which I am not very good at. It's possible the frees are done elsewhere? (aka Fun with OO programming!)

It's also possible this has absolutely nothing to do with the crash. I mean, I currently load 51 (!) modules when I run with spack so...that's a lot of things to look at. But the fact that just changing from shared to static ESMF causes a crash does point us toward ESMF...

@oehmke oehmke assigned oehmke and theurich and unassigned oehmke Jan 12, 2024
@oehmke
Copy link
Contributor

oehmke commented Jan 12, 2024

Thanks for letting us know. This is deep in Gerhard's (@theurich) territory, so I'm going to assign it to him and hopefully he'll have a chance soon to take a look and make sure things are as they should be. What machine is this? I noticed the 12.3 gcc and wondered if this relates to Tom's issue (#397). Is he using a static ESMF?

@climbfuji
Copy link

This is on Discover. We've observed the same with gcc@10.1.0

@mathomp4
Copy link
Contributor Author

Thanks for letting us know. This is deep in Gerhard's (@theurich) territory, so I'm going to assign it to him and hopefully he'll have a chance soon to take a look and make sure things are as they should be. What machine is this? I noticed the 12.3 gcc and wondered if this relates to Tom's issue (#397). Is he using a static ESMF?

@oehmke My guess is @tclune is not using static ESMF. ESMA-Baselibs currently builds ESMF as static and shared and from the experiments @climbfuji and myself have done with spack and other observations, it looks like FindESMF.cmake chooses shared by default.

And I now realize instead of rebuilding ESMF as static only, I could have just set -DUSE_ESMF_STATIC_LIBS=YES in my GEOS builds. Son of a ... dangit. 😠

@mathomp4
Copy link
Contributor Author

Well, my current tests are not looking good for this being the issue. I mean, it's probably a memory leak (maybe?), but it'd be teeny. I've tried a few different ways of doing the MPI_Type_free (loop in order, in reverse order) and no change. (Well, one attempt I think I did it too late and thing went nutty, but I think that's due to GEOS.)

As @atrayano said when I talked with him, since it's a double free it's more like MAPL or ESMF is freeing something twice. But, all the MPI_Type_free in MAPL explicitly are matched up. And most all in ESMF are as well. Grah.

@mathomp4
Copy link
Contributor Author

As a test, per a suggestion by @oehmke, I built ESMF with ESMF_PIO=OFF and ESMF_MOAB=OFF but no change. Dang.

But, a thought occurred to me chatting with @atrayano. What if we build MAPL as static along with ESMF. Do that and one of my at-finalize double-free errors (rs_numtiles.x) goes away.

So, I'm wondering if static ESMF means everything GEOS makes has to be static as well?

@oehmke
Copy link
Contributor

oehmke commented Jan 16, 2024 via email

@tclune
Copy link
Collaborator

tclune commented Jan 16, 2024 via email

@climbfuji
Copy link

I think the right way forward is to re-enable the shared esmf build. I just confirmed that if I do that (flip one character in our spack config file), geos builds and runs correctly.

Then we give the UFS folks a heads up that with the next spack-stack release ESMF will be both shared and static, and that they have to fix their build system to correctly pick up the static version (or move away from static libraries - it's a thing of the past anyway).

@climbfuji
Copy link

See ufs-community/ufs-weather-model#2094 for the heads-up to the UFS that future versions of spack-stack will have both shared and static esmf and mapl. See JCSDA/spack-stack#953 and JCSDA/spack#372 for the spack-stack and spack changes to support GEOS (and build esmf and mapl both shared and static).

I agree nonetheless that this issue should be fixed between esmf and mapl so that one can combine shared and static libraries.

@oehmke
Copy link
Contributor

oehmke commented Jan 16, 2024 via email

@oehmke
Copy link
Contributor

oehmke commented Jan 17, 2024 via email

@mathomp4
Copy link
Contributor Author

@oehmke Yup. Both GEOS and ESMF with debugging flags. And even that just gave the four usable lines of traceback.

@climbfuji
Copy link

climbfuji commented Jan 17, 2024

I think this goes all the way back to 8.3.0, maybe beta snapshot 09. Could also be earlier, but we didn't run the UFS with earlier versions of spack-stack, therefore can't tell.

@climbfuji
Copy link

Fun stuff. Building ESMF in spack shared fails on macOS in the linker stage, see JCSDA/spack-stack#956 ...

@oehmke
Copy link
Contributor

oehmke commented Jan 17, 2024 via email

@climbfuji
Copy link

This looks like it may be an issue with a fix for tracing we put in for Darwin. Would you try setting ESMF_TRACE_LIB_BUILD=OFF when building ESMF and see if that fixes it? Thanks.

On Jan 16, 2024, at 9:12 PM, Dom Heinzeller @.***> wrote: Fun stuff. Building ESMF in spack shared fails on macOS in the linker stage, see JCSDA/spack-stack#956 <JCSDA/spack-stack#956> ... — Reply to this email directly, view it on GitHub <#209 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE6A7U3MHZC7IJPNAZ44CWTYO5FUHAVCNFSM6AAAAABBYVGWGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJUHEYDQMRWGI. You are receiving this because you were mentioned.

Thanks so much @oehmke, that worked! I'll submit a PR to spack with the change for macOS when building shared ESMF. Sorry for the late reply, all-day meeting today ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants