Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with non mpich mpi #71

Closed
LaurentPlagne opened this issue Oct 6, 2023 · 19 comments
Closed

Problem with non mpich mpi #71

LaurentPlagne opened this issue Oct 6, 2023 · 19 comments

Comments

@LaurentPlagne
Copy link

LaurentPlagne commented Oct 6, 2023

Hello,

I use ImplicitGlobalGrid and ParallelStencil for my simulation, and everything is fine when I use the default MPI (MPICH).
This does not work properly with MPI with selected openMPI via MPI preferences (MPI.jl itself seems to run OK but
implicitGlobalGrid complains about mismatching MPI version and run N times the same mono-process pb).

Is there some particular settings I should use for ImplicitGlobalGrid.jl for using the proper MPI library?

julia 1.9.3

(Acoustic3DFD) pkg> status
Project Acoustic3DFD v0.1.0
Status `~/travail/triscale/git/TransfoUS.jl/Acoustic3DFD.jl/Project.toml`
⌅ [052768ef] CUDA v4.4.1
  [5789e2e9] FileIO v1.16.1
  [f67ccb44] HDF5 v0.17.1
  [4d7a3746] ImplicitGlobalGrid v0.13.0
  [033835bb] JLD2 v0.4.35
  [682c06a0] JSON v0.21.4
⌃ [da04e1cc] MPI v0.20.14
  [3da0fdf6] MPIPreferences v0.1.9
  [94395366] ParallelStencil v0.9.0
  [92933f4c] ProgressMeter v1.9.0
  [90137ffa] StaticArrays v1.6.5
  [37e2e46d] LinearAlgebra
@luraess
Copy link
Collaborator

luraess commented Oct 6, 2023

This issue seems not directly related to IGG, but rather an issue from not being able to hook into system MPI. Did you try a plain MPI.jl example with not using jll-shipped MPICH?

@ffevotte
Copy link

ffevotte commented Oct 6, 2023

Thanks @luraess, it looks like you're right: when using MPIPreferences.jl to use the system-wide MPI installation

  • simple MPI.jl examples work
  • simple ImplicitGlobalGrid.jl examples work
  • the 50-lines multi-xPU example from ParallelStencil.jl also works.

So it looks like the whole chain of dependencies can work fine with the locally installed OpenMPI version. However, in our real use-case ImplicitGlobalGrid.init_global_grid consistently fails (but things work when using the packaged MPICH_jll version). So we'll have to track the issue in our code... Would you have any idea where to start?

@ffevotte
Copy link

ffevotte commented Oct 6, 2023

PS: Not sure whether it's related, but we also have to add the select_device=false kwarg to init_global_grid in order for the ParallelStencil example to work. Otherwise ParallelStencil complains that there are fewer GPUs than MPI processes.

(It's not really an issue for us, but I'm not sure whether that warrants an update of the example in the README)

@omlins
Copy link
Collaborator

omlins commented Oct 6, 2023

@ffevotte @LaurentPlagne

However, in our real use-case ImplicitGlobalGrid.init_global_grid consistently fails

What is the error message?

Would you have any idea where to start?

Are there any differences in how you call init_global_grid, meaning, do you use different keyword argument?
Do you have different environment variable set, in particular IGG_CUDAAWARE_MPI or IGG_ROCMAWARE_MPI?

@omlins
Copy link
Collaborator

omlins commented Oct 6, 2023

PS: Not sure whether it's related, but we also have to add the select_device=false kwarg to init_global_grid in order for the ParallelStencil example to work. Otherwise ParallelStencil complains that there are fewer GPUs than MPI processes.

(It's not really an issue for us, but I'm not sure whether that warrants an update of the example in the README)

That should not be: we will sort this out rather than adjusting the README 👍 Thanks for letting us know!

@ffevotte
Copy link

ffevotte commented Oct 6, 2023

Each process writes an error looking like:

--------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_shmem_base_select failed
  --> Returned value -1 instead of OPAL_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_init failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[titanium:20416] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------

Some println-based debugging seems to indicate that this happens while calling

 me, dims, nprocs, coords, comm = init_global_grid(nx, ny, nz, select_device=false) 

The line above works when run in the context of the 50-lines multi-xPU example in the exact same environment.

@omlins
Copy link
Collaborator

omlins commented Oct 6, 2023

Are you may be calling init_global_grid from multiple threads (on a same process)?

@ffevotte
Copy link

ffevotte commented Oct 6, 2023

In all cases, I'm running the computation using

$ mpirun -n 8 julia --project -t 1 my_script.jl

That would mean there is only one thread per process, right?

Also, both in the (working) ParallelStencil example and the (non-working) real code, the parallel stencil initialization is the same:

@init_parallel_stencil(Threads, Float64, 3)

@omlins
Copy link
Collaborator

omlins commented Oct 6, 2023

PS: Not sure whether it's related, but we also have to add the select_device=false kwarg to init_global_grid in order for the ParallelStencil example to work. Otherwise ParallelStencil complains that there are fewer GPUs than MPI processes.
(It's not really an issue for us, but I'm not sure whether that warrants an update of the example in the README)

That should not be: we will sort this out rather than adjusting the README 👍 Thanks for letting us know!

Actually, it might be that you are running more processes than you have GPUs and the error occurs rightfully. If your GPUs our distributed on multiple nodes, then you can solve this for example with a hostfile.

@omlins
Copy link
Collaborator

omlins commented Oct 6, 2023

Can you call MPI.Init before init_global_grid and then call init_global_grid with the keyword init_MPI=false and check if the error occurs in MPI.Init?

@omlins
Copy link
Collaborator

omlins commented Oct 6, 2023

Also don't forget:

Do you have different environment variable set, in particular IGG_CUDAAWARE_MPI or IGG_ROCMAWARE_MPI?

@ffevotte
Copy link

ffevotte commented Oct 6, 2023

Can you call MPI.Init before init_global_grid and then call init_global_grid with the keyword init_MPI=false and check if the error occurs in MPI.Init?

Yes, if I do this the same error occurs in MPI.Init().

Also don't forget:

Do you have different environment variable set, in particular IGG_CUDAAWARE_MPI or IGG_ROCMAWARE_MPI?

Sorry, I had forgotten about it. But no: no IGG_* environment variable is set in any of my tests.

@ffevotte
Copy link

ffevotte commented Oct 6, 2023

Can you call MPI.Init before init_global_grid and then call init_global_grid with the keyword init_MPI=false and check if the error occurs in MPI.Init?

Yes, if I do this the same error occurs in MPI.Init().

Ah, looking at it more closely, it seems that MPI.Init() fails when placed right before the init_global_grid call, but it does work if it happens "early enough" in the process. Characterizing more precisely what "early enough" means should help us find what breaks MPI.

In any case, this definitely sounds like (1) this issue doesn't have much (if anything) to do with ImplicitGlobalGrid or ParallelStencil, and (2) you've probably given us useful enough insight for us to debug things on our own.

Many many thanks! We'll report back if/when we find what happens, in case this could help other ImplicitGlobalGrid users.

@luraess
Copy link
Collaborator

luraess commented Oct 6, 2023

Seems the shmem issue you are getting is indeed related to device selection procedure where we he approach is to use all ranks on same shared memory location to define the node local rank and thus map GPUs to ranks. Could be that there is an issue with this as well.

If you manage to make a MWE or reproducer, then we could also look further into it if needed. And if it turns out being an issue not related to IGG, maybe posting it on Discourse could help other users to know about it.

@ffevotte
Copy link

ffevotte commented Oct 7, 2023

OK, I think I found the issue. It is a "long distance" bug that has nothing to do with either ImplicitGlobalGrid or ParallelStencil. It is rather caused by HDF5, which we use in our larger project. This tiny example demonstrates it clearly I think:

using MPI
# using HDF5 # <-- Uncomment this and `MPI.Init()` breaks when using system-wide MPI
MPI.Init()

comm = MPI.COMM_WORLD
print("Hello world, I am rank $(MPI.Comm_rank(comm)) of $(MPI.Comm_size(comm))\n")
MPI.Barrier(comm)

It seems we're hitting JuliaIO/HDF5.jl#1079, which is due to recent HDF5.jl versions always loading the JLL version of MPI, regardless of what MPIPreferences.jl says. As mentioned in the abovementioned issue, reverting to an older version of HDF5_jll avoids the problem.

In any case, thank you for your time, helping us find the origin of the problem!

@ffevotte
Copy link

ffevotte commented Oct 7, 2023

Actually, it might be that you are running more processes than you have GPUs and the error occurs rightfully. If your GPUs our distributed on multiple nodes, then you can solve this for example with a hostfile.

Yes, it might very well be that this error occurs rightfully, and we simply do not use IGG in the correct way.

In this specific instance, for example, our objective was to test that everything works first in a multi-CPU context. So we're launching a few MPI processes on one compute node (that happens to have a GPU, but we don't really want to use it, only the CPUs). To be clear, IGG is right to say that there are more MPI processes than GPUs, but this is also precisely the situation we want to be in.

What would be the correct way to do this? Should we use a hostfile rather than using the select_device kwarg to init_global_grid?

@omlins
Copy link
Collaborator

omlins commented Oct 9, 2023

Okay, then this means that the error occurs rightfully and it is correct usage to deactivate device selection by setting the keyword select_device=false. By doing so, IGG won't do anything related to GPUs, as everything else is being done in update_halo! in function of the array types passed as arguments (relying on multiple dispatch). Now that said, we could add an option "none" to the keyword device_type (which would automatically set select_device=false), in order to make your particular use case more clear. Would this make the code look fully clear to you?

Should we use a hostfile rather than using the select_device kwarg to init_global_grid?

No, a hostfile would only be something to use, if you are trying to run multiple processes on different nodes, but due to the way you're launching your commands they are not placed on the nodes as you wish for.

@ffevotte
Copy link

ffevotte commented Oct 9, 2023

Thanks for your answer!

that said, we could add an option "none" to the keyword device_type (which would automatically set select_device=false), in order to make your particular use case more clear. Would this make the code look fully clear to you?

Yes, that's a very good idea. I have to admit I was at first a bit puzzled that there wasn't any option available for device_type to ask for a CPU-only computation. But select_device appears right after device_type in the documentation, and it is relatively clear that select_device=false disables automatic GPU selection, so the current API works, too.

As I was writing above, I tend to think that a simple comment in the 50-lines multi-xPU example might help future users find this option more easily. Maybe simply something along the lines:

# Numerics
nx, ny, nz = 256, 256, 256;                              # Number of gridpoints dimensions x, y and z.
nt         = 100;                                        # Number of time steps

# Add the select_device=false / device_type="none" kwarg below
# to ignore GPUs in a multi-CPU configuration
init_global_grid(nx, ny, nz);

dx         = lx/(nx_g()-1);                              # Space step in x-dimension
dy         = ly/(ny_g()-1);                              # Space step in y-dimension
dz         = lz/(nz_g()-1);                              # Space step in z-dimension

@omlins
Copy link
Collaborator

omlins commented Dec 1, 2023

#79 : @ffevotte @LaurentPlagne

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants