Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI hanging if MPI_Init_thread used on generoso with foss/2023a (OpenMPI 4.1.5 / libfabric 1.18.0) #18925

Closed
branfosj opened this issue Oct 5, 2023 · 6 comments

Comments

@branfosj
Copy link
Member

branfosj commented Oct 5, 2023

It is hanging in the PSM3 provider in libfabric. We are seeing this when testing on generoso in #18443, #18444, #18731.

Error:

$ mpirun -n 2 ./a.out
login1:rank0.a.out: Failed to get eth0 (unit 0) cpu set
login1:rank0: PSM3 can't open nic unit: 0 (err=23)
login1:rank0: PSM3 can't open nic unit: 0 (err=23)
login1:rank0.a.out: Failed to get eth0 (unit 0) cpu set
login1:rank1: PSM3 can't open nic unit: 0 (err=23)
login1:rank1.a.out: Failed to get eth0 (unit 0) cpu set
login1:rank1.a.out: Failed to get eth0 (unit 0) cpu set
login1:rank1: PSM3 can't open nic unit: 0 (err=23)

If we control libfabric with FI_PROVIDER (such as FI_PROVIDER="udp,tcp" or FI_PROVIDER="psm2") then the example completes. It fails if we set FI_PROVIDER="psm3".

Running with FI_LOG_LEVEL=debug mpirun -n 2 ./a.out gives lots of output. I think the relevant line is

libfabric:3915417:1696406953::psm3:core:psmx3_trx_ctxt_alloc():320<warn> login1:rank1: psm3_ep_open returns 23, errno=2

From the code:

libfabric-1.18.0/prov/psm3/src/psmx3_util.c:    -FI_EIO,        /* PSM2_EP_DEVICE_FAILURE = 23 */
@boegel
Copy link
Member

boegel commented Oct 5, 2023

Also worth mentioning is that we're not seeing this with OpenMPI/4.1.6-GCC-13.2.0 (libfabric 1.19.0)

@branfosj
Copy link
Member Author

branfosj commented Oct 5, 2023

Also solve the issue:

  • PSM3_HAL=loopback
  • PSM3_DEVICES=self
  • PSM3_DEVICES=shm
  • FI_PROVIDER="^psm3"

Test ideas from open-mpi/ompi#11295 (comment)

Please try setting export PSM3_HAL=loopback. This should stop PSM3 from connecting to a NICs by using the loopback HAL.
Can also set export PSM3_DEVICES=self to disable nic and shm devices from being used by psm3. (Default is self,shm,nic)

@boegel
Copy link
Member

boegel commented Oct 5, 2023

So, which option is the safest (in particular for generoso)?

@boegel
Copy link
Member

boegel commented Oct 5, 2023

workaround for generoso: boegel/boegelbot#28

@boegel
Copy link
Member

boegel commented Oct 11, 2023

@branfosj We can close this one now, right?

@boegel boegel modified the milestones: next release (4.8.2?), 4.x Oct 11, 2023
@branfosj
Copy link
Member Author

branfosj commented Oct 11, 2023

@branfosj We can close this one now, right?

Yes.

For reference, we're decided on

PSM3_DEVICES='self,shm' 

as this fixed the issue we were seeing and looked to be a change least likely to break anything else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants