Intel MPI failure with EFA enabled #1988

WilliamDowns · 2020-08-28T20:23:05Z

Environment:

AWS ParallelCluster / CfnCluster version: 2.8.0 or 2.7.0
ParallelCluster configuration file:

[aws]
aws_region_name = us-east-1

[global]
cluster_template = centos_test
update_check = true
sanity_check = true

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

[cluster centos_test]
key_name = KEYNAME
base_os = centos7
scheduler = slurm
master_instance_type = c5n.9xlarge
compute_instance_type = c5n.18xlarge
maintain_initial_size = true
vpc_settings = default
enable_efa = compute
cluster_type = spot
placement_group = DYNAMIC
placement = cluster
initial_queue_size= 1
max_queue_size= 8
master_root_volume_size = 200
compute_root_volume_size = 40
ebs_settings = shared
disable_hyperthreading = true

[ebs shared]
shared_dir = shared
volume_type = st1
volume_size = 500

[vpc default]
vpc_id = VPCID
master_subnet_id = MASTERID
compute_subnet_id = SUBNETID
#use_public_ips = false

Other environment settings:

spack unload
export PATH=`echo $PATH | tr ":" "\n" | grep -v "openmpi" | tr "\n" ":"`
export SPACK_ROOT=/shared/spack
. $SPACK_ROOT/share/spack/setup-env.sh
spack load emacs
#==============================================================================
# %%%%% Load Spackages %%%%%
#==============================================================================
spack load gcc@9.3.0
spack load git%gcc@9.3.0
spack load cmake%gcc@9.3.0

module load libfabric-aws/1.10.1amzn1.1
#module load intelmpi
#source /opt/intel/impi/2019.7.217/intel64/bin/mpivars.sh
source $(spack location -i intel-mpi)/compilers_and_libraries_2020.2.254/linux/mpi/intel64/bin/mpivars.sh
spack load intel-mpi

export I_MPI_CC=gcc
export I_MPI_CXX=g++
export I_MPI_FC=gfortran
export I_MPI_F77=gfortran
export I_MPI_F90=gfortran
export I_MPI_DEBUG=5
#export I_MPI_PMI_LIBRARY=/opt/slurm/lib/libpmi.so
unset I_MPI_PMI_LIBRARY
#export I_MPI_FABRICS="shm:ofi"
#unset I_MPI_FABRICS
export FI_PROVIDER=efa
#export FI_PROVIDER=tcp
export FI_LOG_LEVEL=debug

spack load netcdf-fortran%gcc@9.3.0

export MPI_ROOT=$(spack location -i intel-mpi)

# Make all files world-readable by default
umask 022

# Specify compilers
export CC=gcc
export CXX=g++
export FC=gfortran

# For ESMF
export ESMF_COMPILER=gfortran
export ESMF_COMM=intelmpi
export ESMF_DIR=/shared/ESMF
export ESMF_INSTALL_PREFIX=${ESMF_DIR}/INSTALL_intelmpi_2019_8

# For GCHP
export ESMF_ROOT=${ESMF_INSTALL_PREFIX}

#==============================================================================
# Set limits
#==============================================================================

#ulimit -c 0                      # coredumpsize
export OMP_STACKSIZE=500m
ulimit -l unlimited              # memorylocked
ulimit -u 50000                  # maxproc
ulimit -v unlimited              # vmemoryuse
ulimit -s unlimited              # stacksize

#==============================================================================
# Print information
#==============================================================================

#module list
echo ""
echo "Environment:"
echo ""
echo "CC: ${CC}"
echo "CXX: ${CXX}"
echo "FC: ${FC}"
echo "ESMF_COMM: ${ESMF_COMM}"
echo "ESMF_COMPILER: ${ESMF_COMPILER}"
echo "ESMF_DIR: ${ESMF_DIR}"
echo "ESMF_INSTALL_PREFIX: ${ESMF_INSTALL_PREFIX}"
echo "ESMF_ROOT: ${ESMF_ROOT}"
echo "MPI_ROOT: ${MPI_ROOT}"
echo "NetCDF C: $(nc-config --prefix)"
#echo "NetCDF Fortran: $(nf-config --prefix)"
echo ""
echo "Done sourcing ${BASH_SOURCE[0]}"

Bug description and how to reproduce:

I'm attempting to run the GEOS-Chem High Performance model on a ParallelCluster setup. This has successfully been done in the past with older versions of GCHP and ParallelCluster (see work by @JiaweiZhuang as documented here). I'm using a similar setup to that described in the previous link, except I am using GNU compilers rather than Intel ones and newer versions of Intel MPI and ParallelCluster. I have been unable to complete a Slurm-submitted run of the model using Intel MPI and EFA on ParallelCluster 2.8.0, generating the following error:

libfabric:5288:efa:mr:efa_mr_reg_impl():312<warn> Unable to register MR: Unknown error -22
libfabric:5288:efa:mr:efa_mr_regattr():400<warn> Unable to register MR: Invalid argument
libfabric:5288:efa:mr:rxr_mr_regattr():108<warn> Unable to register MR buf (Invalid argument): 0x7ffed9d5d790 len: 0
libfabric:5289:efa:mr:efa_mr_reg_impl():312<warn> Unable to register MR: Unknown error -22
libfabric:5289:efa:mr:efa_mr_regattr():400<warn> Unable to register MR: Invalid argument
libfabric:5289:efa:mr:rxr_mr_regattr():108<warn> Unable to register MR buf (Invalid argument): 0x7ffcc1bc1990 len: 0
Abort(807023887) on node 5 (rank 5 in comm 0): Fatal error in PMPI_Win_create: Other MPI error, error stack:
PMPI_Win_create(201)......: MPI_Win_create(base=0x7ffcc1bc1990, size=0, disp_unit=1, MPI_INFO_NULL, comm=0x84000006, win=0x51f517c) failed
MPID_Win_create(120)......: 
MPIDIG_mpi_win_create(828):

This is true for both the pre-installed version of Intel MPI available in ParallelCluster 2.8.0 (2019 Update 7) and a Spack-built version of 2019 Update 8. The model runs successfully when run using mpirun on only the Master node, but any submission through Slurm (either using srun or a bash script containing srun or mpirun) fails instantly. This failure only occurs when using EFA as the fabric provider; setting FI_Provider to TCP or something similar results in a successful but extremely slow run (several times slower than running locally on the Master node).

Additional context:
Runs using other MPI implementations (have tested OpenMPI and MVAPICH2) complete successfully but extremely slowly vs. local Master node runs when submitted through Slurm across 1 or multiple nodes. I am not sure if the poor performance of the model when using these implementations is due to lack of EC2-specific optimization as described in #1436 or if I am simply missing some necessary configuration steps. Assuming the former, I would like to get Intel MPI functional (as well as avoid using Intel compilers due to licensing requirements) for using the model on EC2. I will note that the model performs roughly equally (and actually runs successfully on multiple nodes) across different MPI implementations, including Intel MPI, on my institution's local computing cluster.

I've pasted my environment setup and other settings for the cluster above (I've tried most configurations of the commented-out options). I've attached my Slurm job submission script and a lengthy output file containing I_MPI_DEBUG and FI_LOG_LEVEL output as well. Let me know if you would like any additional clarification / would like me to run any more specific tests.

slurm_script.txt
run_output.txt

The text was updated successfully, but these errors were encountered:

ddeidda · 2020-09-01T08:32:10Z

Hi William, looking into it.

WilliamDowns · 2020-09-08T22:53:30Z

To add, I can run the full IMPI benchmark suite on the master node successfully if I do not specify FI_PROVIDER=efa (which makes sense I suppose given EFA is not enabled on the master node). This same environment setup results in a failure when submitting my model run through Slurm. SSHing into a compute node and running the same benchmark suite with FI_PROVIDER=efa seems okay so far (still running). Running the model fails when I attempt an interactive (non-Slurm) run using mpirun when SSHed into a compute node with FI_PROVIDER=efa. The full error output is below (very similar to the error output from a Slurm-submitted run):

libfabric:5464:efa:mr:efa_mr_reg():254<warn> efa_cmd_reg_mr: Invalid argument(22)
libfabric:5464:efa:mr:efa_mr_reg():267<warn> Unable to register MR: Invalid argument
libfabric:5464:efa:mr:rxr_mr_regattr():169<warn> Unable to register MR buf (Invalid argument): 0x7ffc94c94c40 len: 0
libfabric:5463:efa:mr:efa_mr_reg():254<warn> efa_cmd_reg_mr: Invalid argument(22)
libfabric:5463:efa:mr:efa_mr_reg():267<warn> Unable to register MR: Invalid argument
libfabric:5463:efa:mr:rxr_mr_regattr():169<warn> Unable to register MR buf (Invalid argument): 0x7fffbffd54e0 len: 0
Abort(941241615) on node 3 (rank 3 in comm 0): Fatal error in PMPI_Win_create: Other MPI error, error stack:
PMPI_Win_create(201)......: MPI_Win_create(base=0x7fffbffd54e0, size=0, disp_unit=1, MPI_INFO_NULL, comm=0x84000006, win=0x5cdbf0c) failed
MPID_Win_create(120)......:
MPIDIG_mpi_win_create(819):
win_allgather(195)........: OFI memory registration failed (ofi_win.c:195:win_allgather:Invalid argument)
Abort(1717519) on node 4 (rank 4 in comm 0): Fatal error in PMPI_Win_create: Other MPI error, error stack:
PMPI_Win_create(201)......: MPI_Win_create(base=0x7ffc94c94c40, size=0, disp_unit=1, MPI_INFO_NULL, comm=0x84000006, win=0x631ef0c) failed
MPID_Win_create(120)......:
MPIDIG_mpi_win_create(819):
win_allgather(195)........: OFI memory registration failed (ofi_win.c:195:win_allgather:Invalid argument)

The previous working configuration created by @JiaweiZhuang used ParallelCluster 2.4.1, so I'm not sure if something changed in ParallelCluster or Intel MPI in the time since that requires my environment to be setup differently.

WilliamDowns · 2020-09-09T12:58:48Z

I believe I've narrowed down the issue: it turns out I cannot complete the full IMPI benchmark suite on the compute node with FI_PROVIDER=efa. All of the benchmarks succeed except for the EXT benchmark set. I tested the benchmarks from this set individually, and the only one that fails is Window, generating the same error from my comment above that I get when running my model with EFA enabled. This benchmark error also occurs if I do not specify FI_PROVIDER=efa, but does not occur if I specify another provider like tcp. I've attached the output from that failed Window attempt, as well as a log from the Accumulate benchmark for comparison.
log.window.txt
log.accumulate.txt

rwespetal · 2020-09-11T17:20:54Z

I apologize for the delayed response here and thank you for the update and detailed logs.

We've been able to reproduce the memory registration error you are seeing with EFA and are investigating a fix for this. That's correct that the error you're seeing with GEOS-Chem and the IMB-EXT Window test is the same bug.

rwespetal · 2020-09-18T17:50:51Z

I've been corresponding with the Libfabric community here and we've asked the Intel MPI team to look into this behavior.

Intel suggested turning off direct RMA operations using the environment variable MPIR_CVAR_CH4_OFI_ENABLE_RMA=0. I can't guarantee this will fix the issue you are experiencing with the application, but this workaround does fix the IMB-EXT Window test for me. I don't expect a performance impact using this option as the EFA Libfabric provider already emulates RMA operations.

WilliamDowns · 2020-09-18T20:20:14Z

I can confirm this works and I can now run the model to completion (though now I'm also running into srun being much slower than mpirun, but this is unrelated). Thanks very much for the tip!

rwespetal · 2020-09-19T16:06:58Z

Thank you for testing that and confirming. We'll continue work on a fix for the issue.

wzamazon · 2020-11-05T15:42:38Z

The libfabric community had a discussion on the topic of zero length memory registration,

ofiwg/libfabric#6245

and the conclusion is that zero byte memory registration is an invalid behavior, the documentation has been updated accordingly:

ofiwg/libfabric#6297

so in this case intel MPI is not following libfabric standard.

demartinofra · 2020-11-06T09:24:27Z

I'm going to resolve this issue since there isn't anything we can do on the ParallelCluster side other than upgrading to the latest IntelMPI library once Intel aligns with the libfabric standard. We have relayed this info to Intel which is aware of this.

WilliamDowns changed the title ~~Slurm jobs fail when using Intel MPI with EFA enabled~~ Intel MPI failure with EFA enabled Sep 9, 2020

demartinofra added the bug label Sep 29, 2020

WilliamDowns mentioned this issue Oct 2, 2020

[BUG/ISSUE] Trial run with 13.0.0-alpha.9 version crashes after ~1 simulation hour and gives floating divide by zero error. geoschem/GCHP#37

Closed

WilliamDowns mentioned this issue Oct 15, 2020

[BUG/ISSUE] Crash when using Intel MPI with certain fabric providers geoschem/GCHP#47

Closed

demartinofra closed this as completed Nov 6, 2020

WilliamDowns mentioned this issue Apr 22, 2021

EFA: Hang in MPI_Win_Fence ofiwg/libfabric#6706

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intel MPI failure with EFA enabled #1988

Intel MPI failure with EFA enabled #1988

WilliamDowns commented Aug 28, 2020

ddeidda commented Sep 1, 2020

WilliamDowns commented Sep 8, 2020

WilliamDowns commented Sep 9, 2020 •

edited

rwespetal commented Sep 11, 2020

rwespetal commented Sep 18, 2020

WilliamDowns commented Sep 18, 2020

rwespetal commented Sep 19, 2020

wzamazon commented Nov 5, 2020

demartinofra commented Nov 6, 2020

Intel MPI failure with EFA enabled #1988

Intel MPI failure with EFA enabled #1988

Comments

WilliamDowns commented Aug 28, 2020

ddeidda commented Sep 1, 2020

WilliamDowns commented Sep 8, 2020

WilliamDowns commented Sep 9, 2020 • edited

rwespetal commented Sep 11, 2020

rwespetal commented Sep 18, 2020

WilliamDowns commented Sep 18, 2020

rwespetal commented Sep 19, 2020

wzamazon commented Nov 5, 2020

demartinofra commented Nov 6, 2020

WilliamDowns commented Sep 9, 2020 •

edited