OpenMPI Bcast and Allreduce much slower than Intel-MPI (unrelated to EFA) #1436

JiaweiZhuang · 2019-11-08T21:14:05Z

Environment

AWS ParallelCluster version: 2.4.1
OS: CentOS7
Scheduler: Slurm
Master instance type: c5n.xlarge
Compute instance type: c5n.18xlarge

Bug description and how to reproduce

I found that OpenMPI collectives like Bcast and Allreduce are 3~5 times slower than IntelMPI. This is unrelated to the intra-node EFA latency issue at #1143 (comment), since the performance gap still exist without EFA enabled. I did an extensive benchmark using various MPI implementations and versions, and tweaked the collective algorithms via I_MPI_ADJUST_BCAST for Intel-MPI and --mca coll_tuned_bcast_algorithm for OpenMPI. The results are summarized below. All results can be reproduced in this repo https://github.com/JiaweiZhuang/aws-mpi-benchmark

I specifically need helps on:

Check whether my benchmark was performed correctly. Is such performance gap a real thing, or just my misconfiguration?
See whether it is possible to improve OpenMPI performance by fine tuning OpenMPI MCA parameters. Only resolving the OpenMPI EFA latency problem Abnormal in-node latency with EFA enabled #1143 is probably not enough for closing such a huge performance gap.

I understand that it is difficult for an open-source MPI implementation to beat Intel-MPI's closed-source magic. I just don't want to incorrectly blame OpenMPI for slow performance if I made a mistake somewhere.

Or should I report this on OpenMPI's issue tracker instead? Given that such problem is specific to EC2 / ParallelCluster environment, and that OpenMPI works fine on my local InfiniBand cluster, I think it makes sense to post it here?

Other collectives could have a similar problem but I haven't tested. Bcast and Allreduce are probably two of the most commonly used collective calls (e.g. table 3 in this paper) so I just focus on them here.

Benchmark results

General config:

MPI libraries: Intel-MPI 2019.4, OpenMPI 3.1.4, OpenMPI 4.0.1, MPICH 3.3.1
Use one MPI rank for one physical core, thus 36 ranks for 1 c5n.18xlarge node.
All OpenMPI cases below do not use EFA, to avoid Abnormal in-node latency with EFA enabled #1143
Each run is repeated for 5 times and taken average, in addition to OSU's internal 1000/100 times averaging.
Except the choice of collective algorithm, most MPI parameters are kept as default.

1. Time vs number of cores

Bcast

The out-of-box OpenMPI is 4x slower than IntelMPI-EFA, and still 3x slower than IntelMPI-TCP without EFA enabled. Thus the performance gap mainly comes more from the collective algorithms, less from the EFA vs TCP difference. The "knormial tree" algorithm newly introduced in OpenMPI4 significantly improves the performance and gets closer to IntelMPI-TCP. The default algorithm in OpenMPI4 is still relatively slow. See the later section "Try all collective algorithms" for all algorithms.

Allreduce

For a large message, OpenMPI is 5x slower than Intel-MPI (either with EFA enabled or not). For smaller messages, the performance gap is even larger.

2. Time vs message size

This is the commonly-used time vs message size plot for OSU, if you want to see it.
(expand the details).

Bcast

Allreduce

3. Try all collective algorithms

Most Bcast algorithms are reviewed in this paper Performance Analysis of MPI Collective Operations. Allreduce algorithms are reviewed in Bandwidth Optimal All-reduce Algorithms
for Clusters of Workstations.

The x-axis is clipped to not show very slow algorithms. See the numbers on each bar for the actual time.

Bcast

The "knormial tree OpenMPI4" is significantly faster than default, and also scales better with more cores as shown previously (here only uses 288 cores).

Smaller message:

Allreduce

Not sure why the same algorithm can have very different performance between OpenMPI3 and OpenMP4 (e.g. "recursive doubling" and "basic linear"). I guess some algorithm names exist in OpenMPI3 but not yet implemented, so it's just using the default?

Smaller message:

Some algorithms in OpenMPI4 can be faster than default, but only for specific message sizes. For example "recursive doubling" is fast for medium-size messages, but got very slow for large messages.

Tentative conclusion

If the above analysis is done correctly, my tentative conclusion is: For Bcast, using OpenMPI4 Knormial tree (--mca coll_tuned_bcast_algorithm 7) will give 2~3x performance boost over the out-of-box OpenMPI3/OpenMPI4, getting closer to IntelMPI without EFA. For Allreduce, a clever switch between algorithms, eager vs rendezvous protocol, etc, depending on message size, could give a ~30% speed-up, but that's still 2x far from IntelMPI-TCP and 3x far from Intel-MPI EFA.

The easiest solution is probably "just use Intel MPI on EC2" if you don't care about whether the library is open-source or not. Tuning every collective function is like reinventing a new MPI library and most users shouldn't worry about that.

The text was updated successfully, but these errors were encountered:

JiaweiZhuang · 2019-11-08T22:16:31Z

I notice that most HPC & ML articles on AWS Blog use OpenMPI, for example:

Leveraging Elastic Fabric Adapter to run HPC and ML Workloads on AWS Batch
Scalable deep learning training using multi-node parallel jobs with AWS Batch and Amazon FSx for Lustre
Scalable multi-node deep learning training using GPUs in the AWS Cloud

Given such popularity, it is probably worth spending some effort optimizing OpenMPI collectives for EC2 cluster environment. "Just use Intel MPI on EC2" might not be a viable solution for a well-established user software stack. GPU/NCCL will further complicate the problem; I am curious to see how such issue will affect distributed GPU workloads...

sean-smith · 2019-11-12T20:09:42Z

Hi @JiaweiZhuang Thanks for the detailed analysis! We're looking into it.

ErikLacharite · 2019-11-28T21:05:29Z

I also noticed in the past that Intel MPI-2019 was 2x slower than Intel-MPI 2018. It may be worthwhile also adding Intel-MPI 2018 to your testing.

Install:

yum-config-manager --add-repo https://yum.repos.intel.com/mpi/setup/intel-mpi.repo
rpm --import https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-
yum install intel-mpi-rt-2018.4-274.x86_64

Configure:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-impi

ErikLacharite · 2019-11-28T22:21:41Z

Use one MPI rank for one physical core, thus 36 ranks for 1 c5n.18xlarge node.

There may also be a difference in the default binding/pinning behavior between IntelMPI and OpenMPI (since the benchmarks above were done on a 72 core machine with 36 processes). Disabling hyperthreading may produce different results. It may not make a huge difference in the micro-benchmarks above, but in real-world applications, it's can be significant.

https://docs.aws.amazon.com/parallelcluster/latest/ug/cluster-definition.html#disable-hyperthreading

rajachan · 2019-12-02T23:16:45Z

@JiaweiZhuang - Thanks again for the thorough analysis and the report. We are aware of the collective performance issue with Open MPI's OFI MTL. Like you've concluded, this has to do with the default thresholds for switching between various collective algorithms. We are seeing a significant improvement when using the right algorithms for the various collectives even with our internal testing. The default values are suboptimal for both TCP and EFA when running with the OFI MTL and they need to be tuned. Running exhaustive experiments and tuning these thresholds is definitely on our roadmap.

demartinofra · 2020-03-03T14:10:08Z

@JiaweiZhuang if you are ok with it I'm going to close this one since the reported issue is not something strictly related to ParallelCluster, hence there is no real action item we can take other than installing the latest OpenMPI once this enhancement is published upstream. As @rajachan mentioned this is something we (as AWS) are tracking on our roadmap and plan to contribute to.

Again, thanks for the very valid and thorough analysis.

sean-smith assigned rajachan Nov 8, 2019

ddeidda added the enhancement label Nov 19, 2019

demartinofra closed this as completed Mar 3, 2020

WilliamDowns mentioned this issue Aug 28, 2020

Intel MPI failure with EFA enabled #1988

Closed

yugeniom mentioned this issue Aug 31, 2021

Intel libraries NanoComp/meep#1750

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenMPI Bcast and Allreduce much slower than Intel-MPI (unrelated to EFA) #1436

OpenMPI Bcast and Allreduce much slower than Intel-MPI (unrelated to EFA) #1436

JiaweiZhuang commented Nov 8, 2019 •

edited

Loading

JiaweiZhuang commented Nov 8, 2019 •

edited

Loading

sean-smith commented Nov 12, 2019

ErikLacharite commented Nov 28, 2019 •

edited

Loading

ErikLacharite commented Nov 28, 2019

rajachan commented Dec 2, 2019

demartinofra commented Mar 3, 2020

OpenMPI Bcast and Allreduce much slower than Intel-MPI (unrelated to EFA) #1436

OpenMPI Bcast and Allreduce much slower than Intel-MPI (unrelated to EFA) #1436

Comments

JiaweiZhuang commented Nov 8, 2019 • edited Loading

Environment

Bug description and how to reproduce

Benchmark results

1. Time vs number of cores

2. Time vs message size

3. Try all collective algorithms

Tentative conclusion

JiaweiZhuang commented Nov 8, 2019 • edited Loading

sean-smith commented Nov 12, 2019

ErikLacharite commented Nov 28, 2019 • edited Loading

ErikLacharite commented Nov 28, 2019

rajachan commented Dec 2, 2019

demartinofra commented Mar 3, 2020

JiaweiZhuang commented Nov 8, 2019 •

edited

Loading

JiaweiZhuang commented Nov 8, 2019 •

edited

Loading

ErikLacharite commented Nov 28, 2019 •

edited

Loading