Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenMPI Bcast and Allreduce much slower than Intel-MPI (unrelated to EFA) #1436

Closed
JiaweiZhuang opened this issue Nov 8, 2019 · 6 comments
Closed
Assignees

Comments

@JiaweiZhuang
Copy link

JiaweiZhuang commented Nov 8, 2019

Environment

  • AWS ParallelCluster version: 2.4.1
  • OS: CentOS7
  • Scheduler: Slurm
  • Master instance type: c5n.xlarge
  • Compute instance type: c5n.18xlarge

Bug description and how to reproduce

I found that OpenMPI collectives like Bcast and Allreduce are 3~5 times slower than IntelMPI. This is unrelated to the intra-node EFA latency issue at #1143 (comment), since the performance gap still exist without EFA enabled. I did an extensive benchmark using various MPI implementations and versions, and tweaked the collective algorithms via I_MPI_ADJUST_BCAST for Intel-MPI and --mca coll_tuned_bcast_algorithm for OpenMPI. The results are summarized below. All results can be reproduced in this repo https://github.com/JiaweiZhuang/aws-mpi-benchmark

I specifically need helps on:

  1. Check whether my benchmark was performed correctly. Is such performance gap a real thing, or just my misconfiguration?
  2. See whether it is possible to improve OpenMPI performance by fine tuning OpenMPI MCA parameters. Only resolving the OpenMPI EFA latency problem Abnormal in-node latency with EFA enabled #1143 is probably not enough for closing such a huge performance gap.

I understand that it is difficult for an open-source MPI implementation to beat Intel-MPI's closed-source magic. I just don't want to incorrectly blame OpenMPI for slow performance if I made a mistake somewhere.

Or should I report this on OpenMPI's issue tracker instead? Given that such problem is specific to EC2 / ParallelCluster environment, and that OpenMPI works fine on my local InfiniBand cluster, I think it makes sense to post it here?

Other collectives could have a similar problem but I haven't tested. Bcast and Allreduce are probably two of the most commonly used collective calls (e.g. table 3 in this paper) so I just focus on them here.

Benchmark results

General config:

  • MPI libraries: Intel-MPI 2019.4, OpenMPI 3.1.4, OpenMPI 4.0.1, MPICH 3.3.1
  • Use one MPI rank for one physical core, thus 36 ranks for 1 c5n.18xlarge node.
  • All OpenMPI cases below do not use EFA, to avoid Abnormal in-node latency with EFA enabled #1143
  • Each run is repeated for 5 times and taken average, in addition to OSU's internal 1000/100 times averaging.
  • Except the choice of collective algorithm, most MPI parameters are kept as default.

1. Time vs number of cores

Bcast

The out-of-box OpenMPI is 4x slower than IntelMPI-EFA, and still 3x slower than IntelMPI-TCP without EFA enabled. Thus the performance gap mainly comes more from the collective algorithms, less from the EFA vs TCP difference. The "knormial tree" algorithm newly introduced in OpenMPI4 significantly improves the performance and gets closer to IntelMPI-TCP. The default algorithm in OpenMPI4 is still relatively slow. See the later section "Try all collective algorithms" for all algorithms.

bcast_scaling_1048576 bcast_scaling_65536

Allreduce

For a large message, OpenMPI is 5x slower than Intel-MPI (either with EFA enabled or not). For smaller messages, the performance gap is even larger.

allreduce_1048576 allreduce_65536

2. Time vs message size

This is the commonly-used time vs message size plot for OSU, if you want to see it.
(expand the details).

Bcast

bcast_small_facet bcast_large_facet

Allreduce

allreduce_small_facet allreduce_large_facet

3. Try all collective algorithms

Most Bcast algorithms are reviewed in this paper Performance Analysis of MPI Collective Operations. Allreduce algorithms are reviewed in Bandwidth Optimal All-reduce Algorithms
for Clusters of Workstations
.

The x-axis is clipped to not show very slow algorithms. See the numbers on each bar for the actual time.

Bcast

The "knormial tree OpenMPI4" is significantly faster than default, and also scales better with more cores as shown previously (here only uses 288 cores).

bcast_all_algo_1048576

Smaller message:


bcast_all_algo_65536

Allreduce

Not sure why the same algorithm can have very different performance between OpenMPI3 and OpenMP4 (e.g. "recursive doubling" and "basic linear"). I guess some algorithm names exist in OpenMPI3 but not yet implemented, so it's just using the default?

allreduce_all_algo_1048576

Smaller message:


allreduce_all_algo_65536

Some algorithms in OpenMPI4 can be faster than default, but only for specific message sizes. For example "recursive doubling" is fast for medium-size messages, but got very slow for large messages.

allreduce_important_algo

Tentative conclusion

If the above analysis is done correctly, my tentative conclusion is: For Bcast, using OpenMPI4 Knormial tree (--mca coll_tuned_bcast_algorithm 7) will give 2~3x performance boost over the out-of-box OpenMPI3/OpenMPI4, getting closer to IntelMPI without EFA. For Allreduce, a clever switch between algorithms, eager vs rendezvous protocol, etc, depending on message size, could give a ~30% speed-up, but that's still 2x far from IntelMPI-TCP and 3x far from Intel-MPI EFA.

The easiest solution is probably "just use Intel MPI on EC2" if you don't care about whether the library is open-source or not. Tuning every collective function is like reinventing a new MPI library and most users shouldn't worry about that.

@JiaweiZhuang
Copy link
Author

JiaweiZhuang commented Nov 8, 2019

I notice that most HPC & ML articles on AWS Blog use OpenMPI, for example:

Given such popularity, it is probably worth spending some effort optimizing OpenMPI collectives for EC2 cluster environment. "Just use Intel MPI on EC2" might not be a viable solution for a well-established user software stack. GPU/NCCL will further complicate the problem; I am curious to see how such issue will affect distributed GPU workloads...

@sean-smith
Copy link
Contributor

Hi @JiaweiZhuang Thanks for the detailed analysis! We're looking into it.

@ErikLacharite
Copy link

ErikLacharite commented Nov 28, 2019

I also noticed in the past that Intel MPI-2019 was 2x slower than Intel-MPI 2018. It may be worthwhile also adding Intel-MPI 2018 to your testing.

Install:

yum-config-manager --add-repo https://yum.repos.intel.com/mpi/setup/intel-mpi.repo
rpm --import https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-
yum install intel-mpi-rt-2018.4-274.x86_64

Configure:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-impi

@ErikLacharite
Copy link

Use one MPI rank for one physical core, thus 36 ranks for 1 c5n.18xlarge node.

There may also be a difference in the default binding/pinning behavior between IntelMPI and OpenMPI (since the benchmarks above were done on a 72 core machine with 36 processes). Disabling hyperthreading may produce different results. It may not make a huge difference in the micro-benchmarks above, but in real-world applications, it's can be significant.

https://docs.aws.amazon.com/parallelcluster/latest/ug/cluster-definition.html#disable-hyperthreading

@rajachan
Copy link
Member

rajachan commented Dec 2, 2019

@JiaweiZhuang - Thanks again for the thorough analysis and the report. We are aware of the collective performance issue with Open MPI's OFI MTL. Like you've concluded, this has to do with the default thresholds for switching between various collective algorithms. We are seeing a significant improvement when using the right algorithms for the various collectives even with our internal testing. The default values are suboptimal for both TCP and EFA when running with the OFI MTL and they need to be tuned. Running exhaustive experiments and tuning these thresholds is definitely on our roadmap.

@demartinofra
Copy link
Contributor

@JiaweiZhuang if you are ok with it I'm going to close this one since the reported issue is not something strictly related to ParallelCluster, hence there is no real action item we can take other than installing the latest OpenMPI once this enhancement is published upstream. As @rajachan mentioned this is something we (as AWS) are tracking on our roadmap and plan to contribute to.

Again, thanks for the very valid and thorough analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants