-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenMPI Bcast and Allreduce much slower than Intel-MPI (unrelated to EFA) #1436
Comments
I notice that most HPC & ML articles on AWS Blog use OpenMPI, for example:
Given such popularity, it is probably worth spending some effort optimizing OpenMPI collectives for EC2 cluster environment. "Just use Intel MPI on EC2" might not be a viable solution for a well-established user software stack. GPU/NCCL will further complicate the problem; I am curious to see how such issue will affect distributed GPU workloads... |
Hi @JiaweiZhuang Thanks for the detailed analysis! We're looking into it. |
I also noticed in the past that Intel MPI-2019 was 2x slower than Intel-MPI 2018. It may be worthwhile also adding Intel-MPI 2018 to your testing. Install:
Configure: |
There may also be a difference in the default binding/pinning behavior between IntelMPI and OpenMPI (since the benchmarks above were done on a 72 core machine with 36 processes). Disabling hyperthreading may produce different results. It may not make a huge difference in the micro-benchmarks above, but in real-world applications, it's can be significant. https://docs.aws.amazon.com/parallelcluster/latest/ug/cluster-definition.html#disable-hyperthreading |
@JiaweiZhuang - Thanks again for the thorough analysis and the report. We are aware of the collective performance issue with Open MPI's OFI MTL. Like you've concluded, this has to do with the default thresholds for switching between various collective algorithms. We are seeing a significant improvement when using the right algorithms for the various collectives even with our internal testing. The default values are suboptimal for both TCP and EFA when running with the OFI MTL and they need to be tuned. Running exhaustive experiments and tuning these thresholds is definitely on our roadmap. |
@JiaweiZhuang if you are ok with it I'm going to close this one since the reported issue is not something strictly related to ParallelCluster, hence there is no real action item we can take other than installing the latest OpenMPI once this enhancement is published upstream. As @rajachan mentioned this is something we (as AWS) are tracking on our roadmap and plan to contribute to. Again, thanks for the very valid and thorough analysis. |
Environment
Bug description and how to reproduce
I found that OpenMPI collectives like Bcast and Allreduce are 3~5 times slower than IntelMPI. This is unrelated to the intra-node EFA latency issue at #1143 (comment), since the performance gap still exist without EFA enabled. I did an extensive benchmark using various MPI implementations and versions, and tweaked the collective algorithms via
I_MPI_ADJUST_BCAST
for Intel-MPI and--mca coll_tuned_bcast_algorithm
for OpenMPI. The results are summarized below. All results can be reproduced in this repo https://github.com/JiaweiZhuang/aws-mpi-benchmarkI specifically need helps on:
I understand that it is difficult for an open-source MPI implementation to beat Intel-MPI's closed-source magic. I just don't want to incorrectly blame OpenMPI for slow performance if I made a mistake somewhere.
Or should I report this on OpenMPI's issue tracker instead? Given that such problem is specific to EC2 / ParallelCluster environment, and that OpenMPI works fine on my local InfiniBand cluster, I think it makes sense to post it here?
Other collectives could have a similar problem but I haven't tested. Bcast and Allreduce are probably two of the most commonly used collective calls (e.g. table 3 in this paper) so I just focus on them here.
Benchmark results
General config:
c5n.18xlarge
node.1. Time vs number of cores
Bcast
The out-of-box OpenMPI is 4x slower than IntelMPI-EFA, and still 3x slower than IntelMPI-TCP without EFA enabled. Thus the performance gap mainly comes more from the collective algorithms, less from the EFA vs TCP difference. The "knormial tree" algorithm newly introduced in OpenMPI4 significantly improves the performance and gets closer to IntelMPI-TCP. The default algorithm in OpenMPI4 is still relatively slow. See the later section "Try all collective algorithms" for all algorithms.
Allreduce
For a large message, OpenMPI is 5x slower than Intel-MPI (either with EFA enabled or not). For smaller messages, the performance gap is even larger.
2. Time vs message size
This is the commonly-used time vs message size plot for OSU, if you want to see it.
(expand the details).
Bcast
Allreduce
3. Try all collective algorithms
Most Bcast algorithms are reviewed in this paper Performance Analysis of MPI Collective Operations. Allreduce algorithms are reviewed in Bandwidth Optimal All-reduce Algorithms
for Clusters of Workstations.
The x-axis is clipped to not show very slow algorithms. See the numbers on each bar for the actual time.
Bcast
The "knormial tree OpenMPI4" is significantly faster than default, and also scales better with more cores as shown previously (here only uses 288 cores).
Smaller message:
Allreduce
Not sure why the same algorithm can have very different performance between OpenMPI3 and OpenMP4 (e.g. "recursive doubling" and "basic linear"). I guess some algorithm names exist in OpenMPI3 but not yet implemented, so it's just using the default?
Smaller message:
Some algorithms in OpenMPI4 can be faster than default, but only for specific message sizes. For example "recursive doubling" is fast for medium-size messages, but got very slow for large messages.
Tentative conclusion
If the above analysis is done correctly, my tentative conclusion is: For Bcast, using OpenMPI4 Knormial tree (
--mca coll_tuned_bcast_algorithm 7
) will give 2~3x performance boost over the out-of-box OpenMPI3/OpenMPI4, getting closer to IntelMPI without EFA. For Allreduce, a clever switch between algorithms, eager vs rendezvous protocol, etc, depending on message size, could give a ~30% speed-up, but that's still 2x far from IntelMPI-TCP and 3x far from Intel-MPI EFA.The easiest solution is probably "just use Intel MPI on EC2" if you don't care about whether the library is open-source or not. Tuning every collective function is like reinventing a new MPI library and most users shouldn't worry about that.
The text was updated successfully, but these errors were encountered: