Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{lib}[GCCcore/10.2.0] OpenMPI v4.0.5, libevent v2.1.12, libfabric v1.11.0, PMIx 3.1.5 #11333

Merged
merged 2 commits into from Sep 30, 2020

Conversation

boegel
Copy link
Member

@boegel boegel commented Sep 23, 2020

(created using eb --new-pr)
requires easybuilders/easybuild-easyblocks#2184 + #11320 (UCX) + #11332 (hwloc)

@boegel boegel added the update label Sep 23, 2020
@boegel boegel added this to the next release (4.3.1) milestone Sep 23, 2020
@boegel
Copy link
Member Author

boegel commented Sep 23, 2020

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on generoso

PR test command 'EB_PR=11333 EB_ARGS= /apps/slurm/default/bin/sbatch --job-name test_PR_11333 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 7856

Test results coming soon (I hope)...

- notification for comment with ID 697295413 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 4 out of 4 (4 easyconfigs in this PR)
generoso-x-1 - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/b62bfc33c94b6a4fd0d17ee00f8f1e75 for a full test report.

@boegel
Copy link
Member Author

boegel commented Sep 23, 2020

Test report by @boegel
SUCCESS
Build succeeded for 4 out of 4 (4 easyconfigs in this PR)
node3406.kirlia.os - Linux centos linux 7.8.2003, x86_64, Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz (cascadelake), Python 2.7.5
See https://gist.github.com/ba213ae6dbf1b2f33bf35108c19d1e84 for a full test report.

@boegel
Copy link
Member Author

boegel commented Sep 23, 2020

Test report by @boegel
SUCCESS
Build succeeded for 4 out of 4 (4 easyconfigs in this PR)
node3149.skitty.os - Linux centos linux 7.8.2003, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, Python 3.6.8
See https://gist.github.com/b1a7fec7331b23b529b5f01a34bb735b for a full test report.

@boegel
Copy link
Member Author

boegel commented Sep 23, 2020

Test report by @boegel
SUCCESS
Build succeeded for 4 out of 4 (4 easyconfigs in this PR)
node2609.swalot.os - Linux centos linux 7.8.2003, x86_64, Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz (haswell), Python 2.7.5
See https://gist.github.com/54e0c16b41bdeb8e18cf5527a0ea4afd for a full test report.

@boegel boegel added the 2020b issues & PRs related to 2020b label Sep 24, 2020
@lexming
Copy link
Contributor

lexming commented Sep 25, 2020

Test report by @lexming
SUCCESS
Build succeeded for 5 out of 5 (4 easyconfigs in this PR)
node127.hydra.os - Linux centos linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, Python 2.7.5
See https://gist.github.com/e6d314163c5aedf8d32de7f0f9056d9f for a full test report.

@lexming
Copy link
Contributor

lexming commented Sep 25, 2020

Test report by @lexming
SUCCESS
Build succeeded for 5 out of 5 (4 easyconfigs in this PR)
node376.hydra.os - Linux centos linux 7.7.1908, x86_64, Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, Python 2.7.5
See https://gist.github.com/07f7fa869c7e57849c243ebc26b6175d for a full test report.

@easybuilders easybuilders deleted a comment from boegelbot Sep 25, 2020
Copy link
Contributor

@lexming lexming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This OpenMPI is not working well on my side. A simple MPI hello world program fails to initialise OpenFabrics

$ mpirun ./test
[node379.hydra.os:24944] [[51950,0],0] ORTE_ERROR_LOG: Out of resource in file util/show_help.c at line 501
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   node378
  Local device: mlx5_0
--------------------------------------------------------------------------
Hello world from processor node379.hydra.os, rank 0 out of 2 processors
Hello world from processor node378.hydra.os, rank 1 out of 2 processors

OSU-Micro-benchmarks has the same issue

# OSU MPI Latency Test v5.6.3
# Size          Latency (us)
1024                    2.08
2048                    2.83
4096                    3.72
8192                    5.46
16384                   7.56
32768                   9.83
65536                  14.34
131072                 22.34
262144                 32.28
524288                 54.46
1048576                97.91
2097152               181.33
4194304               354.37
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   node378
  Local device: mlx5_0
--------------------------------------------------------------------------
[node379.hydra.os:15539] [[38701,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file util/show_help.c at line 501

The execution completes in both cases, but those errors are not good.

@terjekv
Copy link
Collaborator

terjekv commented Sep 25, 2020

Started a test build on a "clean" arm box. It'll take a bit. It started building M4... The box has no toolchains. :)

@terjekv
Copy link
Collaborator

terjekv commented Sep 25, 2020

Test report by @terjekv
SUCCESS
Build succeeded for 37 out of 37 (4 easyconfigs in this PR)
arm2 - Linux ubuntu 18.04, AArch64, UNKNOWN, Python 3.6.9
See https://gist.github.com/3f3bd7ad22e365787aba06bd717bc751 for a full test report.

@boegel
Copy link
Member Author

boegel commented Sep 26, 2020

This OpenMPI is not working well on my side. A simple MPI hello world program fails to initialise OpenFabrics

The problem here is that we should be configuring OpenMPI with --without-verbs when we're using UCX.
That certainly fixes the problem for me (and it's a known issue, see https://www.open-mpi.org/faq/?category=all#ofa-device-error.

Please try again with the updated OpenMPI easyblock from easybuilders/easybuild-easyblocks#2188 .

@lexming
Copy link
Contributor

lexming commented Sep 26, 2020

Test report by @lexming
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in this PR)
node128.hydra.os - Linux centos linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, Python 2.7.5
See https://gist.github.com/96567ab27531f1376f4b0764aa6edc36 for a full test report.

@lexming
Copy link
Contributor

lexming commented Sep 26, 2020

Test report by @lexming
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in this PR)
node378.hydra.os - Linux centos linux 7.7.1908, x86_64, Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, Python 2.7.5
See https://gist.github.com/6314f78e9edcc6d1e3dfd75fe5257922 for a full test report.

@lexming
Copy link
Contributor

lexming commented Sep 26, 2020

@boegel thanks a lot, that was indeed the issue. We have been already disabling verbs in our production system, but I was totally misled by the ORTE_ERROR_LOG: Data unpack would read past... error.

Copy link
Contributor

@lexming lexming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@boegel boegel added this to In progress in EasyBuild v4.3.1 via automation Sep 26, 2020
@boegel
Copy link
Member Author

boegel commented Sep 30, 2020

@lexming So let's merge? Or do you want to see more tests?

@boegel
Copy link
Member Author

boegel commented Sep 30, 2020

Test report by @boegel
SUCCESS
Build succeeded for 4 out of 4 (4 easyconfigs in this PR)
node3502.doduo.os - Linux RHEL 8.2, x86_64, AMD EPYC 7302P 16-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/bb1df99da5e1cdd18ba1f4d58c1b7d14 for a full test report.

@lexming
Copy link
Contributor

lexming commented Sep 30, 2020

Going in, thanks @boegel !

@lexming lexming merged commit ef4e18c into easybuilders:develop Sep 30, 2020
EasyBuild v4.3.1 automation moved this from In progress to Done Sep 30, 2020
@boegel boegel deleted the 20200923113619_new_pr_OpenMPI405 branch October 2, 2020 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2020b issues & PRs related to 2020b update
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants