Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve QuantumESPRESSO easyblock by cleaning up and extending configure step + running test suite #3241

Merged
merged 39 commits into from Mar 19, 2024

Conversation

Crivella
Copy link
Contributor

@Crivella Crivella commented Mar 1, 2024

This PR, opened in relation to issue #3234, changes the behavior of the QuantumESPRESSO easyblock in order to make sure the correct compilation flags are used for the majority of versions from 5.x to 7.x.

Also the inclusion of flags has been modularized in order to be easier to mange

The new easyblock has been tested by compiling several QE versions, starting from the latest available intel and foss recipes and using the --try-software-version flag to install different software.
When a valid version of the HDF5 LibXC or ELPA libraries was not readily available it has been manually disabled from the original config file.
In details:

  • ELPA and LibXC have not been used for QE<7.x (They are available but would need to recompile a different version in order to test)
  • HDF5 have not been used for QE < 5.2.1 (Support was experimental from 5.0 but in my case i was obtaining segfaults when HDF5 functions were being invoked).

NOTE: Due to a problem with compiling and running QE with the intel toolchain + openmp, openmp was disabled when using the intel2022b

Below the results of a small reframe test (see PR) used to check that all codes are able to reach the end without any errors or segfaults.
(The version was scaled down only to check that all codes reach the JOB DONE line, considering how small this calculation was, the timings themself are not very significative)
batch_tests

easybuild/easyblocks/q/quantumespresso.py Outdated Show resolved Hide resolved
easybuild/easyblocks/q/quantumespresso.py Outdated Show resolved Hide resolved
easybuild/easyblocks/q/quantumespresso.py Outdated Show resolved Hide resolved
easybuild/easyblocks/q/quantumespresso.py Outdated Show resolved Hide resolved
easybuild/easyblocks/q/quantumespresso.py Outdated Show resolved Hide resolved
easybuild/easyblocks/q/quantumespresso.py Outdated Show resolved Hide resolved
easybuild/easyblocks/q/quantumespresso.py Outdated Show resolved Hide resolved
easybuild/easyblocks/q/quantumespresso.py Outdated Show resolved Hide resolved
easybuild/easyblocks/q/quantumespresso.py Outdated Show resolved Hide resolved
easybuild/easyblocks/q/quantumespresso.py Outdated Show resolved Hide resolved
@ocaisa ocaisa self-assigned this Mar 5, 2024
@ocaisa ocaisa added the bug fix label Mar 5, 2024
@ocaisa ocaisa added this to the release after 4.9.0 milestone Mar 5, 2024
easybuild/easyblocks/q/quantumespresso.py Outdated Show resolved Hide resolved
easybuild/easyblocks/q/quantumespresso.py Outdated Show resolved Hide resolved
easybuild/easyblocks/q/quantumespresso.py Outdated Show resolved Hide resolved
Crivella and others added 3 commits March 6, 2024 15:45
Co-authored-by: ocaisa <alan.ocais@cecam.org>
Co-authored-by: ocaisa <alan.ocais@cecam.org>
Co-authored-by: ocaisa <alan.ocais@cecam.org>
@Crivella
Copy link
Contributor Author

Crivella commented Mar 9, 2024

@ocaisa Just checked those failures in the bot run. Apparently QE implemented the NPROCS for the test_suite in 7.2 and not 7.0. Before that only run with -parallel/-serial are present.
Added a fix for that

@boegelbot
Copy link

Test report by @boegelbot

Overview of tested easyconfigs (in order)

Build succeeded for 3 out of 7 (7 easyconfigs in total)
cns1 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/b9c6f520ca24b3ccca205f450fbb5a18 for a full test report.

@ocaisa
Copy link
Member

ocaisa commented Mar 9, 2024

@boegelbot please test @ generoso
EB_ARGS=" QuantumESPRESSO-6.8-foss-2021a.eb QuantumESPRESSO-6.8-foss-2021b.eb QuantumESPRESSO-6.8-intel-2021a.eb QuantumESPRESSO-7.0-foss-2021b.eb QuantumESPRESSO-7.0-intel-2021b.eb QuantumESPRESSO-7.1-foss-2022a.eb QuantumESPRESSO-7.1-intel-2022a.eb "

@boegelbot
Copy link

@ocaisa: Request for testing this PR well received on login1

PR test command 'EB_PR=3241 EB_ARGS=" QuantumESPRESSO-6.8-foss-2021a.eb QuantumESPRESSO-6.8-foss-2021b.eb QuantumESPRESSO-6.8-intel-2021a.eb QuantumESPRESSO-7.0-foss-2021b.eb QuantumESPRESSO-7.0-intel-2021b.eb QuantumESPRESSO-7.1-foss-2022a.eb QuantumESPRESSO-7.1-intel-2022a.eb " EB_CONTAINER= EB_REPO=easybuild-easyblocks /opt/software/slurm/bin/sbatch --job-name test_PR_3241 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 13062

Test results coming soon (I hope)...

- notification for comment with ID 1986953955 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@ocaisa
Copy link
Member

ocaisa commented Mar 11, 2024

@boegelbot please test @ jsc-zen3
EB_ARGS=" QuantumESPRESSO-6.8-foss-2021a.eb QuantumESPRESSO-6.8-foss-2021b.eb QuantumESPRESSO-6.8-intel-2021a.eb QuantumESPRESSO-7.0-foss-2021b.eb QuantumESPRESSO-7.0-intel-2021b.eb QuantumESPRESSO-7.1-foss-2022a.eb QuantumESPRESSO-7.1-intel-2022a.eb "

@boegelbot
Copy link

@ocaisa: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=3241 EB_ARGS=" QuantumESPRESSO-6.8-foss-2021a.eb QuantumESPRESSO-6.8-foss-2021b.eb QuantumESPRESSO-6.8-intel-2021a.eb QuantumESPRESSO-7.0-foss-2021b.eb QuantumESPRESSO-7.0-intel-2021b.eb QuantumESPRESSO-7.1-foss-2022a.eb QuantumESPRESSO-7.1-intel-2022a.eb " EB_REPO=easybuild-easyblocks EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_3241 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 3742

Test results coming soon (I hope)...

- notification for comment with ID 1988182263 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link

Test report by @boegelbot

Overview of tested easyconfigs (in order)

Build succeeded for 7 out of 15 (7 easyconfigs in total)
jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.3, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/boegelbot/1e2ada209d0c8f375f067c8ba8f07475 for a full test report.

@ocaisa
Copy link
Member

ocaisa commented Mar 12, 2024

@Crivella I've got passing tests almost everywhere now (here and in easybuilders/easybuild-easyconfigs#20070), apart from an issue with the software stack for version 7.0 (which has nothing to do with this PR). I'll do a final review now and then merge

'ph_ahc_diam', # Test detects a ! as an energy in baseline
'tddfpt_magnons_fe', # Too strict thresholds
], "List of test suite targets that are allowed to fail (name can partially match)", CUSTOM],
'test_suite_threshold': [0.97, "Threshold for test suite success rate", CUSTOM],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not that comfortable with giving a % threshold here, I'd prefer to give a specific number (with the default being zero). The we explicitly number the failures in the easyconfig, expecting it to be version (and perhaps toolchain) specific. Pytorch uses

'max_failed_tests': [0, "Maximum number of failing tests", CUSTOM],

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With all the testing I've done, I have the exact numbers for almost everything already, I just need to dig them out. We can then add them to the easyconfigs and do a final rerun of the builds. If we do a rerun of builds, do you think we should add the fast-math as well to see how we do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the falkyness of some tests failing just because the absolute/relative errors are slightly higher than the thresholds sets in the baselines, i think this could be tricky.
What I am mostly worried about is that if that number is not carefully curated we might be missing some segfaults that could arise.
This is why before i added what should be the flaky test to test_suite_allow_failures and raised (removed in commit f182aea) if a test not in that list failed (most likely it would fail because the calculation does not actually finish)

Copy link
Contributor Author

@Crivella Crivella Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With all the testing I've done, I have the exact numbers for almost everything already, I just need to dig them out. We can then add them to the easyconfigs and do a final rerun of the builds. If we do a rerun of builds, do you think we should add the fast-math as well to see how we do?

Sure we could try. This night i just finished testing an easyconfig for 7.3 (will open a PR for it soon) with the option. I think it should work for all versions just by adding

    'extra_cflags': '-ffast-math',
    'extra_fflags': '-ffast-math',
    'extra_fcflags': '-ffast-math',
    'extra_f90flags': '-ffast-math',

to the toolchain options

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the falkyness of some tests failing just because the absolute/relative errors are slightly higher than the thresholds sets in the baselines, i think this could be tricky. What I am mostly worried about is that if that number is not carefully curated we might be missing some segfaults that could arise. This is why before i added what should be the flaky test to test_suite_allow_failures and raised (removed in commit f182aea) if a test not in that list failed (most likely it would fail because the calculation does not actually finish)

But i guess we can use the failures array to check the number of failures without the ignored ones.
I would still leave the threshold on the total number (without the ignored ones) though as the ignored tests including relax actually could be excluding a non trivial number of tests failures

@boegel
Copy link
Member

boegel commented Mar 12, 2024

@Crivella I've got passing tests almost everywhere now (here and in easybuilders/easybuild-easyconfigs#20070), apart from an issue with the software stack for version 7.0 (which has nothing to do with this PR). I'll do a final review now and then merge

Keep in mind that jsc-zen3 is running Rocky 9.x, so the problems you're seeing there probably just means that the older Intel compilers are not compatible with the glibc in Rocky 9.x

easybuild/easyblocks/q/quantumespresso.py Outdated Show resolved Hide resolved
@ocaisa
Copy link
Member

ocaisa commented Mar 15, 2024

@boegelbot please test @ jsc-zen3
EB_ARGS=" QuantumESPRESSO-6.8-foss-2021a.eb QuantumESPRESSO-6.8-foss-2021b.eb QuantumESPRESSO-6.8-intel-2021a.eb QuantumESPRESSO-7.0-foss-2021b.eb QuantumESPRESSO-7.1-foss-2022a.eb QuantumESPRESSO-7.1-intel-2022a.eb "

@boegelbot
Copy link

@ocaisa: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=3241 EB_ARGS=" QuantumESPRESSO-6.8-foss-2021a.eb QuantumESPRESSO-6.8-foss-2021b.eb QuantumESPRESSO-6.8-intel-2021a.eb QuantumESPRESSO-7.0-foss-2021b.eb QuantumESPRESSO-7.1-foss-2022a.eb QuantumESPRESSO-7.1-intel-2022a.eb " EB_REPO=easybuild-easyblocks EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_3241 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 3783

Test results coming soon (I hope)...

- notification for comment with ID 1999632000 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link

Test report by @boegelbot

Overview of tested easyconfigs (in order)

Build succeeded for 3 out of 6 (6 easyconfigs in total)
jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.3, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/boegelbot/d294a007eb6c6d4c15dafd780818c7bd for a full test report.

@ocaisa
Copy link
Member

ocaisa commented Mar 18, 2024

@boegelbot please test @ jsc-zen3
EB_ARGS=" QuantumESPRESSO-6.8-foss-2021a.eb QuantumESPRESSO-6.8-foss-2021b.eb QuantumESPRESSO-7.1-foss-2022a.eb "

@boegelbot
Copy link

@ocaisa: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=3241 EB_ARGS=" QuantumESPRESSO-6.8-foss-2021a.eb QuantumESPRESSO-6.8-foss-2021b.eb QuantumESPRESSO-7.1-foss-2022a.eb " EB_REPO=easybuild-easyblocks EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_3241 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 3794

Test results coming soon (I hope)...

- notification for comment with ID 2005027818 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link

Test report by @boegelbot

Overview of tested easyconfigs (in order)

  • SUCCESS QuantumESPRESSO-6.8-foss-2021a.eb
  • SUCCESS QuantumESPRESSO-6.8-foss-2021b.eb
  • SUCCESS QuantumESPRESSO-7.1-foss-2022a.eb

Build succeeded for 3 out of 3 (3 easyconfigs in total)
jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.3, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/boegelbot/1ee57e8e66d8074a40f5e4b6d054e9e5 for a full test report.

@ocaisa
Copy link
Member

ocaisa commented Mar 19, 2024

This has been extensively tested here and via easybuilders/easybuild-easyconfigs#20070 thanks for all the effort @Crivella

@ocaisa ocaisa merged commit 2bc9e04 into easybuilders:develop Mar 19, 2024
47 checks passed
@Crivella Crivella deleted the feature-improve_qe_eblock branch March 19, 2024 15:05
@boegel boegel changed the title Feature improve QuantumESPRESSO easyblock improve QuantumESPRESSO easyblock by cleaning up and extending configure step + running test suite Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants