Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{bio}[foss/2023b] GROMACS v2024.2 w/ CUDA 12.5.0 #20809

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from

Conversation

boegel
Copy link
Member

@boegel boegel commented Jun 12, 2024

@boegel boegel added the update label Jun 12, 2024
@boegel boegel added this to the 4.x milestone Jun 12, 2024
@boegel
Copy link
Member Author

boegel commented Jun 12, 2024

Test report by @boegel
FAILED
Build succeeded for 1 out of 2 (1 easyconfigs in total)
node3306.joltik.os - Linux RHEL 8.8, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz, 1 x NVIDIA Tesla V100-SXM2-32GB, 545.23.08, Python 3.6.8
See https://gist.github.com/boegel/0e97c95f98a87e72fd334b973a72901f for a full test report.

edit: Timeout for MdrunCoordinationCouplingTests2Ranks because $OMP_PROC_BIND was set to TRUE in environment...

sources = [SOURCELOWER_TAR_GZ]
patches = [
'GROMACS-2023.1_set_omp_num_threads_env_for_ntomp_tests.patch',
'GROMACS-2023.1_fix_tests_for_gmx_thread_mpi.patch',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was recently looking into GROMACS 2024.2 as well, and was wondering if this patch is still required. https://gitlab.com/gromacs/gromacs/-/merge_requests/4093 may solve the same or at least similar issue. Looks like @akesandgren added this patch, so maybe he knows?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GROMACS-2023.1_fix_tests_for_gmx_thread_mpi.patch should still be relevant as long as it passes patching.
Especially the "Don't drop relevant PYTHONPATH and LD_LIBRARY_PATH settings." part is vital.

I haven't looked at 2024.2 but I assume they haven't fixed those parts yet.

And GROMACS-2023.1_set_omp_num_threads_env_for_ntomp_tests.patch is just an extra precaution to keep OMMP_NUM_THREADS and --ntomp in sync.

@boegel
Copy link
Member Author

boegel commented Jun 14, 2024

Test report by @boegel
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
node3903.accelgor.os - Linux RHEL 8.8, x86_64, AMD EPYC 7413 24-Core Processor, 1 x NVIDIA NVIDIA A100-SXM4-80GB, 545.23.08, Python 3.6.8
See https://gist.github.com/boegel/cf1254074701e06327ebcf84b832673c for a full test report.

edit: Timeout for MdrunCoordinationCouplingTests2Ranks because $OMP_PROC_BIND was set to TRUE in environment (?)

@boegel
Copy link
Member Author

boegel commented Jun 15, 2024

Not setting $OMP_PROC_BIND to BIND on our system is not an option, because then the GROMACS test suite doesn't finish even after 11 hours (still running)...

@boegel
Copy link
Member Author

boegel commented Jun 15, 2024

Maybe we should always use -DGMX_TEST_TIMEOUT_FACTOR to increase the timeout a bit, see also https://gitlab.com/gromacs/gromacs/-/issues/5062`.

I think the issue in my case is that I'm running in a Slurm job that's asking for a partial node, and I'm not getting lucky w.r.t. which cores are assigned for the job, which makes this particular tests quite slow...

@bedroge
Copy link
Contributor

bedroge commented Jun 15, 2024

Test report by @bedroge
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
gpu2 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz (skylake_avx512), 1 x NVIDIA GRID V100D-32Q, 535.161.07, Python 3.6.8
See https://gist.github.com/bedroge/d229b0d73d21ced263c31f40a2f15f5c for a full test report.

@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bear-pg0208u15a - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), 1 x NVIDIA NVIDIA A100-SXM4-40GB, 535.154.05, Python 3.6.8
See https://gist.github.com/branfosj/3dd073c237e57b2305fcf41b13f9f6ba for a full test report.

@bedroge
Copy link
Contributor

bedroge commented Jun 15, 2024

Not setting $OMP_PROC_BIND to BIND on our system is not an option, because then the GROMACS test suite doesn't finish even after 11 hours (still running)...

I'm now running it in a Slurm job on an Icelake+A100 node (the successful test report was done on an interactive node without Slurm), and that one also seems to get stuck or something. The test step of the first iteration has been running for more than an hour, while it only took 18 minutes for the interactive V100 build.

@SebastianAchilles
Copy link
Member

@boegelbot please test @ jsc-zen3-a100
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@SebastianAchilles: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=20809 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_20809 --ntasks="16" --partition=jsczen3g --gres=gpu:1 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 4418

Test results coming soon (I hope)...

- notification for comment with ID 2178169901 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
jsczen3g1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.4, x86_64, AMD EPYC-Milan Processor (zen3), 1 x NVIDIA NVIDIA A100 80GB PCIe, 550.54.15, Python 3.9.18
See https://gist.github.com/boegelbot/bb3a1716c849cb6a6085c1b64622964e for a full test report.

@akesandgren
Copy link
Contributor

Test report by @akesandgren
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
b-cn1611.hpc2n.umu.se - Linux Ubuntu 22.04, x86_64, AMD EPYC 7313 16-Core Processor, 1 x NVIDIA NVIDIA A100 80GB PCIe, 550.78, Python 3.10.12
See https://gist.github.com/akesandgren/446251f9d574523a8ff45149c187361b for a full test report.

@akesandgren
Copy link
Contributor

Test report by @akesandgren
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
b-cn1502.hpc2n.umu.se - Linux Ubuntu 20.04, x86_64, Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz, 2 x NVIDIA Tesla V100-PCIE-16GB, 545.29.06, Python 3.8.10
See https://gist.github.com/akesandgren/8236afa5668895b2c6201bce5be8e959 for a full test report.

@akesandgren
Copy link
Contributor

Test report by @akesandgren
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
b-cn1602.hpc2n.umu.se - Linux Ubuntu 22.04, x86_64, AMD EPYC 9454 48-Core Processor, 4 x NVIDIA NVIDIA H100 80GB HBM3, 550.78, Python 3.10.12
See https://gist.github.com/akesandgren/259dc4b1f2a21ccb5103d17f61c10614 for a full test report.

@akesandgren
Copy link
Contributor

Test report by @akesandgren
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
b-cn1604.hpc2n.umu.se - Linux Ubuntu 22.04, x86_64, AMD EPYC 9254 24-Core Processor, 2 x NVIDIA NVIDIA L40S, 550.78, Python 3.10.12
See https://gist.github.com/akesandgren/cb3dec04d82d480e9bad23a4d85f412f for a full test report.

@akesandgren
Copy link
Contributor

I had no problem building this for an A40 on a broadwell node in a non-interactive batch job.

@@ -74,7 +74,8 @@ exts_default_options = {

exts_list = [
('gmxapi', '0.4.2', {
'preinstallopts': 'export CMAKE_ARGS="-Dgmxapi_ROOT=%(installdir)s -C %(installdir)s/share/cmake/gromacs_mpi/gromacs-hints_mpi.cmake" && ',
'preinstallopts': 'export CMAKE_ARGS="-Dgmxapi_ROOT=%(installdir)s ' +
'-C %(installdir)s/share/cmake/gromacs_mpi/gromacs-hints_mpi.cmake" && ',
'source_tmpl': 'gromacs-2023.3.tar.gz',
Copy link
Contributor

@bedroge bedroge Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should correspond to the current version (and the version of gmxapi needs to be bumped as well).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh... Fixed in 5dc1057

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@boegel
Copy link
Member Author

boegel commented Jun 26, 2024

Test report by @boegel
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
node3302.joltik.os - Linux RHEL 8.8, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz, 1 x NVIDIA Tesla V100-SXM2-32GB, 545.23.08, Python 3.6.8
See https://gist.github.com/boegel/516323bafd0a78e15c512921d7f25383 for a full test report.

@boegel
Copy link
Member Author

boegel commented Jun 26, 2024

Test report by @boegel
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
node3902.accelgor.os - Linux RHEL 8.8, x86_64, AMD EPYC 7413 24-Core Processor, 1 x NVIDIA NVIDIA A100-SXM4-80GB, 545.23.08, Python 3.6.8
See https://gist.github.com/boegel/0b1eca1a8e7788ef14314e1781129edb for a full test report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants