Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{data}[gompi/2022.10] netCDF-Fortran v4.6.0, netCDF v4.9.0, HDF5 v1.12.2, Szip v2.1.1 #16834

Conversation

boegel
Copy link
Member

@boegel boegel commented Dec 8, 2022

@boegel boegel added the update label Dec 8, 2022
@boegel boegel added this to the next release (4.7.0) milestone Dec 8, 2022
@boegel
Copy link
Member Author

boegel commented Dec 8, 2022

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on login1

PR test command 'EB_PR=16834 EB_ARGS= EB_CONTAINER= /opt/software/slurm/bin/sbatch --job-name test_PR_16834 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 9718

Test results coming soon (I hope)...

- notification for comment with ID 1342538205 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
FAILED
Build succeeded for 2 out of 4 (4 easyconfigs in total)
cns1 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/454fa3603b82904c59a5a44082f9c4b7 for a full test report.

@easybuilders easybuilders deleted a comment from boegelbot Dec 8, 2022
@branfosj
Copy link
Member

branfosj commented Dec 8, 2022

Test report by @branfosj
SUCCESS
Build succeeded for 7 out of 7 (4 easyconfigs in total)
bear-pg0211u15a.bear.cluster - Linux Ubuntu 20.04.2 LTS (Focal Fossa), x86_64, Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz (cascadelake), Python 3.8.5
See https://gist.github.com/cce047459e4433f1f1cc1791cfe1102f for a full test report.

@branfosj
Copy link
Member

branfosj commented Dec 8, 2022

Test report by @branfosj
SUCCESS
Build succeeded for 7 out of 7 (4 easyconfigs in total)
bear-pg0105u36b.bear.cluster - Linux RHEL 8.5, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/a9a985bf072e63c2ab85027b9189dcd6 for a full test report.

@branfosj
Copy link
Member

branfosj commented Dec 8, 2022

@boegelbot please test @ jsc-zen2

@boegelbot
Copy link
Collaborator

@branfosj: Request for testing this PR well received on jsczen2l1.int.jsc-zen2.easybuild-test.cluster

PR test command 'EB_PR=16834 EB_ARGS= /opt/software/slurm/bin/sbatch --mem-per-cpu=4000M --job-name test_PR_16834 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen2.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 1853

Test results coming soon (I hope)...

- notification for comment with ID 1343226691 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
FAILED
Build succeeded for 2 out of 4 (4 easyconfigs in total)
jsczen2g1.int.jsc-zen2.easybuild-test.cluster - Linux Rocky Linux 8.5, x86_64, AMD EPYC 7742 64-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/ec1025e47b4a8dd5776c8eb9f7e10ed5 for a full test report.

@boegel
Copy link
Member Author

boegel commented Dec 9, 2022

Same test fails on both generoso and jsc-zen2, doesn't look like timeout is too strict:

161/218 Test #161: nc_test4_run_par_test .................***Timeout 1500.15 sec
...
The following tests FAILED:
	161 - nc_test4_run_par_test (Timeout)

In the past it seems like this was a sign of a bug in Open MPI, see Unidata/netcdf-c#1500.
But here it succeeds on most systems...

@branfosj
Copy link
Member

branfosj commented Dec 9, 2022

@boegelbot please test @ generoso
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@branfosj: Request for testing this PR well received on login1

PR test command 'EB_PR=16834 EB_ARGS= EB_CONTAINER= /opt/software/slurm/bin/sbatch --job-name test_PR_16834 --ntasks="16" ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 9728

Test results coming soon (I hope)...

- notification for comment with ID 1344306828 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@branfosj
Copy link
Member

branfosj commented Dec 9, 2022

We disabled one of the parts of nc_test4_run_par_test in #16050, but we are not even getting that far. This part of the tests runs really slowly for some - seeing if more cores on generoso helps.

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
FAILED
Build succeeded for 2 out of 4 (4 easyconfigs in total)
cnx1 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz (haswell), Python 3.6.8
See https://gist.github.com/ea8e72aa79c7aa8e0b67f109422feda1 for a full test report.

@boegel
Copy link
Member Author

boegel commented Dec 12, 2022

@branfosj How do you think we should proceed here? Extend the timeout?

@branfosj
Copy link
Member

@branfosj How do you think we should proceed here? Extend the timeout?

We should try with a longer timeout - to see how long it takes for these tests to complete. I expect that we'll just keep increasing it for the tests in here - I supsect that the functionality these tests hit are a bad fit for certain types of filesystems.

@boegel
Copy link
Member Author

boegel commented Dec 20, 2022

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on login1

PR test command 'EB_PR=16834 EB_ARGS= EB_CONTAINER= /opt/software/slurm/bin/sbatch --job-name test_PR_16834 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 9852

Test results coming soon (I hope)...

- notification for comment with ID 1360078495 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegel
Copy link
Member Author

boegel commented Dec 20, 2022

Test report by @boegel
SUCCESS
Build succeeded for 4 out of 4 (4 easyconfigs in total)
node3104.skitty.os - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (skylake_avx512), Python 3.6.8
See https://gist.github.com/9a852e73965ea63f7b1099fff2ef353a for a full test report.

@boegel boegel force-pushed the 20221208120057_new_pr_netCDF-Fortran460 branch from 18fa679 to d721297 Compare December 21, 2022 08:27
@boegel
Copy link
Member Author

boegel commented Dec 21, 2022

I have cancelled job 9852 that was running on generoso with relatex timeout (ARGS='--timeout 100000' in pretestopts), since it was already running for 12.5h.

Instead, I've added a new patch in d721297 which disable one more test, because it's running excessively long on some types of filesystems.

@boegel
Copy link
Member Author

boegel commented Dec 21, 2022

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on login1

PR test command 'EB_PR=16834 EB_ARGS= EB_CONTAINER= /opt/software/slurm/bin/sbatch --job-name test_PR_16834 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 9856

Test results coming soon (I hope)...

- notification for comment with ID 1361001272 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
FAILED
Build succeeded for 2 out of 4 (4 easyconfigs in total)
cns1 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/7aafcc29d42601ec98c6e59b75ee6d75 for a full test report.

@boegel
Copy link
Member Author

boegel commented Dec 21, 2022

Test report by @boegel
SUCCESS
Build succeeded for 4 out of 4 (4 easyconfigs in total)
node3104.skitty.os - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (skylake_avx512), Python 3.6.8
See https://gist.github.com/846087f611eb1ad12fbc4784aebf2512 for a full test report.

and machine-independent data formats that support the creation, access, and sharing of array-oriented
scientific data."""

toolchain = {'name': 'gompi', 'version': '2022.10'}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: version should be updated to 2022b once #16961 is merged

@SebastianAchilles
Copy link
Member

SebastianAchilles commented Jan 4, 2023

I am also seeing the timeout on my system. In my case I am running the test report in a container and the filesystem is mounted into the container, e.g. similar to:

docker run -it -v /data/easybuild/:/home/easybuild/.local/easybuild/ ghcr.io/easybuilders/rockylinux-9.0:latest

/data is a 4TB SATA HDD with an ext4 filesystem. Using --buildpath=/dev/shm/$USER/ doesn't make a difference.

I tried to nail down the tests that are talking so long. To get rid of the timeout I had to disable in addition to the already disabled tests the @MPIEXEC@ -n 4 ./tst_parallel_zlib and @MPIEXEC@ -n 4 ./tst_parallel_compress tests, e.g. I added the following to the patch netCDF-4.9.0_skip-timeout-tests.patch:

@@ -50,14 +50,14 @@
 # Only run these tests if HDF5 supports parallel filters (v1.10.2 and
 # later).
 if test "@HAS_PAR_FILTERS@" = "yes"; then
-    echo
-    echo "Parallel I/O test with zlib."
-    @MPIEXEC@ -n 4 ./tst_parallel_zlib
+#     echo
+#     echo "Parallel I/O test with zlib."
+#     @MPIEXEC@ -n 4 ./tst_parallel_zlib
 
     echo
     echo "Parallel I/O more tests with zlib and szip (if present in HDF5)."
     @MPIEXEC@ -n 1 ./tst_parallel_compress
-    @MPIEXEC@ -n 4 ./tst_parallel_compress
+#     @MPIEXEC@ -n 4 ./tst_parallel_compress
 fi
 
 echo

Skipping these tests also works on jsc-zen2, which is using a NFS filesystem.

@Micket
Copy link
Contributor

Micket commented Jan 4, 2023

I would have thought the netCDF-4.9.0_skip-nasa-test.patch patch was enough. That seemed to be enough for netCDF-4.9.0-gompi-2022a.eb and netCDF-4.9.0-iimpi-2022a.eb

It did fix things for the person who originated this issue a while back #15959

If it's not enough, then we should apply this new patch to those easyconfigs as well.

@SebastianAchilles
Copy link
Member

Yes, in the 2022a toolchain skipping only the Parallel Performance Test for NASA for netCDF-4.9.0-gompi-2022a.eb worked on generoso, jsc-zen2, and my container that I am using for test reports. For the version for 2022b more tests seem to take longer. The exact reason why the tests take longer is not clear for me.

@Micket
Copy link
Contributor

Micket commented Jan 4, 2023

So the difference stems from the compiler and MPI change, or the version bump of libxml2 and cURL.
I would suspect MPI if anything.

Maybe it's not so clear cut that these failures should just be ignored?

I'll try on cephfs, to see if that also has problems without these patches (it never had issues with any of the tests before).

@boegel
Copy link
Member Author

boegel commented Feb 28, 2023

closing in favor of @jfgrimm's #17107 (which just got merged)...

@boegel boegel closed this Feb 28, 2023
@boegel boegel deleted the 20221208120057_new_pr_netCDF-Fortran460 branch February 28, 2023 15:47
@boegel boegel modified the milestones: next release (4.7.1?), 4.x Feb 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants