Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test timeouts for netCDF 4.9.0 #15959

Open
Ghepardo opened this issue Aug 3, 2022 · 24 comments · Fixed by #16050
Open

test timeouts for netCDF 4.9.0 #15959

Ghepardo opened this issue Aug 3, 2022 · 24 comments · Fixed by #16050

Comments

@Ghepardo
Copy link
Contributor

Ghepardo commented Aug 3, 2022

(I tried raising this by e-mailing easybuild-request@lists.ugent.be, but I received no responses.)

I am getting a consistent build error when attempting to build netCDF-4.9.0-gompi-2022a.eb. This is a dependency for the R 4.2.1 build. I have tried this using both 10 CPUs and 20 CPUs on the build system, with the same result, which is a failure in the MPI tests. I attach copies of the build logs. Can anyone shed any light on what's going wrong here?
easybuild-FMwVd0.log
easybuild-netCDF-4.9.0-20220802.094732.cKaNQ.log
easybuild-netCDF-4.9.0-20220802.094732.cKaNQ_test_report.md

@Micket
Copy link
Contributor

Micket commented Aug 8, 2022

161 - nc_test4_run_par_test (Timeout)

so it times out.
My first guess is what this was perhaps build from a node with an interactive job, doing something like

srun -n 20 bash
eb netCDF--etc...eb 

If so, be aware that this will not work and will typically fail with the same reason, the srun that's running the job interactively blocks the MPI tests from launching (until it's complete, which, it never will be, because it's waiting for the mpi job to run)
This happens on basically everything that runs MPI as part of their tests.

@Ghepardo
Copy link
Contributor Author

Ghepardo commented Aug 9, 2022 via email

@Micket
Copy link
Contributor

Micket commented Aug 9, 2022

However, I have now retried this by connecting to the same node using a simple SSH session, such that all 32 of the node's CPUs were available to me,

Just to be doubly sure, can you write a batch script that simply submits this job to the queue? i.e.

#!/usr/bin/env bash
#SBATCH -n 20

module load EasyBuild
eb netCDF-4.9.0-gompi-2022a.eb -r

@Micket
Copy link
Contributor

Micket commented Aug 9, 2022

Regarding the other questions:

Should I have done anything special in setting up my SSH session?

if this happens to be a node where you are currently running a job, and ssh'ing into it with PAM's slurm_job_adopt active, then it makes no difference. SLURM still considers it busy.

Or is there an srun option that would allow the use of MPI within an interactive session running under it?

there are surely flags, but, it's not very convenient to try to pass those down.
I've hacked around this a few occasions when i just wanted to play around and test stuff, by simply unsetting SLURM* environment variables in the shell, which will let mpirun ignore SLURM and just run (I don't know which ones are strictly required, i just unset all of them)

unset SLURMD_NODENAME
unset SLURM_CLUSTER_NAME
unset SLURM_CONF
unset SLURM_CPUS_ON_NODE
unset SLURM_DISTRIBUTION
unset SLURM_GTIDS
unset SLURM_JOBID
unset SLURM_JOB_ACCOUNT
unset SLURM_JOB_CPUS_PER_NODE
unset SLURM_JOB_GID
unset SLURM_JOB_ID
unset SLURM_JOB_NAME
unset SLURM_JOB_NODELIST
unset SLURM_JOB_NUM_NODES
unset SLURM_JOB_PARTITION
unset SLURM_JOB_QOS
unset SLURM_JOB_RESERVATION
unset SLURM_JOB_UID
unset SLURM_JOB_USER
unset SLURM_LAUNCH_NODE_IPADDR
unset SLURM_LOCALID
unset SLURM_NNODES
unset SLURM_NODEID
unset SLURM_NODELIST
unset SLURM_NPROCS
unset SLURM_NTASKS
unset SLURM_PRIO_PROCESS
unset SLURM_PROCID
unset SLURM_PTY_PORT
unset SLURM_PTY_WIN_COL
unset SLURM_PTY_WIN_ROW
unset SLURM_SRUN_COMM_HOST
unset SLURM_SRUN_COMM_PORT
unset SLURM_STEPID
unset SLURM_STEP_GPUS
unset SLURM_STEP_ID
unset SLURM_STEP_LAUNCHER_PORT
unset SLURM_STEP_NODELIST
unset SLURM_STEP_NUM_NODES
unset SLURM_STEP_NUM_TASKS
unset SLURM_STEP_TASKS_PER_NODE
unset SLURM_SUBMIT_DIR
unset SLURM_SUBMIT_HOST
unset SLURM_TASKS_PER_NODE
unset SLURM_TASK_PID
unset SLURM_TOPOLOGY_ADDR
unset SLURM_TOPOLOGY_ADDR_PATTERN
unset SLURM_UMASK
unset SLURM_WORKING_CLUSTER

@Ghepardo
Copy link
Contributor Author

Ghepardo commented Aug 9, 2022 via email

@Micket
Copy link
Contributor

Micket commented Aug 9, 2022

It's best if you answer comments on github instead of directly on the email; your client adds a bunch of extra stuff that you that ends up here, making things very hard to read (i edited those out of your comments).

I think you need to dig deeper here;

eb --dump-env netCDF-4.9.0-gompi-2022a.eb
source netCDF-4.9.0-gompi-2022a.env
cd /path/to/build/dir/netCDF-4.9.0-gompi-2022a/...

and try to run the test commands manually, look at what is happening with the calls (is there any load at all?), perhaps strace'ing the test.

If think there is just something wrong with this one test (I don't know if there are other mpi tests that pass.. maybe?) then you can also just skip the test step.

@Ghepardo
Copy link
Contributor Author

Ghepardo commented Aug 9, 2022

Noted (and sorry) re response method.

I've created that .env file and sourced it, and I've found a directory /home/apps/eb/build/netCDF/4.9.0/gompi-2022a. This contains no files and 2 subdirectories, easybuild_obj and netcdf-c-4.9.0. I'm not clear what you're asking me to do from here (I have no idea how to run the test commands manually).

@Micket
Copy link
Contributor

Micket commented Aug 9, 2022

I've created that .env file and sourced it, and I've found a directory /home/apps/eb/build/netCDF/4.9.0/gompi-2022a. This contains no files and 2 subdirectories, easybuild_obj and netcdf-c-4.9.0.

yes. the build and the source directory respectively.

I'm not clear what you're asking me to do from here (I have no idea how to run the test commands manually).

well i'm just suggesting the steps you can do in order to dig down and debug this problem yourself as to why this one test fails.
You can find each configure/build/test commands shat EB does by e.g. grepping after run.py in the log file. Otherwise, there is no general answer, just generic problem solving.
I have never used netCDF, and, wouldn't surprise me if no other maintainers have either, like with 99% of the software we build. We are unlikely to find the cause given noone else seem to have managed to reproduce any issue.

@Ghepardo
Copy link
Contributor Author

Ghepardo commented Aug 9, 2022

Hmm, I'm pretty lost here. The log file shows:
== 2022-08-09 12:52:18,019 run.py:233 INFO running cmd: OMPI_MCA_rmaps_base_oversubscribe=1 make test

...before the start of the list of the 227 tests. I can see there's a Makefile, and it contains 2 "special rules" for "target test". One of these specifies a ".../ctest" command with arguments, which also appears in the log file. Are you saying I should try to run that command manually? If I do, won't I simply get the same result that Easybuild does?

I am not sure about getting strace output. If I managed this, wouldn't it be voluminous for those 227 tests? And searching for a timeout error in it would be like looking for a needle in a haystack?

I think you are saying mine is the only Easybuild site that has reported this error, and that other sites have built this with no problem. If that's the case, that suggests the problem is with our site, and I do understand that you can't spend your time trying to help find such a problem. But it does present us with an issue, given that this is a dependency for R, which is a very important package for us. We may have to look for alternative ways of obtaining R.

I do also understand that the Easybuild maintainers can't be responsible for maintaining easyconfigs submitted by the community. It is a shame that Easybuild does not seem to have any mechanism for ensuring that the submitter provides some level of support.

@Micket
Copy link
Contributor

Micket commented Aug 9, 2022

...before the start of the list of the 227 tests. I can see there's a Makefile, and it contains 2 "special rules" for "target test". One of these specifies a ".../ctest" command with arguments, which also appears in the log file. Are you saying I should try to run that command manually? If I do, won't I simply get the same result that Easybuild does?

Well that's how i start tackling most software problems, start be reproducing it. Then continue to dig deeper, and deeper, and deeper. Like checking if there were any other tests that ran MPI and somehow worked, seeing what actually happens on the system while the failing test is running (full load? waiting for MPI init?). Is there some stderr/stdout from the failing test ctest -v? Maybe turn on some OpenMPI debug environment variables to see if they reveal anything.

I am not sure about getting strace output. If I managed this, wouldn't it be voluminous for those 227 tests? And searching for a timeout error in it would be like looking for a needle in a haystack?

No, just the exact command (whatever ctest actually runs, ctest -v -R failing_test_name, and i wouldn't start of with strace. I'd just see if it was even doing anything, i.e. is it just stuck at MPI init, or is it stuck in a loop at full load? then just proceed from there.
Basically just standard debugging.

I think you are saying mine is the only Easybuild site that has reported this error, and that other sites have built this with no problem.

So far yes (as far as I've seen). Whoever merges PRs makes sure there is at least 1 successfull build report from EB, though most of the common ones probably sees +10 different places before it's shipped.
But this is a newer toolchain, not as many outside of the maintainers might have built this yet

I do also understand that the Easybuild maintainers can't be responsible for maintaining easyconfigs submitted by the community. It is a shame that Easybuild does not seem to have any mechanism for ensuring that the submitter provides some level of support.

Then noone would volunteer to submit anything, so we certainly won't ever hold anyone to that (how could we? noone is getting payed here). It's all just a collective best effort where everyone submits patches, fixes, improvements.

@Ghepardo
Copy link
Contributor Author

Ghepardo commented Aug 9, 2022

OK thanks. My first attempt to run "make test" manually did in fact produce slightly more information than appeared in the Easybuild log file: it suggested I reran with some additional arguments, "--rerun-failed --output-on-failure". Once I'd worked out where to put these (it was in the Makefile), I reran the "make test".

Unfortunately, this produced no further useful information: it seems to be a timeout in the "Parallel Performance Test for NASA".

I did attach strace to one of the PIDs participating in this - one with no further children. All I got was the following repeated over and over again:

fcntl(16, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0
fcntl(16, F_SETLKW, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0
pwrite64(16, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 40, 28208) = 40

Only the penultimate number (28208 in the extract above) varied between repetitions. I wouldn't know where to begin in making any use of that. This brings me back to my point about maintainers: it might mean something to someone who (a) is a developer and (b) knows this product. I am neither of those things myself.

If you've got any further suggestions I'd be interested to hear them, but I sense we've probably got as far as we can with this. I shall have to seek alternatives to Easybuild. But I'd be reluctant to close this issue, because it's not been resolved.

@Ghepardo
Copy link
Contributor Author

Ghepardo commented Aug 9, 2022

Sorry, I didn't address a couple of the points you made:

Like checking if there were any other tests that ran MPI and somehow worked, seeing what actually happens on the system while the failing test is running (full load? waiting for MPI init?).

  • I'm afraid I wouldn't know how to identify those tests that ran MPI and those that didn't, and, if I did, I wouldn't know what to do with the information.
  • The system did not seem to be very loaded while the failing test was running - certainly far less than 100% CPU usage. I'm afraid I wouldn't know how to identify whether a system was waiting for MPI init.

@Ghepardo
Copy link
Contributor Author

I have noticed that this build will succeed for me if I comment out the following line in the easyconfig:

runtest = 'test'

I have no way of telling what this line does: "runtest" does not appear in the Makefile, but it does appear in literally hundreds of other files in the build directory. This is a new line in this easyconfig, compared to the previous version.

@branfosj I see you committed this easyconfig. Is it possible to tell who submitted it to you, so that we may pick their brains on "runtest"?

@Ghepardo
Copy link
Contributor Author

I have discovered that 'runtest' is just an easyconfig keyword, and "runtest = 'test'" just says to run 'make test' after 'make'. I guess its absence from previous versions of this easyconfig means we weren't running 'make test' for them. I feel back at square one again. @branfosj probably best if you ignore my earlier question.

@Micket
Copy link
Contributor

Micket commented Aug 15, 2022

I have noticed that this build will succeed for me if I comment out the following line in the easyconfig:

runtest = 'test'

Well yes, not running the test suite for a software will ignore the issue, which, might not be affecting you in practice, depending on what is failing. If all the numerical math was wrong in Numpy, i certainly wouldn't wanna install that module without a closer look, but I've skipped failing tests on PyTorch, when it only occurred on nodes with more than 6 GPUs (and it was a true bug that i reported upstream, with no easy fix or workaround). So, as i said earlier, ignoring failing tests is an option.

Unfortunately, this produced no further useful information: it seems to be a timeout in the "Parallel Performance Test for NASA".

why? This seems very useful. I had a look and it's just a very simple shell script nc_test4/run_par_test.sh.in so now we know that

  1. it runs a bunch of MPI tests did pass no problems, so it's not all MPI stuff that has issues
  2. this seems like a much chunkier test, a performance test, that does parallel IO, and it's failing on:
@MPIEXEC@ -n 4 ./tst_nc4perf

My guesses are

  1. Likely impacted by what filesystem we build on. Trying to just switch to /dev/shm/ as the builddir is a good idea if we weren't doing that already.
  2. Perhaps it's just really really slow on whatever system we build it on. We could run ctest (which is what make test does here) with a way higher timeout to see if it ever finishes. ctest --timeout 100000 -R name_of_test

This brings me back to my point about maintainers:

I can only speak for myself, but there is nothing magic about being a maintainer. We are basically all just users ourselves who were foolish enough to also volunteer our spare time drive the project, review contributions to the best of our ability.

it might mean something to someone who (a) is a developer and (b) knows this product. I am neither of those things myself.

indeed, one will need to learn, or pay someone else to know it.

As noted in the other PR that is linked above, it seems likely that the same issue occurred on the test build for the iimpi variant on the generoso test cluster. I don't have access there but some other maintainer might find time to investigate where that one failed.

@Ghepardo
Copy link
Contributor Author

Thanks @Micket that's really helpful. I reran the failing test with the timeout you suggested. The test now ran to conclusion - but, in taking over 4 hours to complete, it did need all of the scale factor of your suggested timeout. Here is what I got:

Running tests...
/home/apps/eb/software/CMake/3.23.1-GCCcore-11.3.0/bin/ctest --rerun-failed --output-on-failure --timeout 100000 --force-new-ctest-process
Test project /home/apps/eb/build/netCDF/4.9.0/gompi-2022a/easybuild_obj
    Start 161: nc_test4_run_par_test
1/1 Test #161: nc_test4_run_par_test ............   Passed  15827.64 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 15828.10 sec

I see that the Makefile invokes ctest with ${ARGS}, so I tried export ARGS='--timeout 100000' before rerunning the build. That seems to have done the trick - the build has been running for 2 hours so far, and would have fallen over by now without that variable setting.

Is it possible to get that --timeout option to ctest passed via the easyconfig? I can't see anything in the easyconfig options that would allow this.

Of course, this raises the question why this test is taking 100x as long on my platform as on yours. I shall try using /dev/shm next.

@Micket
Copy link
Contributor

Micket commented Aug 16, 2022

Parallel filesystems can be very picky about what IO patterns are used. Perhaps this benchmark might be inducing some locking behavior, or perhaps it just creates the worst case scenario for your particular filesystem (which i don't know you have).

I'd also be interested in seeing if it's just this test, or if any of the others are also slow. We can consider patching the specific NASA performance benchmark out, it might not be broadly applicable.

edit: accidentally double commented since my browser hadn't refreshed

@Micket
Copy link
Contributor

Micket commented Aug 17, 2022

Please test #16050

eb --from-pr 16050 --upload-test-report --rebuild 

@Ghepardo
Copy link
Contributor Author

Ghepardo commented Aug 17, 2022

@Micket running this rebuild myself now. This looks very promising - the testing for iteration #0 took 0:18:35 instead of the 4:37:23 previously. I think the testing for iteration # 1 may take a similar time.

@Ghepardo
Copy link
Contributor Author

And it's completed. Testing for second iteration took 0:18:47. I believe you'll find a test report on the PR. Just confirming, this build was run without a $ARGS in my environment. (Previously, this had needed to contain the value '--timeout=100000'.)

I think we could close this this issue, if you're happy.

@Micket
Copy link
Contributor

Micket commented Aug 17, 2022

it'll be autoclosed when the PR merges, thanks for the test report.

@Poshi
Copy link
Contributor

Poshi commented Sep 3, 2022

Just for reference, I also hit the same issue with a slightly modified toolchain ("just" updating GCC to 12.1.0).
I run the tests increasing the time limit and the test at hand took 16844.59s, but finished properly. So I applied the changes proposed in the PR and it worked (it avoided the problematic test and installed the software).
On my side, as long as the results are OK, the installation is OK. If a specific test have very bad performance, it is OK (I'm just installing R because It is needed for Picard to run some script to generate some graph... and I don't give a shit about the graph, but I cannot disable the graph generation :-()

@boegel boegel added this to the release after 4.7.1 milestone Mar 18, 2023
@boegel boegel changed the title Build error for netCDF 4.9.0 test timeouts for netCDF 4.9.0 Mar 18, 2023
@boegel
Copy link
Member

boegel commented Mar 18, 2023

Not out of the woods yet here, I'm seeing test timeouts for netCDF-4.9.0-iimpi-2022b.eb on pretty much all our clusters (though netCDF-4.9.0-gompi-2022b.eb works fine):

99% tests passed, 2 tests failed out of 218

Total Test time (real) = 3090.92 sec

The following tests FAILED:
        161 - nc_test4_run_par_test (Timeout)
        188 - h5_test_run_par_tests (Timeout)

@sassy-crick
Copy link
Collaborator

To add to this (might be or might not be relevant): I am seeing similar timeouts with netCDF-4.9.0-iimpi-2022a.eb on AMD Rome, but not on our Haswell/Broadwell builds. The failing tests are:

The following tests FAILED:
        161 - nc_test4_run_par_test (Timeout)
        188 - h5_test_run_par_tests (Timeout)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants