Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix command template for empty env_extra in HTCondor #570

Merged
merged 18 commits into from Aug 8, 2022

Conversation

jolange
Copy link
Contributor

@jolange jolange commented Aug 8, 2022

Due to the config read with default=[], env_extra will not stay None but become an empty list. This resulted in a command template starting with a semicolon.

By first merging env_extra and _command_template to a single list, this is avoided.

Possibly related to #568, cc @guillaumeeb

Due to config read with `default=[]`, `env_extra` will not stay `None`
but become an empty list. This resulted in a command template starting
with a semicolon.

Possibly related to dask#568
@jolange
Copy link
Contributor Author

jolange commented Aug 8, 2022

Now I am confused:

  • PBS failed in the container setup step. Solved with new push
  • Slurm failed. <Client: No scheduler connected> sounds suspicious. Solved with new push
  • HTCondor did not run? Ok, I now saw that you explicitly disabled it!

@jolange
Copy link
Contributor Author

jolange commented Aug 8, 2022

This is really shaky: after adding debug output (f03f257), the CI for HTCondor was successfull, but failed again later.

Now, this is an example stderr of a worker (from this run):

2022-08-08 13:40:37,987 - distributed.nanny - INFO -         Start Nanny at: 'tcp://172.18.0.3:44283'
2022-08-08 13:40:39,274 - distributed.worker - INFO -       Start worker at:     tcp://172.18.0.3:42825
2022-08-08 13:40:39,275 - distributed.worker - INFO -          Listening to:     tcp://172.18.0.3:42825
2022-08-08 13:40:39,275 - distributed.worker - INFO -          dashboard at:           172.18.0.3:38721
2022-08-08 13:40:39,276 - distributed.worker - INFO - Waiting to connect to:     tcp://172.18.0.5:36311
2022-08-08 13:40:39,276 - distributed.worker - INFO - -------------------------------------------------
2022-08-08 13:40:39,277 - distributed.worker - INFO -               Threads:                          1
2022-08-08 13:40:39,277 - distributed.worker - INFO -                Memory:                 100.00 MiB
2022-08-08 13:40:39,278 - distributed.worker - INFO -       Local Directory: /var/lib/condor/execute/dir_85/dask-worker-space/worker-wfg9pxc7
2022-08-08 13:40:39,278 - distributed.worker - INFO - -------------------------------------------------
2022-08-08 13:40:39,285 - distributed.worker - INFO -         Registered to:     tcp://172.18.0.5:36311
2022-08-08 13:40:39,285 - distributed.worker - INFO - -------------------------------------------------
2022-08-08 13:40:39,286 - distributed.core - INFO - Starting established connection
2022-08-08 13:40:39,422 - distributed.worker_memory - WARNING - Worker tcp://172.18.0.3:42825 (pid=93) exceeded 95% memory budget. Restarting...
2022-08-08 13:40:39,430 - distributed.nanny - INFO - Worker process 93 was killed by signal 15
2022-08-08 13:40:39,435 - distributed.nanny - WARNING - Restarting worker
2022-08-08 13:40:40,714 - distributed.worker - INFO -       Start worker at:     tcp://172.18.0.3:45541
2022-08-08 13:40:40,714 - distributed.worker - INFO -          Listening to:     tcp://172.18.0.3:45541
2022-08-08 13:40:40,714 - distributed.worker - INFO -          dashboard at:           172.18.0.3:33933
2022-08-08 13:40:40,714 - distributed.worker - INFO - Waiting to connect to:     tcp://172.18.0.5:36311
2022-08-08 13:40:40,714 - distributed.worker - INFO - -------------------------------------------------
2022-08-08 13:40:40,714 - distributed.worker - INFO -               Threads:                          1
2022-08-08 13:40:40,714 - distributed.worker - INFO -                Memory:                 100.00 MiB
2022-08-08 13:40:40,714 - distributed.worker - INFO -       Local Directory: /var/lib/condor/execute/dir_85/dask-worker-space/worker-mgz6j_u2
2022-08-08 13:40:40,714 - distributed.worker - INFO - -------------------------------------------------
2022-08-08 13:40:40,723 - distributed.worker_memory - WARNING - Worker tcp://172.18.0.3:42825 (pid=102) exceeded 95% memory budget. Restarting...
2022-08-08 13:40:40,737 - distributed.nanny - INFO - Worker process 102 was killed by signal 15
2022-08-08 13:40:40,740 - distributed.nanny - WARNING - Restarting worker
2022-08-08 13:40:40,771 - distributed._signals - INFO - Received signal SIGTERM (15)
2022-08-08 13:40:40,772 - distributed.nanny - INFO - Closing Nanny at 'tcp://172.18.0.3:44283'.
2022-08-08 13:40:40,772 - distributed.nanny - INFO - Nanny asking worker to close

@guillaumeeb Is the problem simply WARNING - Worker tcp://172.18.0.3:42825 (pid=93) exceeded 95% memory budget. Restarting...? But Slurm, SGE, PBS, ... use 2GB for this test. I'll try that now.

@guillaumeeb
Copy link
Member

@jolange I think you're on something!! It looks like 100MiB is not enough for running a Dask Worker!

However, I believe the default Condor setup is only 1GB available on each condor worker node, so you should use a number lower than that. Maybe try with 500MiB to be safe?

@jolange
Copy link
Contributor Author

jolange commented Aug 8, 2022

Ah, thanks, I was just trying to find out what the available memory could be. With 2GiB the job did not start to run, so that seemed too much ;-) I'm trying with 500GiB now.

@jolange
Copy link
Contributor Author

jolange commented Aug 8, 2022

With 500GiB it worked without the warning in stderr and I also had a successful CI run for HTCondor.
Still, the last run resultet in a timeout again, but that also happens for "CI / build (none)" for the LocalCluster from time to time.

@guillaumeeb
Copy link
Member

Just tried a complement fix on your branch, hope it's okay. The second test was probably fragile too because it also used only 100MiB for worker jobs. If that test fails and workers are note cleaned up, other test will fail.

@guillaumeeb
Copy link
Member

Okay, HTCondor CI is green, nice 👏. Thanks a lot @jolange!

I will just make another commit here to re-add some of the debug tricks you used, it could be nice later on to have worker logs again!

Copy link
Member

@guillaumeeb guillaumeeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All is green, outputs is detailed, again, thanks a lot for the work here @jolange!

@guillaumeeb guillaumeeb merged commit 7ec9bd0 into dask:main Aug 8, 2022
@jolange
Copy link
Contributor Author

jolange commented Aug 8, 2022

Nice, thanks!

@jolange jolange deleted the fix_empty_env_extra branch August 10, 2022 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants