Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTCondor CI is failing #568

Closed
guillaumeeb opened this issue Aug 7, 2022 · 2 comments
Closed

HTCondor CI is failing #568

guillaumeeb opened this issue Aug 7, 2022 · 2 comments

Comments

@guillaumeeb
Copy link
Member

I've put some time in trying to debug HTCondor CI (see #562 (comment)).

If it did work on my laptop, I've not been able to make it works in github Actions, don't know why yet.

Some details:

  • I've updated Docker images to run the correct Python version (same as other CI environments),
  • I've updated docker-compose setup with a configuration on HTCondor negociator Interval, the time between to scheduling cycle, as the default of 60s made it possible tests fail because of that, or made tests really long.
  • I've added some debugging docker commands to try to understand what's happening.

From the last run here, we can see from the Cleanup sections and condor_history command output that the HTCondor queuing system is working. The problem seems to come from the worker jobs which complete really fast, Dask workers never connect to Scheduler.

I guess we would need to see the jobs stdout/stderr to debug further the problem.

cc @riedel @mivade @jolange.

@jolange
Copy link
Contributor

jolange commented Aug 8, 2022

Thanks @guillaumeeb ! Yes, this looks like the worker jobs are failing directly.
The command is starting with a semicolon now:

 Arguments = "-c '; /opt/anaconda/bin/python -m distributed.cli.dask_worker tcp://172.18.0.4:33755 --nthreads 1 --memory-limit 100.00MiB --name HTCondorCluster-1 --nanny --death-timeout 60'"

I certainly broke this with my changes, I guess because env_extra does not stay None due to this line.
I'm not sure if this would cause this to fail, but I'll fix it first before looking further into this.

Otherwise, seeing stderr would help a lot, probably -- I remember that I had to do some manual downgrades of click for some python versions. But I tried a number of setups, so I don't remember really...

jolange added a commit to jolange/dask-jobqueue that referenced this issue Aug 8, 2022
Due to config read with `default=[]`, `env_extra` will not stay `None`
but become an empty list. This resulted in a command template starting
with a semicolon.

Possibly related to dask#568
guillaumeeb added a commit that referenced this issue Aug 8, 2022
* Fix command template for empty `env_extra` in HTCondor

Due to config read with `default=[]`, `env_extra` will not stay `None`
but become an empty list. This resulted in a command template starting
with a semicolon.

Possibly related to #568

* formatting

* reenable HTCondor CI workflow

* DEBUG: downgrade click

* DEBUG: condor_q + condor logs

* DEBUG: more output

* DEBUG

* DEBUG

* test_basic[HTCondorCluster]: use 2GiB memory

* DEBUG more output

* test_basic[HTCondorCluster]: use 500MiB memory

* adapt assertion to 500MiB

* DEBUG: revert all debugging stuff

* test_basic[HTCondorCluster]: use 500MiB memory

* Revert "DEBUG: downgrade click"

This reverts commit 56e56d4.

* EMPTY to trigger CI

* Fix also test_extra_args_broken_cancel

* Re-add some debugging outputs for HTCondor CI just in case

Co-authored-by: Guillaume EB <g.eynard.bontemps@gmail.com>
@guillaumeeb
Copy link
Member Author

Closed by #570.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants