HTCondor CI is failing #568

guillaumeeb · 2022-08-07T17:29:17Z

I've put some time in trying to debug HTCondor CI (see #562 (comment)).

If it did work on my laptop, I've not been able to make it works in github Actions, don't know why yet.

Some details:

I've updated Docker images to run the correct Python version (same as other CI environments),
I've updated docker-compose setup with a configuration on HTCondor negociator Interval, the time between to scheduling cycle, as the default of 60s made it possible tests fail because of that, or made tests really long.
I've added some debugging docker commands to try to understand what's happening.

From the last run here, we can see from the Cleanup sections and condor_history command output that the HTCondor queuing system is working. The problem seems to come from the worker jobs which complete really fast, Dask workers never connect to Scheduler.

I guess we would need to see the jobs stdout/stderr to debug further the problem.

cc @riedel @mivade @jolange.

The text was updated successfully, but these errors were encountered:

jolange · 2022-08-08T08:19:15Z

Thanks @guillaumeeb ! Yes, this looks like the worker jobs are failing directly.
The command is starting with a semicolon now:

 Arguments = "-c '; /opt/anaconda/bin/python -m distributed.cli.dask_worker tcp://172.18.0.4:33755 --nthreads 1 --memory-limit 100.00MiB --name HTCondorCluster-1 --nanny --death-timeout 60'"

I certainly broke this with my changes, I guess because env_extra does not stay None due to this line.
I'm not sure if this would cause this to fail, but I'll fix it first before looking further into this.

Otherwise, seeing stderr would help a lot, probably -- I remember that I had to do some manual downgrades of click for some python versions. But I tried a number of setups, so I don't remember really...

Due to config read with `default=[]`, `env_extra` will not stay `None` but become an empty list. This resulted in a command template starting with a semicolon. Possibly related to dask#568

* Fix command template for empty `env_extra` in HTCondor Due to config read with `default=[]`, `env_extra` will not stay `None` but become an empty list. This resulted in a command template starting with a semicolon. Possibly related to #568 * formatting * reenable HTCondor CI workflow * DEBUG: downgrade click * DEBUG: condor_q + condor logs * DEBUG: more output * DEBUG * DEBUG * test_basic[HTCondorCluster]: use 2GiB memory * DEBUG more output * test_basic[HTCondorCluster]: use 500MiB memory * adapt assertion to 500MiB * DEBUG: revert all debugging stuff * test_basic[HTCondorCluster]: use 500MiB memory * Revert "DEBUG: downgrade click" This reverts commit 56e56d4. * EMPTY to trigger CI * Fix also test_extra_args_broken_cancel * Re-add some debugging outputs for HTCondor CI just in case Co-authored-by: Guillaume EB <g.eynard.bontemps@gmail.com>

guillaumeeb · 2022-08-08T18:04:56Z

Closed by #570.

riedel mentioned this issue Aug 7, 2022

CI errors after dropping Python 3.6 #547

Closed

jolange mentioned this issue Aug 8, 2022

Fix command template for empty env_extra in HTCondor #570

Merged

guillaumeeb closed this as completed Aug 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTCondor CI is failing #568

HTCondor CI is failing #568

guillaumeeb commented Aug 7, 2022

jolange commented Aug 8, 2022

guillaumeeb commented Aug 8, 2022

HTCondor CI is failing #568

HTCondor CI is failing #568

Comments

guillaumeeb commented Aug 7, 2022

jolange commented Aug 8, 2022

guillaumeeb commented Aug 8, 2022