Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop Python 3.7 #562

Merged
merged 21 commits into from
Aug 7, 2022
Merged

Drop Python 3.7 #562

merged 21 commits into from
Aug 7, 2022

Conversation

guillaumeeb
Copy link
Member

fixes #561.

At the same time, trying to fix CI issues if possible.

@guillaumeeb
Copy link
Member Author

guillaumeeb commented Jul 26, 2022

Okay, so I'm not sure why the none test is failing with config error here. When I try to reproduce it locally in the same environment, everything is fine for the config part. See:

$ pytest dask_jobqueue/tests/test_slurm.py --verbose
=================================================================================== test session starts ====================================================================================
platform linux -- Python 3.8.13, pytest-7.1.2, pluggy-1.0.0 -- /home/guillaumeeb/miniconda3/envs/dask-dev/bin/python
cachedir: .pytest_cache
rootdir: /home/guillaumeeb/dask-jobqueue
collected 7 items

dask_jobqueue/tests/test_slurm.py::test_header PASSED                                                                                                                                [ 14%]
dask_jobqueue/tests/test_slurm.py::test_job_script PASSED                                                                                                                            [ 28%]
dask_jobqueue/tests/test_slurm.py::test_basic SKIPPED (test requires env in ['slurm'])                                                                                               [ 42%]
dask_jobqueue/tests/test_slurm.py::test_adaptive SKIPPED (test requires env in ['slurm'])                                                                                            [ 57%]
dask_jobqueue/tests/test_slurm.py::test_config_name_slurm_takes_custom_config PASSED                                                                                                 [ 71%]
dask_jobqueue/tests/test_slurm.py::test_different_interfaces_on_scheduler_and_workers SKIPPED (test requires env in ['slurm'])                                                       [ 85%]
dask_jobqueue/tests/test_slurm.py::test_worker_name_uses_cluster_name SKIPPED (test requires env in ['slurm'])                                                                       [100%]

=============================================================================== 3 passed, 4 skipped in 4.36s ===============================================================================

@guillaumeeb
Copy link
Member Author

Any insights @jacobtomlinson @jrbourbeau ?

@guillaumeeb
Copy link
Member Author

Correction: when I launch all the tests, I see also the KeyError from config. I'll try to understand why.

@jacobtomlinson
Copy link
Member

I've not seen that before, I know @mrocklin was talking about changing the way configs in subprojects are handled in dask/dask#9171, although that was just a documentation change. But maybe something else has changed?

@guillaumeeb
Copy link
Member Author

guillaumeeb commented Jul 26, 2022

Okay, so after some time of investigation, I narrowed the problem to tests using the loop fixture from distributed.utils_test.

I'm not sure why we use this fixture so often and what it's supposed to do (give a clean loop for the Scheduler to run on?), but it seems it doesn't play well with default loaded dask configuration. It looks like whenever this fixture is used, the default jobqueue config is lost from dask.config (but it also looks like dask or distributed configuration is still there).

Any hints on how to avoid that or what to change?

@jacobtomlinson
Copy link
Member

@graingert has been cleaning up a bunch of the loop handling stuff in distributed lately, so perhaps something has broken things here. Thomas do you have any thoughts on this?

@graingert
Copy link
Member

we reset the config as part of the loop fixture setup: https://github.com/dask/distributed/blob/236945a3f734eea2e687381e5cd86b561796c535/distributed/utils_test.py#L1883

when you make your own loop you should be able to use:

def my_loop(loop):
    with dask.config.set(...):
        yield loop

@graingert
Copy link
Member

graingert commented Jul 27, 2022

and setting the dask config within the cleanup fixture should help too:

From 70375807624e819620b4cad21041b41bb3ece69a Mon Sep 17 00:00:00 2001
From: Thomas Grainger <tagrain@gmail.com>
Date: Wed, 27 Jul 2022 10:47:36 +0100
Subject: [PATCH] set dask config within cleanup

---
 conftest.py | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/conftest.py b/conftest.py
index 1b92c19..26f7aae 100644
--- a/conftest.py
+++ b/conftest.py
@@ -1,7 +1,7 @@
 # content of conftest.py
 
 # Make loop fixture available in all tests
-from distributed.utils_test import loop  # noqa: F401
+from distributed.utils_test import loop, cleanup  # noqa: F401
 
 import pytest
 
@@ -97,18 +97,22 @@ all_envs = {
 @pytest.fixture(
     params=[pytest.param(v, marks=[pytest.mark.env(k)]) for (k, v) in all_envs.items()]
 )
-def EnvSpecificCluster(request):
+def EnvSpecificCluster(request, cleanup):
     """Run test only with the specific cluster class set by the environment"""
     if request.param == HTCondorCluster:
         # HTCondor requires explicitly specifying requested disk space
-        dask.config.set({"jobqueue.htcondor.disk": "1GB"})
-    return request.param
+        with dask.config.set({"jobqueue.htcondor.disk": "1GB"}):
+            yield request.param
+    else:
+        yield request.param
 
 
-@pytest.fixture(params=list(all_envs.values()))
-def Cluster(request):
+@pytest.fixture(params=list(all_envs.values()))
+def Cluster(request, cleanup):
     """Run test for each cluster class when no environment is set (test should not require the actual scheduler)"""
     if request.param == HTCondorCluster:
         # HTCondor requires explicitly specifying requested disk space
-        dask.config.set({"jobqueue.htcondor.disk": "1GB"})
-    return request.param
+        with dask.config.set({"jobqueue.htcondor.disk": "1GB"}):
+            yield request.param
+    else:
+        yield request.param
-- 
2.34.1

@guillaumeeb
Copy link
Member Author

when you make your own loop you should be able to use

I did not try yet, and I'm not fluent in context manager and everything, but I'm not sure how this may fix our issue here.

How can I use the loop fixture and inherit from default dask-jobqueue config on it? Is this what the code you mention is doing?

@guillaumeeb
Copy link
Member Author

Okay, so we have some Bytes counting issues here. After investigation, this originates from dask/dask#7484. As we store the memory as an integer, and we express it in our tests as GB and non GiB, we lose information by converting back and forth from String to Int. See below what I get in Python:

In [23]: from dask.utils import parse_bytes, format_bytes

In [24]: format_bytes(parse_bytes('2GB'))
Out[24]: '1.86 GiB'

In [25]: parse_bytes(format_bytes(parse_bytes('2GB')))
Out[25]: 1997159792

@guillaumeeb
Copy link
Member Author

So I've made things work locally, even with HTCondor tests. But CI still fails.

HTCondor test fix is not really clean, one test is leaving some job running which prevents other tests to succeed. So I had to manually remove these jobs from the queue at the end of the test. Problem is if some test fails, the queue might not be cleaned properly. Adding some fixture for cleaning HTCondor running jobs might help

Moreover, the scheduling cycle seems to be a bit high with default config, so I had to use somehow high timeouts.

We could probably do better, cc @riedel, @mivade or @jolange. If someone is willing to take a look, please do so. We should look in docker compose setup for reducing scheduling cycle.

@guillaumeeb
Copy link
Member Author

So I was going to give up HTCondor tests for now and call it a day (more of a week actually), but the dirty hack I introduced with overriding cleanup fixture doesn't work anymore with the last distributed version. The loop fixture has been modified there (dask/distributed@172e37f), and so again all the dask-jobqueue configuration is lost when using it.

@graingert I think I'll really need your help in order to fix this. I don't see a simple even if hacky workaround for making CI work here, we'll probably need to fix #567 by continuing #565. But if you have a temporary workaround, I'll take it.

@guillaumeeb
Copy link
Member Author

One thing I'd like to understand is:do we really need the loop fixture from distributed?

I've looked at dask-kubernetes, dask-cloudprovider, dask-yarn, and didn't find any use of it there.

The only external place I saw it used was in Xarray.

What's the goal of this fixture for us?

@guillaumeeb
Copy link
Member Author

guillaumeeb commented Aug 7, 2022

There are a lot of modifications here. First is the drop of Python 3.7, some are genuine tests fixes (PBS, LSF, None, HTCondor), others are attempts to fix tests or adding debugging and readability.

Finally, there is one part that is just a dirty workaround with distributed fixtures to have tests passing even if code does not handle loops correctly.

I'm going to merge this anyway, because it still allows a bit of tests and checks upon merge requests.
I'm also going to open issues for remaining questions or problem unsolved.

@guillaumeeb guillaumeeb merged commit 5308486 into dask:main Aug 7, 2022
guillaumeeb added a commit to ikabadzhov/dask-jobqueue that referenced this pull request Aug 7, 2022
* Drop Python 3.7

* Fix cleanup fixture probem (see fistributed#9137)

* Override cleanup distributed fixture, and reconfigure dask-jobqueue when called

* Use power of 2 for the memory checks in tests (see dask#7484)

* Apply Black

* Apply correct version of black...

* conda install Python in HTCondor Docker image

* Fix HTCondor Dockerfile

* Fix PBS Issue

* Add a timeout for every wait_for_workers call

* Fix HTCondor tests, leave room for scheduling cycle to take place for HTCondor

* flake check

* debugging HTCondor tests on CI

* Flake

* reduce negotiator interval for faster job queuing, more debugging logs

* move condor command at the right place

* always run cleanup step, and print more things on HTCondor

* Clean temporary debug and other not necessary modifications

* Disable HTCondor CI for now

* Import loop_in_thread fixture from distributed

* Override method from distributed to make tests pass
guillaumeeb added a commit that referenced this pull request Aug 7, 2022
…ob scripts (#560)

* Use `--nworkers` in stead of deprecated `--nprocs` in the generated job scripts

Fix #559.

* Update the requirements for dask and distributed

This is needed to support the `--nworkers` option.

* Fix behaviour of `env_extra` for HTCondor (#563)

* fix behaviour of `env_extra` for HTCondor

Before only environment variables were considered using HTCondor's
`Environment =` feature, whereas the parameter description generally
says "Other commands to add".

If the `Environment =` is needed, one can still use the generic
`job_extra` parameter to set it.

fixes #393
related to #323 #556

* adapt htcondor tests for new `env_extra` behaviour

* adapt htcondor tests for new `env_extra` behaviour

"export" is preserved now

* docs: made "Moab Deployments" heading the same level as the others

* docs: added description of HTCondorCluster + env_extra

- example in the docstring
- description in "example deployments""
- description in "advanced tops an tricks"

* docs: removed the HTCondorCluster section from examples

* formatting according to black and flake8

* Drop Python 3.7 (#562)

* Drop Python 3.7

* Fix cleanup fixture probem (see fistributed#9137)

* Override cleanup distributed fixture, and reconfigure dask-jobqueue when called

* Use power of 2 for the memory checks in tests (see dask#7484)

* Apply Black

* Apply correct version of black...

* conda install Python in HTCondor Docker image

* Fix HTCondor Dockerfile

* Fix PBS Issue

* Add a timeout for every wait_for_workers call

* Fix HTCondor tests, leave room for scheduling cycle to take place for HTCondor

* flake check

* debugging HTCondor tests on CI

* Flake

* reduce negotiator interval for faster job queuing, more debugging logs

* move condor command at the right place

* always run cleanup step, and print more things on HTCondor

* Clean temporary debug and other not necessary modifications

* Disable HTCondor CI for now

* Import loop_in_thread fixture from distributed

* Override method from distributed to make tests pass

Co-authored-by: Johannes Lange <jolange@users.noreply.github.com>
Co-authored-by: Guillaume Eynard-Bontemps <g.eynard.bontemps@gmail.com>


# Overriding distributed.utils_test.reset_config() method because it reset the
# config from ou tests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woops, I was too eager to merge this, I left a typo, thanks for the proof read! I hope anyway that this text will soon disappear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Drop Python 3.7
3 participants