Make dataframe `assert_eq` scheduler configurable #8811

charlesbluca · 2022-03-15T18:30:42Z

Adds dataframe.assert-eq.scheduler to the config, which controls the default scheduler used in compute calls in assert_eq. This could be used to globally switch the behavior of assert_eq without needing to specify a scheduler kwarg in each call of the function.

This could be used in Dask-SQL to globally set scheduler="distributed" when running tests on a remote Dask cluster, where we would want to ensure that no computations are happening locally - dask-contrib/dask-sql#365 (comment) for additional context.

Closes #xxxx
Tests added / passed
Passes pre-commit run --all-files

jsignell

This is a really nice enhancement. If it's going to be a top level setting though I think it should affect array and bag assert_eq as well.

dask/dask.yaml

charlesbluca · 2022-03-15T18:56:23Z

I think it should affect array and bag assert_eq as well

That makes sense to me - in that case, should we move this setting to a different config module, i.e. something like testing.assert-eq.scheduler? Or were you thinking of having a unique config for dataframe, array, and bag assert_eq so they could all be configured separately?

Co-authored-by: Julia Signell <jsignell@gmail.com>

jsignell · 2022-03-15T18:58:42Z

I think it should affect array and bag assert_eq as well

That makes sense to me - in that case, should we move this setting to a different config module, i.e. something like testing.assert-eq.scheduler? Or were you thinking of having a unique config for dataframe, array, and bag assert_eq so they could all be configured separately?

I missed that this config was nested under dataframe already 🤦 I was imagining one config value which would make sense to me to put under testing like you suggest.

dask/dataframe/utils.py

jcrist · 2022-03-15T19:07:17Z

+1 for testing.assert-eq.scheduler instead of one specific to dataframe/array/bag. This is likely to be useful across all tests, and namespacing it away under testing should better clarify its functionality to users (who likely won't ever need to touch this config).

dask/dask-schema.yaml

dask/dask.yaml

Co-authored-by: Julia Signell <jsignell@gmail.com>

ian-r-rose · 2022-03-16T20:26:07Z

Is it possible to accomplish the same thing using a pytest fixture and the existing scheduler config value? That is to say, do we need to add a new config value here?

ian-r-rose · 2022-03-16T20:47:31Z

To be a bit more concrete about my question above: I think we could make the scheduler configurable on a per-test (or per-module) basis using the exiting config and a pytest fixture, and not have to add any extra config values. It would look something like:

Changing assert_eq's scheduler kwarg default to None, as done here, so that the config value would take precedence, and
Adding a fixture something like the below:

@pytest.fixture
def use_distributed_scheduler():
    distributed = pytest.importorskip("distributed")
    with dask.config.set({"scheduler": "distributed"}):
        with distributed.Client() as c:
            yield

(n.b., you'd probably use one of distributed's test utils for generating a cluster/client pair rather than the top-level Client() constructor).

This fixture could then be applied on a per-test basis, or a per-module/session basis with a @pytest.fixture(autouse=True), and we could avoid adding a config value that most users wouldn't need.

charlesbluca · 2022-03-16T21:50:24Z

The fixture solution certainly seems like a reasonable workaround here, although one potential concern with having assert_eq always grab from the config value is that we might want to manually set it to use the single-threaded scheduler for tests, which would require setting an env variable in CI here (and any other downstream projects expecting it to run single-threaded).

ian-r-rose · 2022-03-16T22:25:42Z

we might want to manually set it to use the single-threaded scheduler for tests, which would require setting an env variable in CI here (and any other downstream projects expecting it to run single-threaded)

I'm not sure I understand what you mean -- I chose to use a distributed scheduler in my snippet above, but the fixture could be more complicated, and dispatch to other schedulers on an as-needed basis (and different fixtures could be used by different downstream projects). I am not advocating for a project-wide autouse fixture in dask/dask, I'm thinking something like this could be used in whatever project that wants to customize the default scheduler its test suite (including dask-sql)

I guess I'm advocating for using pytest idioms to customize pytest behavior, and I think we can do that using the existing config, rather than adding another very specialized configurable value.

charlesbluca · 2022-03-17T02:56:00Z

I'm strictly speaking about how testing behavior in Dask would change if we switched the default of assert_eg's scheduler kwarg from "sync" to None - doing this would mean that by default, assert_eq would end up using the underlying __dask_scheduler__ of the collections it's computing rather than the single-threaded scheduler (i.e. threaded.get or dask.multiprocessing.get instead of dask.local.get_sync) - is this a change that we're okay with making, or was there a larger underlying reason why Dask defaulted to the single-threaded scheduler for assert_eq (my immediate guess would be for debugging purposes, but there could be other reasons)?

We could avoid these implications, but that would require either:

marking all (or relevant) Dask tests with a fixture to run on single-threaded scheduler, essentially your code snippet but replacing "distributed" with "sync"
setting an environment variable at test time to globally set the scheduler, i.e. DASK_SCHEDULER="sync"

~~Also, not sure if I'm missing something but I don't actually see a scheduler option documented in Dask's configuration reference.~~ I see now that it exists and is used occasionally - perhaps we should add it to the configuration reference so it is more immediately obvious to users that it can be configured?

charlesbluca

@ian-r-rose how does this look as a compromise between the current PR and your preference? This would make it so assert_eq's behavior can be controlled by config.set(scheduler=...) (avoiding the need for an additional config option), while also implicitly keeping scheduler="sync" as its default behavior when nothing is specified:

charlesbluca · 2022-03-17T03:42:31Z

dask/array/utils.py

    **kwargs,
 ):
+    if scheduler is None:
+        scheduler = config.get("testing.assert-eq.scheduler")


Note that we would be using the old style of config.get, as I don't think we want Dask using the single-threaded scheduler by default:

Suggested change

scheduler = config.get("testing.assert-eq.scheduler")

scheduler = config.get("scheduler", "sync")

charlesbluca · 2022-03-17T03:42:58Z

dask/bag/utils.py

+
+def assert_eq(a, b, scheduler=None):
+    if scheduler is None:
+        scheduler = config.get("testing.assert-eq.scheduler")


Suggested change

scheduler = config.get("testing.assert-eq.scheduler")

scheduler = config.get("scheduler", "sync")

charlesbluca · 2022-03-17T03:44:16Z

dask/dask-schema.yaml

+
+  testing:
+    type: object
+    properties:
+
+      assert-eq:
+        type: object
+        properties:
+
+          scheduler:
+            type: string
+            description: |
+              The scheduler used to compute Dask collections when they are provided as
+              input to ``assert_eq``. By default, ``assert_eq`` will set
+              ``scheduler="sync"``, using a local single-threaded scheduler. Can be
+              overriden by explicitly passing a ``scheduler`` argument to ``assert_eq``.


Wouldn't need this config option anymore:

Suggested change

testing:

type: object

properties:

assert-eq:

type: object

properties:

scheduler:

type: string

description: |

The scheduler used to compute Dask collections when they are provided as

input to ``assert_eq``. By default, ``assert_eq`` will set

``scheduler="sync"``, using a local single-threaded scheduler. Can be

overriden by explicitly passing a ``scheduler`` argument to ``assert_eq``.

charlesbluca · 2022-03-17T03:44:40Z

dask/dask.yaml

+
+testing:
+  assert-eq:
+    scheduler: "sync"  # default scheduler used when computing dask collections for assertions


Suggested change

testing:

assert-eq:

scheduler: "sync" # default scheduler used when computing dask collections for assertions

charlesbluca · 2022-03-17T03:45:16Z

dask/dataframe/utils.py

    **kwargs,
 ):
+    if scheduler is None:
+        scheduler = config.get("testing.assert-eq.scheduler")


Suggested change

scheduler = config.get("testing.assert-eq.scheduler")

scheduler = config.get("scheduler", "sync")

charlesbluca · 2022-03-17T04:28:45Z

Opened #8821 as a larger effort to address the fact that scheduler (among other config options) is not documented in the config reference

ian-r-rose · 2022-03-17T18:10:02Z

is this a change that we're okay with making

Good question! To me it seems okay, unless it makes the test suite completely blow up. A related question is: "do we want users to be able to affect how the test suite runs with the dask config system?". Both of the approaches here do that, but it does have some potential downsides as you point out. Local config has also caused some problems with running tests before, which resulted in this PR in distributed (briefly, some local preload scripts were wreaking havoc on running the test suite).

I see now that it exists and is used occasionally

Yeah, it's unfortunate that this is not well documented, thanks for opening up #8821. But it's used more than occasionally (at least in terms of reads), since the primary compute() entrypoint for all dask collections pulls from this config value.

ian-r-rose · 2022-03-17T18:14:13Z

how does this look as a compromise between the current PR and your preference?

This sounds like a reasonable idea to me.

ian-r-rose · 2022-03-17T18:15:53Z

I'm also curious if @jsignell agrees with my preference for avoiding new config values :)

jsignell · 2022-03-22T21:06:37Z

Sorry for being slow to respond. I don't have a strong feeling either way. I generally feel that testing needs are pretty different from general dask needs so I am fine with them being separated out into a testing section of the config. @charlesbluca's point about how we currently want scheduler to be sync for tests rather than default kind of emphasizes how the needs are different. I guess I mildly prefer the existing option in this PR ( aka make a separate config field for controlling tests).

Consider how confusing it would be if you set the scheduler in the config to something, and then a bunch of tests failed because they expected that the scheduler that they'd be testing with would be "sync". That would be pretty weird to try to debug and it's unlikely that that would be what the user intended.

ian-r-rose · 2022-03-23T15:16:59Z

Thanks @jsignell! Ultimately, I don't have super strong opinions here, so if you are happy with a separate config option, happy to go that way.

I generally feel that testing needs are pretty different from general dask needs so I am fine with them being separated out into a testing section of the config.

I sort of feel the opposite -- that tests should be as close as possible to IRL usage of the library, and special-casing tests is a good way to miss important regressions or use-cases.

Consider how confusing it would be if you set the scheduler in the config to something, and then a bunch of tests failed because they expected that the scheduler that they'd be testing with would be "sync". That would be pretty weird to try to debug and it's unlikely that that would be what the user intended.

I agree! And similar local config has caused weird test behavior in the past (I'm thinking specifically of local preload scripts in distributed). I actually think we should disable loading local config during tests at all, but instead should be setting it with fixtures as suggested above when we need it. If a downstream project then wanted to customize what scheduler was run for a subset of tests, it would then be free to do it in code, rather than relying on a locally set value. But that's a rabbit hole which is outside of the scope of this PR.

jcrist · 2022-03-23T16:05:42Z

Hi @charlesbluca, from my understanding the goal here is:

to override the default scheduler used by assert_eq to use some other specified scheduler
to use this behavior in the dask-sql test suite

Instead of adding a new config field (which I'm ok with, but not thrilled about), couldn't you define your own assert_eq shim that does what you want, and use that in the dask-sql test suite?

Something like:

from dask.dataframe import assert_eq as _assert_eq

def assert_eq(*args, **kwargs):
    kwargs.setdefault("scheduler", "your-new-default-scheduler-value")
    return _assert_eq(*args, **kwargs)

# Then use this assert_eq everywhere in the dask-sql test suite

This is simple to understand, and fully separates a user's dask config from the test suite itself.

Would this be sufficient to solve your use case?

charlesbluca · 2022-03-23T16:59:49Z

Thanks @jcrist, I do think that shim should be sufficient for my case in that it doesn't require any modifications to existing assert_eq calls to control behavior - I will try that out and ping if it works well.

If that method succeeds, is there any consensus on what to do with this PR? My thought process was that this could potentially be useful if we wanted to control assert_eq behavior from outside the scope of tests with relatively minimal code changes, but @ian-r-rose and @jcrist have shown there are plenty of minimal ways to handle the various cases I had in mind, so don't mind closing 🙂

jcrist · 2022-03-23T17:14:01Z

Yeah, if the shim is sufficient for your current use case, then I think we close this.

jsignell · 2022-03-28T18:54:17Z

Closing this, since @charlesbluca has agreed on a different approach.

Make dataframe assert_eq scheduler configurable

2c91c29

github-actions bot added the dataframe label Mar 15, 2022

charlesbluca mentioned this pull request Mar 15, 2022

Switch tests from pd.testing.assert_frame_equal to dd.assert_eq dask-contrib/dask-sql#365

Merged

jsignell reviewed Mar 15, 2022

View reviewed changes

dask/dask.yaml Outdated Show resolved Hide resolved

Update dask/dask.yaml

9041f64

Co-authored-by: Julia Signell <jsignell@gmail.com>

jsignell reviewed Mar 15, 2022

View reviewed changes

dask/dataframe/utils.py Outdated Show resolved Hide resolved

charlesbluca added 2 commits March 15, 2022 12:18

Remove default case from config.get call

686630c

Move config to testing prefix, make affect all assert_eq functions

fd926db

github-actions bot added the array label Mar 15, 2022

jsignell reviewed Mar 15, 2022

View reviewed changes

dask/dask-schema.yaml Outdated Show resolved Hide resolved

dask/dask.yaml Outdated Show resolved Hide resolved

charlesbluca and others added 3 commits March 15, 2022 17:03

Update dask/dask.yaml

d03a2c9

Co-authored-by: Julia Signell <jsignell@gmail.com>

Merge remote-tracking branch 'upstream/main' into assert-eq-config

1a2e8df

Update schema description

0fb74ac

jsignell approved these changes Mar 16, 2022

View reviewed changes

charlesbluca commented Mar 17, 2022

View reviewed changes

charlesbluca mentioned this pull request Mar 22, 2022

Add basic predicate-pushdown optimization dask-contrib/dask-sql#433

Merged

3 tasks

jsignell closed this Mar 28, 2022

crusaderky added a commit to fjetter/dask that referenced this pull request Jan 26, 2024

Align to dask#8811

2891540

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make dataframe `assert_eq` scheduler configurable #8811

Make dataframe `assert_eq` scheduler configurable #8811

charlesbluca commented Mar 15, 2022

jsignell left a comment

charlesbluca commented Mar 15, 2022

jsignell commented Mar 15, 2022

jcrist commented Mar 15, 2022

ian-r-rose commented Mar 16, 2022

ian-r-rose commented Mar 16, 2022

charlesbluca commented Mar 16, 2022

ian-r-rose commented Mar 16, 2022

charlesbluca commented Mar 17, 2022 •

edited

charlesbluca left a comment •

edited

charlesbluca Mar 17, 2022

charlesbluca Mar 17, 2022

charlesbluca Mar 17, 2022

charlesbluca Mar 17, 2022

charlesbluca Mar 17, 2022

charlesbluca commented Mar 17, 2022

ian-r-rose commented Mar 17, 2022

ian-r-rose commented Mar 17, 2022

ian-r-rose commented Mar 17, 2022

jsignell commented Mar 22, 2022

ian-r-rose commented Mar 23, 2022

jcrist commented Mar 23, 2022

charlesbluca commented Mar 23, 2022

jcrist commented Mar 23, 2022

jsignell commented Mar 28, 2022

	scheduler = config.get("testing.assert-eq.scheduler")
	scheduler = config.get("scheduler", "sync")

Make dataframe assert_eq scheduler configurable #8811

Make dataframe assert_eq scheduler configurable #8811

Conversation

charlesbluca commented Mar 15, 2022

jsignell left a comment

Choose a reason for hiding this comment

charlesbluca commented Mar 15, 2022

jsignell commented Mar 15, 2022

jcrist commented Mar 15, 2022

ian-r-rose commented Mar 16, 2022

ian-r-rose commented Mar 16, 2022

charlesbluca commented Mar 16, 2022

ian-r-rose commented Mar 16, 2022

charlesbluca commented Mar 17, 2022 • edited

charlesbluca left a comment • edited

Choose a reason for hiding this comment

charlesbluca Mar 17, 2022

Choose a reason for hiding this comment

charlesbluca Mar 17, 2022

Choose a reason for hiding this comment

charlesbluca Mar 17, 2022

Choose a reason for hiding this comment

charlesbluca Mar 17, 2022

Choose a reason for hiding this comment

charlesbluca Mar 17, 2022

Choose a reason for hiding this comment

charlesbluca commented Mar 17, 2022

ian-r-rose commented Mar 17, 2022

ian-r-rose commented Mar 17, 2022

ian-r-rose commented Mar 17, 2022

jsignell commented Mar 22, 2022

ian-r-rose commented Mar 23, 2022

jcrist commented Mar 23, 2022

charlesbluca commented Mar 23, 2022

jcrist commented Mar 23, 2022

jsignell commented Mar 28, 2022

Make dataframe `assert_eq` scheduler configurable #8811

Make dataframe `assert_eq` scheduler configurable #8811

charlesbluca commented Mar 17, 2022 •

edited

charlesbluca left a comment •

edited