New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make dataframe assert_eq
scheduler configurable
#8811
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really nice enhancement. If it's going to be a top level setting though I think it should affect array and bag assert_eq
as well.
That makes sense to me - in that case, should we move this setting to a different config module, i.e. something like |
Co-authored-by: Julia Signell <jsignell@gmail.com>
I missed that this config was nested under dataframe already 🤦 I was imagining one config value which would make sense to me to put under testing like you suggest. |
+1 for |
Co-authored-by: Julia Signell <jsignell@gmail.com>
Is it possible to accomplish the same thing using a pytest fixture and the existing |
To be a bit more concrete about my question above: I think we could make the scheduler configurable on a per-test (or per-module) basis using the exiting config and a pytest fixture, and not have to add any extra config values. It would look something like:
@pytest.fixture
def use_distributed_scheduler():
distributed = pytest.importorskip("distributed")
with dask.config.set({"scheduler": "distributed"}):
with distributed.Client() as c:
yield (n.b., you'd probably use one of distributed's test utils for generating a cluster/client pair rather than the top-level This fixture could then be applied on a per-test basis, or a per-module/session basis with a |
The fixture solution certainly seems like a reasonable workaround here, although one potential concern with having |
I'm not sure I understand what you mean -- I chose to use a distributed scheduler in my snippet above, but the fixture could be more complicated, and dispatch to other schedulers on an as-needed basis (and different fixtures could be used by different downstream projects). I am not advocating for a project-wide autouse fixture in I guess I'm advocating for using pytest idioms to customize pytest behavior, and I think we can do that using the existing config, rather than adding another very specialized configurable value. |
I'm strictly speaking about how testing behavior in Dask would change if we switched the default of We could avoid these implications, but that would require either:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ian-r-rose how does this look as a compromise between the current PR and your preference? This would make it so assert_eq
's behavior can be controlled by config.set(scheduler=...)
(avoiding the need for an additional config option), while also implicitly keeping scheduler="sync"
as its default behavior when nothing is specified:
**kwargs, | ||
): | ||
if scheduler is None: | ||
scheduler = config.get("testing.assert-eq.scheduler") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we would be using the old style of config.get
, as I don't think we want Dask using the single-threaded scheduler by default:
scheduler = config.get("testing.assert-eq.scheduler") | |
scheduler = config.get("scheduler", "sync") |
|
||
def assert_eq(a, b, scheduler=None): | ||
if scheduler is None: | ||
scheduler = config.get("testing.assert-eq.scheduler") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scheduler = config.get("testing.assert-eq.scheduler") | |
scheduler = config.get("scheduler", "sync") |
|
||
testing: | ||
type: object | ||
properties: | ||
|
||
assert-eq: | ||
type: object | ||
properties: | ||
|
||
scheduler: | ||
type: string | ||
description: | | ||
The scheduler used to compute Dask collections when they are provided as | ||
input to ``assert_eq``. By default, ``assert_eq`` will set | ||
``scheduler="sync"``, using a local single-threaded scheduler. Can be | ||
overriden by explicitly passing a ``scheduler`` argument to ``assert_eq``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't need this config option anymore:
testing: | |
type: object | |
properties: | |
assert-eq: | |
type: object | |
properties: | |
scheduler: | |
type: string | |
description: | | |
The scheduler used to compute Dask collections when they are provided as | |
input to ``assert_eq``. By default, ``assert_eq`` will set | |
``scheduler="sync"``, using a local single-threaded scheduler. Can be | |
overriden by explicitly passing a ``scheduler`` argument to ``assert_eq``. |
|
||
testing: | ||
assert-eq: | ||
scheduler: "sync" # default scheduler used when computing dask collections for assertions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testing: | |
assert-eq: | |
scheduler: "sync" # default scheduler used when computing dask collections for assertions |
**kwargs, | ||
): | ||
if scheduler is None: | ||
scheduler = config.get("testing.assert-eq.scheduler") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scheduler = config.get("testing.assert-eq.scheduler") | |
scheduler = config.get("scheduler", "sync") |
Opened #8821 as a larger effort to address the fact that |
Good question! To me it seems okay, unless it makes the test suite completely blow up. A related question is: "do we want users to be able to affect how the test suite runs with the dask config system?". Both of the approaches here do that, but it does have some potential downsides as you point out. Local config has also caused some problems with running tests before, which resulted in this PR in
Yeah, it's unfortunate that this is not well documented, thanks for opening up #8821. But it's used more than occasionally (at least in terms of reads), since the primary |
This sounds like a reasonable idea to me. |
I'm also curious if @jsignell agrees with my preference for avoiding new config values :) |
Sorry for being slow to respond. I don't have a strong feeling either way. I generally feel that testing needs are pretty different from general dask needs so I am fine with them being separated out into a testing section of the config. @charlesbluca's point about how we currently want scheduler to be sync for tests rather than default kind of emphasizes how the needs are different. I guess I mildly prefer the existing option in this PR ( aka make a separate config field for controlling tests). Consider how confusing it would be if you set the scheduler in the config to something, and then a bunch of tests failed because they expected that the scheduler that they'd be testing with would be |
Thanks @jsignell! Ultimately, I don't have super strong opinions here, so if you are happy with a separate config option, happy to go that way.
I sort of feel the opposite -- that tests should be as close as possible to IRL usage of the library, and special-casing tests is a good way to miss important regressions or use-cases.
I agree! And similar local config has caused weird test behavior in the past (I'm thinking specifically of local preload scripts in |
Hi @charlesbluca, from my understanding the goal here is:
Instead of adding a new config field (which I'm ok with, but not thrilled about), couldn't you define your own Something like: from dask.dataframe import assert_eq as _assert_eq
def assert_eq(*args, **kwargs):
kwargs.setdefault("scheduler", "your-new-default-scheduler-value")
return _assert_eq(*args, **kwargs)
# Then use this assert_eq everywhere in the dask-sql test suite This is simple to understand, and fully separates a user's dask config from the test suite itself. Would this be sufficient to solve your use case? |
Thanks @jcrist, I do think that shim should be sufficient for my case in that it doesn't require any modifications to existing If that method succeeds, is there any consensus on what to do with this PR? My thought process was that this could potentially be useful if we wanted to control |
Yeah, if the shim is sufficient for your current use case, then I think we close this. |
Closing this, since @charlesbluca has agreed on a different approach. |
Adds
dataframe.assert-eq.scheduler
to the config, which controls the default scheduler used incompute
calls inassert_eq
. This could be used to globally switch the behavior ofassert_eq
without needing to specify ascheduler
kwarg in each call of the function.This could be used in Dask-SQL to globally set
scheduler="distributed"
when running tests on a remote Dask cluster, where we would want to ensure that no computations are happening locally - dask-contrib/dask-sql#365 (comment) for additional context.pre-commit run --all-files