Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make dataframe assert_eq scheduler configurable #8811

Closed
wants to merge 7 commits into from

Conversation

charlesbluca
Copy link
Member

Adds dataframe.assert-eq.scheduler to the config, which controls the default scheduler used in compute calls in assert_eq. This could be used to globally switch the behavior of assert_eq without needing to specify a scheduler kwarg in each call of the function.

This could be used in Dask-SQL to globally set scheduler="distributed" when running tests on a remote Dask cluster, where we would want to ensure that no computations are happening locally - dask-contrib/dask-sql#365 (comment) for additional context.

  • Closes #xxxx
  • Tests added / passed
  • Passes pre-commit run --all-files

Copy link
Member

@jsignell jsignell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really nice enhancement. If it's going to be a top level setting though I think it should affect array and bag assert_eq as well.

dask/dask.yaml Outdated Show resolved Hide resolved
@charlesbluca
Copy link
Member Author

I think it should affect array and bag assert_eq as well

That makes sense to me - in that case, should we move this setting to a different config module, i.e. something like testing.assert-eq.scheduler? Or were you thinking of having a unique config for dataframe, array, and bag assert_eq so they could all be configured separately?

Co-authored-by: Julia Signell <jsignell@gmail.com>
@jsignell
Copy link
Member

I think it should affect array and bag assert_eq as well

That makes sense to me - in that case, should we move this setting to a different config module, i.e. something like testing.assert-eq.scheduler? Or were you thinking of having a unique config for dataframe, array, and bag assert_eq so they could all be configured separately?

I missed that this config was nested under dataframe already 🤦 I was imagining one config value which would make sense to me to put under testing like you suggest.

dask/dataframe/utils.py Outdated Show resolved Hide resolved
@jcrist
Copy link
Member

jcrist commented Mar 15, 2022

+1 for testing.assert-eq.scheduler instead of one specific to dataframe/array/bag. This is likely to be useful across all tests, and namespacing it away under testing should better clarify its functionality to users (who likely won't ever need to touch this config).

@github-actions github-actions bot added the array label Mar 15, 2022
dask/dask-schema.yaml Outdated Show resolved Hide resolved
dask/dask.yaml Outdated Show resolved Hide resolved
@ian-r-rose
Copy link
Collaborator

Is it possible to accomplish the same thing using a pytest fixture and the existing scheduler config value? That is to say, do we need to add a new config value here?

@ian-r-rose
Copy link
Collaborator

To be a bit more concrete about my question above: I think we could make the scheduler configurable on a per-test (or per-module) basis using the exiting config and a pytest fixture, and not have to add any extra config values. It would look something like:

  1. Changing assert_eq's scheduler kwarg default to None, as done here, so that the config value would take precedence, and
  2. Adding a fixture something like the below:
@pytest.fixture
def use_distributed_scheduler():
    distributed = pytest.importorskip("distributed")
    with dask.config.set({"scheduler": "distributed"}):
        with distributed.Client() as c:
            yield

(n.b., you'd probably use one of distributed's test utils for generating a cluster/client pair rather than the top-level Client() constructor).

This fixture could then be applied on a per-test basis, or a per-module/session basis with a @pytest.fixture(autouse=True), and we could avoid adding a config value that most users wouldn't need.

@charlesbluca
Copy link
Member Author

The fixture solution certainly seems like a reasonable workaround here, although one potential concern with having assert_eq always grab from the config value is that we might want to manually set it to use the single-threaded scheduler for tests, which would require setting an env variable in CI here (and any other downstream projects expecting it to run single-threaded).

@ian-r-rose
Copy link
Collaborator

we might want to manually set it to use the single-threaded scheduler for tests, which would require setting an env variable in CI here (and any other downstream projects expecting it to run single-threaded)

I'm not sure I understand what you mean -- I chose to use a distributed scheduler in my snippet above, but the fixture could be more complicated, and dispatch to other schedulers on an as-needed basis (and different fixtures could be used by different downstream projects). I am not advocating for a project-wide autouse fixture in dask/dask, I'm thinking something like this could be used in whatever project that wants to customize the default scheduler its test suite (including dask-sql)

I guess I'm advocating for using pytest idioms to customize pytest behavior, and I think we can do that using the existing config, rather than adding another very specialized configurable value.

@charlesbluca
Copy link
Member Author

charlesbluca commented Mar 17, 2022

I'm strictly speaking about how testing behavior in Dask would change if we switched the default of assert_eg's scheduler kwarg from "sync" to None - doing this would mean that by default, assert_eq would end up using the underlying __dask_scheduler__ of the collections it's computing rather than the single-threaded scheduler (i.e. threaded.get or dask.multiprocessing.get instead of dask.local.get_sync) - is this a change that we're okay with making, or was there a larger underlying reason why Dask defaulted to the single-threaded scheduler for assert_eq (my immediate guess would be for debugging purposes, but there could be other reasons)?

We could avoid these implications, but that would require either:

  • marking all (or relevant) Dask tests with a fixture to run on single-threaded scheduler, essentially your code snippet but replacing "distributed" with "sync"
  • setting an environment variable at test time to globally set the scheduler, i.e. DASK_SCHEDULER="sync"

Also, not sure if I'm missing something but I don't actually see a scheduler option documented in Dask's configuration reference. I see now that it exists and is used occasionally - perhaps we should add it to the configuration reference so it is more immediately obvious to users that it can be configured?

Copy link
Member Author

@charlesbluca charlesbluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ian-r-rose how does this look as a compromise between the current PR and your preference? This would make it so assert_eq's behavior can be controlled by config.set(scheduler=...) (avoiding the need for an additional config option), while also implicitly keeping scheduler="sync" as its default behavior when nothing is specified:

**kwargs,
):
if scheduler is None:
scheduler = config.get("testing.assert-eq.scheduler")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we would be using the old style of config.get, as I don't think we want Dask using the single-threaded scheduler by default:

Suggested change
scheduler = config.get("testing.assert-eq.scheduler")
scheduler = config.get("scheduler", "sync")


def assert_eq(a, b, scheduler=None):
if scheduler is None:
scheduler = config.get("testing.assert-eq.scheduler")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
scheduler = config.get("testing.assert-eq.scheduler")
scheduler = config.get("scheduler", "sync")

Comment on lines +138 to +153

testing:
type: object
properties:

assert-eq:
type: object
properties:

scheduler:
type: string
description: |
The scheduler used to compute Dask collections when they are provided as
input to ``assert_eq``. By default, ``assert_eq`` will set
``scheduler="sync"``, using a local single-threaded scheduler. Can be
overriden by explicitly passing a ``scheduler`` argument to ``assert_eq``.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't need this config option anymore:

Suggested change
testing:
type: object
properties:
assert-eq:
type: object
properties:
scheduler:
type: string
description: |
The scheduler used to compute Dask collections when they are provided as
input to ``assert_eq``. By default, ``assert_eq`` will set
``scheduler="sync"``, using a local single-threaded scheduler. Can be
overriden by explicitly passing a ``scheduler`` argument to ``assert_eq``.

Comment on lines +27 to +30

testing:
assert-eq:
scheduler: "sync" # default scheduler used when computing dask collections for assertions
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
testing:
assert-eq:
scheduler: "sync" # default scheduler used when computing dask collections for assertions

**kwargs,
):
if scheduler is None:
scheduler = config.get("testing.assert-eq.scheduler")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
scheduler = config.get("testing.assert-eq.scheduler")
scheduler = config.get("scheduler", "sync")

@charlesbluca
Copy link
Member Author

Opened #8821 as a larger effort to address the fact that scheduler (among other config options) is not documented in the config reference

@ian-r-rose
Copy link
Collaborator

is this a change that we're okay with making

Good question! To me it seems okay, unless it makes the test suite completely blow up. A related question is: "do we want users to be able to affect how the test suite runs with the dask config system?". Both of the approaches here do that, but it does have some potential downsides as you point out. Local config has also caused some problems with running tests before, which resulted in this PR in distributed (briefly, some local preload scripts were wreaking havoc on running the test suite).

I see now that it exists and is used occasionally

Yeah, it's unfortunate that this is not well documented, thanks for opening up #8821. But it's used more than occasionally (at least in terms of reads), since the primary compute() entrypoint for all dask collections pulls from this config value.

@ian-r-rose
Copy link
Collaborator

how does this look as a compromise between the current PR and your preference?

This sounds like a reasonable idea to me.

@ian-r-rose
Copy link
Collaborator

I'm also curious if @jsignell agrees with my preference for avoiding new config values :)

@jsignell
Copy link
Member

Sorry for being slow to respond. I don't have a strong feeling either way. I generally feel that testing needs are pretty different from general dask needs so I am fine with them being separated out into a testing section of the config. @charlesbluca's point about how we currently want scheduler to be sync for tests rather than default kind of emphasizes how the needs are different. I guess I mildly prefer the existing option in this PR ( aka make a separate config field for controlling tests).

Consider how confusing it would be if you set the scheduler in the config to something, and then a bunch of tests failed because they expected that the scheduler that they'd be testing with would be "sync". That would be pretty weird to try to debug and it's unlikely that that would be what the user intended.

@ian-r-rose
Copy link
Collaborator

Thanks @jsignell! Ultimately, I don't have super strong opinions here, so if you are happy with a separate config option, happy to go that way.

I generally feel that testing needs are pretty different from general dask needs so I am fine with them being separated out into a testing section of the config.

I sort of feel the opposite -- that tests should be as close as possible to IRL usage of the library, and special-casing tests is a good way to miss important regressions or use-cases.

Consider how confusing it would be if you set the scheduler in the config to something, and then a bunch of tests failed because they expected that the scheduler that they'd be testing with would be "sync". That would be pretty weird to try to debug and it's unlikely that that would be what the user intended.

I agree! And similar local config has caused weird test behavior in the past (I'm thinking specifically of local preload scripts in distributed). I actually think we should disable loading local config during tests at all, but instead should be setting it with fixtures as suggested above when we need it. If a downstream project then wanted to customize what scheduler was run for a subset of tests, it would then be free to do it in code, rather than relying on a locally set value. But that's a rabbit hole which is outside of the scope of this PR.

@jcrist
Copy link
Member

jcrist commented Mar 23, 2022

Hi @charlesbluca, from my understanding the goal here is:

  • to override the default scheduler used by assert_eq to use some other specified scheduler
  • to use this behavior in the dask-sql test suite

Instead of adding a new config field (which I'm ok with, but not thrilled about), couldn't you define your own assert_eq shim that does what you want, and use that in the dask-sql test suite?

Something like:

from dask.dataframe import assert_eq as _assert_eq

def assert_eq(*args, **kwargs):
    kwargs.setdefault("scheduler", "your-new-default-scheduler-value")
    return _assert_eq(*args, **kwargs)

# Then use this assert_eq everywhere in the dask-sql test suite

This is simple to understand, and fully separates a user's dask config from the test suite itself.

Would this be sufficient to solve your use case?

@charlesbluca
Copy link
Member Author

Thanks @jcrist, I do think that shim should be sufficient for my case in that it doesn't require any modifications to existing assert_eq calls to control behavior - I will try that out and ping if it works well.

If that method succeeds, is there any consensus on what to do with this PR? My thought process was that this could potentially be useful if we wanted to control assert_eq behavior from outside the scope of tests with relatively minimal code changes, but @ian-r-rose and @jcrist have shown there are plenty of minimal ways to handle the various cases I had in mind, so don't mind closing 🙂

@jcrist
Copy link
Member

jcrist commented Mar 23, 2022

Yeah, if the shim is sufficient for your current use case, then I think we close this.

@jsignell
Copy link
Member

Closing this, since @charlesbluca has agreed on a different approach.

@jsignell jsignell closed this Mar 28, 2022
crusaderky added a commit to fjetter/dask that referenced this pull request Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants