Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add config for handling obj normalization with missing __dask_tokenize__ #7413

Merged
merged 18 commits into from
Oct 4, 2021

Conversation

hristog
Copy link
Contributor

@hristog hristog commented Mar 17, 2021

@@ -8,6 +8,8 @@

from tlz import merge, partial, compose, curry

from unittest import mock
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely sure what convention to employ here in terms of whether to import mock at this level or within test_normalize_object itself (just as it's been done for import warnings; the latter has been imported within the test only, because its general utility is not as wide, in this particular context).

Any feedback would be greatly appreciated.

dask/base.py Outdated Show resolved Hide resolved
dask/base.py Outdated Show resolved Hide resolved
dask/base.py Outdated Show resolved Hide resolved
@hristog
Copy link
Contributor Author

hristog commented Mar 17, 2021

Need to go through the rest of the codebase a bit more to get an idea how it'd be best to go about documenting the newly-proposed config.

@hristog
Copy link
Contributor Author

hristog commented Mar 17, 2021

I'm considering documenting the config as part of this section, because I haven't been able to find a more suitable one.
Any suggestions would be greatly appreciated.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @hristog!

We can probably scale the changes back here a bit to just support a boolean flag for whether or not we should raise an error. I suspect a relatively small group of users will utilize this setting and when things are non-deterministic they'll want an error to be raised. We can always expand thing in the future if needed.

Something like (I've not tested the snippet below)

if callable(o):
    return normalize_function(o)
elif dask.config.get("tokenize.allow-random", True):
    return uuid.uuid4().hex
else:
    raise RuntimeError("...")

should work for that case

@@ -224,6 +226,88 @@ def test_normalize_base():
assert normalize_token(i) is i


def test_normalize_object():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about something like (again I've not tested the snippet below):

with dask.config.set({"tokenize.allow-random": False}):
    with pytest.raises(RuntimeError, match="..."):
        tokenize(object())

@hristog
Copy link
Contributor Author

hristog commented Mar 18, 2021

Thanks for looking into this PR, @jrbourbeau!

I've scaled down the functionality, associated with the newly-proposed config, as suggested, and updated its value to bool. The accompanying unit-tests have been updated accordingly, too.

One tiny concern I've got about the naming (allow-random) is that - as mentioned in this comment of mine - there are other sources of non-determinism that aren't going to be impacted by the proposed logic.

I suppose this could always be added, given actual demand, if GitHub issues are submitted requesting those to be supported at any point later on.

When you get a chance, please, let me know what your thoughts are, regarding the best possible way to document the newly-introduced config.

@hristog
Copy link
Contributor Author

hristog commented Apr 8, 2021

Hi @jrbourbeau - I'm writing to check if you've had a chance to look into the further changes that I pushed in 41b135b, and whether you'd like any additional refinements to be done on top of that.
Thanks!

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hristog -- apologies for the delayed response.

One tiny concern I've got about the naming (allow-random) is that - as mentioned in this comment of mine - there are other sources of non-determinism that aren't going to be impacted by the proposed logic.

This is a great point -- if you want to add a check for tokenize.allow-random in those locations too, that would be welcome

dask/base.py Outdated Show resolved Hide resolved
@@ -224,6 +226,49 @@ def test_normalize_base():
assert normalize_token(i) is i


def test_normalize_object():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test can be simplified a bit. In particular, we can probably just test that an informative error is raised when tokenize.allow-random is False and we attempt to tokenize an object which can't be deterministically hashed. See this comment https://github.com/dask/dask/pull/7413/files#r596499598 for an outline of a simplified test.

Copy link
Contributor Author

@hristog hristog Apr 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. my concerns here are that, without tests proving that true negatives are handled as such, we may not have sufficient confidence (purely, from viewing unit tests as documentation of expected functionality) that the newly introduced functionality doesn't break behavior which it shouldn't affect. And, somehow, I don't feel comfortable to rely on other tests, external to this one, to do this for us.

Of course, if you strongly believe we could make do without the extra set of tests, and just have one which makes assertions with respect to one aspect only, then I'll have to remove them (no issues about that). Just wanted to share my concerns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great point -- if you want to add a check for tokenize.allow-random in those locations too, that would be welcome

Hi @jrbourbeau - yes, I've opted to implement configurable behavior for those as well. The commits have already been added to this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test can be simplified a bit.

I've made an attempt at simplifying the tests. Hopefully, looking better from your point of view now. Let me know, please, if you'd like anything else changed.

@hristog
Copy link
Contributor Author

hristog commented Apr 21, 2021

Hi @jrbourbeau, I'm writing to check if you've had a chance to have a look at the most recent updates.
As discussed, I've scaled down the tests a bit further - the summary of my updates can be found in the most recent two comments, from this thread.

Please, let me know if the tests should be further stripped down, despite the discussed concerns.

@ncclementi
Copy link
Member

@jrbourbeau checking in here, it looks like @hristog updated the code according to your requests and CI is green. Do you think this is ready to go, or are there any other issues that should be addressed?

Copy link
Member

@jsignell jsignell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good, you just need a change to the config file to support this change.

- Rename config field to `tokenize.ensure-deterministic`
- Add config field to `dask.yaml` and `dask-schema.yaml`
@jcrist
Copy link
Member

jcrist commented Oct 4, 2021

I've renamed the config field to tokenize.ensure-deterministic to better reflect the meaning of the parameter. I've also added it to the dask.yaml config file and the dask-schema.yaml schema description. I believe this should be good to go once tests pass.

@hristog
Copy link
Contributor Author

hristog commented Oct 4, 2021

I've renamed the config field to tokenize.ensure-deterministic to better reflect the meaning of the parameter. I've also added it to the dask.yaml config file and the dask-schema.yaml schema description. I believe this should be good to go once tests pass.

Thanks, @jcrist!

@jrbourbeau
Copy link
Member

Thanks for the updates @jcrist! I pushed a small commit to avoid mocking in the tests added here. Will merge after CI finishes

@jcrist
Copy link
Member

jcrist commented Oct 4, 2021

Test failure is unrelated - merging. Thanks @hristog!

@jcrist jcrist merged commit 049d803 into dask:main Oct 4, 2021
@hristog
Copy link
Contributor Author

hristog commented Oct 5, 2021

Thanks for the updates @jcrist! I pushed a small commit to avoid mocking in the tests added here. Will merge after CI finishes

Thanks, @jrbourbeau! Thanks to your commit, I can much better understand what you meant in your previous comments on this PR.

@hristog hristog deleted the flexible-dask-tokenize branch October 5, 2021 10:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flag for raising an error when normalize_object doesn't find __dask_tokenize__
5 participants