-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure tokenize is consistent for pickle roundtrips #10808
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ± 0 15 suites ±0 3h 20m 29s ⏱️ + 8m 0s For more details on these failures, see this check. Results for commit a9b17c4. ± Comparison against base commit c2c1ece. This pull request removes 12 and adds 17 tests. Note that renamed tests count towards both.
♻️ This comment has been updated with latest results. |
dask/tests/test_base.py
Outdated
def test_empty_numpy_array(): | ||
arr = np.array([]) | ||
assert arr.strides | ||
assert tokenize(arr) == tokenize(arr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like something that's useful to push into the tokenize wrapper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call... will update the PR shortly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test is now failing
try: | ||
impl = lk[cls2] | ||
except KeyError: | ||
pass | ||
else: | ||
if cls is not cls2: | ||
# Cache lookup | ||
lk[cls] = impl | ||
return impl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lazy registration is sometimes registering more specific types which the dispatch misses otherwise. I should probably add a dedicated test for this / break it into a dedicated PR.
This shows up when running some of the tests in isolation
dask/tests/test_base.py
Outdated
@@ -485,7 +556,7 @@ class BDataClass: | |||
def test_tokenize_dataclass(): | |||
a1 = ADataClass(1) | |||
a2 = ADataClass(2) | |||
assert tokenize(a1) == tokenize(a1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was actually broken and shadowed by the cache. Hence, the cache is cleared now. I corrected the dataclass implementation but unfortunately, I had to access some semi-private attributes. I think that it should be OK but this should be reviewed
dask/base.py
Outdated
# co.co_filename, | ||
# co.co_firstlineno, | ||
# co.co_flags, | ||
# co.co_lnotab, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modeled this after an issue that was opened a couple of months ago. I don't have a strong prefernence whether line numbers should count here. See test test_tokenize_local_functions
This somewhat overlaps with dask/dask-expr#689 cc @milesgranger
1ca49b3
to
9cac38c
Compare
So this got a little more intense than I thought. Good news first: It looks like the "non determinism of cloudpickle" is actually not a thing or at least not relevant for the cases we're covering here. To ensure that we're getting the kind of token determinism we're hoping for, I added two additional kinds of roundtrips.
It looks like all the cases we have covered here are working for all three consistency levels. Bad news: I had to patch up a lot. There will also be a small patch in distributed, i.e. one test here will fail until that is in. Some tokens will also be more complex than before, e.g. I ended up tokenizing even a |
dask/distributed#8480 will be needed for CI to pass |
6b4e4db
to
c83b26f
Compare
Interestingly, in CI the local tokenization does fail but only sporadically 🤔 |
1d83bf0
to
d0c846e
Compare
Status update
Failing tests
False positives
|
This one is green now if you merge main |
This reverts commit 1b1136d.
This reverts commit dbb9278.
This reverts commit 4098e7f.
This reverts commit 4098e7f.
ffbc27d
to
1d4b876
Compare
8e8f560
to
06526e5
Compare
06526e5
to
a9b17c4
Compare
Superseded by #10883 |
It looks like the memory mapped arrays implemented different semantics than other arrays. I changed the implementation accordingly to also hash the data instead of using file metadata. Memory mapped arrays now tokenize the same as other arrays (e.g. when they point to the same data they will tokenize identically)
Beyond that, all tokenize calls are now testing pickle roundtrips.
This is particularly important for dask/dask-expr#14 where we technically need even stronger guarantees (same token for different processes) but I believe this should be sufficient.
Similar problems could already pop up with HLGs but they are likely a little less susceptible to this