Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure tokenize is consistent for pickle roundtrips #10808

Closed
wants to merge 2 commits into from

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Jan 16, 2024

It looks like the memory mapped arrays implemented different semantics than other arrays. I changed the implementation accordingly to also hash the data instead of using file metadata. Memory mapped arrays now tokenize the same as other arrays (e.g. when they point to the same data they will tokenize identically)

Beyond that, all tokenize calls are now testing pickle roundtrips.

This is particularly important for dask/dask-expr#14 where we technically need even stronger guarantees (same token for different processes) but I believe this should be sufficient.

Similar problems could already pop up with HLGs but they are likely a little less susceptible to this

Copy link
Contributor

github-actions bot commented Jan 16, 2024

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

     15 files  ± 0       15 suites  ±0   3h 20m 29s ⏱️ + 8m 0s
 12 991 tests + 5   12 060 ✅ + 3     929 💤 ± 0  2 ❌ +2 
160 552 runs  +75  144 044 ✅ +93  16 505 💤  - 21  3 ❌ +3 

For more details on these failures, see this check.

Results for commit a9b17c4. ± Comparison against base commit c2c1ece.

This pull request removes 12 and adds 17 tests. Note that renamed tests count towards both.
dask.tests.test_delayed ‑ test_name_consistent_across_instances
dask.tests.test_tokenize ‑ test_normalize_function_dataclass_field_no_repr
dask.tests.test_tokenize ‑ test_tokenize_datetime_date
dask.tests.test_tokenize ‑ test_tokenize_dense_sparse_array[bsr]
dask.tests.test_tokenize ‑ test_tokenize_dense_sparse_array[coo]
dask.tests.test_tokenize ‑ test_tokenize_dense_sparse_array[csc]
dask.tests.test_tokenize ‑ test_tokenize_dense_sparse_array[csr]
dask.tests.test_tokenize ‑ test_tokenize_dense_sparse_array[dia]
dask.tests.test_tokenize ‑ test_tokenize_dense_sparse_array[lil]
dask.tests.test_tokenize ‑ test_tokenize_function_cloudpickle
…
dask.tests.test_delayed ‑ test_deterministic_name
dask.tests.test_tokenize ‑ test_empty_numpy_array
dask.tests.test_tokenize ‑ test_local_objects
dask.tests.test_tokenize ‑ test_tokenize_callable_class
dask.tests.test_tokenize ‑ test_tokenize_circular_recursion
dask.tests.test_tokenize ‑ test_tokenize_dataclass_field_no_repr
dask.tests.test_tokenize ‑ test_tokenize_datetime_date[other0]
dask.tests.test_tokenize ‑ test_tokenize_datetime_date[other1]
dask.tests.test_tokenize ‑ test_tokenize_datetime_date[other2]
dask.tests.test_tokenize ‑ test_tokenize_local_functions
…

♻️ This comment has been updated with latest results.

def test_empty_numpy_array():
arr = np.array([])
assert arr.strides
assert tokenize(arr) == tokenize(arr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like something that's useful to push into the tokenize wrapper

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call... will update the PR shortly

Copy link
Collaborator

@crusaderky crusaderky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test is now failing

Comment on lines +751 to +765
try:
impl = lk[cls2]
except KeyError:
pass
else:
if cls is not cls2:
# Cache lookup
lk[cls] = impl
return impl
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lazy registration is sometimes registering more specific types which the dispatch misses otherwise. I should probably add a dedicated test for this / break it into a dedicated PR.

This shows up when running some of the tests in isolation

@@ -485,7 +556,7 @@ class BDataClass:
def test_tokenize_dataclass():
a1 = ADataClass(1)
a2 = ADataClass(2)
assert tokenize(a1) == tokenize(a1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually broken and shadowed by the cache. Hence, the cache is cleared now. I corrected the dataclass implementation but unfortunately, I had to access some semi-private attributes. I think that it should be OK but this should be reviewed

dask/base.py Outdated
Comment on lines 1190 to 1193
# co.co_filename,
# co.co_firstlineno,
# co.co_flags,
# co.co_lnotab,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modeled this after an issue that was opened a couple of months ago. I don't have a strong prefernence whether line numbers should count here. See test test_tokenize_local_functions

This somewhat overlaps with dask/dask-expr#689 cc @milesgranger

@fjetter
Copy link
Member Author

fjetter commented Jan 25, 2024

So this got a little more intense than I thought.

Good news first: It looks like the "non determinism of cloudpickle" is actually not a thing or at least not relevant for the cases we're covering here. To ensure that we're getting the kind of token determinism we're hoping for, I added two additional kinds of roundtrips.

  1. We roundtrip the same object just ordinarily and assert that the token is the same
  2. we'll roundtrip the object through the distributed serializer and tokenize the deserialized object and compare to the original
  3. Last but not least, we'll run this as well on another process, i.e. we'll spin up an actual cluster (we'll reuse one for the entire module, i.e. test runtime for this module is at 2s for me), generate the token over there and compare it with the local version

It looks like all the cases we have covered here are working for all three consistency levels.

Bad news: I had to patch up a lot. There will also be a small patch in distributed, i.e. one test here will fail until that is in. Some tokens will also be more complex than before, e.g. I ended up tokenizing even a type object (previously we often used str or __name__ so the tokens will be a little more nested which can cause the tokenization to degrade in performance. For now, I don't worry about that since I do not believe this to be substantial

@fjetter
Copy link
Member Author

fjetter commented Jan 25, 2024

dask/distributed#8480 will be needed for CI to pass

dask/tests/test_tokenize.py Outdated Show resolved Hide resolved
@fjetter
Copy link
Member Author

fjetter commented Jan 26, 2024

Interestingly, in CI the local tokenization does fail but only sporadically 🤔

crusaderky added a commit to crusaderky/dask that referenced this pull request Jan 26, 2024
crusaderky added a commit to crusaderky/dask that referenced this pull request Jan 26, 2024
@crusaderky crusaderky force-pushed the ensure_tokenize_pickle_consistent branch from 1d83bf0 to d0c846e Compare January 29, 2024 22:00
@crusaderky
Copy link
Collaborator

crusaderky commented Jan 29, 2024

Status update

  • dask-expr CI [https://github.com/Remove lambda tokenization hack dask-expr#822] is very very red due to removal of dask-expr specific lambda tokenization. Need to figure out why the new one in dask/dask isn't fit for purpose and why none of the tests in test_tokenize.py are reproducing the issue. Also, why are the dask/dask tests with dask-expr enabled almost completely green?

  • Unknown objects are blindly passed to pickle4. This can be very slow for third-party wrappers around numpy or pandas or any other large data. We should do a follow-up where we move to pickle5 with buffers.

  • If pickle4 of unknown objects fails, we go through a very sophisticated slots/dict recursive introspection. I think this should be replaced with a blind cloudpickle like it happens with functions. I attempted to do slots/dict introspection BEFORE blindly calling pickle4 and it resulted in an extreme slowdown.

  • We're now pickling potentially a lot more unknown objects than before. A performance A/B test is in order.

Failing tests

  • Windows /3.10: FAILED dask/tests/test_tokenize.py::test_tokenize_object_with_recursion_error
  • mindeps: dask/tests/test_tokenize.py::test_local_objects
  • dask_expr: FAILED dask/dataframe/io/tests/test_parquet.py::test_parquet_pyarrow_write_empty_metadata_append

False positives

  • dask_expr: FAILED dask/dataframe/io/tests/test_parquet.py::test_pyarrow_filter_divisions (this one fails in main too)

  • Closes GPU Tokenization #6718

@phofl
Copy link
Collaborator

phofl commented Jan 29, 2024

dask_expr: FAILED dask/dataframe/io/tests/test_parquet.py::test_pyarrow_filter_divisions (this one fails in main too)

This one is green now if you merge main

crusaderky added a commit to fjetter/dask that referenced this pull request Jan 31, 2024
crusaderky added a commit to crusaderky/dask that referenced this pull request Jan 31, 2024
crusaderky added a commit to crusaderky/dask that referenced this pull request Jan 31, 2024
crusaderky added a commit to crusaderky/dask that referenced this pull request Jan 31, 2024
crusaderky added a commit to crusaderky/dask that referenced this pull request Jan 31, 2024
@crusaderky crusaderky force-pushed the ensure_tokenize_pickle_consistent branch from ffbc27d to 1d4b876 Compare February 1, 2024 14:55
@crusaderky crusaderky closed this Feb 1, 2024
@crusaderky crusaderky reopened this Feb 1, 2024
@crusaderky crusaderky force-pushed the ensure_tokenize_pickle_consistent branch 4 times, most recently from 8e8f560 to 06526e5 Compare February 5, 2024 15:31
@crusaderky crusaderky force-pushed the ensure_tokenize_pickle_consistent branch from 06526e5 to a9b17c4 Compare February 6, 2024 09:32
This was referenced Feb 6, 2024
@crusaderky
Copy link
Collaborator

Superseded by #10883

@crusaderky crusaderky closed this Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants