Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenize bag groupby keys #10734

Merged
merged 9 commits into from Apr 8, 2024
Merged

Conversation

cisaacstern
Copy link
Contributor

I wanted to see if this approach could solve effect of non-deterministic hashing of None (and containers which contain it) on dask.bag.Bag.groupby keys, as discussed in #6723 (comment). It appears that it does!

Opening as a draft as discussion is ongoing in #6723 as to the best solution. I thought it would be helpful to have something concrete to discuss, so offering this in that spirit.

If we do take this route, it's possible we may be able to revert dask/distributed#8400 & #6660.

@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

Admins can comment ok to test to allow this one PR to run or add to allowlist to allow all future PRs from the same author to run.

Copy link
Contributor

github-actions bot commented Mar 15, 2024

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

     15 files  ±  0       15 suites  ±0   3h 22m 11s ⏱️ + 3m 47s
 13 117 tests + 12   12 171 ✅ + 12     930 💤 ±0  16 ❌ ±0 
162 419 runs  +180  142 261 ✅ +184  20 142 💤  - 4  16 ❌ ±0 

For more details on these failures, see this check.

Results for commit 568f51c. ± Comparison against base commit f9310c4.

♻️ This comment has been updated with latest results.

@cisaacstern
Copy link
Contributor Author

cisaacstern commented Mar 15, 2024

Aside from pandas-nightly, which IIUC is not related to this PR, the only failures are on:

  • (ubuntu-latest, 3.9)
  • (ubuntu-latest, 3.10)

And in both cases test_bag_groupby_none fails with timeouts:

[31mFAILED�[0m dask/bag/tests/test_bag.py::[1mtest_bag_groupby_none[processes-disk]�[0m - Failed: Timeout >300.0s
[31mFAILED�[0m dask/bag/tests/test_bag.py::[1mtest_bag_groupby_none[processes-tasks]�[0m - Failed: Timeout >300.0s

Based on the raw logs, this seems possibly threadlock related? Thus far I am not able to reproduce this locally, having attempted to do so with an ubuntu + python 3.9 docker container locally. I will take another look at this soon.

@cisaacstern cisaacstern marked this pull request as ready for review March 19, 2024 23:41
@cisaacstern
Copy link
Contributor Author

cisaacstern commented Mar 19, 2024

👋 Dask team! This is now ready for review IMHO.

To recap for those who may be getting (back) up to speed on this, #6723 is a long-running issue related to the (deliberate) inter-process/session non-determinism of Python's built-in hash, which here in dask.bag is currently used in a few places for groupby, causing inaccurate grouping when workers do not share a process/session.

dask/distributed#8400 was a partial solution for this for the case of string keys in groupby, but I later realized did not go far enough, as it does not cover the case of None being non-deterministically hashed. While None as (part of) a key may initially seem uncommon, it's actually quite practical insofar as frozen dataclasses (an otherwise reasonable key type, IMO) will hash non-deterministically if any of their fields are set to None when hashed.

Over in #6723, there's been extensive discussion spanning years about what the best stable hashing approach might be, but the fix proposed here I think avoids most of the aesthetic/design concerns there by doing something embarrassingly simple: it just uses Dask's own tokenize to generate a stable hash!

My hope is that this should be relatively non-controversial but of course look forward to feedback from those more familiar with this code! In terms of my own motivations, this PR will dramatically unblock my ongoing efforts on the Apache Beam Dask Runner, as Beam internals use None extensively as part of group keys.

cc @jacobtomlinson, since you have been following this work 😃

Edit: The only failing test now is pandas-nightly which IIUC is not related to this PR.

@fjetter
Copy link
Member

fjetter commented Mar 20, 2024

The dask tokenizer is exposed to a more or less similar problem. We recently tried to solve this for tokenize but couldn't come up with a stable version that is stable across machines. see #10905 for some references.
There are cases where we have to fall back to cloudpickle which can cause issues.

@cisaacstern
Copy link
Contributor Author

Thanks for that additional context, @fjetter.

I wonder if this PR may still be worth merging, as an incremental improvement over the status quo?

If I understand those linked issue(s) correctly, where tokenize still falls short on determinism mostly has to do with complex, user-defined types, and perhaps especially in the distributed context?

Here, I am hoping we can find a way simply to get built-ins (i.e. NoneType) to hash deterministically, at the very least in the single-machine multiprocessing context. The current release of dask.bag cannot do this; tokenize as used here appears to solve that. I am not sure if this solution generalizes to the (multi machine) distributed case as well, but even the single machine multi-processing setting is a huge win IMHO.

In the particular case of the Apache Beam runner I mentioned above, Beam is notably lacking a performant and user-friendly single-machine multi-processing runner. At a minimum, I believe this PR (or something like it), would unblock the Dask backend work there, e.g. by unblocking apache/beam#29365 for single-machine multiprocessing.

Very curious to hear your thoughts. Thanks so much.

Copy link
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes seem reasonable to me. Would we need to make this change with a deprecation cycle? I'm leaning towards not needing a deprecation, but maybe I'm missing out on some use case where the outcome is sensitive to the hashing.

@cisaacstern
Copy link
Contributor Author

Thanks for weighing in, @TomAugspurger.

Of course will defer to dask folks re: deprecation or not.

@cisaacstern
Copy link
Contributor Author

Just a gentle ping to ask if merging this as an incremental improvement seems like a possibility, or if there would be further changes needed before that happens?

This would still be very impactful to our work on Pangeo Forge if/when it's able to go in. Thanks so much!

Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, didn't want to stop merging this. I agree this is an improvement and the edge cases that are not handled well by tokenize are likely rarely if ever hit by this code.

I also don't think this needs a deprecation cycle. Thanks for you work and sorry for the delay @cisaacstern

@fjetter fjetter merged commit 2fd6b7d into dask:main Apr 8, 2024
26 of 27 checks passed
@cisaacstern
Copy link
Contributor Author

Thanks so much, @fjetter !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Switch to different, stable hash algorithm in Bag
4 participants