New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenize bag groupby keys #10734
Tokenize bag groupby keys #10734
Conversation
Can one of the admins verify this patch? Admins can comment |
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ± 0 15 suites ±0 3h 22m 11s ⏱️ + 3m 47s For more details on these failures, see this check. Results for commit 568f51c. ± Comparison against base commit f9310c4. ♻️ This comment has been updated with latest results. |
Aside from
And in both cases
Based on the raw logs, this seems possibly threadlock related? Thus far I am not able to reproduce this locally, having attempted to do so with an ubuntu + python 3.9 docker container locally. I will take another look at this soon. |
👋 Dask team! This is now ready for review IMHO. To recap for those who may be getting (back) up to speed on this, #6723 is a long-running issue related to the (deliberate) inter-process/session non-determinism of Python's built-in dask/distributed#8400 was a partial solution for this for the case of string keys in groupby, but I later realized did not go far enough, as it does not cover the case of Over in #6723, there's been extensive discussion spanning years about what the best stable hashing approach might be, but the fix proposed here I think avoids most of the aesthetic/design concerns there by doing something embarrassingly simple: it just uses Dask's own My hope is that this should be relatively non-controversial but of course look forward to feedback from those more familiar with this code! In terms of my own motivations, this PR will dramatically unblock my ongoing efforts on the Apache Beam Dask Runner, as Beam internals use cc @jacobtomlinson, since you have been following this work 😃 Edit: The only failing test now is |
The dask tokenizer is exposed to a more or less similar problem. We recently tried to solve this for tokenize but couldn't come up with a stable version that is stable across machines. see #10905 for some references. |
Thanks for that additional context, @fjetter. I wonder if this PR may still be worth merging, as an incremental improvement over the status quo? If I understand those linked issue(s) correctly, where Here, I am hoping we can find a way simply to get built-ins (i.e. In the particular case of the Apache Beam runner I mentioned above, Beam is notably lacking a performant and user-friendly single-machine multi-processing runner. At a minimum, I believe this PR (or something like it), would unblock the Dask backend work there, e.g. by unblocking apache/beam#29365 for single-machine multiprocessing. Very curious to hear your thoughts. Thanks so much. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes seem reasonable to me. Would we need to make this change with a deprecation cycle? I'm leaning towards not needing a deprecation, but maybe I'm missing out on some use case where the outcome is sensitive to the hashing.
Thanks for weighing in, @TomAugspurger. Of course will defer to dask folks re: deprecation or not. |
Just a gentle ping to ask if merging this as an incremental improvement seems like a possibility, or if there would be further changes needed before that happens? This would still be very impactful to our work on Pangeo Forge if/when it's able to go in. Thanks so much! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, didn't want to stop merging this. I agree this is an improvement and the edge cases that are not handled well by tokenize are likely rarely if ever hit by this code.
I also don't think this needs a deprecation cycle. Thanks for you work and sorry for the delay @cisaacstern
Thanks so much, @fjetter ! |
pre-commit run --all-files
I wanted to see if this approach could solve effect of non-deterministic hashing of
None
(and containers which contain it) ondask.bag.Bag.groupby
keys, as discussed in #6723 (comment). It appears that it does!Opening as a draft as discussion is ongoing in #6723 as to the best solution. I thought it would be helpful to have something concrete to discuss, so offering this in that spirit.
If we do take this route, it's possible we may be able to revert dask/distributed#8400 & #6660.