Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

P2P HashJoin #7514

Merged
merged 3 commits into from
Feb 24, 2023
Merged

P2P HashJoin #7514

merged 3 commits into from
Feb 24, 2023

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Jan 31, 2023

Partially closes #7496

The merge itself is pretty trivial but the boilerplate for the layer is a bit heavy...

I'll pass over it again in hopes to simplify it but the code is already ready for reviews.

I'll need to think about how to test this best. I would at least like to have a couple of basic tests, e.g. to ensure all parameters are properly serialized and forwarded, etc.

cc @hendrikmakait

@fjetter

This comment was marked as resolved.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 31, 2023

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       24 files  ±    0         24 suites  ±0   10h 43m 2s ⏱️ + 37m 14s
  3 412 tests +  62    3 308 ✔️ +  59     103 💤 +  3  1 ±  0 
40 157 runs  +664  38 317 ✔️ +662  1 839 💤 +12  1  - 10 

For more details on these failures, see this check.

Results for commit 9d2828d. ± Comparison against base commit 76d0104.

♻️ This comment has been updated with latest results.

@fjetter
Copy link
Member Author

fjetter commented Feb 2, 2023

cc @rjzamora @jrbourbeau I wouldn't mind a review from one of you two.

Getting this running took a little more code than I wanted to but it should be working now. I'll see how I can remove a bit of boilerplate. I encountered many problems about serialization/stringification of arguments. Are there any best practices around this?

Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test cases and executed merge logic look reasonable. The HLG stuff is somewhat beyond me at the moment.

dd._compat.assert_numpy_array_equal(av, bv)


@pytest.mark.parametrize("how", ["inner", "left", "right", "outer"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We parametrize multiple tests with how, should we turn this into a fixture?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied most of the tests over so I didn't feel like rewriting them like this

distributed/shuffle/tests/test_merge.py Outdated Show resolved Hide resolved
distributed/shuffle/tests/test_merge.py Show resolved Hide resolved
distributed/shuffle/tests/test_merge_column_and_index.py Outdated Show resolved Hide resolved
distributed/shuffle/_merge.py Show resolved Hide resolved
distributed/shuffle/_merge.py Outdated Show resolved Hide resolved
@fjetter
Copy link
Member Author

fjetter commented Feb 23, 2023

I intend to move forward with this before finishing #7564

We can follow up with a simplification of the layer class afterwards

Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM assuming CI is green.

@@ -0,0 +1,439 @@
# mypy: ignore-errors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? If so, let's add a follow-up ticket to clean this up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be a follow up task to clean up this layer after the pack/unpack protocol is dropped. I didn't want to go through this exercise for almost-dead-code.
Refactoring the layer makes sense as a follow up ticket but I don't think we should have a ticket just for mypy.

Comment on lines -23 to -27
# FIXME: Serializing custom objects to PyArrow is not supported in P2P shuffling
if pd.api.types.is_object_dtype(column):
raise TypeError(
f"p2p does not support custom objects found in column '{name}'."
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hendrikmakait Sorry I missed this in review of #7425 but is_object_dtype is way too generic to raise here. This includes very simple string columns and we should definitely be able to shuffle dataframes with string columns. Since this is purely about failing early, I removed this check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

P2P shuffling and queuing combined may cause high memory usage with dask.dataframe.merge
2 participants