Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix merge with emtpy left DataFrame #9578

Merged
merged 5 commits into from Oct 18, 2022

Conversation

ian-r-rose
Copy link
Collaborator

Fixes #9577. The issue over there is that when the left df in a merge is empty, pandas can give back columns in the wrong order, resulting in a meta mismatch. This just identifies that case and re-orders the columns.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ian-r-rose! I left a few small comments / questions, but overall this looks good

Comment on lines +251 to +252
empty_index_dtype = result_meta.index.dtype
categorical_columns = result_meta.select_dtypes(include="category").columns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, this isn't a functional change -- it's just cleanup. Instead of computing and passing around empty_index_dtype and categorical_columns as kwargs, we're passing result_meta around and computing empty_index_dtype and categorical_columns once in the place it's used. Correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, it's just consolidation of that logic. Now that I'm passing in result_meta directly to the chunk function, these derived-from-meta parameters can all just be computed once.


merged = dd_left.merge(dd_right, on="a", how=how)
expected = left.merge(right, on="a", how=how)
assert_eq(merged, expected, check_index=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect this to fail on main? Running this test locally on main this assert_eq passes

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, this doesn't fail on main, but the next line does.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. In that case would you mind adding a comment that merge works, but subsequent operation may fail due to meta mismatches? This is a subtle enough point that I think a bit more context would be nice for future readers

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may even be possible to construct a case where merged does not provide the correct answer (it probably depends on the order of concatenation of the result). I'll poke at it, and add a comment if I can't find anything

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I tweaked the test data, and now it is covered by this line as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even better -- thanks @ian-r-rose

Comment on lines +290 to +292
# Workaround for pandas bug where if the left frame of a merge operation is
# empty, the resulting dataframe can have columns in the wrong order.
# https://github.com/pandas-dev/pandas/issues/9937
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Thanks for writing this detailed comment"

- Future me

@jrbourbeau jrbourbeau changed the title Fix merge with emtpy left df Fix merge with emtpy left DataFrame Oct 18, 2022
@jrbourbeau jrbourbeau merged commit 7e263dd into dask:main Oct 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Outer or right merges with missing values can fail with meta mismatches
2 participants