New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix merge
with emtpy left DataFrame
#9578
Fix merge
with emtpy left DataFrame
#9578
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ian-r-rose! I left a few small comments / questions, but overall this looks good
empty_index_dtype = result_meta.index.dtype | ||
categorical_columns = result_meta.select_dtypes(include="category").columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify, this isn't a functional change -- it's just cleanup. Instead of computing and passing around empty_index_dtype
and categorical_columns
as kwargs, we're passing result_meta
around and computing empty_index_dtype
and categorical_columns
once in the place it's used. Correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, it's just consolidation of that logic. Now that I'm passing in result_meta
directly to the chunk function, these derived-from-meta parameters can all just be computed once.
|
||
merged = dd_left.merge(dd_right, on="a", how=how) | ||
expected = left.merge(right, on="a", how=how) | ||
assert_eq(merged, expected, check_index=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we expect this to fail on main
? Running this test locally on main
this assert_eq
passes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct, this doesn't fail on main
, but the next line does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. In that case would you mind adding a comment that merge
works, but subsequent operation may fail due to meta mismatches? This is a subtle enough point that I think a bit more context would be nice for future readers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may even be possible to construct a case where merged
does not provide the correct answer (it probably depends on the order of concatenation of the result). I'll poke at it, and add a comment if I can't find anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I tweaked the test data, and now it is covered by this line as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even better -- thanks @ian-r-rose
# Workaround for pandas bug where if the left frame of a merge operation is | ||
# empty, the resulting dataframe can have columns in the wrong order. | ||
# https://github.com/pandas-dev/pandas/issues/9937 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Thanks for writing this detailed comment"
- Future me
merge
with emtpy left DataFrame
Fixes #9577. The issue over there is that when the left df in a merge is empty, pandas can give back columns in the wrong order, resulting in a meta mismatch. This just identifies that case and re-orders the columns.
pre-commit run --all-files