New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix optimize_dataframe_getitem bug #7698
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing so quickly @rjzamora!
dask/dataframe/optimize.py
Outdated
# so we check for the first item of the tuple. | ||
# See https://github.com/dask/dask/issues/5893 | ||
return dsk | ||
if block.indices[1][1] is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wasn't immediately clear to me that this condition is what's needed for selecting out column projection layers. I wonder if there's something more direct we can check for here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it would be best to have a more-intuitive check here... Still thinking about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is more motivation for a formal getitem Layer... What we are looking for here is a Blockwise
layer, based on operator.getitem
, with the getitem-key (block.indices[1][0]
) being a str
or list
. We do want the getitem key to be a literal (so block.indices[1][0]
should be None
), but the only time it will not be a literal is if the key is another layer/collection key. Therefore, we can either check that the getitem key is not a layer key, or we can check that the indice is None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jrbourbeau - Did you stil have concerns about this fix? I wasn't planning on making further changes here, but I certainly can if you have reservations or suggestions. |
Closes #7692
In the current implementation of
optimize_dataframe_getitem
, it is possible for a collection name to be confused with a column selection. This PR includes a simple fix.