Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix optimize_dataframe_getitem bug #7698

Merged
merged 4 commits into from May 27, 2021
Merged

Conversation

rjzamora
Copy link
Member

Closes #7692

In the current implementation of optimize_dataframe_getitem, it is possible for a collection name to be confused with a column selection. This PR includes a simple fix.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing so quickly @rjzamora!

dask/dataframe/optimize.py Show resolved Hide resolved
# so we check for the first item of the tuple.
# See https://github.com/dask/dask/issues/5893
return dsk
if block.indices[1][1] is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't immediately clear to me that this condition is what's needed for selecting out column projection layers. I wonder if there's something more direct we can check for here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it would be best to have a more-intuitive check here... Still thinking about this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is more motivation for a formal getitem Layer... What we are looking for here is a Blockwise layer, based on operator.getitem, with the getitem-key (block.indices[1][0]) being a str or list. We do want the getitem key to be a literal (so block.indices[1][0] should be None), but the only time it will not be a literal is if the key is another layer/collection key. Therefore, we can either check that the getitem key is not a layer key, or we can check that the indice is None.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @rjzamora. I agree we might want to add some additional structure long-term to make this code easier to reason about. Though for now, the current set of changes look good and fix #7692

@rjzamora
Copy link
Member Author

@jrbourbeau - Did you stil have concerns about this fix? I wasn't planning on making further changes here, but I certainly can if you have reservations or suggestions.

@jrbourbeau jrbourbeau merged commit 94dae16 into dask:main May 27, 2021
@rjzamora rjzamora deleted the fix-getitem-opt branch May 27, 2021 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KeyError tracing to optimize_dataframe_getitem
2 participants