Skip to content

Fix divisions when joining two single-partition dataframes#4636

Merged
mrocklin merged 5 commits intodask:masterfrom
bluecoconut:single-parititon-dask-error
Mar 27, 2019
Merged

Fix divisions when joining two single-partition dataframes#4636
mrocklin merged 5 commits intodask:masterfrom
bluecoconut:single-parititon-dask-error

Conversation

@bluecoconut
Copy link
Copy Markdown
Contributor

@bluecoconut bluecoconut commented Mar 27, 2019

There was an error where if npartition=1 for both left and right on a merge, then the check inside of single_partition_join would short to using the right divisions, even if it was supposed to be left table's divisions that are used.

  • Tests added / passed
  • Passes flake8 dask

Note: the test I added first validated the bug (on previous code), and then was resolved with the fix in single_partition_join

This answers the bug identified in #4617

@bluecoconut bluecoconut changed the title Single parititon dask error Single parititon merge bug Mar 27, 2019
divisions = [None for _ in right.divisions]

elif right.npartitions == 1:
elif right.npartitions == 1 and kwargs['how'] in ('inner', 'left'):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there cases where neither of these two branches is taken?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the conditions required to enter single_partition_join method.

dask.dataframe.multi.py#370-371

Copy link
Copy Markdown
Contributor Author

@bluecoconut bluecoconut Mar 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    # Single partition on one side
    elif (left.npartitions == 1 and how in ('inner', 'right') or
          right.npartitions == 1 and how in ('inner', 'left')):
        return single_partition_join(left, right, how=how, right_on=right_on,
                                     left_on=left_on, left_index=left_index,
                                     right_index=right_index,
                                     suffixes=suffixes, indicator=indicator)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could add (if desired for future developer ease) an else: raise NotImplementedError... Not sure what best practices in this code base are. At the moment, the method is only called inside of that parent elif statement.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok, sounds like it's safe then. An else: raise block seems reasonable to me. I'm happy to defer to your preferences here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the raise error~

@mrocklin
Copy link
Copy Markdown
Member

Thanks for taking this on @bluecoconut ! Sorry for the slow response on your issue. A combination of conferences and newborns I think is saturating most maintainer cycles these days :)

@mrocklin mrocklin changed the title Single parititon merge bug Fix divisions when joining two single-partition dataframes Mar 27, 2019
@mrocklin mrocklin merged commit a611e60 into dask:master Mar 27, 2019
@mrocklin
Copy link
Copy Markdown
Member

Thanks @bluecoconut !

jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this pull request May 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants