Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix divisions when joining two single-partition dataframes #4636

Merged
merged 5 commits into from Mar 27, 2019

Conversation

Projects
None yet
2 participants
@bluecoconut
Copy link
Contributor

commented Mar 27, 2019

There was an error where if npartition=1 for both left and right on a merge, then the check inside of single_partition_join would short to using the right divisions, even if it was supposed to be left table's divisions that are used.

  • Tests added / passed
  • Passes flake8 dask

Note: the test I added first validated the bug (on previous code), and then was resolved with the fix in single_partition_join

This answers the bug identified in #4617

@bluecoconut bluecoconut changed the title Single parititon dask error Single parititon merge bug Mar 27, 2019

@@ -295,7 +295,7 @@ def single_partition_join(left, right, **kwargs):
else:
divisions = [None for _ in right.divisions]

elif right.npartitions == 1:
elif right.npartitions == 1 and kwargs['how'] in ('inner', 'left'):

This comment has been minimized.

Copy link
@mrocklin

mrocklin Mar 27, 2019

Member

Are there cases where neither of these two branches is taken?

This comment has been minimized.

Copy link
@bluecoconut

bluecoconut Mar 27, 2019

Author Contributor

These are the conditions required to enter single_partition_join method.

dask.dataframe.multi.py#370-371

This comment has been minimized.

Copy link
@bluecoconut

bluecoconut Mar 27, 2019

Author Contributor
    # Single partition on one side
    elif (left.npartitions == 1 and how in ('inner', 'right') or
          right.npartitions == 1 and how in ('inner', 'left')):
        return single_partition_join(left, right, how=how, right_on=right_on,
                                     left_on=left_on, left_index=left_index,
                                     right_index=right_index,
                                     suffixes=suffixes, indicator=indicator)

This comment has been minimized.

Copy link
@bluecoconut

bluecoconut Mar 27, 2019

Author Contributor

Could add (if desired for future developer ease) an else: raise NotImplementedError... Not sure what best practices in this code base are. At the moment, the method is only called inside of that parent elif statement.

This comment has been minimized.

Copy link
@mrocklin

mrocklin Mar 27, 2019

Member

Ah, ok, sounds like it's safe then. An else: raise block seems reasonable to me. I'm happy to defer to your preferences here.

This comment has been minimized.

Copy link
@bluecoconut

bluecoconut Mar 27, 2019

Author Contributor

Added the raise error~

@mrocklin

This comment has been minimized.

Copy link
Member

commented Mar 27, 2019

Thanks for taking this on @bluecoconut ! Sorry for the slow response on your issue. A combination of conferences and newborns I think is saturating most maintainer cycles these days :)

@mrocklin mrocklin changed the title Single parititon merge bug Fix divisions when joining two single-partition dataframes Mar 27, 2019

@mrocklin mrocklin merged commit a611e60 into dask:master Mar 27, 2019

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@mrocklin

This comment has been minimized.

Copy link
Member

commented Mar 27, 2019

Thanks @bluecoconut !

jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this pull request May 14, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.