Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broadcast join fails when passing a list of columns to merge on #9870

Closed
ayushdg opened this issue Jan 24, 2023 · 2 comments · Fixed by #9871
Closed

Broadcast join fails when passing a list of columns to merge on #9870

ayushdg opened this issue Jan 24, 2023 · 2 comments · Fixed by #9871
Labels
needs triage Needs a response from a contributor

Comments

@ayushdg
Copy link
Contributor

ayushdg commented Jan 24, 2023

Describe the issue:

The broadcast merge codepath for dataframes throws an error when passing a list of columns in the on argument with a keyError

Minimal Complete Verifiable Example:

import pandas as pd
import numpy as np
from dask import dataframe as dd
from distributed import Client, wait
c = Client()

df1 = pd.DataFrame({"a":np.arange(20),"b":np.arange(20),"c":[1,2]*10})
df2 = df1.copy(deep=True)
df1 = dd.from_pandas(df1,2)
df2 = dd.from_pandas(df2,5)

len(df1.merge(df2,on=["a"], how="inner", shuffle="tasks", broadcast=True))
# KeyError: "('a',)"
# Works with on="a"

Anything else we need to know?:
Same issue persists when joining on multiple columns

Environment:

  • Dask version: 2022.12.0
  • Python version: 3.9
  • Operating System: ubuntu18.04
  • Install method (conda, pip, source): pip
@github-actions github-actions bot added the needs triage Needs a response from a contributor label Jan 24, 2023
@fjetter
Copy link
Member

fjetter commented Jan 24, 2023

I can confirm and can reproduce this. My best guess is that this is a serialization error or we're calling a stringify or smth too eagerly.

basically the list is cast to a tuple and then stringified somewhere, i.e. ["a"] becomes "('a',)". Passing a tuple directly or the literal works as expected.

If I had to guess, this looks suspicious but I don't fully understand what's going on there. Maybe @rjzamora ?

@rjzamora
Copy link
Member

Thanks for raising @ayushdg ! I ran into this bug yesterday, but didn't get a chance to raise and issue and investigate yet. @fjetter is correct that this is likely another HLG-serialization edge case (probably having to do with msgpack not distinguishing lists/tuples). I will try to figure out a fix, but also look forward to something like dask/distributed#6028 avoiding these problems altogether :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs triage Needs a response from a contributor
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants