[bug] dask.dataframe.DataFrame.merge fails for inner join #4643

trstovall · 2019-03-28T21:39:11Z

Setup

from pandas import DataFrame
from dask.delayed import delayed
from dask.dataframe import from_delayed

A = from_delayed([
  delayed(DataFrame)({'x': range(i, i+5), 'a': range(i, i+5)})
  for i in range(0, 10, 5)
])

B = from_delayed([
  delayed(DataFrame)({'x': range(i, i+5), 'b': range(i, i+5)})
  for i in range(0, 10, 5)
])

C = A.merge(
  B, on=('x',), how='inner'
)

Traceback

>>> C.compute()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/conda/lib/python3.6/site-packages/dask/base.py", line 156, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/conda/lib/python3.6/site-packages/dask/base.py", line 398, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/conda/lib/python3.6/site-packages/dask/threaded.py", line 76, in get
    pack_exception=pack_exception, **kwargs)
  File "/conda/lib/python3.6/site-packages/dask/local.py", line 462, in get_async
    raise_exception(exc, tb)
  File "/conda/lib/python3.6/site-packages/dask/compatibility.py", line 112, in reraise
    raise exc
  File "/conda/lib/python3.6/site-packages/dask/local.py", line 230, in execute_task
    result = _execute_task(task, data)
  File "/conda/lib/python3.6/site-packages/dask/core.py", line 119, in _execute_task
    return func(*args2)
  File "/conda/lib/python3.6/site-packages/dask/optimization.py", line 942, in __call__
    dict(zip(self.inkeys, args)))
  File "/conda/lib/python3.6/site-packages/dask/core.py", line 149, in get
    result = _execute_task(task, cache)
  File "/conda/lib/python3.6/site-packages/dask/core.py", line 119, in _execute_task
    return func(*args2)
  File "/conda/lib/python3.6/site-packages/dask/compatibility.py", line 93, in apply
    return func(*args, **kwargs)
  File "/conda/lib/python3.6/site-packages/dask/dataframe/core.py", line 3794, in apply_and_enforce
    df = func(*args, **kwargs)
  File "/conda/lib/python3.6/site-packages/dask/dataframe/shuffle.py", line 417, in partitioning_index
    return hash_pandas_object(df, index=False) % int(npartitions)
  File "/conda/lib/python3.6/site-packages/pandas/core/util/hashing.py", line 117, in hash_pandas_object
    h = Series(h, index=obj.index, dtype='uint64', copy=False)
  File "/conda/lib/python3.6/site-packages/pandas/core/series.py", line 262, in __init__
    .format(val=len(data), ind=len(index)))
ValueError: Length of passed values is 0, index implies 5

Version

>>> pandas.__version__
'0.23.4'
>>> dask.__version__
'1.1.4'

mrocklin · 2019-03-28T21:51:04Z

Reproduced. Thanks for the excellent bug report.

…

On Thu, Mar 28, 2019 at 2:39 PM trstovall ***@***.***> wrote: Setup from pandas import DataFramefrom dask.delayed import delayedfrom dask.dataframe import from_delayed A = from_delayed([ delayed(DataFrame)({'x': range(i, i+5), 'a': range(i, i+5)}) for i in range(0, 10, 5) ]) B = from_delayed([ delayed(DataFrame)({'x': range(i, i+5), 'b': range(i, i+5)}) for i in range(0, 10, 5) ]) C = A.merge( B, on=('x',), how='inner' ) Traceback >>> C.compute() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/conda/lib/python3.6/site-packages/dask/base.py", line 156, in compute (result,) = compute(self, traverse=False, **kwargs) File "/conda/lib/python3.6/site-packages/dask/base.py", line 398, in compute results = schedule(dsk, keys, **kwargs) File "/conda/lib/python3.6/site-packages/dask/threaded.py", line 76, in get pack_exception=pack_exception, **kwargs) File "/conda/lib/python3.6/site-packages/dask/local.py", line 462, in get_async raise_exception(exc, tb) File "/conda/lib/python3.6/site-packages/dask/compatibility.py", line 112, in reraise raise exc File "/conda/lib/python3.6/site-packages/dask/local.py", line 230, in execute_task result = _execute_task(task, data) File "/conda/lib/python3.6/site-packages/dask/core.py", line 119, in _execute_task return func(*args2) File "/conda/lib/python3.6/site-packages/dask/optimization.py", line 942, in __call__ dict(zip(self.inkeys, args))) File "/conda/lib/python3.6/site-packages/dask/core.py", line 149, in get result = _execute_task(task, cache) File "/conda/lib/python3.6/site-packages/dask/core.py", line 119, in _execute_task return func(*args2) File "/conda/lib/python3.6/site-packages/dask/compatibility.py", line 93, in apply return func(*args, **kwargs) File "/conda/lib/python3.6/site-packages/dask/dataframe/core.py", line 3794, in apply_and_enforce df = func(*args, **kwargs) File "/conda/lib/python3.6/site-packages/dask/dataframe/shuffle.py", line 417, in partitioning_index return hash_pandas_object(df, index=False) % int(npartitions) File "/conda/lib/python3.6/site-packages/pandas/core/util/hashing.py", line 117, in hash_pandas_object h = Series(h, index=obj.index, dtype='uint64', copy=False) File "/conda/lib/python3.6/site-packages/pandas/core/series.py", line 262, in __init__ .format(val=len(data), ind=len(index)))ValueError: Length of passed values is 0, index implies 5 Version >>> pandas.__version__'0.23.4'>>> dask.__version__'1.1.4' — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4643>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszFuFOTP2_LCGdoN4pRMsDwj0XdHJks5vbTaAgaJpZM4cRMNl> .

trstovall · 2019-04-02T15:33:56Z

This bug doesn't occur when pandas.DataFrame is replaced by cudf.DataFrame.

jakirkham · 2019-06-03T20:34:45Z

Any thoughts on this one, @TomAugspurger?

TomAugspurger · 2019-06-03T20:51:26Z

Due to pandas-dev/pandas#24318, we need to not call hash_pandas_object on empty data frames. It's not clear what pandas should do in that case, so I think dask needs to work around it.

I suspect shuffle and partitioning_index would both need to be updated to handle empty partitions somehow, but I haven't looked into what they should do (and don't plan to any time soon).

saravananpsg · 2020-10-11T11:43:55Z

I am facing the same issue when joining multiple dask data-frames using inner join on a multi-core cpu ?

Version:
Dask - '2.20.0'
Pandas - '0.24.2'

Code Snippet:
dfs=[]
for file in csv_files:
df = pd.read_csv(file)
df = dd.from_pandas(df, npartitions=20)
df['key'] = df.map_partitions(
lambda mdf: mdf.apply(lambda row: get_value(row), axis=1)
).compute(scheduler='processes')
df.set_index('key')
dfs.append(df)

df_merged = reduce(lambda left, right: dd.merge(left, right,
                                                left_index=True, 
                                                right_index=True,
                                                on=['key'],
                                                how='inner', 
                                                suffixes=('', '_y')), dfs)

The same inner join works fine when we pass the compute='sync'; however, compute='processes' yields incorrect results.

saravananpsg · 2020-10-13T20:04:36Z

@mrocklin @TomAugspurger any advise on this!

TomAugspurger · 2020-10-13T20:36:10Z

@saravananpsg does #4643 (comment) answer your question?

This issue raises a traceback, but you mention incorrect results. Are you sure your issue is the same?

saravananpsg · 2020-10-14T04:18:07Z

@TomAugspurger Yes. It may be related to this issue.

In my case, the inner join results are more like a outer join. I am getting additional unmatched rows when joining multiple dataframes on a distributed systems. I tried with/without shuffle ~ with respect to task level(shuffle='task') during merging. No luck though :-(

Also tried to

convert back to pandas df and reset the index
tried to merge the results in pandas, the results are not the same as expected for inner join

Is there any workaround for this ?

Kindly provide your thoughts on this!

lfdversluis · 2020-12-08T14:18:46Z

I see the merge function is still bugged, both for outer and inner. Any way to mitigate this? I tried to use align to make sure the dataframes have the same indices/columns but that seems unsupported atm.

saravananpsg · 2020-12-09T03:24:26Z

One way is to break down the functionalities (suppose if you are computing data at different points); then write the files as and when required, by this way the partitioned index can be wiped away.

jsignell · 2020-12-09T22:07:44Z

I think this is fixed on dask master. I just failed to reproduce the MRE on dask master with pandas 1.1.4.

jakirkham · 2020-12-14T07:39:27Z

We also just released on Friday. So this may already be fixed in a release. Would make sure to upgrade and see if there’s still an issue.

jsignell · 2020-12-14T15:12:59Z

I am going to close this, but feel free to reopen or comment if you can still reproduce.

lampda · 2022-01-08T16:02:33Z

Has this bug fixed already?

When I do outer merge, I had this error message:
"ValueError: Length of passed values is 0, index implies 48679."

My pandas is of version '1.1.5' while dask '2021.03.0'

jakirkham · 2022-01-08T21:28:02Z

@lampda would suggest raising a new issue with a minimal reproducer

jrbourbeau added the dataframe label Apr 2, 2019

TomAugspurger mentioned this issue Jun 27, 2019

Groupby apply using pd.Grouper yields ValueError #5012

Open

jsignell closed this as completed Dec 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] dask.dataframe.DataFrame.merge fails for inner join #4643

[bug] dask.dataframe.DataFrame.merge fails for inner join #4643

trstovall commented Mar 28, 2019

mrocklin commented Mar 28, 2019 via email

trstovall commented Apr 2, 2019

jakirkham commented Jun 3, 2019 •

edited

TomAugspurger commented Jun 3, 2019

saravananpsg commented Oct 11, 2020

saravananpsg commented Oct 13, 2020

TomAugspurger commented Oct 13, 2020

saravananpsg commented Oct 14, 2020 •

edited

lfdversluis commented Dec 8, 2020

saravananpsg commented Dec 9, 2020

jsignell commented Dec 9, 2020

jakirkham commented Dec 14, 2020

jsignell commented Dec 14, 2020

lampda commented Jan 8, 2022

jakirkham commented Jan 8, 2022

[bug] dask.dataframe.DataFrame.merge fails for inner join #4643

[bug] dask.dataframe.DataFrame.merge fails for inner join #4643

Comments

trstovall commented Mar 28, 2019

Setup

Traceback

Version

mrocklin commented Mar 28, 2019 via email

trstovall commented Apr 2, 2019

jakirkham commented Jun 3, 2019 • edited

TomAugspurger commented Jun 3, 2019

saravananpsg commented Oct 11, 2020

saravananpsg commented Oct 13, 2020

TomAugspurger commented Oct 13, 2020

saravananpsg commented Oct 14, 2020 • edited

lfdversluis commented Dec 8, 2020

saravananpsg commented Dec 9, 2020

jsignell commented Dec 9, 2020

jakirkham commented Dec 14, 2020

jsignell commented Dec 14, 2020

lampda commented Jan 8, 2022

jakirkham commented Jan 8, 2022

jakirkham commented Jun 3, 2019 •

edited

saravananpsg commented Oct 14, 2020 •

edited