New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] dask.dataframe.DataFrame.merge fails for inner join #4643
Comments
Reproduced. Thanks for the excellent bug report.
…On Thu, Mar 28, 2019 at 2:39 PM trstovall ***@***.***> wrote:
Setup
from pandas import DataFramefrom dask.delayed import delayedfrom dask.dataframe import from_delayed
A = from_delayed([
delayed(DataFrame)({'x': range(i, i+5), 'a': range(i, i+5)})
for i in range(0, 10, 5)
])
B = from_delayed([
delayed(DataFrame)({'x': range(i, i+5), 'b': range(i, i+5)})
for i in range(0, 10, 5)
])
C = A.merge(
B, on=('x',), how='inner'
)
Traceback
>>> C.compute()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/conda/lib/python3.6/site-packages/dask/base.py", line 156, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/conda/lib/python3.6/site-packages/dask/base.py", line 398, in compute
results = schedule(dsk, keys, **kwargs)
File "/conda/lib/python3.6/site-packages/dask/threaded.py", line 76, in get
pack_exception=pack_exception, **kwargs)
File "/conda/lib/python3.6/site-packages/dask/local.py", line 462, in get_async
raise_exception(exc, tb)
File "/conda/lib/python3.6/site-packages/dask/compatibility.py", line 112, in reraise
raise exc
File "/conda/lib/python3.6/site-packages/dask/local.py", line 230, in execute_task
result = _execute_task(task, data)
File "/conda/lib/python3.6/site-packages/dask/core.py", line 119, in _execute_task
return func(*args2)
File "/conda/lib/python3.6/site-packages/dask/optimization.py", line 942, in __call__
dict(zip(self.inkeys, args)))
File "/conda/lib/python3.6/site-packages/dask/core.py", line 149, in get
result = _execute_task(task, cache)
File "/conda/lib/python3.6/site-packages/dask/core.py", line 119, in _execute_task
return func(*args2)
File "/conda/lib/python3.6/site-packages/dask/compatibility.py", line 93, in apply
return func(*args, **kwargs)
File "/conda/lib/python3.6/site-packages/dask/dataframe/core.py", line 3794, in apply_and_enforce
df = func(*args, **kwargs)
File "/conda/lib/python3.6/site-packages/dask/dataframe/shuffle.py", line 417, in partitioning_index
return hash_pandas_object(df, index=False) % int(npartitions)
File "/conda/lib/python3.6/site-packages/pandas/core/util/hashing.py", line 117, in hash_pandas_object
h = Series(h, index=obj.index, dtype='uint64', copy=False)
File "/conda/lib/python3.6/site-packages/pandas/core/series.py", line 262, in __init__
.format(val=len(data), ind=len(index)))ValueError: Length of passed values is 0, index implies 5
Version
>>> pandas.__version__'0.23.4'>>> dask.__version__'1.1.4'
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#4643>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszFuFOTP2_LCGdoN4pRMsDwj0XdHJks5vbTaAgaJpZM4cRMNl>
.
|
This bug doesn't occur when |
Any thoughts on this one, @TomAugspurger? |
Due to pandas-dev/pandas#24318, we need to not call I suspect |
I am facing the same issue when joining multiple dask data-frames using inner join on a multi-core cpu ? Version: Code Snippet:
The same inner join works fine when we pass the compute='sync'; however, compute='processes' yields incorrect results. |
@mrocklin @TomAugspurger any advise on this! |
@saravananpsg does #4643 (comment) answer your question? This issue raises a traceback, but you mention incorrect results. Are you sure your issue is the same? |
@TomAugspurger Yes. It may be related to this issue. In my case, the inner join results are more like a outer join. I am getting additional unmatched rows when joining multiple dataframes on a distributed systems. I tried with/without shuffle ~ with respect to task level(shuffle='task') during merging. No luck though :-( Also tried to
Is there any workaround for this ? Kindly provide your thoughts on this! |
I see the merge function is still bugged, both for outer and inner. Any way to mitigate this? I tried to use |
One way is to break down the functionalities (suppose if you are computing data at different points); then write the files as and when required, by this way the partitioned index can be wiped away. |
I think this is fixed on dask master. I just failed to reproduce the MRE on dask master with pandas 1.1.4. |
We also just released on Friday. So this may already be fixed in a release. Would make sure to upgrade and see if there’s still an issue. |
I am going to close this, but feel free to reopen or comment if you can still reproduce. |
Has this bug fixed already? When I do outer merge, I had this error message: My pandas is of version '1.1.5' while dask '2021.03.0' |
@lampda would suggest raising a new issue with a minimal reproducer |
Setup
Traceback
Version
The text was updated successfully, but these errors were encountered: