-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random values/wrong type when grouping a filtered DataFrame by an unfiltered column #1876
Comments
Thanks for the error report @p-himik ! Can I ask you to do a bit more and provide a reproducible example? Ideally this would be simple code that someone could copy-paste to easily get this same error. |
import numpy as np
import pandas as pd
import dask.dataframe as df
NAMES = ['Alice', 'Bob', 'Charlie', 'Dan', 'Edith', 'Frank', 'George',
'Hannah', 'Ingrid', 'Jerry', 'Kevin', 'Laura', 'Michael', 'Norbert', 'Oliver',
'Patricia', 'Quinn', 'Ray', 'Sarah', 'Tim', 'Ursula', 'Victor', 'Wendy',
'Xavier', 'Yvonne', 'Zelda']
def account_params(k):
ids = np.arange(k, dtype=int)
names2 = np.random.choice(NAMES, size=k, replace=True)
wealth_mag = np.random.exponential(100, size=k)
wealth_trend = np.random.normal(10, 10, size=k)
freq = np.random.exponential(size=k)
freq /= freq.sum()
return ids, names2, wealth_mag, wealth_trend, freq
def account_entries(n, ids, names, wealth_mag, wealth_trend, freq):
indices = np.random.choice(ids, size=n, replace=True, p=freq)
amounts = (np.random.normal(size=n) + wealth_trend[indices]) * wealth_mag[indices]
return pd.DataFrame({'id': indices,
'names': names[indices],
'amount': amounts.astype('i4')},
columns=['id', 'names', 'amount'])
def accounts_dataframes(num_dfs, n, k):
args = account_params(k)
return [account_entries(n, *args) for _ in range(num_dfs)]
# interesting - n=1000, k=50 does produce the correct result for each invocation
p_dfs = accounts_dataframes(3, 1000000, 500)
d_df = df.concat([df.from_pandas(d, npartitions=1, sort=False) for d in p_dfs])
p_df = d_df.compute()
p_df.index = range(len(p_df))
print(p_df[p_df.amount < 0].groupby(p_df.id).amount.mean().head())
print(d_df[d_df.amount < 0].groupby(d_df.id).amount.mean().head())
print(d_df[d_df.amount < 0].groupby(d_df[d_df.amount < 0].id).amount.mean().head()) |
Apologies for letting this sit so long. This appears to be a bug in pandas, see pandas-dev/pandas#15244. The issue seems isolated to only grouping by unaligned indices. ( In the long run, it'd be best to get this fixed in pandas. For now, we may be able to avoid this ourselves by aligning before the call to groupby. I'll look into this. |
Pandas allows grouping by a key that isn't aligned with the grouped DataFrame/Series. If this happens, then groupby first aligns the indices before performing the grouping. Unfortunately this operation isn't threadsafe, and can lead to incorrect results. Since grouping by an unaligned key isn't recommended, we just error loudly in these cases. Fixes dask#1876.
Actually, using |
Pandas allows grouping by a key that isn't aligned with the grouped DataFrame/Series. If this happens, then groupby first aligns the indices before performing the grouping. Unfortunately this operation isn't threadsafe, and can lead to incorrect results. Since grouping by an unaligned key isn't recommended, we just error loudly in these cases. Fixes #1876.
The issue has been discovered while working through
03a-DataFrame.ipynb
notebook of the tutorial.There we have a Dask DataFrame
df
.As you can see, for some reason in Dask if you use unfiltered column for groupby, you don't receive any ValueError, you end up with
float64
column type instead ofint64
for the groupby column, and the values are very different from the ones obtained with pandas. In fact, they are different each time you run the code.The text was updated successfully, but these errors were encountered: