New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix groupby.std() with integer colum names #5096
Conversation
The exception here is interesting. It looks like there are some versions where this approach has some issues. |
Thanks for working on this @NicolasHug! Apologies for the delayed response.
The RuntimeWarning: '<' not supported between instances of 'str' and 'int'
error we're seeing here seem to be because in _var_chunk
, x
still has integer columns:
dask/dask/dataframe/groupby.py
Line 285 in a53d45e
x = g.sum() |
while n
and x2
have string columns.
On a related note, I suspect we should be able to support the tuples-based solution proposed in #3560 (comment) with the following changes:
diff --git a/dask/dataframe/groupby.py b/dask/dataframe/groupby.py
index f1b03b2a..59ac4955 100644
--- a/dask/dataframe/groupby.py
+++ b/dask/dataframe/groupby.py
@@ -284,11 +284,11 @@ def _var_chunk(df, *index):
g = _groupby_raise_unaligned(df, by=index)
x = g.sum()
- n = g[x.columns].count().rename(columns=lambda c: str(c) + "-count")
+ n = g[x.columns].count().rename(columns=lambda c: (c, "count"))
df[cols] = df[cols] ** 2
g2 = _groupby_raise_unaligned(df, by=index)
- x2 = g2.sum().rename(columns=lambda c: str(c) + "-x2")
+ x2 = g2.sum().rename(columns=lambda c: (c ,"x2"))
x2.index = x.index
return concat([x, x2, n], axis=1)
@@ -302,8 +302,8 @@ def _var_agg(g, levels, ddof):
g = g.groupby(level=levels, sort=False).sum()
nc = len(g.columns)
x = g[g.columns[: nc // 3]]
- x2 = g[g.columns[nc // 3 : 2 * nc // 3]].rename(columns=lambda c: c[:-3])
- n = g[g.columns[-nc // 3 :]].rename(columns=lambda c: c[:-6])
+ x2 = g[g.columns[nc // 3 : 2 * nc // 3]].rename(columns=lambda c: c[0])
+ n = g[g.columns[-nc // 3 :]].rename(columns=lambda c: c[0])
# TODO: replace with _finalize_var?
result = x2 - x ** 2 / n
This approach should be a little more robust. It will, for example, avoid the column 0 vs. column '0' issue you mentioned
@NicolasHug do you have time to apply @jrbourbeau fix from #5096 (review)? |
Sorry I was short on time but I'll get back to it this week! |
I think I figured out what was wrong with tuples, let's see what CI says |
df = pd.DataFrame({0: [5], 1: [5]}) | ||
ddf = dd.from_pandas(df, npartitions=2) | ||
by = dask.array.from_array([0, 1]).to_dask_dataframe() | ||
ddf.groupby(by).std() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this test is valid. The group key should I think be the same length as df
.
Can you try something like
df = pd.DataFrame({0: [0, 1], 1: [1, 2]})
ddf = dd.from_pandas(df, npartitions=2)
by = pd.Series([0, 1])
expected = df.groupby(by).std()
result = ddf.groupby(dd.from_pandas(by, 2)).std()
assert_eq(result, expected)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm that the current test fails on master with the same error as in the original issue.
To make the suggested test pass I need to set the following args:
assert_eq(result, expected, check_meta=False, equal_nan=True)
All entries are Nans so I'm not sure that makes lot of sense to compare the results? Let me know which one you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I suppose it's all NaN because the Dataframe is small, so the std calculation returns NaN? Can you make a larger dataframe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, we may be going past the original issue now. Remind me, that issue was specifically about column names, which this test exercises?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes the original issue was about column name mismatch, not about expected results. Do you still need me to adapt your test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, I think we're good. Sorry about the misunderstanding.
Co-Authored-By: James Bourbeau <jrbourbeau@users.noreply.github.com>
thanks @NicolasHug! |
Thanks for working on this and pushing it through @NicolasHug! |
black dask
/flake8 dask
Fixes #3560
I tried using the tuples solution proposed in #3560 (comment) but pandas wasn't happy.
Note that the solution in this PR might cause collisions if we have a column 0 and another column '0'.