Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix groupby.std() with integer colum names #5096

Merged
merged 7 commits into from Aug 12, 2019

Conversation

NicolasHug
Copy link
Contributor

@NicolasHug NicolasHug commented Jul 13, 2019

  • Tests added / passed
  • Passes black dask / flake8 dask

Fixes #3560

I tried using the tuples solution proposed in #3560 (comment) but pandas wasn't happy.

Note that the solution in this PR might cause collisions if we have a column 0 and another column '0'.

@mrocklin
Copy link
Member

@mrocklin mrocklin commented Jul 13, 2019

The exception here is interesting. It looks like there are some versions where this approach has some issues.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Thanks for working on this @NicolasHug! Apologies for the delayed response.

The RuntimeWarning: '<' not supported between instances of 'str' and 'int' error we're seeing here seem to be because in _var_chunk, x still has integer columns:

x = g.sum()

while n and x2 have string columns.

On a related note, I suspect we should be able to support the tuples-based solution proposed in #3560 (comment) with the following changes:

diff --git a/dask/dataframe/groupby.py b/dask/dataframe/groupby.py
index f1b03b2a..59ac4955 100644
--- a/dask/dataframe/groupby.py
+++ b/dask/dataframe/groupby.py
@@ -284,11 +284,11 @@ def _var_chunk(df, *index):
     g = _groupby_raise_unaligned(df, by=index)
     x = g.sum()

-    n = g[x.columns].count().rename(columns=lambda c: str(c) + "-count")
+    n = g[x.columns].count().rename(columns=lambda c: (c, "count"))

     df[cols] = df[cols] ** 2
     g2 = _groupby_raise_unaligned(df, by=index)
-    x2 = g2.sum().rename(columns=lambda c: str(c) + "-x2")
+    x2 = g2.sum().rename(columns=lambda c: (c ,"x2"))

     x2.index = x.index
     return concat([x, x2, n], axis=1)
@@ -302,8 +302,8 @@ def _var_agg(g, levels, ddof):
     g = g.groupby(level=levels, sort=False).sum()
     nc = len(g.columns)
     x = g[g.columns[: nc // 3]]
-    x2 = g[g.columns[nc // 3 : 2 * nc // 3]].rename(columns=lambda c: c[:-3])
-    n = g[g.columns[-nc // 3 :]].rename(columns=lambda c: c[:-6])
+    x2 = g[g.columns[nc // 3 : 2 * nc // 3]].rename(columns=lambda c: c[0])
+    n = g[g.columns[-nc // 3 :]].rename(columns=lambda c: c[0])

     # TODO: replace with _finalize_var?
     result = x2 - x ** 2 / n

This approach should be a little more robust. It will, for example, avoid the column 0 vs. column '0' issue you mentioned

@TomAugspurger
Copy link
Member

@TomAugspurger TomAugspurger commented Aug 8, 2019

@NicolasHug do you have time to apply @jrbourbeau fix from #5096 (review)?

@NicolasHug
Copy link
Contributor Author

@NicolasHug NicolasHug commented Aug 8, 2019

Sorry I was short on time but I'll get back to it this week!

@NicolasHug
Copy link
Contributor Author

@NicolasHug NicolasHug commented Aug 9, 2019

I think I figured out what was wrong with tuples, let's see what CI says

df = pd.DataFrame({0: [5], 1: [5]})
ddf = dd.from_pandas(df, npartitions=2)
by = dask.array.from_array([0, 1]).to_dask_dataframe()
ddf.groupby(by).std()
Copy link
Member

@TomAugspurger TomAugspurger Aug 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this test is valid. The group key should I think be the same length as df.

Can you try something like

df = pd.DataFrame({0: [0, 1], 1: [1, 2]})
ddf = dd.from_pandas(df, npartitions=2)
by = pd.Series([0, 1])

expected = df.groupby(by).std()
result = ddf.groupby(dd.from_pandas(by, 2)).std()
assert_eq(result, expected)

Copy link
Contributor Author

@NicolasHug NicolasHug Aug 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that the current test fails on master with the same error as in the original issue.

To make the suggested test pass I need to set the following args:

assert_eq(result, expected, check_meta=False, equal_nan=True)

All entries are Nans so I'm not sure that makes lot of sense to compare the results? Let me know which one you prefer.

Copy link
Member

@TomAugspurger TomAugspurger Aug 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I suppose it's all NaN because the Dataframe is small, so the std calculation returns NaN? Can you make a larger dataframe?

Copy link
Member

@TomAugspurger TomAugspurger Aug 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, we may be going past the original issue now. Remind me, that issue was specifically about column names, which this test exercises?

Copy link
Contributor Author

@NicolasHug NicolasHug Aug 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes the original issue was about column name mismatch, not about expected results. Do you still need me to adapt your test?

Copy link
Member

@TomAugspurger TomAugspurger Aug 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I think we're good. Sorry about the misunderstanding.

dask/dataframe/groupby.py Outdated Show resolved Hide resolved
Co-Authored-By: James Bourbeau <jrbourbeau@users.noreply.github.com>
@TomAugspurger TomAugspurger merged commit e027912 into dask:master Aug 12, 2019
2 checks passed
@TomAugspurger
Copy link
Member

@TomAugspurger TomAugspurger commented Aug 12, 2019

thanks @NicolasHug!

@jrbourbeau
Copy link
Member

@jrbourbeau jrbourbeau commented Aug 12, 2019

Thanks for working on this and pushing it through @NicolasHug!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants