Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix groupby.std() with integer colum names #5096

Merged
merged 7 commits into from Aug 12, 2019

Conversation

@NicolasHug
Copy link
Contributor

commented Jul 13, 2019

  • Tests added / passed
  • Passes black dask / flake8 dask

Fixes #3560

I tried using the tuples solution proposed in #3560 (comment) but pandas wasn't happy.

Note that the solution in this PR might cause collisions if we have a column 0 and another column '0'.

NicolasHug added 5 commits Jul 13, 2019
@mrocklin

This comment has been minimized.

Copy link
Member

commented Jul 13, 2019

The exception here is interesting. It looks like there are some versions where this approach has some issues.

@jrbourbeau
Copy link
Member

left a comment

Thanks for working on this @NicolasHug! Apologies for the delayed response.

The RuntimeWarning: '<' not supported between instances of 'str' and 'int' error we're seeing here seem to be because in _var_chunk, x still has integer columns:

x = g.sum()

while n and x2 have string columns.

On a related note, I suspect we should be able to support the tuples-based solution proposed in #3560 (comment) with the following changes:

diff --git a/dask/dataframe/groupby.py b/dask/dataframe/groupby.py
index f1b03b2a..59ac4955 100644
--- a/dask/dataframe/groupby.py
+++ b/dask/dataframe/groupby.py
@@ -284,11 +284,11 @@ def _var_chunk(df, *index):
     g = _groupby_raise_unaligned(df, by=index)
     x = g.sum()

-    n = g[x.columns].count().rename(columns=lambda c: str(c) + "-count")
+    n = g[x.columns].count().rename(columns=lambda c: (c, "count"))

     df[cols] = df[cols] ** 2
     g2 = _groupby_raise_unaligned(df, by=index)
-    x2 = g2.sum().rename(columns=lambda c: str(c) + "-x2")
+    x2 = g2.sum().rename(columns=lambda c: (c ,"x2"))

     x2.index = x.index
     return concat([x, x2, n], axis=1)
@@ -302,8 +302,8 @@ def _var_agg(g, levels, ddof):
     g = g.groupby(level=levels, sort=False).sum()
     nc = len(g.columns)
     x = g[g.columns[: nc // 3]]
-    x2 = g[g.columns[nc // 3 : 2 * nc // 3]].rename(columns=lambda c: c[:-3])
-    n = g[g.columns[-nc // 3 :]].rename(columns=lambda c: c[:-6])
+    x2 = g[g.columns[nc // 3 : 2 * nc // 3]].rename(columns=lambda c: c[0])
+    n = g[g.columns[-nc // 3 :]].rename(columns=lambda c: c[0])

     # TODO: replace with _finalize_var?
     result = x2 - x ** 2 / n

This approach should be a little more robust. It will, for example, avoid the column 0 vs. column '0' issue you mentioned

@TomAugspurger

This comment has been minimized.

Copy link
Member

commented Aug 8, 2019

@NicolasHug do you have time to apply @jrbourbeau fix from #5096 (review)?

@NicolasHug

This comment has been minimized.

Copy link
Contributor Author

commented Aug 8, 2019

Sorry I was short on time but I'll get back to it this week!

@NicolasHug

This comment has been minimized.

Copy link
Contributor Author

commented Aug 9, 2019

I think I figured out what was wrong with tuples, let's see what CI says

df = pd.DataFrame({0: [5], 1: [5]})
ddf = dd.from_pandas(df, npartitions=2)
by = dask.array.from_array([0, 1]).to_dask_dataframe()
ddf.groupby(by).std()

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Aug 9, 2019

Member

I'm not sure this test is valid. The group key should I think be the same length as df.

Can you try something like

df = pd.DataFrame({0: [0, 1], 1: [1, 2]})
ddf = dd.from_pandas(df, npartitions=2)
by = pd.Series([0, 1])

expected = df.groupby(by).std()
result = ddf.groupby(dd.from_pandas(by, 2)).std()
assert_eq(result, expected)

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Aug 12, 2019

Author Contributor

I can confirm that the current test fails on master with the same error as in the original issue.

To make the suggested test pass I need to set the following args:

assert_eq(result, expected, check_meta=False, equal_nan=True)

All entries are Nans so I'm not sure that makes lot of sense to compare the results? Let me know which one you prefer.

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Aug 12, 2019

Member

Thanks. I suppose it's all NaN because the Dataframe is small, so the std calculation returns NaN? Can you make a larger dataframe?

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Aug 12, 2019

Member

That said, we may be going past the original issue now. Remind me, that issue was specifically about column names, which this test exercises?

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Aug 12, 2019

Author Contributor

yes the original issue was about column name mismatch, not about expected results. Do you still need me to adapt your test?

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Aug 12, 2019

Member

Nope, I think we're good. Sorry about the misunderstanding.

dask/dataframe/groupby.py Outdated Show resolved Hide resolved
Update dask/dataframe/groupby.py
Co-Authored-By: James Bourbeau <jrbourbeau@users.noreply.github.com>

@TomAugspurger TomAugspurger merged commit e027912 into dask:master Aug 12, 2019

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@TomAugspurger

This comment has been minimized.

Copy link
Member

commented Aug 12, 2019

thanks @NicolasHug!

@jrbourbeau

This comment has been minimized.

Copy link
Member

commented Aug 12, 2019

Thanks for working on this and pushing it through @NicolasHug!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.