Stats on numeric columns #422

nitlev · 2019-06-03T20:36:37Z

Solves #280

Ensure stats functions other than min/max/count are applied on numeric/boolean columns only by default.

codecov-io · 2019-06-03T20:46:30Z

Codecov Report

Merging #422 into master will decrease coverage by 0.28%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #422      +/-   ##
==========================================
- Coverage   94.73%   94.44%   -0.29%     
==========================================
  Files          42       43       +1     
  Lines        4746     5025     +279     
==========================================
+ Hits         4496     4746     +250     
- Misses        250      279      +29

Impacted Files	Coverage Δ
databricks/koalas/series.py	`92.05% <ø> (-1.13%)`	⬇️
databricks/koalas/frame.py	`94.74% <ø> (-0.85%)`	⬇️
databricks/koalas/tests/test_stats.py	`100% <100%> (ø)`	⬆️
databricks/koalas/generic.py	`94.02% <100%> (+0.13%)`	⬆️
databricks/koalas/metadata.py	`79.54% <0%> (-18.19%)`	⬇️
databricks/koalas/namespace.py	`90.17% <0%> (-2.78%)`	⬇️
databricks/koalas/indexes.py	`90.42% <0%> (-0.11%)`	⬇️
databricks/koalas/missing/frame.py	`100% <0%> (ø)`	⬆️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 96906ea...50c6695. Read the comment docs.

rxin · 2019-06-04T09:25:55Z

databricks/koalas/frame.py

@@ -164,7 +164,7 @@ def _init_from_spark(self, sdf, metadata=None):
        else:
            self._metadata = metadata

-    def _reduce_for_stat_function(self, sfun):
+    def _reduce_for_stat_function(self, sfun, numeric_only=None):


should this just default to True rather than None? Then you can remove that line below.

I agree that it would make more sense, but the signature for all stats function on pandas have None as a default value for numeric_only. So I could replicate the conversion to a Boolean value inside all methods that call _reduce_for_stat_function, or keep the conversion inside _reduce_for_stat_function. I'd prefer keeping it inside to remove duplicate code, but I am open to suggestions :-) wdyt ?

@rxin Let me know if you want me to change anything :-)

I would also argue against None as the default, especially since pandas' way of handling None seems rather unintuitive to me.

When for example running pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']}).max(), we get

A 3 B c dtype: object

as a result which shows that by default numeric_only should be False (this can also be verified by looking at the code linked to above).

However, numeric_only=True should probably still be used for some aggregations since for example the mean and median of strings are hard to make sense of. In fact, pandas also doesn't output any value for column B when calling .mean() on the DataFrame above.

Seems fair. So if I recap, for some aggregations method, the default value should be False, for some it should be True (which is fine i guess). Consequently, there is no reasonable default value for _reduce_for_stat_function, which means the most reasonable choice here is probably to simply remove the default value, and make sure we pass down a value for numeric_only every time we call it.

I agree - not having a difficult value seems better.

I'd argue that the default value should be False since ignoring non-numerical values should only happen if explicitly stated. numeric_only=False however, uses all columns of the DataFrame which I'd say is what the user expects without specifying anything.

Before making any changes to your code, I would probably wait for an opinion from @rxin though.

Sorry @floscha I don't have a strong opinion here. I think we should just follow your suggestion :)

databricks/koalas/series.py

databricks/koalas/tests/test_stats.py

floscha · 2019-06-07T22:12:53Z

databricks/koalas/tests/test_stats.py

+        ddf = koalas.from_pandas(df)
+
+        # min and max do not discard non-numeric columns by default
+        self.assertEqual(len(ddf.min()), 3)


Regarding all assertEqual statements: Maybe it makes more sense to compare the results from the Koalas aggregations with pandas rather than the actual results, such that line 133 could for example be replaced with self.assertEqual(len(kdf.min()), len(pdf.min())). This focuses more on the consistency between both libraries while I believe pandas is pretty well-tested itself so we don't need to write all sorts of tests twice. Maybe @HyukjinKwon also has an opinion about this?

Fixing this led me to realise that koalas behavior differs from pandas since in koalas we can apply stats function on non-numeric columns; the returned value is just NaN. In pandas, forcing to compute the mean on a string column raises an TypeError. Do we want to keep pandas behavior, and raise an error ?
Personaly, I'd favor the simplest path for now, and would create another issue for this change later on, if needed. Any strong opinion about this ?

Yea I agree with comparing to Pandas' in general. Handling NaN is pretty annoying problem currently in Koalas.

forcing to compute the mean on a string column raises

It would be better to match it with Pandas' for now. Creating another issue makes sense to me. We can fix it here too. I'll leave it to you @nitlev

I also prefer pandas' behavior as it keeps the user from producing unintended results by accident. Maybe it's cleaner to move this to a new PR though and merge this one if you approve @HyukjinKwon.

softagram-bot · 2019-06-08T00:05:02Z

Softagram Impact Report for pull/422 (head commit: `50c6695`)

⭐ Change Overview

(Open in Softagram Desktop for full details)

📄 Full report

Permalink: Full report for pull/422

Give feedback on this report to support@softagram.com

floscha · 2019-06-08T07:16:17Z

databricks/koalas/frame.py

-        :param sfun: either an 1-arg function that takes a Column and returns a Column, or
+        Parameters
+        ----------
+        sfun : either an 1-arg function that takes a Column and returns a Column, or
        a 2-arg function that takes a Column and its DataType and returns a Column.


Line 175 needs to be indented so that it's under sfun.

floscha · 2019-06-08T07:18:26Z

databricks/koalas/tests/test_stats.py

-            db = ddf.B
+            a = pdf.A
+            b = pdf.B
+            da = kdf.A


You could also rename da and db to ka and kb since adding the k suffix for the Koalas version of pandas variables is what you commonly see in this project.

databricks/koalas/tests/test_stats.py

HyukjinKwon · 2019-06-11T11:05:13Z

Merged. Thanks for thorough review @floscha and addressing them nicely @nitlev.

Veltin Dupont added 5 commits June 3, 2019 13:47

first naive implementation

10c4fbf

refine implementation

5fc7555

fix linter

56694b4

remove unrelated change

da46331

fix count method

31ecdf8

nitlev mentioned this pull request Jun 4, 2019

mean/kurt/var/std/skew should apply on numeric columns by default #280

Closed

rxin reviewed Jun 4, 2019

View reviewed changes

review: change numeric_only default values

b4ea17d

floscha suggested changes Jun 7, 2019

View reviewed changes

review: fixed docstring formats and renamed some variables

50c6695

floscha suggested changes Jun 8, 2019

View reviewed changes

HyukjinKwon approved these changes Jun 11, 2019

View reviewed changes

HyukjinKwon merged commit 8fcb209 into databricks:master Jun 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stats on numeric columns #422

Stats on numeric columns #422

nitlev commented Jun 3, 2019

codecov-io commented Jun 3, 2019 •

edited

rxin Jun 4, 2019

nitlev Jun 4, 2019

nitlev Jun 5, 2019

floscha Jun 5, 2019

nitlev Jun 6, 2019

rxin Jun 6, 2019

floscha Jun 6, 2019

rxin Jun 6, 2019

floscha Jun 7, 2019

nitlev Jun 9, 2019

HyukjinKwon Jun 11, 2019

floscha Jun 11, 2019

softagram-bot commented Jun 8, 2019

floscha Jun 8, 2019

floscha Jun 8, 2019

HyukjinKwon commented Jun 11, 2019 •

edited

Stats on numeric columns #422

Stats on numeric columns #422

Conversation

nitlev commented Jun 3, 2019

codecov-io commented Jun 3, 2019 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

softagram-bot commented Jun 8, 2019

Softagram Impact Report for pull/422 (head commit: 50c6695)

⭐ Change Overview

📄 Full report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Jun 11, 2019 • edited

codecov-io commented Jun 3, 2019 •

edited

Softagram Impact Report for pull/422 (head commit: `50c6695`)

HyukjinKwon commented Jun 11, 2019 •

edited