Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stats on numeric columns #422

Merged
merged 7 commits into from Jun 11, 2019

Conversation

nitlev
Copy link
Contributor

@nitlev nitlev commented Jun 3, 2019

Solves #280

Ensure stats functions other than min/max/count are applied on numeric/boolean columns only by default.

@codecov-io
Copy link

codecov-io commented Jun 3, 2019

Codecov Report

Merging #422 into master will decrease coverage by 0.28%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #422      +/-   ##
==========================================
- Coverage   94.73%   94.44%   -0.29%     
==========================================
  Files          42       43       +1     
  Lines        4746     5025     +279     
==========================================
+ Hits         4496     4746     +250     
- Misses        250      279      +29
Impacted Files Coverage Δ
databricks/koalas/series.py 92.05% <ø> (-1.13%) ⬇️
databricks/koalas/frame.py 94.74% <ø> (-0.85%) ⬇️
databricks/koalas/tests/test_stats.py 100% <100%> (ø) ⬆️
databricks/koalas/generic.py 94.02% <100%> (+0.13%) ⬆️
databricks/koalas/metadata.py 79.54% <0%> (-18.19%) ⬇️
databricks/koalas/namespace.py 90.17% <0%> (-2.78%) ⬇️
databricks/koalas/indexes.py 90.42% <0%> (-0.11%) ⬇️
databricks/koalas/missing/frame.py 100% <0%> (ø) ⬆️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 96906ea...50c6695. Read the comment docs.

@@ -164,7 +164,7 @@ def _init_from_spark(self, sdf, metadata=None):
else:
self._metadata = metadata

def _reduce_for_stat_function(self, sfun):
def _reduce_for_stat_function(self, sfun, numeric_only=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this just default to True rather than None? Then you can remove that line below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it would make more sense, but the signature for all stats function on pandas have None as a default value for numeric_only. So I could replicate the conversion to a Boolean value inside all methods that call _reduce_for_stat_function, or keep the conversion inside _reduce_for_stat_function. I'd prefer keeping it inside to remove duplicate code, but I am open to suggestions :-) wdyt ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin Let me know if you want me to change anything :-)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also argue against None as the default, especially since pandas' way of handling None seems rather unintuitive to me.

When for example running pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']}).max(), we get

A    3
B    c
dtype: object

as a result which shows that by default numeric_only should be False (this can also be verified by looking at the code linked to above).

However, numeric_only=True should probably still be used for some aggregations since for example the mean and median of strings are hard to make sense of. In fact, pandas also doesn't output any value for column B when calling .mean() on the DataFrame above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fair. So if I recap, for some aggregations method, the default value should be False, for some it should be True (which is fine i guess). Consequently, there is no reasonable default value for _reduce_for_stat_function, which means the most reasonable choice here is probably to simply remove the default value, and make sure we pass down a value for numeric_only every time we call it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - not having a difficult value seems better.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue that the default value should be False since ignoring non-numerical values should only happen if explicitly stated. numeric_only=False however, uses all columns of the DataFrame which I'd say is what the user expects without specifying anything.

Before making any changes to your code, I would probably wait for an opinion from @rxin though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @floscha I don't have a strong opinion here. I think we should just follow your suggestion :)

databricks/koalas/series.py Show resolved Hide resolved
databricks/koalas/tests/test_stats.py Outdated Show resolved Hide resolved
ddf = koalas.from_pandas(df)

# min and max do not discard non-numeric columns by default
self.assertEqual(len(ddf.min()), 3)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding all assertEqual statements: Maybe it makes more sense to compare the results from the Koalas aggregations with pandas rather than the actual results, such that line 133 could for example be replaced with self.assertEqual(len(kdf.min()), len(pdf.min())). This focuses more on the consistency between both libraries while I believe pandas is pretty well-tested itself so we don't need to write all sorts of tests twice. Maybe @HyukjinKwon also has an opinion about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing this led me to realise that koalas behavior differs from pandas since in koalas we can apply stats function on non-numeric columns; the returned value is just NaN. In pandas, forcing to compute the mean on a string column raises an TypeError. Do we want to keep pandas behavior, and raise an error ?
Personaly, I'd favor the simplest path for now, and would create another issue for this change later on, if needed. Any strong opinion about this ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I agree with comparing to Pandas' in general. Handling NaN is pretty annoying problem currently in Koalas.

forcing to compute the mean on a string column raises

It would be better to match it with Pandas' for now. Creating another issue makes sense to me. We can fix it here too. I'll leave it to you @nitlev

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also prefer pandas' behavior as it keeps the user from producing unintended results by accident. Maybe it's cleaner to move this to a new PR though and merge this one if you approve @HyukjinKwon.

@softagram-bot
Copy link

Softagram Impact Report for pull/422 (head commit: 50c6695)

⭐ Change Overview

Showing the changed files, dependency changes and the impact - click for full size
(Open in Softagram Desktop for full details)

📄 Full report

Give feedback on this report to support@softagram.com

:param sfun: either an 1-arg function that takes a Column and returns a Column, or
Parameters
----------
sfun : either an 1-arg function that takes a Column and returns a Column, or
a 2-arg function that takes a Column and its DataType and returns a Column.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 175 needs to be indented so that it's under sfun.

db = ddf.B
a = pdf.A
b = pdf.B
da = kdf.A
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also rename da and db to ka and kb since adding the k suffix for the Koalas version of pandas variables is what you commonly see in this project.

databricks/koalas/tests/test_stats.py Show resolved Hide resolved
@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jun 11, 2019

Merged. Thanks for thorough review @floscha and addressing them nicely @nitlev.

@HyukjinKwon HyukjinKwon merged commit 8fcb209 into databricks:master Jun 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants