New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stats on numeric columns #422
Stats on numeric columns #422
Conversation
Codecov Report
@@ Coverage Diff @@
## master #422 +/- ##
==========================================
- Coverage 94.73% 94.44% -0.29%
==========================================
Files 42 43 +1
Lines 4746 5025 +279
==========================================
+ Hits 4496 4746 +250
- Misses 250 279 +29
Continue to review full report at Codecov.
|
databricks/koalas/frame.py
Outdated
@@ -164,7 +164,7 @@ def _init_from_spark(self, sdf, metadata=None): | |||
else: | |||
self._metadata = metadata | |||
|
|||
def _reduce_for_stat_function(self, sfun): | |||
def _reduce_for_stat_function(self, sfun, numeric_only=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this just default to True rather than None? Then you can remove that line below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it would make more sense, but the signature for all stats function on pandas have None
as a default value for numeric_only
. So I could replicate the conversion to a Boolean value inside all methods that call _reduce_for_stat_function
, or keep the conversion inside _reduce_for_stat_function
. I'd prefer keeping it inside to remove duplicate code, but I am open to suggestions :-) wdyt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin Let me know if you want me to change anything :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also argue against None
as the default, especially since pandas' way of handling None
seems rather unintuitive to me.
When for example running pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']}).max()
, we get
A 3
B c
dtype: object
as a result which shows that by default numeric_only
should be False
(this can also be verified by looking at the code linked to above).
However, numeric_only=True
should probably still be used for some aggregations since for example the mean and median of strings are hard to make sense of. In fact, pandas also doesn't output any value for column B when calling .mean()
on the DataFrame above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fair. So if I recap, for some aggregations method, the default value should be False
, for some it should be True
(which is fine i guess). Consequently, there is no reasonable default value for _reduce_for_stat_function
, which means the most reasonable choice here is probably to simply remove the default value, and make sure we pass down a value for numeric_only
every time we call it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree - not having a difficult value seems better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd argue that the default value should be False
since ignoring non-numerical values should only happen if explicitly stated. numeric_only=False
however, uses all columns of the DataFrame which I'd say is what the user expects without specifying anything.
Before making any changes to your code, I would probably wait for an opinion from @rxin though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @floscha I don't have a strong opinion here. I think we should just follow your suggestion :)
ddf = koalas.from_pandas(df) | ||
|
||
# min and max do not discard non-numeric columns by default | ||
self.assertEqual(len(ddf.min()), 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding all assertEqual
statements: Maybe it makes more sense to compare the results from the Koalas aggregations with pandas rather than the actual results, such that line 133 could for example be replaced with self.assertEqual(len(kdf.min()), len(pdf.min()))
. This focuses more on the consistency between both libraries while I believe pandas is pretty well-tested itself so we don't need to write all sorts of tests twice. Maybe @HyukjinKwon also has an opinion about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixing this led me to realise that koalas behavior differs from pandas since in koalas we can apply stats function on non-numeric columns; the returned value is just NaN
. In pandas, forcing to compute the mean
on a string column raises an TypeError
. Do we want to keep pandas behavior, and raise an error ?
Personaly, I'd favor the simplest path for now, and would create another issue for this change later on, if needed. Any strong opinion about this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I agree with comparing to Pandas' in general. Handling NaN is pretty annoying problem currently in Koalas.
forcing to compute the mean on a string column raises
It would be better to match it with Pandas' for now. Creating another issue makes sense to me. We can fix it here too. I'll leave it to you @nitlev
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also prefer pandas' behavior as it keeps the user from producing unintended results by accident. Maybe it's cleaner to move this to a new PR though and merge this one if you approve @HyukjinKwon.
Softagram Impact Report for pull/422 (head commit: 50c6695)⭐ Change Overview
📄 Full report
Give feedback on this report to support@softagram.com |
:param sfun: either an 1-arg function that takes a Column and returns a Column, or | ||
Parameters | ||
---------- | ||
sfun : either an 1-arg function that takes a Column and returns a Column, or | ||
a 2-arg function that takes a Column and its DataType and returns a Column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 175 needs to be indented so that it's under sfun
.
db = ddf.B | ||
a = pdf.A | ||
b = pdf.B | ||
da = kdf.A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also rename da
and db
to ka
and kb
since adding the k
suffix for the Koalas version of pandas variables is what you commonly see in this project.
Solves #280
Ensure stats functions other than min/max/count are applied on numeric/boolean columns only by default.