-
Notifications
You must be signed in to change notification settings - Fork 22
remove repeated sorting in describe dict #1682
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1682 +/- ##
=======================================
Coverage 98.79% 98.80%
=======================================
Files 98 98
Lines 11809 11855 +46
=======================================
+ Hits 11667 11713 +46
Misses 142 142
|
Can we get perf tests for this? |
Yeah sure, is there a place in woodwork where I should put a performance test in? or should I just test it locally and provide the results here? |
…into speed_up_describe_dict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get a test that confirms that if nans are present, the current way gets used? I'm thinking you could mock percentile
and check if that gets called or not.
""" | ||
k = (count - 1) * percent | ||
f = math.floor(k) | ||
c = math.ceil(k) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we get some more descriptive variable names and maybe some comments outlining what each step in the percentile calculation is? Or, since this seems to match the scipy implementation (from this stack overflow answer), maybe just linking to that would be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've included the source now
percent (float): float value from 0.0 to 1.0. | ||
count (int): Count of values in series | ||
|
||
@return - the percentile of the values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit on return formatting:
Returns:
The percentile of the values.
semantic_tags={"numeric_col": "custom_tag"}, | ||
) | ||
numeric_data.ww.describe() | ||
assert mock_percentile.called |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By doing mock_percentile.called
in a for loop, you're only really confirming that the first logical type has percentile
called (since mock_percentile.called
will stay True
for the rest of the test). My recommendation would be to parametrize over the different nullable logical types; then, each time the test is created it has a fresh mock_percentile
.
Additionally, the check below with the non nullable types would run into a similar issue bc once the mock has been called, we don't learn anything from confirming that it's still called. I think it may just be simplest to make a totally separate test for the non nullable ltypes and only check that it's not called prior to the describe called and is called afterwards. I'm not totally in love with that, though, and would be open to other solutions!
|
||
@patch.object(sys.modules["woodwork.statistics_utils._get_describe_dict"], "percentile") | ||
@pytest.mark.parametrize("non_nullable_numeric_type", [Integer, Age]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one more thing!
Can you add [Double, IntegerNullable, AgeNullable, AgeFractional] to the parametrization? It's a small thing, but we'll want to know that it's not just the type that is defining whether or not percentile gets used; it's whether nans are present. I would also update the test names to test_percentile_func_not_called_with_nans
and test_percentile_func_called_without_nans
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once that last comment is done, you're good to go!
_get_describe_dict
to minimize repetitive computation #1042