Skip to content

remove repeated sorting in describe dict #1682

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 52 commits into from
May 3, 2023
Merged

Conversation

simha104
Copy link
Contributor

@simha104 simha104 commented Apr 18, 2023

@codecov
Copy link

codecov bot commented Apr 18, 2023

Codecov Report

Merging #1682 (d0b209d) into main (a9f8652) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1682   +/-   ##
=======================================
  Coverage   98.79%   98.80%           
=======================================
  Files          98       98           
  Lines       11809    11855   +46     
=======================================
+ Hits        11667    11713   +46     
  Misses        142      142           
Impacted Files Coverage Δ
woodwork/statistics_utils/_get_describe_dict.py 100.00% <100.00%> (ø)
woodwork/tests/accessor/test_statistics.py 100.00% <100.00%> (ø)

@ParthivNaresh
Copy link
Contributor

Can we get perf tests for this?

@simha104
Copy link
Contributor Author

Can we get perf tests for this?

Yeah sure, is there a place in woodwork where I should put a performance test in? or should I just test it locally and provide the results here?

Copy link
Contributor

@ParthivNaresh ParthivNaresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@tamargrey tamargrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get a test that confirms that if nans are present, the current way gets used? I'm thinking you could mock percentile and check if that gets called or not.

"""
k = (count - 1) * percent
f = math.floor(k)
c = math.ceil(k)
Copy link
Contributor

@tamargrey tamargrey May 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get some more descriptive variable names and maybe some comments outlining what each step in the percentile calculation is? Or, since this seems to match the scipy implementation (from this stack overflow answer), maybe just linking to that would be enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've included the source now

percent (float): float value from 0.0 to 1.0.
count (int): Count of values in series

@return - the percentile of the values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit on return formatting:

Returns:
     The percentile of the values.

semantic_tags={"numeric_col": "custom_tag"},
)
numeric_data.ww.describe()
assert mock_percentile.called
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By doing mock_percentile.called in a for loop, you're only really confirming that the first logical type has percentile called (since mock_percentile.called will stay True for the rest of the test). My recommendation would be to parametrize over the different nullable logical types; then, each time the test is created it has a fresh mock_percentile.

Additionally, the check below with the non nullable types would run into a similar issue bc once the mock has been called, we don't learn anything from confirming that it's still called. I think it may just be simplest to make a totally separate test for the non nullable ltypes and only check that it's not called prior to the describe called and is called afterwards. I'm not totally in love with that, though, and would be open to other solutions!


@patch.object(sys.modules["woodwork.statistics_utils._get_describe_dict"], "percentile")
@pytest.mark.parametrize("non_nullable_numeric_type", [Integer, Age])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more thing!

Can you add [Double, IntegerNullable, AgeNullable, AgeFractional] to the parametrization? It's a small thing, but we'll want to know that it's not just the type that is defining whether or not percentile gets used; it's whether nans are present. I would also update the test names to test_percentile_func_not_called_with_nans and test_percentile_func_called_without_nans

Copy link
Contributor

@tamargrey tamargrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once that last comment is done, you're good to go!

@simha104 simha104 enabled auto-merge (squash) May 3, 2023 21:28
@simha104 simha104 merged commit 36b6714 into main May 3, 2023
@simha104 simha104 deleted the speed_up_describe_dict branch May 3, 2023 21:47
@ParthivNaresh ParthivNaresh mentioned this pull request May 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize _get_describe_dict to minimize repetitive computation
4 participants