Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix ordering of quantiles in describe. #4647

Merged
merged 6 commits into from Apr 9, 2019

Conversation

Projects
None yet
3 participants
@gregrf
Copy link
Contributor

commented Mar 29, 2019

Previously dataframe.describe was using the built-in set to sort percentiles which doesn't sort properly. This led to unpredictably wrong results. For example percentiles=[0.25, 0.5, 0.75, 0.99] was reordered by set to [0.25, 0.5, 0.99, 0.75] which broke dataframe.quantiles which assumes quantiles are sorted by value.

After this patch the built-in sorted will be used to sort quantiles.

  • Tests added / passed
  • Passes flake8 dask
fix ordering of quantiles in describe.
Previously dataframe.describe was using the built-in set to sort quantiles which sorts by hash rather than by value. Now the built-in sorted is used instead.
@martindurant

This comment has been minimized.

Copy link
Member

commented Mar 29, 2019

Can you please include a test with your original set of quantiles, so we can prove this is fixed?

Fixes #4642

@gregrf

This comment has been minimized.

Copy link
Contributor Author

commented Mar 29, 2019

I added a simple test to catch this error.

@martindurant

This comment has been minimized.

Copy link
Member

commented Mar 29, 2019

Thank you, that's perfect.

@martindurant

This comment has been minimized.

Copy link
Member

commented Mar 29, 2019

Appears to fail on flake8 linting:

dask/dataframe/tests/test_dataframe.py:322:1: W293 blank line contains whitespace
dask/dataframe/tests/test_dataframe.py:325:1: W293 blank line contains whitespace
dask/dataframe/tests/test_dataframe.py:331:1: W293 blank line contains whitespace
dask/dataframe/tests/test_dataframe.py:338:1: E302 expected 2 blank lines, found 1

gregrf added some commits Mar 29, 2019

@gregrf

This comment has been minimized.

Copy link
Contributor Author

commented Mar 30, 2019

Yesterday's patch, #4650, should help this problem but this patch is still important. Currently in describe, if the percentiles argument is not None, it passes to this line:

percentiles = list(set(sorted(percentiles + [0.5])))

But look what happens if percentiles is an ndarray:

A = dask.array.arange(101)
s = dask.dataframe.from_dask_array(A)
s.describe(percentiles=np.array([0.25])).compute()

count 101.000000
mean 50.000000
std 29.300171
min 0.000000
75% 75.000000
max 100.000000
dtype: float64

It's calculated the 75'th% instead of the 25'th% because numpy added [0.5] instead of concatenating like list would. I've added tests where percentiles is a list, tuple or ndarray to make sure it works for all of those list-like types.

I'm off on vacation without a computer. Back in a week.

@martindurant

This comment has been minimized.

Copy link
Member

commented Mar 31, 2019

Have a nice holiday!
It would be good to have a test for the specific case you cite; although I note that the docstring says the input should be "list-like", which may or may not include array.

gregrf added some commits Apr 9, 2019

@gregrf

This comment has been minimized.

Copy link
Contributor Author

commented Apr 9, 2019

Hi, I'm back.

As a user I'm certainly surprised whenever a function in the scipy ecosystem doesn't accept a 1d ndarray when a 1-d container is needed but I admit to having had to google "list-like." Is there a tight definition for this in dask? It turns out Pandas has a function pandas.api.types.is_list_like() that considers ndarrays to be list-like although it also considers the builtin set and pandas.series to be list-like which surprises me a little. I'm any event, this patch fixes the problem.

@martindurant

This comment has been minimized.

Copy link
Member

commented Apr 9, 2019

Failure comes from seemingly unrelated bokeh thing

@gregrf

This comment has been minimized.

Copy link
Contributor Author

commented Apr 9, 2019

Sorry, I'm new to this. Am I right that it's passing for PYTHON=3.5 NUMPY=1.11.1 PANDAS=0.19.2 and below and failing for newer packages? I don't know what to make of that.

@martindurant

This comment has been minimized.

Copy link
Member

commented Apr 9, 2019

You do not need to do anything, I don't think this is your fault. @mrocklin @jakirkham , is the bokeh failure a known thing?
I am +1 on this PR, that aside.

@martindurant

This comment has been minimized.

Copy link
Member

commented Apr 9, 2019

(Bokeh 1.1.0 was released today, with a lot of layout refactoring)

@mrocklin

This comment has been minimized.

Copy link
Member

commented Apr 9, 2019

@martindurant

This comment has been minimized.

Copy link
Member

commented Apr 9, 2019

Should be fixed in #4680

@martindurant

This comment has been minimized.

Copy link
Member

commented Apr 9, 2019

OK, so merging this now. Thank you @gregrf

@martindurant martindurant merged commit 9d87de5 into dask:master Apr 9, 2019

1 of 2 checks passed

continuous-integration/travis-ci/pr The Travis CI build failed
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details

asmith26 added a commit to asmith26/dask that referenced this pull request Apr 22, 2019

fix ordering of quantiles in describe. (dask#4647)
* fix ordering of quantiles in describe.

Previously dataframe.describe was using the built-in set to sort quantiles which sorts by hash rather than by value. Now the built-in sorted is used instead.

* added test for describe with unsorted q parameter

* Update test_dataframe.py

* test describe percentiles arg for ndarray or list

* made code that handles quantile input in dataframe.describe more
explicit and fixed mistake in related test.

jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this pull request May 14, 2019

fix ordering of quantiles in describe. (dask#4647)
* fix ordering of quantiles in describe.

Previously dataframe.describe was using the built-in set to sort quantiles which sorts by hash rather than by value. Now the built-in sorted is used instead.

* added test for describe with unsorted q parameter

* Update test_dataframe.py

* test describe percentiles arg for ndarray or list

* made code that handles quantile input in dataframe.describe more
explicit and fixed mistake in related test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.