Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix ordering of quantiles in describe. #4647

Merged
merged 6 commits into from Apr 9, 2019
Merged

Conversation

@gregrf
Copy link
Contributor

@gregrf gregrf commented Mar 29, 2019

Previously dataframe.describe was using the built-in set to sort percentiles which doesn't sort properly. This led to unpredictably wrong results. For example percentiles=[0.25, 0.5, 0.75, 0.99] was reordered by set to [0.25, 0.5, 0.99, 0.75] which broke dataframe.quantiles which assumes quantiles are sorted by value.

After this patch the built-in sorted will be used to sort quantiles.

  • Tests added / passed
  • Passes flake8 dask
Previously dataframe.describe was using the built-in set to sort quantiles which sorts by hash rather than by value. Now the built-in sorted is used instead.
@martindurant
Copy link
Member

@martindurant martindurant commented Mar 29, 2019

Can you please include a test with your original set of quantiles, so we can prove this is fixed?

Fixes #4642

@gregrf
Copy link
Contributor Author

@gregrf gregrf commented Mar 29, 2019

I added a simple test to catch this error.

@martindurant
Copy link
Member

@martindurant martindurant commented Mar 29, 2019

Thank you, that's perfect.

@martindurant
Copy link
Member

@martindurant martindurant commented Mar 29, 2019

Appears to fail on flake8 linting:

dask/dataframe/tests/test_dataframe.py:322:1: W293 blank line contains whitespace
dask/dataframe/tests/test_dataframe.py:325:1: W293 blank line contains whitespace
dask/dataframe/tests/test_dataframe.py:331:1: W293 blank line contains whitespace
dask/dataframe/tests/test_dataframe.py:338:1: E302 expected 2 blank lines, found 1

@gregrf
Copy link
Contributor Author

@gregrf gregrf commented Mar 30, 2019

Yesterday's patch, #4650, should help this problem but this patch is still important. Currently in describe, if the percentiles argument is not None, it passes to this line:

percentiles = list(set(sorted(percentiles + [0.5])))

But look what happens if percentiles is an ndarray:

A = dask.array.arange(101)
s = dask.dataframe.from_dask_array(A)
s.describe(percentiles=np.array([0.25])).compute()

count 101.000000
mean 50.000000
std 29.300171
min 0.000000
75% 75.000000
max 100.000000
dtype: float64

It's calculated the 75'th% instead of the 25'th% because numpy added [0.5] instead of concatenating like list would. I've added tests where percentiles is a list, tuple or ndarray to make sure it works for all of those list-like types.

I'm off on vacation without a computer. Back in a week.

@martindurant
Copy link
Member

@martindurant martindurant commented Mar 31, 2019

Have a nice holiday!
It would be good to have a test for the specific case you cite; although I note that the docstring says the input should be "list-like", which may or may not include array.

@gregrf
Copy link
Contributor Author

@gregrf gregrf commented Apr 9, 2019

Hi, I'm back.

As a user I'm certainly surprised whenever a function in the scipy ecosystem doesn't accept a 1d ndarray when a 1-d container is needed but I admit to having had to google "list-like." Is there a tight definition for this in dask? It turns out Pandas has a function pandas.api.types.is_list_like() that considers ndarrays to be list-like although it also considers the builtin set and pandas.series to be list-like which surprises me a little. I'm any event, this patch fixes the problem.

@martindurant
Copy link
Member

@martindurant martindurant commented Apr 9, 2019

Failure comes from seemingly unrelated bokeh thing

@gregrf
Copy link
Contributor Author

@gregrf gregrf commented Apr 9, 2019

Sorry, I'm new to this. Am I right that it's passing for PYTHON=3.5 NUMPY=1.11.1 PANDAS=0.19.2 and below and failing for newer packages? I don't know what to make of that.

@martindurant
Copy link
Member

@martindurant martindurant commented Apr 9, 2019

You do not need to do anything, I don't think this is your fault. @mrocklin @jakirkham , is the bokeh failure a known thing?
I am +1 on this PR, that aside.

@martindurant
Copy link
Member

@martindurant martindurant commented Apr 9, 2019

(Bokeh 1.1.0 was released today, with a lot of layout refactoring)

@mrocklin
Copy link
Member

@mrocklin mrocklin commented Apr 9, 2019

@martindurant
Copy link
Member

@martindurant martindurant commented Apr 9, 2019

Should be fixed in #4680

@martindurant
Copy link
Member

@martindurant martindurant commented Apr 9, 2019

OK, so merging this now. Thank you @gregrf

@martindurant martindurant merged commit 9d87de5 into dask:master Apr 9, 2019
1 of 2 checks passed
1 of 2 checks passed
continuous-integration/travis-ci/pr The Travis CI build failed
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this issue May 14, 2019
* fix ordering of quantiles in describe.

Previously dataframe.describe was using the built-in set to sort quantiles which sorts by hash rather than by value. Now the built-in sorted is used instead.

* added test for describe with unsorted q parameter

* Update test_dataframe.py

* test describe percentiles arg for ndarray or list

* made code that handles quantile input in dataframe.describe more
explicit and fixed mistake in related test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants