Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

correct quantile to handle unsorted quantiles #4650

Merged
merged 7 commits into from Mar 29, 2019

Conversation

Projects
None yet
4 participants
@gregrf
Copy link
Contributor

commented Mar 29, 2019

Currently dask.dataframe.core.quantile(df, q) can silently give incorrect results when the list of quantiles, q, is not sorted. For instance quantile(dask.array.arange(100), [0.75, 0.50, 0.25]) gives incorrect results. This patch uses numpy's mergesort to ensure that the quantiles are sorted. Note that with the patch behavior still differs from that in pandas.DataFrame.quantile() where quantiles are calculated correctly while preserving order. While this patch does not duplicate the behavior of pandas because it does not preserve the order of the quantiles, it does at least avoids the silent errors.

  • Tests added / passed
  • Passes flake8 dask
correct quantile to handle unsorted quantiles
Currently dask.dataframe.core.quantile(df, q) can silently give incorrect results when the list of quantiles, q, is not sorted. For instance quantile(dask.array.arange(100), [0.75, 0.50, 0.25]) gives incorrect results. This patch uses numpy's mergesort to ensure that the quantiles are sorted. Note that with the patch behavior still differs from that in pandas.DataFrame.quantile() where quantiles are calculated correctly while preserving order. While this patch does not duplicate the behavior of pandas because it does not preserve the order of the quantiles, it does at least avoids the silent errors.
@martindurant

This comment has been minimized.

Copy link
Member

commented Mar 29, 2019

Should be tested with an input where the given percentiles are not sorted.

gregrf added some commits Mar 29, 2019

# current implementation needs qs to be sorted, sort in-place to make sure
q = np.asanyarray(q)
if q.ndim > 0:
q.sort(kind='mergesort')

This comment has been minimized.

Copy link
@mrocklin

mrocklin Mar 29, 2019

Member

Is it possible that this modifies the input provided by the user? If so then we probably don't want to sort in place.

In [1]: import numpy as np

In [2]: x = np.array([1, 2, 3])

In [3]: np.asanyarray(x) is x
Out[3]: True

In [4]: np.sort(x)
Out[4]: array([1, 2, 3])

In [5]: np.sort(x) is x
Out[5]: False

This comment has been minimized.

Copy link
@gregrf

gregrf Mar 29, 2019

Author Contributor

Glad you noticed that. You are right that it will sort the user's array if it was an ndarray to begin with. I'll fix it.

gregrf added some commits Mar 29, 2019

# current implementation needs q to be sorted so
# sort if array-like, otherwise leave it alone
q_ndarray = np.array(q)
if q_ndarray.ndim > 0:

This comment has been minimized.

Copy link
@martindurant

martindurant Mar 29, 2019

Member

In this case, maybe want to test for the single-value/scalar case?

This comment has been minimized.

Copy link
@gregrf

gregrf Mar 29, 2019

Author Contributor

Is there a straightforward way to tell if an arg is a scalar in python? Maybe there is now that there are base classes everywhere. It used to be hard so I'm used to doing the thing where I try to convert it to an array with np.array, then check to see if it has non-zero dimension, sorting if it does then setting q=q_ndarray and simply abandoning q_ndarray if q_ndarray.ndim==0.

By the way, I'm off on vacation in a couple hours. Feel free to modify and commit in my absence. I don't care who GIT blames for this.

@martindurant

This comment has been minimized.

Copy link
Member

commented Mar 29, 2019

Ah, I hadn't realised you simply dropped the variable if it had no dimensions, that's fine. Please still add a case of a single quantile, and one of a scalar quantile into the test, just to be sure.

Still showing lint problems:

dask/dataframe/core.py:3949:53: W291 trailing whitespace
dask/dataframe/tests/test_dataframe.py:899:1: W293 blank line contains whitespace
dask/dataframe/tests/test_dataframe.py:902:1: W293 blank line contains whitespace
dask/dataframe/tests/test_dataframe.py:908:1: W293 blank line contains whitespace
dask/dataframe/tests/test_dataframe.py:910:14: E128 continuation line under-indented for visual indent
dask/dataframe/tests/test_dataframe.py:915:1: W293 blank line contains whitespace

gregrf added some commits Mar 29, 2019

@codecov

This comment has been minimized.

Copy link

commented Mar 29, 2019

Codecov Report

❗️ No coverage uploaded for pull request base (master@2391040). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #4650   +/-   ##
=========================================
  Coverage          ?   91.22%           
=========================================
  Files             ?       92           
  Lines             ?    17189           
  Branches          ?        0           
=========================================
  Hits              ?    15680           
  Misses            ?     1509           
  Partials          ?        0
Impacted Files Coverage Δ
dask/dataframe/core.py 95.77% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2391040...ccf0721. Read the comment docs.

@pentschev

This comment has been minimized.

Copy link
Member

commented Mar 29, 2019

Not directly related to the PR itself, but we weren't supposed to have a codecov report as a comment. Also, we were supposed to have a report already, not sure if this is because the PR was opened before codecov was enabled. I'll check this, but for now, it's safe to go ahead and ignore codecov.

@martindurant

This comment has been minimized.

Copy link
Member

commented Mar 29, 2019

Thank you @gregrf

@martindurant martindurant merged commit e8e2d76 into dask:master Mar 29, 2019

4 checks passed

codecov/patch No report found to compare against
Details
codecov/project No report found to compare against
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@gregrf gregrf deleted the gregrf:gregrf-patch-2 branch Mar 30, 2019

@gregrf

This comment has been minimized.

Copy link
Contributor Author

commented Mar 30, 2019

Thanks for being so responsive to my pull requests.

@gregrf gregrf referenced this pull request Mar 30, 2019

Merged

fix ordering of quantiles in describe. #4647

2 of 2 tasks complete

asmith26 added a commit to asmith26/dask that referenced this pull request Apr 22, 2019

correct quantile to handle unsorted quantiles (dask#4650)
* correct quantile to handle unsorted quantiles

Currently dask.dataframe.core.quantile(df, q) can silently give incorrect results when the list of quantiles, q, is not sorted. For instance quantile(dask.array.arange(100), [0.75, 0.50, 0.25]) gives incorrect results. This patch uses numpy's mergesort to ensure that the quantiles are sorted. Note that with the patch behavior still differs from that in pandas.DataFrame.quantile() where quantiles are calculated correctly while preserving order. While this patch does not duplicate the behavior of pandas because it does not preserve the order of the quantiles, it does at least avoids the silent errors.

* added tests for unsorted arg to quantile

* moved sorting of quantiles before index is created

* avoid modifying argument to quantile()

* leave q arg to quantiles alone if scalar

* eliminated trailing whitespace

* testing quantile added length 1 list and scalar case

jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this pull request May 14, 2019

correct quantile to handle unsorted quantiles (dask#4650)
* correct quantile to handle unsorted quantiles

Currently dask.dataframe.core.quantile(df, q) can silently give incorrect results when the list of quantiles, q, is not sorted. For instance quantile(dask.array.arange(100), [0.75, 0.50, 0.25]) gives incorrect results. This patch uses numpy's mergesort to ensure that the quantiles are sorted. Note that with the patch behavior still differs from that in pandas.DataFrame.quantile() where quantiles are calculated correctly while preserving order. While this patch does not duplicate the behavior of pandas because it does not preserve the order of the quantiles, it does at least avoids the silent errors.

* added tests for unsorted arg to quantile

* moved sorting of quantiles before index is created

* avoid modifying argument to quantile()

* leave q arg to quantiles alone if scalar

* eliminated trailing whitespace

* testing quantile added length 1 list and scalar case
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.