Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

correct quantile to handle unsorted quantiles #4650

Merged
merged 7 commits into from Mar 29, 2019

Conversation

@gregrf
Copy link
Contributor

@gregrf gregrf commented Mar 29, 2019

Currently dask.dataframe.core.quantile(df, q) can silently give incorrect results when the list of quantiles, q, is not sorted. For instance quantile(dask.array.arange(100), [0.75, 0.50, 0.25]) gives incorrect results. This patch uses numpy's mergesort to ensure that the quantiles are sorted. Note that with the patch behavior still differs from that in pandas.DataFrame.quantile() where quantiles are calculated correctly while preserving order. While this patch does not duplicate the behavior of pandas because it does not preserve the order of the quantiles, it does at least avoids the silent errors.

  • Tests added / passed
  • Passes flake8 dask
Currently dask.dataframe.core.quantile(df, q) can silently give incorrect results when the list of quantiles, q, is not sorted. For instance quantile(dask.array.arange(100), [0.75, 0.50, 0.25]) gives incorrect results. This patch uses numpy's mergesort to ensure that the quantiles are sorted. Note that with the patch behavior still differs from that in pandas.DataFrame.quantile() where quantiles are calculated correctly while preserving order. While this patch does not duplicate the behavior of pandas because it does not preserve the order of the quantiles, it does at least avoids the silent errors.
@martindurant
Copy link
Member

@martindurant martindurant commented Mar 29, 2019

Should be tested with an input where the given percentiles are not sorted.

# current implementation needs qs to be sorted, sort in-place to make sure
q = np.asanyarray(q)
if q.ndim > 0:
q.sort(kind='mergesort')
Copy link
Member

@mrocklin mrocklin Mar 29, 2019

Is it possible that this modifies the input provided by the user? If so then we probably don't want to sort in place.

In [1]: import numpy as np

In [2]: x = np.array([1, 2, 3])

In [3]: np.asanyarray(x) is x
Out[3]: True

In [4]: np.sort(x)
Out[4]: array([1, 2, 3])

In [5]: np.sort(x) is x
Out[5]: False

Copy link
Contributor Author

@gregrf gregrf Mar 29, 2019

Glad you noticed that. You are right that it will sort the user's array if it was an ndarray to begin with. I'll fix it.

# current implementation needs q to be sorted so
# sort if array-like, otherwise leave it alone
q_ndarray = np.array(q)
if q_ndarray.ndim > 0:
Copy link
Member

@martindurant martindurant Mar 29, 2019

In this case, maybe want to test for the single-value/scalar case?

Copy link
Contributor Author

@gregrf gregrf Mar 29, 2019

Is there a straightforward way to tell if an arg is a scalar in python? Maybe there is now that there are base classes everywhere. It used to be hard so I'm used to doing the thing where I try to convert it to an array with np.array, then check to see if it has non-zero dimension, sorting if it does then setting q=q_ndarray and simply abandoning q_ndarray if q_ndarray.ndim==0.

By the way, I'm off on vacation in a couple hours. Feel free to modify and commit in my absence. I don't care who GIT blames for this.

@martindurant
Copy link
Member

@martindurant martindurant commented Mar 29, 2019

Ah, I hadn't realised you simply dropped the variable if it had no dimensions, that's fine. Please still add a case of a single quantile, and one of a scalar quantile into the test, just to be sure.

Still showing lint problems:

dask/dataframe/core.py:3949:53: W291 trailing whitespace
dask/dataframe/tests/test_dataframe.py:899:1: W293 blank line contains whitespace
dask/dataframe/tests/test_dataframe.py:902:1: W293 blank line contains whitespace
dask/dataframe/tests/test_dataframe.py:908:1: W293 blank line contains whitespace
dask/dataframe/tests/test_dataframe.py:910:14: E128 continuation line under-indented for visual indent
dask/dataframe/tests/test_dataframe.py:915:1: W293 blank line contains whitespace

@codecov
Copy link

@codecov codecov bot commented Mar 29, 2019

Codecov Report

No coverage uploaded for pull request base (master@2391040). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #4650   +/-   ##
=========================================
  Coverage          ?   91.22%           
=========================================
  Files             ?       92           
  Lines             ?    17189           
  Branches          ?        0           
=========================================
  Hits              ?    15680           
  Misses            ?     1509           
  Partials          ?        0
Impacted Files Coverage Δ
dask/dataframe/core.py 95.77% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2391040...ccf0721. Read the comment docs.

@pentschev
Copy link
Member

@pentschev pentschev commented Mar 29, 2019

Not directly related to the PR itself, but we weren't supposed to have a codecov report as a comment. Also, we were supposed to have a report already, not sure if this is because the PR was opened before codecov was enabled. I'll check this, but for now, it's safe to go ahead and ignore codecov.

@martindurant
Copy link
Member

@martindurant martindurant commented Mar 29, 2019

Thank you @gregrf

@martindurant martindurant merged commit e8e2d76 into dask:master Mar 29, 2019
4 checks passed
4 checks passed
@codecov[bot]
codecov/patch No report found to compare against
Details
@codecov[bot]
codecov/project No report found to compare against
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@gregrf gregrf deleted the gregrf-patch-2 branch Mar 30, 2019
@gregrf
Copy link
Contributor Author

@gregrf gregrf commented Mar 30, 2019

Thanks for being so responsive to my pull requests.

jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this issue May 14, 2019
* correct quantile to handle unsorted quantiles

Currently dask.dataframe.core.quantile(df, q) can silently give incorrect results when the list of quantiles, q, is not sorted. For instance quantile(dask.array.arange(100), [0.75, 0.50, 0.25]) gives incorrect results. This patch uses numpy's mergesort to ensure that the quantiles are sorted. Note that with the patch behavior still differs from that in pandas.DataFrame.quantile() where quantiles are calculated correctly while preserving order. While this patch does not duplicate the behavior of pandas because it does not preserve the order of the quantiles, it does at least avoids the silent errors.

* added tests for unsorted arg to quantile

* moved sorting of quantiles before index is created

* avoid modifying argument to quantile()

* leave q arg to quantiles alone if scalar

* eliminated trailing whitespace

* testing quantile added length 1 list and scalar case
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants