Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor array.percentile and dataframe.quantile to use t-digest #4677

Merged
merged 26 commits into from Apr 26, 2019

Conversation

Projects
None yet
3 participants
@Dimplexion
Copy link
Contributor

commented Apr 8, 2019

  • Tests added / passed
  • Passes flake8 dask

Related to #1225

This PR changes the public array.percentile and dataframe.quantile methods to use t-digest when possible. Meaning that depending on the input parameters it uses t-digest when it's possible and falls back to the old implementation for data types that are not integer or float and if interpolation is not allowed.

I also noticed that array.percentile was being used for calculating divisions and figured someone else might want to be able to do that also. So I added a parameter use_tdigest that defaults to True but can be disabled to revert back to the old behavior. I had to add it to a few places for it to make it sense, let me know if you think this is a good idea or if it should be changed.

I will still need to write some tests to cover cases when use_tdigest == True for good coverage before this is ready to be merged (if we will keep it).

@mrocklin

This comment has been minimized.

Copy link
Member

commented Apr 8, 2019

cc @jcrist

@jcrist
Copy link
Member

left a comment

Thanks @Dimplexion for working on this. I've only left high-level api comments for now, will try to give a deeper review later.

@@ -30,6 +30,6 @@ source activate test-imports
(test_import "Core" "" "import dask, dask.threaded, dask.optimization") && \
(test_import "Delayed" "toolz" "import dask.delayed") && \
(test_import "Bag" "toolz partd cloudpickle" "import dask.bag") && \
(test_import "Array" "toolz numpy" "import dask.array") && \
(test_import "Array" "toolz numpy crick" "import dask.array") && \

This comment has been minimized.

Copy link
@jcrist

jcrist Apr 8, 2019

Member

crick should not be a mandatory dependency for dask array - it could be optional for certain functionality (e.g. percentile), but should only be needed if requested.

This comment has been minimized.

Copy link
@Dimplexion

Dimplexion Apr 9, 2019

Author Contributor

Removed crick from being a mandatory dependency.

This comment has been minimized.

Copy link
@Dimplexion

Dimplexion Apr 9, 2019

Author Contributor

Btw is there a list of optional dependencies somewhere where I should add it?

This comment has been minimized.

Copy link
@jcrist

jcrist Apr 10, 2019

Member

We should add it to the install scripts on travis and appveyor, but other than that no.

There's also dask.utils.import_required which is used in a few places to provide a nice error message when importing a dependency fails for some optional functionality. I'd probably add this to quantile and percentile to error nicely when they're called by crick isn't installed. See e.g.

dask/dask/bag/avro.py

Lines 95 to 97 in 9f870d1

import_required('fastavro',
"fastavro is a required dependency for using "
"bag.read_avro().")

This comment has been minimized.

Copy link
@Dimplexion

Dimplexion Apr 11, 2019

Author Contributor

I added it to the install scripts and also added those import warnings.

@@ -4,6 +4,7 @@
from numbers import Number

import numpy as np
from crick import TDigest

This comment has been minimized.

Copy link
@jcrist

jcrist Apr 8, 2019

Member

This import should be local to functions using it to prevent crick from being a required dependency.

This comment has been minimized.

Copy link
@Dimplexion

Dimplexion Apr 9, 2019

Author Contributor

Moved it to be imported only where needed.

@@ -1494,7 +1494,7 @@ def sem(self, axis=None, skipna=None, ddof=1, split_every=False):
result.divisions = (min(self.columns), max(self.columns))
return result

def quantile(self, q=0.5, axis=0):
def quantile(self, q=0.5, axis=0, use_tdigest=True):

This comment has been minimized.

Copy link
@jcrist

jcrist Apr 8, 2019

Member

Since we may add more methods in the future, I'd prefer this to be a string. Current values would be:

  • `method='tdigest': uses tdigest
  • `method='dask': uses dask's custom algorithm
  • method='default': the default value for this kwarg just means use the default for this version of dask. We should feel free to change this if we find a new best method. For now this would map to 'dask', but could be changed later.

This comment has been minimized.

Copy link
@Dimplexion

Dimplexion Apr 9, 2019

Author Contributor

Implemented this change as described here. Although I must raise the concern here that defaulting to 'dask' can lead to some pretty nasty surprises. For example in the case that brought this issue to my knowledge I wanted to get the 95th percentile from data and the call ddf.quantile(0.95) was giving me a value of ~12. A week later I learned about the issues with the quantile method and ran it using t-digest which game me a result of ~8. The internal implementation sometimes fails massively and it can be very difficult to spot. I think it's dangerous to use it as the default without a warning. I'm not sure what would be the best way to handle it without breaking existing code, though, as t-digest can't handle all the cases the internal one can.

This comment has been minimized.

Copy link
@mrocklin

mrocklin Apr 9, 2019

Member

@jcrist what are your thoughts on what the default should be? Should we make crick mandatory for dask dataframe?

This comment has been minimized.

Copy link
@jcrist

jcrist Apr 10, 2019

Member

Maybe? My thought was we should add the implementation's in this PR, let them sit for a bit so they can get some use, then maybe pull the switch and change the default. I think we do still want the current algorithm for partitioning logic (no external dependency, handles more dtypes), but for numerical results tdigest is likely better.

This comment has been minimized.

Copy link
@Dimplexion

Dimplexion Apr 10, 2019

Author Contributor

Yeah I agree we still want to use the custom algorithm for handling partitioning. I'm fine with having it as 'dask' for now, just wanted to make sure I bring that up since it was a pretty nasty surprise for me when I was starting to use Dask.

@@ -1533,7 +1538,7 @@ def quantile(self, q=0.5, axis=0):
return DataFrame(graph, keyname, meta, quantiles[0].divisions)

@derived_from(pd.DataFrame)
def describe(self, split_every=False, percentiles=None):
def describe(self, split_every=False, percentiles=None, use_tdigest=True):

This comment has been minimized.

Copy link
@jcrist

jcrist Apr 8, 2019

Member

perhaps percentiles_method=... here?

This comment has been minimized.

Copy link
@Dimplexion

Dimplexion Apr 9, 2019

Author Contributor

Done.

@@ -2032,13 +2037,17 @@ def to_timestamp(self, freq=None, how='start', axis=0):
df.divisions = tuple(pd.Index(self.divisions).to_timestamp())
return df

def quantile(self, q=0.5):
def quantile(self, q=0.5, use_tdigest=True):

This comment has been minimized.

Copy link
@jcrist

jcrist Apr 8, 2019

Member

perhaps method=... here?

This comment has been minimized.

Copy link
@Dimplexion

Dimplexion Apr 9, 2019

Author Contributor

Done.

@@ -3938,13 +3947,16 @@ def _rename_dask(df, names):
return new_dd_object(graph, name, metadata, df.divisions)


def quantile(df, q):
def quantile(df, q, use_tdigest=True):

This comment has been minimized.

Copy link
@jcrist

jcrist Apr 8, 2019

Member

perhaps method=... here?

This comment has been minimized.

Copy link
@Dimplexion

Dimplexion Apr 9, 2019

Author Contributor

Done.

setup.py Outdated
@@ -8,7 +8,7 @@
# NOTE: These are tested in `continuous_integration/travis/test_imports.sh` If
# you modify these, make sure to change the corresponding line there.
extras_require = {
'array': ['numpy >= 1.11.0', 'toolz >= 0.7.3'],
'array': ['numpy >= 1.11.0', 'toolz >= 0.7.3', 'crick >= 0.0.3'],

This comment has been minimized.

Copy link
@jcrist

jcrist Apr 8, 2019

Member

crick should be an optional dependency for dask array, and probably an optional dependency for dask dataframe (for now).

This comment has been minimized.

Copy link
@Dimplexion

Dimplexion Apr 9, 2019

Author Contributor

Removed.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Apr 10, 2019

It looks like tests are fine except for linting errors

(From travis-ci logs)

$ if [[ $LINT == 'true' ]]; then flake8 dask; fi
dask/dataframe/core.py:4016:17: E128 continuation line under-indented for visual indent
dask/dataframe/core.py:4020:45: E128 continuation line under-indented for visual indent
dask/dataframe/core.py:4027:17: E128 continuation line under-indented for visual indent
dask/array/percentile.py:92:17: E128 continuation line under-indented for visual indent
@Dimplexion

This comment has been minimized.

Copy link
Contributor Author

commented Apr 10, 2019

It looks like tests are fine except for linting errors

(From travis-ci logs)

$ if [[ $LINT == 'true' ]]; then flake8 dask; fi
dask/dataframe/core.py:4016:17: E128 continuation line under-indented for visual indent
dask/dataframe/core.py:4020:45: E128 continuation line under-indented for visual indent
dask/dataframe/core.py:4027:17: E128 continuation line under-indented for visual indent
dask/array/percentile.py:92:17: E128 continuation line under-indented for visual indent

Thanks for pointing that out, those should be fixed now. For some reason I'm getting this weird error when I'm trying to run flake8 locally so I'm gonna have to rely on travis for this.

The error I'm getting is this: pkg_resources.DistributionNotFound: The 'pyflakes<2.1.0,>=2.0.0' distribution was not found and is required by flake8. I'll see if I can get it fixed later.

@Dimplexion

This comment has been minimized.

Copy link
Contributor Author

commented Apr 11, 2019

The interface is probably set now so I'll start adding the tests so that we have similar coverage with method='tdigest' as we have with method='dask'.

@Dimplexion

This comment has been minimized.

Copy link
Contributor Author

commented Apr 15, 2019

Pushed some test cases now. Still need to add some tests for DataFrame.describe() and DataFrame.quantile().

@Dimplexion Dimplexion changed the title WIP: Refactor array.percentile and dataframe.quantile to use t-digest Refactor array.percentile and dataframe.quantile to use t-digest Apr 16, 2019

@Dimplexion

This comment has been minimized.

Copy link
Contributor Author

commented Apr 16, 2019

I pushed the tests for the other functions that were still missing. This PR is ready from my perspective so feel free to check it whenever you have time @jcrist. I also removed WIP from the title now.

@jcrist

This comment has been minimized.

Copy link
Member

commented Apr 18, 2019

Thanks @Dimplexion, I'll give this a review tomorrow. Looks like this has developed merge conflicts, if you have time at some point could you resolve these?

@jcrist
Copy link
Member

left a comment

Thanks @Dimplexion. Overall the implementation looks good - I left a few comments about the tests and docstrings but this is pretty close to mergeable.

t = TDigest()
t.merge(*digests)

return np.array([t.quantile(q / 100.0) for q in qs])

This comment has been minimized.

Copy link
@jcrist

jcrist Apr 19, 2019

Member

quantile supports array arguments:

t.quantile(q / 100.0)

This comment has been minimized.

Copy link
@Dimplexion

Dimplexion Apr 23, 2019

Author Contributor

Good catch, didn't notice that.

@@ -9,19 +9,35 @@

def test_percentile():

This comment has been minimized.

Copy link
@jcrist

jcrist Apr 19, 2019

Member

I'd write this with pytest.mark.parametrize over the method arg:

@pytest.mark.parametrize('method', ['tdigest', 'dask'])
def test_percentile(method):
    ...

For arguments that don't support one type I'd just special case around them inside the test:

    if method != 'tdigest':
        x = np.array(['a', 'a', 'd', 'd', 'd', 'e'])
        d = da.from_array(x, chunks=(3,))
        assert_eq(da.percentile(d, [0, 50, 100]),
                  np.array(['a', 'd', 'e'], dtype=x.dtype))

Doing it this way ensures we have coverage across both methods and minimizes duplicating code.

This comment has been minimized.

Copy link
@Dimplexion

Dimplexion Apr 23, 2019

Author Contributor

That's a really nice way of doing it, thanks for pointing that out!

@@ -59,6 +76,7 @@ def test_percentiles_with_scaler_percentile(q):
# See #3020
d = da.ones((16,), chunks=(4,))
assert_eq(da.percentile(d, q), np.array([1], dtype=d.dtype))
assert_eq(da.percentile(d, q, method='tdigest'), np.array([1], dtype=d.dtype))


def test_unknown_chunk_sizes():

This comment has been minimized.

Copy link
@jcrist

jcrist Apr 19, 2019

Member

Could use pytest.mark.parametrize here.

Note: this implementation will use t-digest for columns with floating
dtype if axis is set to 0 and `method` is set to `tdigest`.
Otherwise it falls back to the internal implementation.

This comment has been minimized.

Copy link
@jcrist

jcrist Apr 19, 2019

Member

Doesn't it use tdigest only if that's specified explicitly?

Also, should put this as a parameter docstring:

"""
...
method : {'default', 'tdigest', 'dask'}:
    What method to use. By default will use dask's internal custom algorithm (``'dask'``).
    If set to ``'tdigest'`` will use tdigest for floats and ints and fallback to the ``'dask'``
    otherwise.
...
"""

This comment has been minimized.

Copy link
@Dimplexion

Dimplexion Apr 23, 2019

Author Contributor

Yes it does, your version is much better.

""" Approximate quantiles of Series
q : list/array of floats, default 0.5 (50%)
Iterable of numbers ranging from 0 to 1 for the desired quantiles
Note: this implementation will use t-digest is `method` is set to `tdigest` and the

This comment has been minimized.

Copy link
@jcrist

jcrist Apr 19, 2019

Member

Should put this as a parameter docstring:

"""
...
method : {'default', 'tdigest', 'dask'}:
    What method to use. By default will use dask's internal custom algorithm (``'dask'``).
    If set to ``'tdigest'`` will use tdigest for floats and ints and fallback to the ``'dask'``
    otherwise.
...
"""
"""Approximate quantiles of Series.
Parameters
----------
q : list/array of floats
Iterable of numbers ranging from 0 to 100 for the desired quantiles
Note: this implementation will use t-digest is `method` is set to `tdigest` and the
dtype of df is float. Otherwise it falls back to the internal implementation.

This comment has been minimized.

Copy link
@jcrist

jcrist Apr 19, 2019

Member

Should put this as a parameter docstring:

"""
...
method : {'default', 'tdigest', 'dask'}:
    What method to use. By default will use dask's internal custom algorithm (``'dask'``).
    If set to ``'tdigest'`` will use tdigest for floats and ints and fallback to the ``'dask'``
    otherwise.
...
"""
@@ -300,13 +300,33 @@ def test_describe():
assert ds.describe(split_every=2)._name != ds.describe()._name
assert ddf.describe(split_every=2)._name != ddf.describe()._name

s = pd.Series(list(range(10)) * 6)

This comment has been minimized.

Copy link
@jcrist

jcrist Apr 19, 2019

Member

General comment for this whole file: it would be good to use pytest.mark.parametrize here to minimize duplicating code.

@Dimplexion

This comment has been minimized.

Copy link
Contributor Author

commented Apr 25, 2019

Everything that was listed has be fixed now, including the conflict with the base branch. I also added testing on some cases that were still missing previously. test_quantile_for_possibly_unsorted_q is not being run with 'tdigest' yet as it wasn't trivial to make it work the way it's now. I can look into that a bit later still.

Fixups
- Make tests not require crick to run
- Fix docstring formatting
- A few other nits
@jcrist

This comment has been minimized.

Copy link
Member

commented Apr 26, 2019

Thanks @Dimplexion. I pushed an extra commit to make the tests run without crick installed, and also formatted a few of the docstrings. Everything looks good, will merge once tests pass.

@jcrist jcrist merged commit 87a46eb into dask:master Apr 26, 2019

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jcrist

This comment has been minimized.

Copy link
Member

commented Apr 26, 2019

Thanks @Dimplexion!

@Dimplexion

This comment has been minimized.

Copy link
Contributor Author

commented Apr 27, 2019

Always happy to help!

@Dimplexion Dimplexion deleted the Dimplexion:refactor-public-quantile branch Apr 27, 2019

jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this pull request May 14, 2019

Refactor array.percentile and dataframe.quantile to use t-digest (das…
…k#4677)

* Use t-digest for arrays when possible

* implement using t-digest for DataFrame quantiles when possible

* Change dataframe/io/from_bcolz to use the old percentile implementation

* Change tests to use the old DataFrame.quantile() and Array.percentile() versions to allow them to pass

* Add crick as a dependency

* Remove crick from being a mandatory requirement

* Change "use_tdigest" parameter to a more general "method"

* Update tests to work with the new "method" parameter to quantile and percentile functions

* Fix flake8 warnings.

* Add warning when attempting to use t-digest function without crick being installed

* Add crick as an optional dependency to appveyor and travis

* Add tests for array.percentile() when using 'tdigest' method

* Add some tests for DataFrame.quantile with method 'tdigest'

* Fix styling error.

* Add tests for DataFrame.quantile when method='tdigest'

* Add tests for DataFrame.describe() when using method='tdigest'

* Change array.percentile to use list parameter for crick.quantile().

* Change to use pytest.mark.parametrize when needed

* Fix doc strings in dataframe/core.

* Refactor quantile tests in test_dataframe to use pytest.mark.parametrize

* Fixups

- Make tests not require crick to run
- Fix docstring formatting
- A few other nits

Thomas-Z added a commit to Thomas-Z/dask that referenced this pull request May 17, 2019

Refactor array.percentile and dataframe.quantile to use t-digest (das…
…k#4677)

* Use t-digest for arrays when possible

* implement using t-digest for DataFrame quantiles when possible

* Change dataframe/io/from_bcolz to use the old percentile implementation

* Change tests to use the old DataFrame.quantile() and Array.percentile() versions to allow them to pass

* Add crick as a dependency

* Remove crick from being a mandatory requirement

* Change "use_tdigest" parameter to a more general "method"

* Update tests to work with the new "method" parameter to quantile and percentile functions

* Fix flake8 warnings.

* Add warning when attempting to use t-digest function without crick being installed

* Add crick as an optional dependency to appveyor and travis

* Add tests for array.percentile() when using 'tdigest' method

* Add some tests for DataFrame.quantile with method 'tdigest'

* Fix styling error.

* Add tests for DataFrame.quantile when method='tdigest'

* Add tests for DataFrame.describe() when using method='tdigest'

* Change array.percentile to use list parameter for crick.quantile().

* Change to use pytest.mark.parametrize when needed

* Fix doc strings in dataframe/core.

* Refactor quantile tests in test_dataframe to use pytest.mark.parametrize

* Fixups

- Make tests not require crick to run
- Fix docstring formatting
- A few other nits
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.