[Feature request] aggregate syntax and quantile computation #5986

Amyantis · 2020-03-06T14:58:16Z

Hi,

The Dask API provides a method to compute quantiles of Series:
https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.quantile

This is already a great thing as quantile distributed computation is known to be a real challenge.

Unfortunately, this is not available using aggregate API.
https://docs.dask.org/en/latest/dataframe-groupby.html#aggregate

Would it be possible to provide quantile computation to Dask aggregate syntax?

TomAugspurger · 2020-03-06T15:56:55Z

Can you provide a snippet with the input data and expected output? Is this in a groupby context, i.e. the Dask version of

In [12]: df = pd.DataFrame({"A": ['a', 'a', 'b'], "B": [1, 2, 3], "C": [4, 5, 4]})

In [13]: df.groupby("A").quantile()
Out[13]:
     B    C
A
a  1.5  4.5
b  3.0  4.0

One slight complication is that quantile isn't always an aggregation.

In [14]: df.groupby("A").quantile([0.5, 0.75])
Out[14]:
           B     C
A
a 0.50  1.50  4.50
  0.75  1.75  4.75
b 0.50  3.00  4.00
  0.75  3.00  4.00

It may still be doable though.

Amyantis · 2020-03-06T16:34:39Z

Thank you for your answer.

Correct, this is in groupby context.

Here is a snippet of what I would do using pure Pandas:

In [3]: import pandas as pd

    def q25(s):
        return s.quantile(0.25)

    df_pandas = pd.DataFrame({"car": [1, 1, 2, 4, 4, 4], "speed": [1, 2, 3, 4, 5, 6]})
    df_pandas.groupby("car").speed.agg(["mean", q25])                       

Out[3]:
     mean   q25
car
1     1.5  1.25
2     3.0  3.00
4     5.0  4.50

leonardokidd · 2021-02-15T17:42:00Z

When I used df.groupby("A").quantile(0.5). I got 'column not found quantile' as error.

jsignell · 2021-02-22T16:07:42Z

@leonardokidd this issue is still open because the feature that it is requesting has not been implemented yet. If you are interested in implementing it please open a Pull Request.

istorch · 2021-07-17T21:15:51Z

I came up with a workaround for a particular case. If the groups that you want to calculate quantiles on are relatively small, you can separate them by partition, then use a dask Aggregation. Here is an example:

# this puts different user ids on different partitions
df = df.set_index("user_id")

median_fun = dd.Aggregation(
    name="median",
    # this computes the median on each partition
    chunk=lambda s: s.median(),
    # this combines results across partitions; the input should just be a list of length 1
    agg=lambda s0: s0.sum(),
)

median_df = df.groupby("user_id")["some_column"].agg(median_fun)

It seemed to work for the data I was working with, but I haven't thoroughly tested it. I am curious to see if anyone finds any issues with it.

jsignell · 2021-07-19T15:10:10Z

Yeah that should work as long as all the data for each group is on exactly one partition. Nice!

jrbourbeau added the dataframe label Feb 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] aggregate syntax and quantile computation #5986

[Feature request] aggregate syntax and quantile computation #5986

Amyantis commented Mar 6, 2020

TomAugspurger commented Mar 6, 2020

Amyantis commented Mar 6, 2020 •

edited

leonardokidd commented Feb 15, 2021

jsignell commented Feb 22, 2021

istorch commented Jul 17, 2021

jsignell commented Jul 19, 2021

[Feature request] aggregate syntax and quantile computation #5986

[Feature request] aggregate syntax and quantile computation #5986

Comments

Amyantis commented Mar 6, 2020

TomAugspurger commented Mar 6, 2020

Amyantis commented Mar 6, 2020 • edited

leonardokidd commented Feb 15, 2021

jsignell commented Feb 22, 2021

istorch commented Jul 17, 2021

jsignell commented Jul 19, 2021

Amyantis commented Mar 6, 2020 •

edited