Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] aggregate syntax and quantile computation #5986

Open
Amyantis opened this issue Mar 6, 2020 · 6 comments
Open

[Feature request] aggregate syntax and quantile computation #5986

Amyantis opened this issue Mar 6, 2020 · 6 comments

Comments

@Amyantis
Copy link

Amyantis commented Mar 6, 2020

Hi,

The Dask API provides a method to compute quantiles of Series:
https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.quantile

This is already a great thing as quantile distributed computation is known to be a real challenge.

Unfortunately, this is not available using aggregate API.
https://docs.dask.org/en/latest/dataframe-groupby.html#aggregate

Would it be possible to provide quantile computation to Dask aggregate syntax?

@TomAugspurger
Copy link
Member

Can you provide a snippet with the input data and expected output? Is this in a groupby context, i.e. the Dask version of

In [12]: df = pd.DataFrame({"A": ['a', 'a', 'b'], "B": [1, 2, 3], "C": [4, 5, 4]})

In [13]: df.groupby("A").quantile()
Out[13]:
     B    C
A
a  1.5  4.5
b  3.0  4.0

One slight complication is that quantile isn't always an aggregation.

In [14]: df.groupby("A").quantile([0.5, 0.75])
Out[14]:
           B     C
A
a 0.50  1.50  4.50
  0.75  1.75  4.75
b 0.50  3.00  4.00
  0.75  3.00  4.00

It may still be doable though.

@Amyantis
Copy link
Author

Amyantis commented Mar 6, 2020

Thank you for your answer.

Correct, this is in groupby context.

Here is a snippet of what I would do using pure Pandas:

In [3]: import pandas as pd

    def q25(s):
        return s.quantile(0.25)

    df_pandas = pd.DataFrame({"car": [1, 1, 2, 4, 4, 4], "speed": [1, 2, 3, 4, 5, 6]})
    df_pandas.groupby("car").speed.agg(["mean", q25])                       

Out[3]:
     mean   q25
car
1     1.5  1.25
2     3.0  3.00
4     5.0  4.50

@leonardokidd
Copy link

When I used df.groupby("A").quantile(0.5). I got 'column not found quantile' as error.

@jsignell
Copy link
Member

@leonardokidd this issue is still open because the feature that it is requesting has not been implemented yet. If you are interested in implementing it please open a Pull Request.

@istorch
Copy link

istorch commented Jul 17, 2021

I came up with a workaround for a particular case. If the groups that you want to calculate quantiles on are relatively small, you can separate them by partition, then use a dask Aggregation. Here is an example:

# this puts different user ids on different partitions
df = df.set_index("user_id")

median_fun = dd.Aggregation(
    name="median",
    # this computes the median on each partition
    chunk=lambda s: s.median(),
    # this combines results across partitions; the input should just be a list of length 1
    agg=lambda s0: s0.sum(),
)

median_df = df.groupby("user_id")["some_column"].agg(median_fun)

It seemed to work for the data I was working with, but I haven't thoroughly tested it. I am curious to see if anyone finds any issues with it.

@jsignell
Copy link
Member

Yeah that should work as long as all the data for each group is on exactly one partition. Nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants