-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature request] aggregate syntax and quantile computation #5986
Comments
Can you provide a snippet with the input data and expected output? Is this in a groupby context, i.e. the Dask version of In [12]: df = pd.DataFrame({"A": ['a', 'a', 'b'], "B": [1, 2, 3], "C": [4, 5, 4]})
In [13]: df.groupby("A").quantile()
Out[13]:
B C
A
a 1.5 4.5
b 3.0 4.0 One slight complication is that quantile isn't always an aggregation. In [14]: df.groupby("A").quantile([0.5, 0.75])
Out[14]:
B C
A
a 0.50 1.50 4.50
0.75 1.75 4.75
b 0.50 3.00 4.00
0.75 3.00 4.00 It may still be doable though. |
Thank you for your answer. Correct, this is in groupby context. Here is a snippet of what I would do using pure Pandas: In [3]: import pandas as pd
def q25(s):
return s.quantile(0.25)
df_pandas = pd.DataFrame({"car": [1, 1, 2, 4, 4, 4], "speed": [1, 2, 3, 4, 5, 6]})
df_pandas.groupby("car").speed.agg(["mean", q25])
Out[3]:
mean q25
car
1 1.5 1.25
2 3.0 3.00
4 5.0 4.50 |
When I used df.groupby("A").quantile(0.5). I got 'column not found quantile' as error. |
@leonardokidd this issue is still open because the feature that it is requesting has not been implemented yet. If you are interested in implementing it please open a Pull Request. |
I came up with a workaround for a particular case. If the groups that you want to calculate quantiles on are relatively small, you can separate them by partition, then use a dask Aggregation. Here is an example: # this puts different user ids on different partitions
df = df.set_index("user_id")
median_fun = dd.Aggregation(
name="median",
# this computes the median on each partition
chunk=lambda s: s.median(),
# this combines results across partitions; the input should just be a list of length 1
agg=lambda s0: s0.sum(),
)
median_df = df.groupby("user_id")["some_column"].agg(median_fun) It seemed to work for the data I was working with, but I haven't thoroughly tested it. I am curious to see if anyone finds any issues with it. |
Yeah that should work as long as all the data for each group is on exactly one partition. Nice! |
Hi,
The Dask API provides a method to compute quantiles of Series:
https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.quantile
This is already a great thing as quantile distributed computation is known to be a real challenge.
Unfortunately, this is not available using aggregate API.
https://docs.dask.org/en/latest/dataframe-groupby.html#aggregate
Would it be possible to provide quantile computation to Dask aggregate syntax?
The text was updated successfully, but these errors were encountered: