Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow expressions in group_by #494

Open
josevalim opened this issue Feb 1, 2023 · 6 comments
Open

Allow expressions in group_by #494

josevalim opened this issue Feb 1, 2023 · 6 comments

Comments

@josevalim
Copy link
Member

@cigrainger, instead of:

df
|> DF.mutate(day_of_week: day_of_week(date))
|> DF.group_by([:day_of_week]

what do you think about supporting this:

df
|> DF.group_by(day_of_week: day_of_week(date))

They will be basically equivalent.

Thoughts?

@ahacking
Copy link

ahacking commented Feb 2, 2023

Something to consider, is dplyr also permits per operation grouping using the by(selection)_ argument which is applied and scoped just for the operation. Ref:

Grouping radically affects the computation of the dplyr verb you use it with, and one of the goals of .by is to allow you to place that grouping specification alongside the code that actually uses it. As an added benefit, with .by you no longer need to remember to ungroup() after summarise(), and summarise() won't ever message you about how it's handling the groups!

I did run into this problem of locality of the grouping to the intent/assumptions of the operation I am performing. You may run into cases where you need to push and pop the active group so the callers assumptions on the dataframe returned are not violated.

Elixir favours explicitness so I see an operation local grouping being aligned to this ideal whereas group_by /ungroup is much easier to get wrong.

I would possibly go as far to say that the dataframe level group_by may be a bad idea and will lead to surprises and coding patterns of saving groups, ungrouping, applying new groups, ungrouping and restoring groups aka pushing and popping group specs.

In terms of group expressions is there any issue using any series however computed (including lead or lagged series) to be applied a group_by()/operation by() selection providing it has the same row count as the dataframe?

@josevalim
Copy link
Member Author

That's a nice API but unfortunately we would have ambiguity with mutate:

DF.mutate(new_column: ..., by: ...)

So we would need to come up with a solution to that.

In terms of group expressions is there any issue using any series however computed (including lead or lagged series) to be applied a group_by()/operation by() selection providing it has the same row count as the dataframe?

I don't think so, you can either compute it with mutate or put an existing series with put.

@ahacking
Copy link

ahacking commented Feb 2, 2023

Yes good point, I believe that why dplyr uses the dot ".by" to distinguish the parameter from columns. I don't think that approach is all that clean as it would put restrictions/reservations on column names.

I thought something like apply_grouping_for(df, grouping, fn) might work but its a bit ugly having to use a function for the operation it's applied to.

Then I considered something like push_grouping()/pop_grouping() which is nicer from a pipe perspective but is not adding a whole lot of value over the existing group_by/ungroup and it is still a bit clumsy as the pop can easily be forgotten.

So there is nothing really compelling in any of those options unfortunately.

@josevalim
Copy link
Member Author

One option I considered was:

mutate(
  foo: this |> that |> by(:user_id)
)

That would apply per column but we could optimize it a bit and do the grouping once in case several columns have the same by clause.

@ahacking
Copy link

ahacking commented Feb 3, 2023

I think that could work, intersting approach!

Would it be possible to also apply to a keyword list piping into the group by?

mutate([a: 7,  b: a * 2]
       |> this()
       |> that()
       |> group_by(:user_id)
)

That would ameliorate having to collect the group by clauses for efficiency and running into potential issues in changing the order of dependencies between the column values.

@cigrainger
Copy link
Member

cigrainger commented Oct 23, 2023

@josevalim I'm not sure how I overlooked this for most of a year (sorry!), but I don't have opposition to this but also don't see too much reason to do it. I think it's cool and reduces verbosity, seems clear enough to me. I also don't think I'd personally end up using it much. Maybe just because I'm old and crotchety and set in my ways.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants