Allow expressions in `group_by` #494

josevalim · 2023-02-01T09:40:45Z

df
|> DF.mutate(day_of_week: day_of_week(date))
|> DF.group_by([:day_of_week]

what do you think about supporting this:

df
|> DF.group_by(day_of_week: day_of_week(date))

They will be basically equivalent.

Thoughts?

The text was updated successfully, but these errors were encountered:

ahacking · 2023-02-02T06:21:08Z

Something to consider, is dplyr also permits per operation grouping using the by(selection)_ argument which is applied and scoped just for the operation. Ref:

Grouping radically affects the computation of the dplyr verb you use it with, and one of the goals of .by is to allow you to place that grouping specification alongside the code that actually uses it. As an added benefit, with .by you no longer need to remember to ungroup() after summarise(), and summarise() won't ever message you about how it's handling the groups!

I did run into this problem of locality of the grouping to the intent/assumptions of the operation I am performing. You may run into cases where you need to push and pop the active group so the callers assumptions on the dataframe returned are not violated.

Elixir favours explicitness so I see an operation local grouping being aligned to this ideal whereas group_by /ungroup is much easier to get wrong.

I would possibly go as far to say that the dataframe level group_by may be a bad idea and will lead to surprises and coding patterns of saving groups, ungrouping, applying new groups, ungrouping and restoring groups aka pushing and popping group specs.

In terms of group expressions is there any issue using any series however computed (including lead or lagged series) to be applied a group_by()/operation by() selection providing it has the same row count as the dataframe?

josevalim · 2023-02-02T08:18:26Z

That's a nice API but unfortunately we would have ambiguity with mutate:

DF.mutate(new_column: ..., by: ...)

So we would need to come up with a solution to that.

In terms of group expressions is there any issue using any series however computed (including lead or lagged series) to be applied a group_by()/operation by() selection providing it has the same row count as the dataframe?

I don't think so, you can either compute it with mutate or put an existing series with put.

ahacking · 2023-02-02T16:01:44Z

Yes good point, I believe that why dplyr uses the dot ".by" to distinguish the parameter from columns. I don't think that approach is all that clean as it would put restrictions/reservations on column names.

I thought something like apply_grouping_for(df, grouping, fn) might work but its a bit ugly having to use a function for the operation it's applied to.

Then I considered something like push_grouping()/pop_grouping() which is nicer from a pipe perspective but is not adding a whole lot of value over the existing group_by/ungroup and it is still a bit clumsy as the pop can easily be forgotten.

So there is nothing really compelling in any of those options unfortunately.

josevalim · 2023-02-02T16:07:07Z

One option I considered was:

mutate(
  foo: this |> that |> by(:user_id)
)

That would apply per column but we could optimize it a bit and do the grouping once in case several columns have the same by clause.

ahacking · 2023-02-03T07:37:37Z

I think that could work, intersting approach!

Would it be possible to also apply to a keyword list piping into the group by?

mutate([a: 7,  b: a * 2]
       |> this()
       |> that()
       |> group_by(:user_id)
)

That would ameliorate having to collect the group by clauses for efficiency and running into potential issues in changing the order of dependencies between the column values.

cigrainger · 2023-10-23T12:40:45Z

@josevalim I'm not sure how I overlooked this for most of a year (sorry!), but I don't have opposition to this but also don't see too much reason to do it. I think it's cool and reduces verbosity, seems clear enough to me. I also don't think I'd personally end up using it much. Maybe just because I'm old and crotchety and set in my ways.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow expressions in `group_by` #494

Allow expressions in `group_by` #494

josevalim commented Feb 1, 2023

ahacking commented Feb 2, 2023

josevalim commented Feb 2, 2023

ahacking commented Feb 2, 2023

josevalim commented Feb 2, 2023

ahacking commented Feb 3, 2023

cigrainger commented Oct 23, 2023 •

edited

Allow expressions in group_by #494

Allow expressions in group_by #494

Comments

josevalim commented Feb 1, 2023

ahacking commented Feb 2, 2023

josevalim commented Feb 2, 2023

ahacking commented Feb 2, 2023

josevalim commented Feb 2, 2023

ahacking commented Feb 3, 2023

cigrainger commented Oct 23, 2023 • edited

Allow expressions in `group_by` #494

Allow expressions in `group_by` #494

cigrainger commented Oct 23, 2023 •

edited