Add `filter` and `filter_with` to `Series` #728

billylanchantin · 2023-11-02T15:44:23Z

Description

Adds

Series.filter
Series.filter_with

As was discussed in:

Revisit Series.mask vs. Series.filter #726

Support macros and `_with` variants?

If/how to include macro and _with-callback versions of functions is up for debate. Here's how I'm proposing it would work for filter:

require Explorer.Series, as: S

series = S.from_list([1, 2, 3])

S.filter_with(series, &(&1 |> S.remainder(2) |> S.equal(1)))
# vs.
S.filter(series, n: remainder(n, 2) == 1)

Happy to discuss :)

josevalim · 2023-11-02T16:49:24Z

So clean! 😍

The biggest question is the API. Personally, I am not the biggest fan of the macro version in this case, but I will gladly concede it is about taste.

Then we need to decide on the names. Today the names we use for DF and Series do not necessarily match. We have Series.sort and DF.arrange, Series.filter and DF.filter, and then DF.mutate with no equivalent for Series (which could be added in the same style of this function).

@billylanchantin / @cigrainger / @philss, so we need to decide:

Are we going to add macro versions? Yes or no?
If we add the macro version, they most likely be named filter and filter_with. But if don't add the macro version, do we call the non-macro version filter or filter_with?
What are we going to name the arrange and mutate versions of Series? Will we try to mirror the names or it is ok to assume they are different?

Please let me know your thoughts!

billylanchantin · 2023-11-02T17:27:52Z

@josevalim Beautifully summarized!

I like being able to write: filter(s, n: n > 2) (or similar), so my vote is pro-macro. But it certainly introduces some headaches. I think it's worth tackling them, but I really want to hear other thoughts since it's a big decision.

Since the later decisions hinge on including macros or not, I'll leave it it at that for now. I think we can decide on the names once we reach a decision on macros.

cigrainger · 2023-11-02T20:14:11Z

This is great work. My 2 cents:

Not a huge fan of macros here. Maybe just because they smell like Python lambdas. I find it very un-elixir feeling. In DataFrames the keyword list vibe feels like elixir and makes sense because it corresponds to column badges. With n: n > 2, it feels like a new anonymous function notation and I just don't feel good about that. But I'll admit the infix operators are a sticking point for a lot of people. And there's a continuity issue -- if you can use them in DF.filter you might be annoyed you can't in Series.filter.
I'm not a fan of the _with unless required for macros. I prefer overloading. But I know there's a possibility where there will be a function that can't be overloaded and then we're a bit stuck. I still prefer just filter.
I don't think it has to be consistent. Mutate in particular has quite a different meaning. I think map is great for series 'mutate'. Arrange is less different from sort and IIRC comes from the history of plyr and base R and avoiding namespace conflicts (more an issue in R). I'd actually lean more towards renaming arrange to sort and bringing them in line there.

cigrainger · 2023-11-02T20:29:50Z

Ah I've just put my finger on what bugs me about a macro here: it's too close to Enum and feels like the functions should apply to each element. But they apply to the entire series. I think it's confusing. I know it's a bit more verbose but I honestly don't think it's that ugly to do Series.filter(s, &Series.greater(&1, 2)).

josevalim · 2023-11-02T20:46:01Z

So map/filter/sort and then transform which actually goes element by element (as it works today)?

map should be straightforward to implement but sort will require adding support to nulls_first in arrange (but it should be straightforward).

billylanchantin · 2023-11-02T22:21:51Z

So I'm gonna try to make the case for macros. Sorry for the text wall, but I wanna give it a fair shake!

The goals of macros on Series are:

Readability
Consistency

Used well (and sparingly), macros are more readable. Here's a good example from this project:

DF.mutate(df, c: a + b)
# vs.
DF.mutate_with(df, &[c: Series.add(&1["a"], &1["b"])])

The mutate one is much easier to read. It uses way fewer characters, and it uses the math-y infix operators which most folks are used to.

If we were to support macros on Series, we'd get that same benefit:

S.mutate(s, x: 1 + x + x**2/2)
# vs.
S.mutate_with(s, &(&1 |> Series.add(1) |> Series.add(&1 |> Series.pow(2) |> Series.divide(2))))

It's also consistent with what's available in DataFrames. DataFrames allow this syntax, and it seems like the lack of availability on Series was driven mostly by what Polars happened to make easy. I think there's agreement on this point:

And there's a continuity issue -- if you can use them in DF.filter you might be annoyed you can't in Series.filter.

Certainly when I started using the library, I wondered why macros were restricted to DataFrames.

In DataFrames the keyword list vibe feels like elixir and makes sense because it corresponds to column badges. With n: n > 2, it feels like a new anonymous function notation and I just don't feel good about that.

I feel like this is the strongest objection. But I'd argue less that n: n > 2 is a perfect solution, but more that it (or something like it) is necessary and the readability is worth the cost. I considered a few other syntaxes:

# I'd be fine with this too, though it's not evocative of the `DF.mutate` syntax.
filter(s, n, n > 2)

# Less preferred: `filter_with` also uses anonymous functions and I think that's confusing.
filter(s, &(&1 > 2))

Also, I think a premise here is that an "Elixir purist" feel has some friction. That's why, I assume, macros were introduced to DataFrames in the first place (and same for Nx): long, math-y computations are hard to read when they're made up of multiple, piped function calls.

Ah I've just put my finger on what bugs me about a macro here: it's too close to Enum and feels like the functions should apply to each element. But they apply to the entire series. I think it's confusing.

I think I see this too, though I'm less bothered by it. There is a learning curve when you start using Explorer with the difference between Series.add(s, 1) vs. Series.transform(s, &(&1 + 1)), and why the former is preferred. But that's true across the entire library. The fact that macros are syntactic sugar for Series operations isn't much more confusing in this context than with DataFrames IMHO.

I'll end by saying that, while obviously this is my first choice, I'll be happy with either decision! This is all good stuff and I don't think we can go wrong either way :)

cigrainger · 2023-11-03T05:09:37Z

Well that is a really convincing argument and I think I'm coming around. Thanks for laying it out so clearly @billylanchantin. I'm actually starting to come around to the x: x > 2 approach as well. If documented well I think it would be fine.

@josevalim @philss are there dangers lurking with macros here? For example, what happens if we use a macro Series.map inside a macro DF.mutate.

So map/filter/sort and then transform which actually goes element by element (as it works today)?

Yep I think that's the cleanest.

josevalim · 2023-11-03T11:20:21Z

Re: macros.

In this example:

DF.mutate(df, c: a + b)
# vs.
DF.mutate_with(df, &[c: Series.add(&1["a"], &1["b"])])

The ugliest part of mutate comes from accessing the columns. For example, if we could somehow magically bind the function argument names to columns, we could write this instead:

DF.mutate_with(df, fn a, b -> [c: Series.add(a, b)] end)

Which is a bit more acceptable. The issue is that series have no name to access, so we need to introduce an artificial name:

S.mutate(s, x: 1 + x + x**2/2)

In the example above, x: is being used for binding and not for naming a new column, while x: in DF is used for naming new columns, never binding. The usages are different and using x: for binding is very uncommon in Elixir.

A more Elixirish approach would be:

S.mutate(s) do
  x -> 1 + x + x**2/2
end

Or mirroring Ecto:

S.mutate(x <- s, 1 + x + x**2/2)

The last one is syntactically cleaner, IMO, but for col <- ... is already supported in queries to mean traversing whole columns (instead of each entry).

The other aspect of the macros are operator conveniences. I don't disagree the operator conveniences help but for series they have limited use because I can only perform series operations against myself. So only operators that work against myself are useful (and they are not that many).

Although I can't deny doing this sort of stuff with series would be neat: https://hexdocs.pm/explorer/Explorer.Query.html#module-conditionals

If this is really a concern, we can always document the approach used in this PR and say "hey, you want to do crazy stuff, convert it to a DF like this".

@josevalim @philss are there dangers lurking with macros here? For example, what happens if we use a macro Series.map inside a macro DF.mutate.

Series.map/filter/sort/transform for a lazy data frame should raise, so that's not a concern here.

josevalim · 2023-11-03T11:21:13Z

Btw, awesome input on the discussion @billylanchantin. You definitely brought up good points.

billylanchantin · 2023-11-03T14:36:34Z

Well that is a really convincing argument and I think I'm coming around. Thanks for laying it out so clearly @billylanchantin.

Btw, awesome input on the discussion @billylanchantin. You definitely brought up good points.

That made my morning. You all are very nice to work with :)

Ok, this is my takeaway: macros might be nice, but the syntax is a sticking point. There doesn't seem to be a way to do it without being confusing or non-idiomatic.

Assuming filter(s, n: n > 2) and filter(s, n, n > 2) are out, I've got two more ideas (then I'll drop it, I promise!).

Do what Ecto does for piped bindings.
```
# DataFrames
DF.mutate(df, [a, b], [c: Series.add(a, b)])

# Series
S.mutate(s, [x], 1 + x + x**2/2)
```
With the arity change (mutate and friends go from 2 to 3), it barely manages to be non-breaking. But it'd be a good amount of churn, so not ideal. Though the improved readability on the DataFrame macros is a nice bonus.
Re:

If this is really a concern, we can always document the approach used in this PR and say "hey, you want to do crazy stuff, convert it to a DF like this".

What if that was the feature? Instead of:
```
S.filter(s, n: n > 2)
```
We supported:
```
S.wrap(s, DF.filter(n: n > 2))
```
or something. By making the promotion-then-demotion first class as opposed to an implementation detail, I think it makes the binding hack less magical. Plus, we wouldn't have to add any more Series macros besides that one. We get them for free:
```
S.wrap(s, DF.mutate(x: 1 + x + x**2))
```

josevalim · 2023-11-03T17:39:41Z

I thought about the S.wrap style or even allowing series in DF.mutate directly. The sticking point is that you would still need to name them in both cases, so S.wrap above is not enough. :( And then if you do mutate and use a different name, then your DF has two names and you would need to know which one to return. If you want to disambiguate it via an option, you basically reimplemented all of DF.new |> DF.mutate |> DF.pull :D

billylanchantin · 2023-11-03T21:07:43Z

Ok, if we don't think there's a way to syntactically introduce a name for binding (a hard requirement) without leading to confusion, then I concede on macros :)

Without macros, there's no need for _with.
I'm good with diverging on names. Though my background is Pandas. I think a dplyr expert should ultimately decide on how the names feel.

I'll wait for consensus on names, etc. before I make any changes to the PR.

benwilson512 · 2023-11-03T23:28:43Z

In the example above, x: is being used for binding and not for naming a new column, while x: in DF is used for naming new columns, never binding. The usages are different and using x: for binding is very uncommon in Elixir.

At the risk of popping my head into an area wherein I am a novice, is this true? My sense is that the col: syntax in dataframes served as both an assignment and a binding. Specifically you can do:

iex(6)> df = Explorer.DataFrame.new(a: [4,5,6])
#Explorer.DataFrame<
  Polars[3 x 1]
  a integer [4, 5, 6]
>
iex(7)> Explorer.DataFrame.mutate(df, a: a * 2)
#Explorer.DataFrame<
  Polars[3 x 1]
  a integer [8, 10, 12]
>

If that's the case then the parallel for series seems actually rather clear. The name is more "anonymous" than a column but in both cases you are:

Binding a value to a variable
Assigning the result of the pseudo function to the column.

The sticking point is that you would still need to name them in both cases, so S.wrap above is not enough.

Right. But isn't that just down to the nature of the data structure? The ordinary expectation of operating on a series is that you are talking about quasi-anonymous values n whereas if you operate on a dataframe the expectation is that you operate on named columns. In either case the structure of the macro is the same:

thing_to_be_bound_and_assigned: thing_to_be_bound_and_assigned + MATH_GOES_HERE)

and it's just the case that there is some variance in the structures between what counstitutes thing_to_be_bound_and_assigned. When operating with a series it is not opinionated about the name of its values. Whe you operate a dataframe it expects you to name values after columns. The semantics of the macro quasi-function is the same either way.

EDIT: OK I see the flaw in my argument. In dataframes, the binding of column names to variables happens regardless of what goes before the :. If what's before the : already exists then it overwrites, if it's new then it writes. That technically makes its role entirely focused on assignment, not binding. STILL I question whether that distinction really makes that much difference from a DX standpoint.

josevalim · 2023-11-05T20:01:45Z

At the risk of popping my head into an area wherein I am a novice, is this true? My sense is that the col: syntax in dataframes served as both an assignment and a binding.

It is an assignment after the fact but not for the current operation. The point is that in a dataframe, all bound names have been given before. There is no such thing for series, hence the need to bind and assign.

For example, I would prefer something like this filter(s, as: n, do: n > 2). But we are introducing another syntax and I am not really sure it is worth it. :( So to me, whatever we pick, I don't see the costs in increasing the API surface being worth the feature. But that's just my opinion :)

josevalim · 2023-11-05T20:05:03Z

Another idea. Since series are not named, we can do this:

filter(s, _ * 2)

where _ stands for the series.

billylanchantin · 2023-11-05T22:02:50Z

where _ stands for the series.

Scala has entered the chat... (also I'm totally down 😄)

But we are introducing another syntax and ... I don't see the costs in increasing the API surface being worth the feature.

Yep this is the heart of it. Whatever we pick will introduce a new syntax. So the question to answer is: is the feature worth it? You get some cool stuff, but that's the price of admission.

josevalim · 2023-11-05T23:59:43Z

Of all syntaxes for filter proposed so far, which one is your favorite? Maybe we pick one and then put it for voting.

cigrainger · 2023-11-06T04:35:23Z

Could we set a default name with the as:, do: approach? E.g. I'm thinking it could default to s (which I prefer over x or n to communicate that you're operating on series, not elements), say, so in practice you would only have to use do:?

josevalim · 2023-11-06T07:56:10Z

@cigrainger we can but to me all names are arbitrary unless we use a special token (such as _) to reduce the arbitrariness.

cigrainger · 2023-11-06T08:44:26Z

That's fair. I'm actually on board with that even if a bit begrudgingly supporting macros here generally. I think it's familiar enough from other languages and everything. Would it not cause potential issues if someone inevitably pushes things too far and tries to write something longer that might have pattern matching?

cigrainger · 2023-11-06T08:48:20Z

Yeah okay I can get behind _.

josevalim · 2023-11-06T08:50:58Z

I don't think we can have pattern matching inside queries, so that wouldn't be a concern.

cigrainger · 2023-11-06T08:57:32Z

I'm on board then. I think it's the cleanest and most straightforward to use _. I appreciate that we've kept macros opt-in. I would love to use conditionals in queries for series. But generally I'm seeing this entirely as syntactic sugar to make things more approachable and create some continuity outside of dataframes.

josevalim · 2023-11-06T11:09:12Z

To recap:

Are we going to add macro versions? Yes or no?

Yes, with _ as the name.

If we add the macro version, they most likely be named filter and filter_with.

filter and filter_with

What are we going to name the arrange and mutate versions of Series?

Now this is tricky. We already have sort and we should make sort a macro. So we need to either roll with arrange and arrange_with or figure something else. But that's the next task, I think we can move ahead with this one.

billylanchantin · 2023-11-06T13:55:14Z

I'm also on board with:

filter(s, _ > 2)
filter_with(s, &Series.greater(&1, 2))

I think it's a clean way to get around the ambiguous variable issue. (Plus it wins code golf! 😄)

We already have sort and we should make sort a macro. So we need to either roll with arrange and arrange_with or figure something else. But that's the next task, I think we can move ahead with this one.

Agreed. I'll make the change for the _ syntax on this PR and it'll hopefully be good to go. Then we can tackle the additional considerations elsewhere.

lib/explorer/series.ex

josevalim · 2023-11-07T22:22:54Z

Two nits then feel free to merge it!

Co-authored-by: José Valim <jose.valim@gmail.com>

billylanchantin · 2023-11-07T22:47:56Z

@josevalim Small hiccup: I can't merge PRs yet 😬

josevalim · 2023-11-07T23:24:24Z

Please try again.

billylanchantin added 2 commits November 2, 2023 11:17

filter_with

60cde77

filter (macro)

e3adf4b

switch to underscore syntax

daae376

billylanchantin commented Nov 7, 2023

View reviewed changes

lib/explorer/series.ex Show resolved Hide resolved

josevalim reviewed Nov 7, 2023

View reviewed changes

lib/explorer/series.ex Outdated Show resolved Hide resolved

josevalim approved these changes Nov 7, 2023

View reviewed changes

josevalim reviewed Nov 7, 2023

View reviewed changes

lib/explorer/series.ex Show resolved Hide resolved

billylanchantin and others added 3 commits November 7, 2023 17:27

no need for curly braces

8850f12

Co-authored-by: José Valim <jose.valim@gmail.com>

you no what? no braces at all

1519639

reference mask/2

d17236d

billylanchantin merged commit f0d981d into elixir-explorer:main Nov 7, 2023
3 checks passed

billylanchantin deleted the series-filter-with branch November 7, 2023 23:27

This was referenced Nov 8, 2023

Revisit Series.mask vs. Series.filter #726

Closed

Additional _ macros in Series #730

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `filter` and `filter_with` to `Series` #728

Add `filter` and `filter_with` to `Series` #728

billylanchantin commented Nov 2, 2023

josevalim commented Nov 2, 2023

billylanchantin commented Nov 2, 2023

cigrainger commented Nov 2, 2023

cigrainger commented Nov 2, 2023

josevalim commented Nov 2, 2023 •

edited

billylanchantin commented Nov 2, 2023 •

edited

cigrainger commented Nov 3, 2023

josevalim commented Nov 3, 2023

josevalim commented Nov 3, 2023

billylanchantin commented Nov 3, 2023

josevalim commented Nov 3, 2023

billylanchantin commented Nov 3, 2023

benwilson512 commented Nov 3, 2023 •

edited

josevalim commented Nov 5, 2023 •

edited

josevalim commented Nov 5, 2023

billylanchantin commented Nov 5, 2023

josevalim commented Nov 5, 2023

cigrainger commented Nov 6, 2023 •

edited

josevalim commented Nov 6, 2023

cigrainger commented Nov 6, 2023

cigrainger commented Nov 6, 2023

josevalim commented Nov 6, 2023

cigrainger commented Nov 6, 2023

josevalim commented Nov 6, 2023

billylanchantin commented Nov 6, 2023

josevalim commented Nov 7, 2023

billylanchantin commented Nov 7, 2023

josevalim commented Nov 7, 2023

Add filter and filter_with to Series #728

Add filter and filter_with to Series #728

Conversation

billylanchantin commented Nov 2, 2023

Description

Support macros and _with variants?

josevalim commented Nov 2, 2023

billylanchantin commented Nov 2, 2023

cigrainger commented Nov 2, 2023

cigrainger commented Nov 2, 2023

josevalim commented Nov 2, 2023 • edited

billylanchantin commented Nov 2, 2023 • edited

cigrainger commented Nov 3, 2023

josevalim commented Nov 3, 2023

josevalim commented Nov 3, 2023

billylanchantin commented Nov 3, 2023

josevalim commented Nov 3, 2023

billylanchantin commented Nov 3, 2023

benwilson512 commented Nov 3, 2023 • edited

josevalim commented Nov 5, 2023 • edited

josevalim commented Nov 5, 2023

billylanchantin commented Nov 5, 2023

josevalim commented Nov 5, 2023

cigrainger commented Nov 6, 2023 • edited

josevalim commented Nov 6, 2023

cigrainger commented Nov 6, 2023

cigrainger commented Nov 6, 2023

josevalim commented Nov 6, 2023

cigrainger commented Nov 6, 2023

josevalim commented Nov 6, 2023

billylanchantin commented Nov 6, 2023

josevalim commented Nov 7, 2023

billylanchantin commented Nov 7, 2023

josevalim commented Nov 7, 2023

Add `filter` and `filter_with` to `Series` #728

Add `filter` and `filter_with` to `Series` #728

Support macros and `_with` variants?

josevalim commented Nov 2, 2023 •

edited

billylanchantin commented Nov 2, 2023 •

edited

benwilson512 commented Nov 3, 2023 •

edited

josevalim commented Nov 5, 2023 •

edited

cigrainger commented Nov 6, 2023 •

edited