Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add filter and filter_with to Series #728

Merged

Conversation

billylanchantin
Copy link
Contributor

Description

Adds

  • Series.filter
  • Series.filter_with

As was discussed in:

Support macros and _with variants?

If/how to include macro and _with-callback versions of functions is up for debate. Here's how I'm proposing it would work for filter:

require Explorer.Series, as: S

series = S.from_list([1, 2, 3])

S.filter_with(series, &(&1 |> S.remainder(2) |> S.equal(1)))
# vs.
S.filter(series, n: remainder(n, 2) == 1)

Happy to discuss :)

@josevalim
Copy link
Member

So clean! 😍

The biggest question is the API. Personally, I am not the biggest fan of the macro version in this case, but I will gladly concede it is about taste.

Then we need to decide on the names. Today the names we use for DF and Series do not necessarily match. We have Series.sort and DF.arrange, Series.filter and DF.filter, and then DF.mutate with no equivalent for Series (which could be added in the same style of this function).

@billylanchantin / @cigrainger / @philss, so we need to decide:

  1. Are we going to add macro versions? Yes or no?

  2. If we add the macro version, they most likely be named filter and filter_with. But if don't add the macro version, do we call the non-macro version filter or filter_with?

  3. What are we going to name the arrange and mutate versions of Series? Will we try to mirror the names or it is ok to assume they are different?

Please let me know your thoughts!

@billylanchantin
Copy link
Contributor Author

@josevalim Beautifully summarized!

I like being able to write: filter(s, n: n > 2) (or similar), so my vote is pro-macro. But it certainly introduces some headaches. I think it's worth tackling them, but I really want to hear other thoughts since it's a big decision.

Since the later decisions hinge on including macros or not, I'll leave it it at that for now. I think we can decide on the names once we reach a decision on macros.

@cigrainger
Copy link
Member

This is great work. My 2 cents:

  1. Not a huge fan of macros here. Maybe just because they smell like Python lambdas. I find it very un-elixir feeling. In DataFrames the keyword list vibe feels like elixir and makes sense because it corresponds to column badges. With n: n > 2, it feels like a new anonymous function notation and I just don't feel good about that. But I'll admit the infix operators are a sticking point for a lot of people. And there's a continuity issue -- if you can use them in DF.filter you might be annoyed you can't in Series.filter.
  2. I'm not a fan of the _with unless required for macros. I prefer overloading. But I know there's a possibility where there will be a function that can't be overloaded and then we're a bit stuck. I still prefer just filter.
  3. I don't think it has to be consistent. Mutate in particular has quite a different meaning. I think map is great for series 'mutate'. Arrange is less different from sort and IIRC comes from the history of plyr and base R and avoiding namespace conflicts (more an issue in R). I'd actually lean more towards renaming arrange to sort and bringing them in line there.

@cigrainger
Copy link
Member

Ah I've just put my finger on what bugs me about a macro here: it's too close to Enum and feels like the functions should apply to each element. But they apply to the entire series. I think it's confusing. I know it's a bit more verbose but I honestly don't think it's that ugly to do Series.filter(s, &Series.greater(&1, 2)).

@josevalim
Copy link
Member

josevalim commented Nov 2, 2023

So map/filter/sort and then transform which actually goes element by element (as it works today)?

map should be straightforward to implement but sort will require adding support to nulls_first in arrange (but it should be straightforward).

@billylanchantin
Copy link
Contributor Author

billylanchantin commented Nov 2, 2023

So I'm gonna try to make the case for macros. Sorry for the text wall, but I wanna give it a fair shake!

The goals of macros on Series are:

  • Readability
  • Consistency

Used well (and sparingly), macros are more readable. Here's a good example from this project:

DF.mutate(df, c: a + b)
# vs.
DF.mutate_with(df, &[c: Series.add(&1["a"], &1["b"])])

The mutate one is much easier to read. It uses way fewer characters, and it uses the math-y infix operators which most folks are used to.

If we were to support macros on Series, we'd get that same benefit:

S.mutate(s, x: 1 + x + x**2/2)
# vs.
S.mutate_with(s, &(&1 |> Series.add(1) |> Series.add(&1 |> Series.pow(2) |> Series.divide(2))))

It's also consistent with what's available in DataFrames. DataFrames allow this syntax, and it seems like the lack of availability on Series was driven mostly by what Polars happened to make easy. I think there's agreement on this point:

And there's a continuity issue -- if you can use them in DF.filter you might be annoyed you can't in Series.filter.

Certainly when I started using the library, I wondered why macros were restricted to DataFrames.

In DataFrames the keyword list vibe feels like elixir and makes sense because it corresponds to column badges. With n: n > 2, it feels like a new anonymous function notation and I just don't feel good about that.

I feel like this is the strongest objection. But I'd argue less that n: n > 2 is a perfect solution, but more that it (or something like it) is necessary and the readability is worth the cost. I considered a few other syntaxes:

# I'd be fine with this too, though it's not evocative of the `DF.mutate` syntax.
filter(s, n, n > 2)

# Less preferred: `filter_with` also uses anonymous functions and I think that's confusing.
filter(s, &(&1 > 2))

Also, I think a premise here is that an "Elixir purist" feel has some friction. That's why, I assume, macros were introduced to DataFrames in the first place (and same for Nx): long, math-y computations are hard to read when they're made up of multiple, piped function calls.

Ah I've just put my finger on what bugs me about a macro here: it's too close to Enum and feels like the functions should apply to each element. But they apply to the entire series. I think it's confusing.

I think I see this too, though I'm less bothered by it. There is a learning curve when you start using Explorer with the difference between Series.add(s, 1) vs. Series.transform(s, &(&1 + 1)), and why the former is preferred. But that's true across the entire library. The fact that macros are syntactic sugar for Series operations isn't much more confusing in this context than with DataFrames IMHO.


I'll end by saying that, while obviously this is my first choice, I'll be happy with either decision! This is all good stuff and I don't think we can go wrong either way :)

@cigrainger
Copy link
Member

Well that is a really convincing argument and I think I'm coming around. Thanks for laying it out so clearly @billylanchantin. I'm actually starting to come around to the x: x > 2 approach as well. If documented well I think it would be fine.

@josevalim @philss are there dangers lurking with macros here? For example, what happens if we use a macro Series.map inside a macro DF.mutate.

So map/filter/sort and then transform which actually goes element by element (as it works today)?

Yep I think that's the cleanest.

@josevalim
Copy link
Member

Re: macros.

In this example:

DF.mutate(df, c: a + b)
# vs.
DF.mutate_with(df, &[c: Series.add(&1["a"], &1["b"])])

The ugliest part of mutate comes from accessing the columns. For example, if we could somehow magically bind the function argument names to columns, we could write this instead:

DF.mutate_with(df, fn a, b -> [c: Series.add(a, b)] end)

Which is a bit more acceptable. The issue is that series have no name to access, so we need to introduce an artificial name:

S.mutate(s, x: 1 + x + x**2/2)

In the example above, x: is being used for binding and not for naming a new column, while x: in DF is used for naming new columns, never binding. The usages are different and using x: for binding is very uncommon in Elixir.

A more Elixirish approach would be:

S.mutate(s) do
  x -> 1 + x + x**2/2
end

Or mirroring Ecto:

S.mutate(x <- s, 1 + x + x**2/2)

The last one is syntactically cleaner, IMO, but for col <- ... is already supported in queries to mean traversing whole columns (instead of each entry).

The other aspect of the macros are operator conveniences. I don't disagree the operator conveniences help but for series they have limited use because I can only perform series operations against myself. So only operators that work against myself are useful (and they are not that many).

Although I can't deny doing this sort of stuff with series would be neat: https://hexdocs.pm/explorer/Explorer.Query.html#module-conditionals

If this is really a concern, we can always document the approach used in this PR and say "hey, you want to do crazy stuff, convert it to a DF like this".


@josevalim @philss are there dangers lurking with macros here? For example, what happens if we use a macro Series.map inside a macro DF.mutate.

Series.map/filter/sort/transform for a lazy data frame should raise, so that's not a concern here.

@josevalim
Copy link
Member

Btw, awesome input on the discussion @billylanchantin. You definitely brought up good points.

@billylanchantin
Copy link
Contributor Author

Well that is a really convincing argument and I think I'm coming around. Thanks for laying it out so clearly @billylanchantin.

Btw, awesome input on the discussion @billylanchantin. You definitely brought up good points.

That made my morning. You all are very nice to work with :)

Ok, this is my takeaway: macros might be nice, but the syntax is a sticking point. There doesn't seem to be a way to do it without being confusing or non-idiomatic.

Assuming filter(s, n: n > 2) and filter(s, n, n > 2) are out, I've got two more ideas (then I'll drop it, I promise!).

  1. Do what Ecto does for piped bindings.

    # DataFrames
    DF.mutate(df, [a, b], [c: Series.add(a, b)])
    
    # Series
    S.mutate(s, [x], 1 + x + x**2/2)

    With the arity change (mutate and friends go from 2 to 3), it barely manages to be non-breaking. But it'd be a good amount of churn, so not ideal. Though the improved readability on the DataFrame macros is a nice bonus.

  2. Re:

    If this is really a concern, we can always document the approach used in this PR and say "hey, you want to do crazy stuff, convert it to a DF like this".

    What if that was the feature? Instead of:

    S.filter(s, n: n > 2)

    We supported:

    S.wrap(s, DF.filter(n: n > 2))

    or something. By making the promotion-then-demotion first class as opposed to an implementation detail, I think it makes the binding hack less magical. Plus, we wouldn't have to add any more Series macros besides that one. We get them for free:

    S.wrap(s, DF.mutate(x: 1 + x + x**2))

@josevalim
Copy link
Member

I thought about the S.wrap style or even allowing series in DF.mutate directly. The sticking point is that you would still need to name them in both cases, so S.wrap above is not enough. :( And then if you do mutate and use a different name, then your DF has two names and you would need to know which one to return. If you want to disambiguate it via an option, you basically reimplemented all of DF.new |> DF.mutate |> DF.pull :D

@billylanchantin
Copy link
Contributor Author

Ok, if we don't think there's a way to syntactically introduce a name for binding (a hard requirement) without leading to confusion, then I concede on macros :)


  1. Without macros, there's no need for _with.
  2. I'm good with diverging on names. Though my background is Pandas. I think a dplyr expert should ultimately decide on how the names feel.

I'll wait for consensus on names, etc. before I make any changes to the PR.

@benwilson512
Copy link

benwilson512 commented Nov 3, 2023

In the example above, x: is being used for binding and not for naming a new column, while x: in DF is used for naming new columns, never binding. The usages are different and using x: for binding is very uncommon in Elixir.

At the risk of popping my head into an area wherein I am a novice, is this true? My sense is that the col: syntax in dataframes served as both an assignment and a binding. Specifically you can do:

iex(6)> df = Explorer.DataFrame.new(a: [4,5,6])
#Explorer.DataFrame<
  Polars[3 x 1]
  a integer [4, 5, 6]
>
iex(7)> Explorer.DataFrame.mutate(df, a: a * 2)
#Explorer.DataFrame<
  Polars[3 x 1]
  a integer [8, 10, 12]
>

If that's the case then the parallel for series seems actually rather clear. The name is more "anonymous" than a column but in both cases you are:

  1. Binding a value to a variable
  2. Assigning the result of the pseudo function to the column.

The sticking point is that you would still need to name them in both cases, so S.wrap above is not enough.

Right. But isn't that just down to the nature of the data structure? The ordinary expectation of operating on a series is that you are talking about quasi-anonymous values n whereas if you operate on a dataframe the expectation is that you operate on named columns. In either case the structure of the macro is the same:

thing_to_be_bound_and_assigned: thing_to_be_bound_and_assigned + MATH_GOES_HERE)

and it's just the case that there is some variance in the structures between what counstitutes thing_to_be_bound_and_assigned. When operating with a series it is not opinionated about the name of its values. Whe you operate a dataframe it expects you to name values after columns. The semantics of the macro quasi-function is the same either way.

EDIT: OK I see the flaw in my argument. In dataframes, the binding of column names to variables happens regardless of what goes before the :. If what's before the : already exists then it overwrites, if it's new then it writes. That technically makes its role entirely focused on assignment, not binding. STILL I question whether that distinction really makes that much difference from a DX standpoint.

@josevalim
Copy link
Member

josevalim commented Nov 5, 2023

At the risk of popping my head into an area wherein I am a novice, is this true? My sense is that the col: syntax in dataframes served as both an assignment and a binding.

It is an assignment after the fact but not for the current operation. The point is that in a dataframe, all bound names have been given before. There is no such thing for series, hence the need to bind and assign.

For example, I would prefer something like this filter(s, as: n, do: n > 2). But we are introducing another syntax and I am not really sure it is worth it. :( So to me, whatever we pick, I don't see the costs in increasing the API surface being worth the feature. But that's just my opinion :)

@josevalim
Copy link
Member

Another idea. Since series are not named, we can do this:

filter(s, _ * 2)

where _ stands for the series.

@billylanchantin
Copy link
Contributor Author

where _ stands for the series.

Scala has entered the chat... (also I'm totally down 😄)

But we are introducing another syntax and ... I don't see the costs in increasing the API surface being worth the feature.

Yep this is the heart of it. Whatever we pick will introduce a new syntax. So the question to answer is: is the feature worth it? You get some cool stuff, but that's the price of admission.

@josevalim
Copy link
Member

Of all syntaxes for filter proposed so far, which one is your favorite? Maybe we pick one and then put it for voting.

@cigrainger
Copy link
Member

cigrainger commented Nov 6, 2023

Could we set a default name with the as:, do: approach? E.g. I'm thinking it could default to s (which I prefer over x or n to communicate that you're operating on series, not elements), say, so in practice you would only have to use do:?

@josevalim
Copy link
Member

@cigrainger we can but to me all names are arbitrary unless we use a special token (such as _) to reduce the arbitrariness.

@cigrainger
Copy link
Member

That's fair. I'm actually on board with that even if a bit begrudgingly supporting macros here generally. I think it's familiar enough from other languages and everything. Would it not cause potential issues if someone inevitably pushes things too far and tries to write something longer that might have pattern matching?

@cigrainger
Copy link
Member

Yeah okay I can get behind _.

@josevalim
Copy link
Member

I don't think we can have pattern matching inside queries, so that wouldn't be a concern.

@cigrainger
Copy link
Member

I'm on board then. I think it's the cleanest and most straightforward to use _. I appreciate that we've kept macros opt-in. I would love to use conditionals in queries for series. But generally I'm seeing this entirely as syntactic sugar to make things more approachable and create some continuity outside of dataframes.

@josevalim
Copy link
Member

To recap:

Are we going to add macro versions? Yes or no?

Yes, with _ as the name.

If we add the macro version, they most likely be named filter and filter_with.

filter and filter_with

What are we going to name the arrange and mutate versions of Series?

Now this is tricky. We already have sort and we should make sort a macro. So we need to either roll with arrange and arrange_with or figure something else. But that's the next task, I think we can move ahead with this one.

@billylanchantin
Copy link
Contributor Author

I'm also on board with:

  • filter(s, _ > 2)
  • filter_with(s, &Series.greater(&1, 2))

I think it's a clean way to get around the ambiguous variable issue. (Plus it wins code golf! 😄)

We already have sort and we should make sort a macro. So we need to either roll with arrange and arrange_with or figure something else. But that's the next task, I think we can move ahead with this one.

Agreed. I'll make the change for the _ syntax on this PR and it'll hopefully be good to go. Then we can tackle the additional considerations elsewhere.

lib/explorer/series.ex Outdated Show resolved Hide resolved
@josevalim
Copy link
Member

Two nits then feel free to merge it!

@billylanchantin
Copy link
Contributor Author

@josevalim Small hiccup: I can't merge PRs yet 😬

Screen Shot 2023-11-07 at 5 46 42 PM

@josevalim
Copy link
Member

Please try again.

@billylanchantin billylanchantin merged commit f0d981d into elixir-explorer:main Nov 7, 2023
3 checks passed
@billylanchantin billylanchantin deleted the series-filter-with branch November 7, 2023 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants