Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit Series.mask vs. Series.filter #726

Closed
billylanchantin opened this issue Oct 31, 2023 · 19 comments
Closed

Revisit Series.mask vs. Series.filter #726

billylanchantin opened this issue Oct 31, 2023 · 19 comments

Comments

@billylanchantin
Copy link
Contributor

Description

This discussion came up in another issue. I'm breaking it out into its own issue so it can be tracked independently.

Show/hide recap

@cigrainger:

And I also agree that it's confusing we have Series.mask/2 instead of Series.filter/2.

@philss or @josevalim I'm sure there was a convo about this but I can't remember what the reasoning was for this anymore. #326 (comment)

@josevalim

We changed the implementation and renamed at the same time but I am fine with reverting the name back to filter. :) It should be a quick change and we can add:

@deprecated "Use Explorer.Series.filter/2 instead"
def mask(s1, s2), do: filter(s1, s2) 

It will certainly be much easier to find.

@billylanchantin

It may be worth having both since mask/2 and filter/2 accept different datatypes: mask takes a boolean series while filter/2/filter_with/2 take a query/function. If you have the boolean series on hand, you'd want mask/2. But if you're finding that you need to build the boolean series e.g. with transform/2 only to pass it right into mask/2, filter/2 would be convenient.

@josevalim

The issue is that doing it with a function is horribly expensive and should be generally avoided.

@billylanchantin
Copy link
Contributor Author

billylanchantin commented Oct 31, 2023

The issue is that doing it with a function is horribly expensive and should be generally avoided.

I agree! I was mostly focused on making the DataFrame and Series APIs similar.

Polars as a few filter functions which take either an expression or a boolean series (mask). They're a bit inconsistent with how they do that, though:

Entity Lang Function Expr Mask
DataFrame Python DataFrame.filter
DataFrame Rust DataFrame::filter
LazyFrame Python LazyFrame.filter
LazyFrame Rust LazyFrame::filter
Series Python Series.filter
Series Rust Series::filter

Explorer OTOH introduced the concept of a mask function. I think this was a good call. It helps hint what the input should be:

  • filter functions accept expressions
  • mask functions accept masks

So if Explorer is to have a Series.filter, I think the least surprising choice would be to have it accept an expression as well. Unfortunately for the goal of a mask/filter distinction, I couldn't find a way in Polars to filter a Series on an expression (other than I guess wrapping it in a DataFrame then converting back?).

Here are the options I see:

# Option Pro Con
1. Keep things as they are Consistent meaning for mask Lack of filter surprises newcomers
2. Rename Series.mask to Series.filter Newcomers find filter quicker filter is inconsistent across Series and DataFrame
3. Keep Series.mask and hack together a Series.filter that accepts an expression Consistent and easy for newcomers to find Hacks are bad

My first choice is 3. Assuming the hack isn't terrible, it helps newcomers find the functionality and keeps the meanings consistent. We can also document that Series.mask should be preferred.

My second choice is 1. I think I favor consistency over newcomer surprise. We could possibly add an example to the docs to help newcomers find Series.mask if the question comes up often enough.

My last choice is 2. I think Series.filter working like DataFrame.mask while also having a DataFrame.filter would be confusing long term.

@josevalim
Copy link
Member

Honestly, an implementation of Series.filter(fn x -> x end) (or filter_with) where we put the series in a dataframe, filter it, and then read the series out, sounds very elegant to me and it would also be optimized quite cleanly and most likely the most efficient approach too. Polars may even have better APIs.

This would also allow us to introduce map (or mutate?) and arrange for individual series. filter/filter_with docs could point out to mask for when the lazy expression version is not enough. Thoughts?

@billylanchantin
Copy link
Contributor Author

Yeah I'd love to use Series.{filter,filter_with} if I could!

This would also allow us to introduce map (or mutate?) and arrange for individual series. filter/filter_with docs could point out to mask for when the lazy expression version is not enough. Thoughts?

For map/mutate, I know there's already Series.transform so we'll want to keep that in mind. But yes I agree. Assuming the overhead with wrapping-then-unwrapping is acceptable, we should be able to bring a lot of the DataFrame specific functionality over to Series.

I'm happy to try for a filter/filter_with PR to see how the idea shakes out.

@josevalim
Copy link
Member

mutate would be different than transform, because mutate would work on lazy series.

I would start with filter_with cause I am not sure how we can do filter without an anonymous function. In DFs we refer to them by column name but that's not an option here.

Please go ahead!

@billylanchantin
Copy link
Contributor Author

In DFs we refer to them by column name but that's not an option here.

Oh of course. I'll think about it, but nothing clean comes to immediately to mind.

@cigrainger
Copy link
Member

cigrainger commented Nov 1, 2023

Well that is a SUPER elegant solution. What a great idea. I'm in full support.

Edit: to clarify I mean the idea of converting to a df, applying functions, and converting back to a series.

@cigrainger
Copy link
Member

This would also allow us to introduce map (or mutate?) and arrange for individual series. filter/filter_with docs could point out to mask for when the lazy expression version is not enough. Thoughts?

What would arrange be for individual Series?

@josevalim
Copy link
Member

@cigrainger that was what i thought but i guess it doesn't make much sense since we would only arrange ourselves?

@cigrainger
Copy link
Member

cigrainger commented Nov 1, 2023

Exactly, I think it would just be a sort. Though maybe you could provide a sorter? E.g. on strings you could sort by ends_with or similar? (speaking of which I need to add some more string ops). But that could just be sort/2.

Edit: So in this case, in the background you're basically doing: DataFrame.new(s: s) |> DataFrame.mutate(new_s: some_func(s)) |> DataFrame.arrange(new_s) |> DataFrame.pull(s).

@josevalim
Copy link
Member

Correct. So I guess we could have some use cases?

@cigrainger
Copy link
Member

Definitely.

@josevalim
Copy link
Member

We will need to decide on the naming though. Today we have Series.sort. Should it be sort_with or arrange_with?

@cigrainger
Copy link
Member

Good question. I'm struggling to disentangle the macro aspect from the motivation to introduce _with variants. I don't have a strong opinion about sort vs arrange here (largely because in R they're different anyway -- sorting a vector in R is done with base R's sort). Do we need a _with or could we just make it multi-arity?

@josevalim
Copy link
Member

That's a good discussion. We don't need _with in series. It depends on how consistent we want to be with DF and within ourselves. The question is: can we overload sort? Would the two options (direction and nils) apply to our function-based version?

@cigrainger
Copy link
Member

I think they do apply. We have a direction selector in DataFrame.arrange. I think we could also use nils in DataFrame.arrange.

@billylanchantin
Copy link
Contributor Author

Thought: for the macro versions, what if we did like Ecto and had them provide the column as an argument? E.g. we make a filter/3:

dates =
  [~D[2023-11-01], ~D[2023-11-02], ~D[2023-11-03]]
  |> S.from_list()
  |> S.filter(date, date > ^~D[2023-11-01])

@josevalim
Copy link
Member

That would work but, at the same time, everyone is using pipelines to transfrom series today anyway, so requesting a pipeline inside the anonymous function is not that bad.

@billylanchantin
Copy link
Contributor Author

I think #728 closed this issue. We discussed several other additions to Series in that PR. Shall I close this issue and open another to track those additions?

@josevalim
Copy link
Member

Yeah!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants