Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Series.filter should work inside DataFrame.summarise #927

Open
billylanchantin opened this issue Jun 13, 2024 · 5 comments
Open

Series.filter should work inside DataFrame.summarise #927

billylanchantin opened this issue Jun 13, 2024 · 5 comments

Comments

@billylanchantin
Copy link
Contributor

Originally noted here:

Example:

require Explorer.DataFrame, as: DF

DF.new(a: [1, 2, 2], b: ["x", "y", "z"])
|> DF.group_by(:a)
|> DF.summarise(c: filter(b, _ != "z"))

yields:

** (ArgumentError) expected a variable to be given to var!, got: Explorer.DataFrame.pull(var!(df, Explorer.Query), :df)
    (elixir 1.16.0) expanding macro: Kernel.var!/2
    iex:5: (file)
    (explorer 0.8.3-dev) expanding macro: Explorer.Query.query/1
    iex:5: (file)
    (elixir 1.16.0) expanding macro: Kernel.|>/2
    iex:5: (file)
    (elixir 1.16.0) expanding macro: Kernel.|>/2
    iex:5: (file)
@josevalim
Copy link
Member

I am not sure I agree. Wouldn’t that be the same as a DF.filter before hand? In any case, we should at least improve the error message. :)

@billylanchantin
Copy link
Contributor Author

In this case it'd be the same, but mine is just a minimal example. The original example from elixirforum isn't equivalent.

I don't see why we shouldn't support it. But if we can't for some reason, then definitely an improved error message is the way to go.

@mhanberg
Copy link
Contributor

I am not sure I agree. Wouldn’t that be the same as a DF.filter before hand? In any case, we should at least improve the error message. :)

The group_by makes DF.filter not entirely viable without backfilling some column values after the fact.

For example, currently our approach looks like this. in the future, we will also have 3 more of these aggregations

I have to get the distinct values of the sim_idx to use in a join later, so that we can backfil any of that group that the drop_nil removes entirely.

I believe that filtering a series inside summarise would make that

really what i want to do for each column of interest inside the group is "give me the first not nil value or if the series only has nil, then 'none'."

    sim_idx = data_frame |> DataFrame.distinct([:sim_idx])

    data_frame =
      any_data_frame
      |> DataFrame.mutate(
        any_id:
          if result in ["one", "two", "three", "four"] do
            person_id
          else
            nil
          end
      )
      |> DataFrame.drop_nil([:any_id])
      |> DataFrame.group_by(["sim_idx"])
      |> DataFrame.summarise(any: first(any_id))
      |> DataFrame.join(sim_idx, on: [:sim_idx], how: :right)

    two_data_frame =
      data_frame
      |> DataFrame.mutate(
        two_id:
          if result == "two" do
            person_id
          else
            nil
          end
      )
      |> DataFrame.drop_nil([:two_id])
      |> DataFrame.group_by(["sim_idx"])
      |> DataFrame.summarise(two: first(two_id))
      |> DataFrame.join(sim_idx, on: [:sim_idx], how: :right)

    DataFrame.join(any_data_frame, two_data_frame, on: [:sim_idx])
    |> DataFrame.mutate(
      any: fill_missing(any, "none"),
      two: fill_missing(two, "none")
    )

I might be misunderstanding, but the dplyr docs seems to imply that their API can do grouped filtering: https://dplyr.tidyverse.org/articles/grouping.html?q=summ#filter

CleanShot 2024-06-13 at 14 23 27@2x

@mhanberg
Copy link
Contributor

but, as I send that, I see that DF.filter works with groups... which is what i think Jose was saying.

let me try that out 🤦

@mhanberg
Copy link
Contributor

Yeah so that method can work, but seems like my previous workaround just rearranged.

I think the key thing that that the call to DF.summarise after the call to DF.filter will not summarise any grouped values if they were filtered out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants