Update Polars to v0.36 #797

philss · 2024-01-04T00:15:27Z

The Polars team released the version v0.36.2 of the Rust crates yesterday (2024-01-02), and we should bump the version on our side.

I started this work - branch is ps-bump-polars-to-v0.36 -, but I found some issues and things that were removed, and we need to implement on our side.

fix Series.window_median/3 - done in Ps bump polars to v0.36 #798
fix Series.frequencies/1 done in Ps bump polars to v0.36 #798
fix Series comparisons operations - done in Ps bump polars to v0.36 #798
fix DataFrame.join/3 using the outer strategy - done in Bump to v0.36.0 - fix join outer #802
fix DataFrame.describe/ - probably implement on our side - done in Bump to v0.36.0 - implement describe function #803

So I leave the issue open, and if anyone wants to work on it, feel free to do so.
I should finish #794 before going back here.

The text was updated successfully, but these errors were encountered:

lkarthee · 2024-01-04T10:57:10Z

2 test cases relating to DataFrame.join/3 are failing due to changes from #784 ?

1) test join/3 with a custom 'on' but with repeated column on left side - outer join (Explorer.DataFrame.LazyTest)
     test/explorer/data_frame/lazy_test.exs:1224
     ** (RuntimeError) Polars Error: not found: d
     code: assert DF.to_columns(df, atom_keys: true) == %{
     stacktrace:
       (explorer 0.8.0-dev) lib/explorer/polars_backend/shared.ex:35: Explorer.PolarsBackend.Shared.apply_dataframe/3
       (explorer 0.8.0-dev) lib/explorer/data_frame.ex:1873: anonymous fn/3 in Explorer.DataFrame.to_columns/2
       (elixir 1.15.7) lib/enum.ex:1693: Enum."-map/2-lists^map/1-1-"/2
       (elixir 1.15.7) lib/enum.ex:1693: Enum."-map/2-lists^map/1-1-"/2
       (explorer 0.8.0-dev) lib/explorer/data_frame.ex:1872: Explorer.DataFrame.to_columns/2
       test/explorer/data_frame/lazy_test.exs:1233: (test)

7) test join/3 with a custom 'on' but with repeated column on left side (Explorer.DataFrameTest)
     test/explorer/data_frame_test.exs:2070
     ** (RuntimeError) Polars Error: not found: d
     code: assert DF.to_columns(df2, atom_keys: true) == %{
     stacktrace:
       (explorer 0.8.0-dev) lib/explorer/polars_backend/shared.ex:35: Explorer.PolarsBackend.Shared.apply_dataframe/3
       (explorer 0.8.0-dev) lib/explorer/data_frame.ex:1873: anonymous fn/3 in Explorer.DataFrame.to_columns/2
       (elixir 1.15.7) lib/enum.ex:1693: Enum."-map/2-lists^map/1-1-"/2
       (elixir 1.15.7) lib/enum.ex:1693: Enum."-map/2-lists^map/1-1-"/2
       (explorer 0.8.0-dev) lib/explorer/data_frame.ex:1872: Explorer.DataFrame.to_columns/2
       test/explorer/data_frame_test.exs:2098: (test)

josevalim · 2024-01-04T15:02:08Z

@lkarthee it is fine to mirror the new outer join with polars. The reason why we are having an exception is because the current join assumes all columns from both arguments will be the output, but it seems polars changed it to drop duplicate columns. You will have to mirror this logic in Explorer.DataFrame.join and fix the tests.

billylanchantin · 2024-01-04T15:56:35Z

@philss Is your sense that we can fix these issues on main? Or should we branch off ps-bump-polars-to-v0.36? (I probably won't have time to look into specifics until tonight.)

philss · 2024-01-04T16:01:55Z

@billylanchantin I was thinking of keeping the work outside the main branch - branching from, or working directly in my branch. But we could implement "our version" of DF.describe/2 branching from main and merge it separately. WDYT?

billylanchantin · 2024-01-04T16:08:24Z

I was thinking of keeping the work outside the main branch - branching from, or working directly in my branch.

That's what I was thinking too. I just wanted to make sure :)

But we could implement "our version" of DF.describe/2 branching from main and merge it separately. WDYT?

I think it's fine either way. If I tackle any of the pieces I think I'll branch off yours. But others should feel free to do that one off main.

philss · 2024-01-04T19:46:31Z

FYI, the DF.describe/2 implementation in Polars is now written in Python: https://github.com/pola-rs/polars/blob/67cb6923b8546fc96bda2d28fce293bfe47561c6/py-polars/polars/dataframe/frame.py#L4363

We can use that as a reference :)

lkarthee · 2024-01-05T03:51:29Z

Logged a bug for outer_coalesce - pola-rs/polars#13450 .

lkarthee · 2024-01-06T14:48:04Z

@josevalim One question I have about outer join is - polars returns new columns, should we forward them to Explorer ?

The behavior has been changed to include the original join keys. Name clashes are solved by appending a suffix (_right by default) to the right join key name.

L1_right in below case ?

>>> df1.join(df2, on="L1", how="outer")
shape: (4, 4)
┌──────┬──────┬──────────┬──────┐
│ L1   ┆ L2   ┆ L1_right ┆ R2   │
│ ---  ┆ ---  ┆ ---      ┆ ---  │
│ str  ┆ i64  ┆ str      ┆ i64  │
╞══════╪══════╪══════════╪══════╡
│ a    ┆ 1    ┆ a        ┆ 7    │
│ b    ┆ 2    ┆ null     ┆ null │
│ c    ┆ 3    ┆ c        ┆ 8    │
│ null ┆ null ┆ d        ┆ 9    │
└──────┴──────┴──────────┴──────┘

josevalim · 2024-01-06T15:08:05Z

Yes!

lkarthee · 2024-01-07T08:56:22Z

I have fixed outer join in latest pr.

Can I fix the describe function ? @philss or @billylanchantin are you working on it ?

josevalim · 2024-01-07T08:58:10Z

@lkarthee please go ahead, I don't think any of them will reply soon due to timezone :)

billylanchantin · 2024-01-07T16:20:42Z

@lkarthee Yeah go for it! I actually tried yesterday morning, but I spun my wheels trying to make it "elegant". More than happy to let you take over :)

EDIT: Based on what I tried yesterday, advice would be to just calculate what you need and move on. I kept trying to be clever with loops but I couldn't make it work.

lkarthee · 2024-01-07T18:11:46Z

Thank you @billylanchantin .

I have a draft rust version (have to test more). I am exploring if it can be implemented with DF.summarise() in elixir. Took me Down the rabbit hole trying to achieve lit(None) (Null Series) in explorer from elixir. NullType is missing from dtypes - I logged a bug #783 recently relating to this.

Is there any way we can achieve lit(None) with existing api ?

billylanchantin · 2024-01-07T18:19:05Z

@lkarthee

This is what I tried that didn't work (I hadn't gotten to percentiles yet).

Show/Hide

  def describe(%DataFrame{} = df, _percentiles) do
    require Explorer.DataFrame, as: DF

    numeric_dtypes = Explorer.Shared.numeric_types()

    ordered_dtypes =
      List.flatten([
        # [:date, :string],
        [:date],
        numeric_dtypes,
        Explorer.Shared.datetime_types(),
        Explorer.Shared.duration_types()
      ])

    metrics = [
      count: %{dtypes: nil, fun: &Explorer.Series.n_distinct/1},
      nil_count: %{dtypes: nil, fun: &Explorer.Series.nil_count/1},
      mean: %{dtypes: numeric_dtypes, fun: &Explorer.Series.mean/1},
      std: %{dtypes: numeric_dtypes, fun: &Explorer.Series.standard_deviation/1},
      min: %{dtypes: ordered_dtypes, fun: &Explorer.Series.min/1},
      max: %{dtypes: ordered_dtypes, fun: &Explorer.Series.max/1}
    ]

    metric_dfs =
      for {_metric, %{dtypes: dtypes, fun: fun}} <- metrics do
        if dtypes == nil do
          DF.summarise(df, for(s <- across(), do: {s.name, ^fun.(s)}))
        else
          metric_df =
            DF.summarise(df, for(s <- across(), s.dtype in ^dtypes, do: {s.name, ^fun.(s)}))

          # Manually add `nil` to all non-computed columns.
          metric_df =
            Enum.reduce(df.names, metric_df, fn col, acc ->
              if col not in acc.names, do: DF.put(acc, col, [nil]), else: acc
            end)

          metric_df[df.names]
        end
      end

    metric_df =
      metric_dfs
      |> DF.concat_rows()
      |> DF.put(:describe, metrics |> Keyword.keys() |> Enum.map(&Atom.to_string/1))

    metric_df[["describe"] ++ df.names]
  end

Which gives you (notice the string columns aren't handled right):

# test/explorer/data_frame_test.exs:3321
df = DF.new(a: ["d", nil, "f"], b: [1, 2, 3], c: ["a", "b", "c"])
df1 = DF.describe(df)

# +-------------------------------------------+
# | Explorer DataFrame: [rows: 6, columns: 4] |
# +-------------+---------+---------+---------+
# |  describe   |    a    |    b    |    c    |
# |  <string>   |  <f64>  |  <f64>  |  <f64>  |
# +=============+=========+=========+=========+
# | count       | 3.0     | 3.0     | 3.0     |
# +-------------+---------+---------+---------+
# | nil_count   | 1.0     | 0.0     | 0.0     |
# +-------------+---------+---------+---------+
# | mean        |         | 2.0     |         |
# +-------------+---------+---------+---------+
# | std         |         | 1.0     |         |
# +-------------+---------+---------+---------+
# | min         |         | 1.0     |         |
# +-------------+---------+---------+---------+
# | max         |         | 3.0     |         |
# +-------------+---------+---------+---------+

The issue was that our min/max functions don't work on dtype: :string (perhaps they should?). So I think we need to handle that case by hand.

One approach I thought of: use the Series.sort function to compute all order statistics (min, max, 25%, etc.). You'll need to compute nil_count first since those need to be excluded. But after that, arithmetic should give you the indices of each order statistic. Something like:

percentile_25_index = floor(0.25 * (length(s) - nil_count(s)))

This was apparently attempted by someone on the Polars side, but they said they didn't get the performance improvements they expected:

Improve DataFrame.describe performance by sorting columns first pola-rs/polars#9368

Is there any way we can achieve lit(None) with existing api ?

I got stuck on that too! You can see my workaround in my attempt. I don't know if there's a way we can do it easily on our side.

josevalim · 2024-01-07T18:28:57Z

👍 for adding null type. And I think it is safest to skip mean, std, min, and max for strings and other dtypes.

lkarthee · 2024-01-07T18:33:22Z

I have to expose this on Series.nil_() - figuring out the cogs in the wheel. Its not working yet.

#[rustler::nif]
pub fn expr_nil_() -> ExExpr {
    ExExpr::new(Expr::Literal(LiteralValue::Null))
}

Below works for df with numeric types - pivot is pending. Exprs work and currently data is in columns.

 def describe(df, opts \\ []) do
    opts = Keyword.validate!(opts, percentiles: nil)

    if Enum.empty?(df.names) do
      raise ArgumentError, message: "cannot describe a DataFrame without any columns"
    end

    percentiles = process_percentiles(opts[:percentiles])
    numeric_dtypes = Shared.numeric_types()
    datetime_types = Shared.datetime_types()
    duration_types = Shared.duration_types()
    stat_cols = for {name, type} <- df.dtypes, type in numeric_dtypes, do: name

    min_max_cols =
      for {name, type} <- df.dtypes,
          type in numeric_dtypes or type in datetime_types or type in duration_types,
          do: name

    metrics = ["count", "null_count", "mean", "std", "min"]
    p_metrics = for p <- percentiles, do: "#{p * 100}%"
    metrics = metrics ++ p_metrics
    metrics = ["max" | metrics]

    df_metrics =
      summarise_with(df, fn x ->
        counts_exprs = Enum.map(df.names, &{"count:#{&1}", Series.count(x[&1])})
        nil_counts_exprs = Enum.map(df.names, &{"nil_count:#{&1}", Series.nil_count(x[&1])})

        percentile_exprs =
          for p <- percentiles, c <- df.names do
            name = "#{p}:#{c}"

            if c in stat_cols do
              {name, Series.quantile(x[c], p)}
            else
              {name, Series.nil_()} # this i wrote in rust and exposed in expression.ex, I have to expose it on Series i guess.
            end
          end
      #  TODO: handle Series.nil_() for below
        mean_exprs = for c <- stat_cols, do: {"mean:#{c}", Series.mean(x[c])}
        std_exprs = for c <- stat_cols, do: {"std:#{c}", Series.standard_deviation(x[c])}
        min_exprs = for c <- min_max_cols, do: {"min:#{c}", Series.min(x[c])}
        max_exprs = for c <- min_max_cols, do: {"max:#{c}", Series.max(x[c])}

        counts_exprs ++
          nil_counts_exprs ++
          mean_exprs ++ std_exprs ++ min_exprs ++ percentile_exprs ++ max_exprs
      end)
    # Reshape wide result
    row = head(df_metrics)
   #TODO - pivot columns to rows
end

 def process_percentiles(nil), do: [0.25, 0.50, 0.75]

  def process_percentiles(percentiles) do
    Enum.each(percentiles, fn p ->
      if p < 0 or p > 1 do
        raise ArgumentError, message: "percentiles must all be in the range [0, 1]"
      end
    end)

    Enum.sort(percentiles)
  end

df = Explorer.DataFrame.new(a: [5,6,7], b: [1, 2, 3])
df2 = DF.describe(df)

#Explorer.DataFrame<
  Polars[1 x 18]
  count:a s64 [3]
  count:b s64 [3]
  nil_count:a s64 [0]
  nil_count:b s64 [0]
  mean:a f64 [6.0]
  mean:b f64 [2.0]
  std:a f64 [1.0]
  std:b f64 [1.0]
  min:a s64 [5]
  min:b s64 [1]
  0.25:a f64 [6.0]
  0.25:b f64 [2.0]
  0.5:a f64 [6.0]
  0.5:b f64 [2.0]
  0.75:a f64 [7.0]
  0.75:b f64 [3.0]
  max:a s64 [7]
  max:b s64 [3]
>

billylanchantin · 2024-01-07T18:33:51Z

👍 to skipping order statistics on strings. While technically possible, I don't think people usually care.

lkarthee · 2024-01-07T18:52:32Z

@billylanchantin Thank you for the pointers, I have completed the percentiles part. I have tried to mirror python code very closely. Hopefully I will figure out more about adding Series.nil_() tomorrow.

@josevalim One way to go is exclude non_stat columns from describe and revisit this after null type pr ? Only two metrics will be relevant for non_stat columns - count and nil_count.

josevalim · 2024-01-07T19:51:54Z

One way to go is exclude non_stat columns from describe and revisit this after null type pr ? Only two metrics will be relevant for non_stat columns - count and nil_count.

Sounds good to me!

lkarthee · 2024-01-08T08:52:01Z

Went ahead with rust func - only percentiles and pivot logic is in elixir. It fails if there is a non numeric column in data frame. Please review PR, i will fix rest in next pr.

lkarthee · 2024-01-08T17:51:24Z

Implemented describe for stat_cols in elixir and tests are passing

philss · 2024-01-08T21:13:33Z

The work here is complete, so I'm close.

Thank you all for the contributions! 💜

lkarthee mentioned this issue Jan 4, 2024

Ps bump polars to v0.36 #798

Merged

philss mentioned this issue Jan 8, 2024

Update Polars to v0.36 #804

Merged

philss closed this as completed Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Polars to v0.36 #797

Update Polars to v0.36 #797

philss commented Jan 4, 2024 •

edited

Loading

lkarthee commented Jan 4, 2024 •

edited

Loading

josevalim commented Jan 4, 2024

billylanchantin commented Jan 4, 2024

philss commented Jan 4, 2024

billylanchantin commented Jan 4, 2024

philss commented Jan 4, 2024

lkarthee commented Jan 5, 2024 •

edited

Loading

lkarthee commented Jan 6, 2024 •

edited

Loading

josevalim commented Jan 6, 2024

lkarthee commented Jan 7, 2024

josevalim commented Jan 7, 2024

billylanchantin commented Jan 7, 2024 •

edited

Loading

lkarthee commented Jan 7, 2024

billylanchantin commented Jan 7, 2024 •

edited

Loading

josevalim commented Jan 7, 2024

lkarthee commented Jan 7, 2024 •

edited

Loading

billylanchantin commented Jan 7, 2024

lkarthee commented Jan 7, 2024

josevalim commented Jan 7, 2024

lkarthee commented Jan 8, 2024

lkarthee commented Jan 8, 2024

philss commented Jan 8, 2024

Update Polars to v0.36 #797

Update Polars to v0.36 #797

Comments

philss commented Jan 4, 2024 • edited Loading

lkarthee commented Jan 4, 2024 • edited Loading

josevalim commented Jan 4, 2024

billylanchantin commented Jan 4, 2024

philss commented Jan 4, 2024

billylanchantin commented Jan 4, 2024

philss commented Jan 4, 2024

lkarthee commented Jan 5, 2024 • edited Loading

lkarthee commented Jan 6, 2024 • edited Loading

josevalim commented Jan 6, 2024

lkarthee commented Jan 7, 2024

josevalim commented Jan 7, 2024

billylanchantin commented Jan 7, 2024 • edited Loading

lkarthee commented Jan 7, 2024

billylanchantin commented Jan 7, 2024 • edited Loading

josevalim commented Jan 7, 2024

lkarthee commented Jan 7, 2024 • edited Loading

billylanchantin commented Jan 7, 2024

lkarthee commented Jan 7, 2024

josevalim commented Jan 7, 2024

lkarthee commented Jan 8, 2024

lkarthee commented Jan 8, 2024

philss commented Jan 8, 2024

philss commented Jan 4, 2024 •

edited

Loading

lkarthee commented Jan 4, 2024 •

edited

Loading

lkarthee commented Jan 5, 2024 •

edited

Loading

lkarthee commented Jan 6, 2024 •

edited

Loading

billylanchantin commented Jan 7, 2024 •

edited

Loading

billylanchantin commented Jan 7, 2024 •

edited

Loading

lkarthee commented Jan 7, 2024 •

edited

Loading