Check duplicate keys dataframe rename column #850

pcapel · 2024-02-07T19:12:22Z

Adds a column pair check to the rename function prior to passing to Polars. Fixes #848

The root cause of the NIF panic was that Polars is unable to find the column a second time. So when given a rename with the same column name, the second parameter will always fail with ColumnNotFound.

That could be something that is fixed from within Polars, or we could mirror the error message more directly. Though I don't think that would work well as an error without being able to report the partially updated column names which are stuck down in the Rust code at that point.

I opted for a simple argument error which raises when a duplicated column is found. I assume that there are few enough parameters passed to a rename that iterating them isn't a performance hit in any way that matters.

Performing the check after the column pairs are parsed makes things simplest. I'm not sure if this is something that bears abstracting into the Shared module, but I named the function according to the patterns I saw there.

I'm not sold on my error message. If there's a consistent way that error messages are written, please advise on that.

I wrote a test for the case of a keyword list only, as the case of a map given to rename will raise a warning to the user alerting them to the duplicate key and it will be overwritten such that this error doesn't occur.

philss

Just a minor suggestion :)

lib/explorer/data_frame.ex

josevalim · 2024-02-09T09:01:30Z

lib/explorer/data_frame.ex

@@ -3609,6 +3609,7 @@ defmodule Explorer.DataFrame do
        df

      pairs ->
+        maybe_raise_column_duplicate(pairs)


pairs_map will already be free of duplicates. So maybe we can compare: Enum.count(pairs) == map_size(pairs_map) and raise if they mismatch?

Ah, that's a fair point. It doesn't allow us to call out the specific column though, right? I felt that was a useful addition to the developer experience.

I think it's helpful to not compute which are duplicates unless we have to. We could do something like:

if Enum.count(pairs) != map_size(pairs_map) do duplicates = pairs |> Enum.frequencies_by(fn {col, _} -> col end) |> Enum.filter(fn {_, count} -> count > 1 end) |> Enum.map(fn {col, _} -> col end) raise "found duplicate columns: #{duplicates}" end

That way we don't traverse the list unless we already know there's an issue.

In general I agree with this wholeheartedly, and I can make the change. However, there are two points I'm unclear on:

Traversal of the pairs is probably constrained by the normal size of a dataframe. I would imagine that it would be relatively small in 99% of cases, and the use of rename/2 is most likely a 1-time operation. That leads me to think that it's a bit of a boondoggle to try and optimize overly much. That's why I accepted the change to use a list, as it's concise, readable, and unlikely to be a performance issue. I admit I am fairly naive in this space though, so this assumption may be poor.

I wanted to verify that Enum.count/1 would not traverse the list. I'll trust if I'm told that's the case, but here's what I was able to dig up myself:

Enum.count/1 guarded for a list will call to Kernel.length/1

Kernel.length/1 will call :erlang.length/1, which is where I start to get a little shaky.

It looks like :erlang.length/1 is probably the ERTS length/1 function, whose docs don't mention complexity.

It's at this point I get a little lost, since I'm not great at digging into the erlang source code. I was hoping to find either the implementation for length/1 or a declaration for the List data structure which carried a len[gth]: field (similar to python) whereby it could be assumed that length/1 would be a constant time operation.

If anyone has the time to help point me in the right direction, I'd love to get a little more savvy on reading the "nuts and bolts" of erlang. Otherwise, as I mentioned, I'm happy to simply implement this as suggested.

p.s. Thank you for introducing the Enum.frequencies_by/2 function. Great Today I Learned (TIL)!

p.p.s to_column_pairs/3 actually traverses the list, so if we wanted to be particularly parsimonious, we could place the check there. I opted not do because it damages the readability, in my opinion.

Your assumption is correct AFAIK. Readable > optimized for this function, 100%.

Great point! I had it in my head that length was O(1). That appears to be incorrect: https://www.erlang.org/doc/efficiency_guide/commoncaveats#length-1. (Jose would know better than I, to please call me out if that's wrong.)

In light of those points, I retract my statement!

Though I'm glad it resulted in a TIL. And I wish I knew more about Erlang too. One thing at a time :)

I'd still like to request my changes, where we compare if Enum.count(pairs) != map_size(pairs_map) do. That's an O(n) operation (Enum.count/1 is O(n), map_size/1 is O(1)). I am afraid maybe_raise_column_duplicate/1 is O(n^2), as it traverses once and then again on col in seen.

Renames could be done programmatically too, for example, by downcasing 2048 columns. Still unlikely to matter but I think avoiding O(n^2) doesn't hurt.

Man, this has been a really engaging way to get into the Explorer/Elixir/Erlang internals. I wouldn't have seen that coming!

I'm happy to implement the simple solution proposed by José. I would note that it fails to deal with the fact that rename will also raise a NIF panic when there are duplicated target columns, e.g. rename(df, [a: "first", b: "first"]. This is a little trickier, because the most reasonable place to deal with that is in to_column_pairs/3. That leads down a rabbit hole dealing with the other places that this function is used.

The to_column_pairs/3 use in other locations expects to allow LazySeries as values. In that case, they basically all look the same to any Map or MapSet. So you end up throwing up errors from mutate_with and summarise_with, as well as their dependent macros.

All this is to say, I spent a lot of time thinking about how to get this to work before I saw a funny little output in one of the tests that I wrote:

thread '<unnamed>' panicked at src/dataframe.rs:536:39: should rename: Duplicate(ErrString("duplicate column names found")) note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

And that immediately caught my attention. That's where I found this:

#[rustler::nif(schedule = "DirtyCpu")] pub fn df_rename_columns( df: ExDataFrame, renames: Vec<(&str, &str)>, ) -> Result<ExDataFrame, ExplorerError> { let mut df = df.clone(); for (original, new_name) in renames { df.rename(original, new_name).expect("should rename"); } Ok(ExDataFrame::new(df)) }

The panic is called because of the expect and is causing the NIF panic. Swap that out for something like:

#[rustler::nif(schedule = "DirtyCpu")] pub fn df_rename_columns( df: ExDataFrame, renames: Vec<(&str, &str)>, ) -> Result<ExDataFrame, ExplorerError> { let mut df = df.clone(); for (original, new_name) in renames { match df.rename(original, new_name){ Ok(df) => df, Err(e) => return Err(ExplorerError::Polars(e)) }; } Ok(ExDataFrame::new(df)) }

And you get a runtime error, with the same test output looking more like:

1) test rename/2 with binary tuples and a target column that is duplicated (Explorer.DataFrameTest) test/explorer/data_frame_test.exs:2790 Expected exception ErlangError but got RuntimeError (Polars Error: duplicate: duplicate column names found) code: assert_raise ErlangError, ~r"Erlang error: :nif_panicked", fn -> stacktrace: (explorer 0.9.0-dev) lib/explorer/polars_backend/shared.ex:77: Explorer.PolarsBackend.Shared.apply_dataframe/4 test/explorer/data_frame_test.exs:2793: (test)

Now, I like the idea of knowing rust, but I honestly don't right now. This seems like the best solution to me because polars is already doing the check, but I would want someone who knows more about the interface between the two double checking this.

Ok, so I think after reading the docs on handling errors in Rust, this makes the most sense:

#[rustler::nif(schedule = "DirtyCpu")] pub fn df_rename_columns( df: ExDataFrame, renames: Vec<(&str, &str)>, ) -> Result<ExDataFrame, ExplorerError> { let mut df = df.clone(); for (original, new_name) in renames { df.rename(original, new_name)?; } Ok(ExDataFrame::new(df)) }

The downside from our perspective is that the error returned in several cases is still weird. You would end up seeing RuntimeError (Polars Error: not found: a) if you duplicated the :a column name, for example. It might make more sense to capture the error and derive the issue after the fact. This eliminates concern over performing a perhaps costly operation during the expected path, and pushes any extra computation into a sad path.

Thoughts?

My view is RuntimeError Polars Error is better than nif_panic as it at least hints error has some thing with column a and it is duplicated. User will figure out something sooner or later.

Also regarding O(1), O(n), etc does it really matter for columns names as they will be less. Time complexity is kind of relevant for rows than columns. I agree with Jose on this.

@pcapel great call on handling it at the Rust side as well. I think the Rust change + the conditional check is the way to go.

* elixir-explorer#848

* Check pairs for duplicated source column values

* Refactor the reduction to use a list instead of a map. This preserves the ability to name the duplicated column while simplifying the procedure. Co-authored-by: Philip Sampaio <philip.sampaio@gmail.com>

* df.rename propagates error, resulting in RuntimeError for duplicated columns

test/explorer/data_frame_test.exs

josevalim · 2024-02-14T19:40:18Z

💚 💙 💜 💛 ❤️

philss reviewed Feb 8, 2024

View reviewed changes

lib/explorer/data_frame.ex Outdated Show resolved Hide resolved

josevalim reviewed Feb 9, 2024

View reviewed changes

pcapel and others added 5 commits February 14, 2024 11:38

Add failing test reproduction of issue

96cfca5

* elixir-explorer#848

Add column check for pairs

4ede84f

* Check pairs for duplicated source column values

Refactor maybe_raise_column_duplicate

70bca93

* Refactor the reduction to use a list instead of a map. This preserves the ability to name the duplicated column while simplifying the procedure. Co-authored-by: Philip Sampaio <philip.sampaio@gmail.com>

Refactor to a more performant check

2ce4eb7

Check for Err in result of df_rename_columns

c0a5afb

* df.rename propagates error, resulting in RuntimeError for duplicated columns

pcapel force-pushed the fix/check-duplicate-keys-dataframe-rename-column branch from a30b1c9 to c0a5afb Compare February 14, 2024 17:41

pcapel requested a review from philss February 14, 2024 17:42

josevalim reviewed Feb 14, 2024

View reviewed changes

test/explorer/data_frame_test.exs Outdated Show resolved Hide resolved

josevalim reviewed Feb 14, 2024

View reviewed changes

test/explorer/data_frame_test.exs Outdated Show resolved Hide resolved

josevalim reviewed Feb 14, 2024

View reviewed changes

test/explorer/data_frame_test.exs Outdated Show resolved Hide resolved

Apply suggestions from code review

275f913

josevalim merged commit 424909e into elixir-explorer:main Feb 14, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check duplicate keys dataframe rename column #850

Check duplicate keys dataframe rename column #850

pcapel commented Feb 7, 2024

philss left a comment

josevalim Feb 9, 2024

pcapel Feb 13, 2024

billylanchantin Feb 13, 2024

pcapel Feb 13, 2024

billylanchantin Feb 13, 2024

josevalim Feb 13, 2024

pcapel Feb 13, 2024

pcapel Feb 14, 2024

lkarthee Feb 14, 2024 •

edited

josevalim Feb 14, 2024

josevalim commented Feb 14, 2024

Check duplicate keys dataframe rename column #850

Check duplicate keys dataframe rename column #850

Conversation

pcapel commented Feb 7, 2024

philss left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lkarthee Feb 14, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

josevalim commented Feb 14, 2024

lkarthee Feb 14, 2024 •

edited