Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check duplicate keys dataframe rename column #850

Conversation

pcapel
Copy link
Contributor

@pcapel pcapel commented Feb 7, 2024

Adds a column pair check to the rename function prior to passing to Polars. Fixes #848

The root cause of the NIF panic was that Polars is unable to find the column a second time. So when given a rename with the same column name, the second parameter will always fail with ColumnNotFound.

That could be something that is fixed from within Polars, or we could mirror the error message more directly. Though I don't think that would work well as an error without being able to report the partially updated column names which are stuck down in the Rust code at that point.

I opted for a simple argument error which raises when a duplicated column is found. I assume that there are few enough parameters passed to a rename that iterating them isn't a performance hit in any way that matters.

Performing the check after the column pairs are parsed makes things simplest. I'm not sure if this is something that bears abstracting into the Shared module, but I named the function according to the patterns I saw there.

I'm not sold on my error message. If there's a consistent way that error messages are written, please advise on that.

I wrote a test for the case of a keyword list only, as the case of a map given to rename will raise a warning to the user alerting them to the duplicate key and it will be overwritten such that this error doesn't occur.

Copy link
Member

@philss philss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor suggestion :)

lib/explorer/data_frame.ex Outdated Show resolved Hide resolved
@@ -3609,6 +3609,7 @@ defmodule Explorer.DataFrame do
df

pairs ->
maybe_raise_column_duplicate(pairs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pairs_map will already be free of duplicates. So maybe we can compare: Enum.count(pairs) == map_size(pairs_map) and raise if they mismatch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's a fair point. It doesn't allow us to call out the specific column though, right? I felt that was a useful addition to the developer experience.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's helpful to not compute which are duplicates unless we have to. We could do something like:

if Enum.count(pairs) != map_size(pairs_map) do
  duplicates =
    pairs
    |> Enum.frequencies_by(fn {col, _} -> col end)
    |> Enum.filter(fn {_, count} -> count > 1 end)
    |> Enum.map(fn {col, _} -> col end)

  raise "found duplicate columns: #{duplicates}"
end

That way we don't traverse the list unless we already know there's an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I agree with this wholeheartedly, and I can make the change. However, there are two points I'm unclear on:

  1. Traversal of the pairs is probably constrained by the normal size of a dataframe. I would imagine that it would be relatively small in 99% of cases, and the use of rename/2 is most likely a 1-time operation. That leads me to think that it's a bit of a boondoggle to try and optimize overly much. That's why I accepted the change to use a list, as it's concise, readable, and unlikely to be a performance issue. I admit I am fairly naive in this space though, so this assumption may be poor.
  2. I wanted to verify that Enum.count/1 would not traverse the list. I'll trust if I'm told that's the case, but here's what I was able to dig up myself:
  • Enum.count/1 guarded for a list will call to Kernel.length/1
  • Kernel.length/1 will call :erlang.length/1, which is where I start to get a little shaky.
  • It looks like :erlang.length/1 is probably the ERTS length/1 function, whose docs don't mention complexity.

It's at this point I get a little lost, since I'm not great at digging into the erlang source code. I was hoping to find either the implementation for length/1 or a declaration for the List data structure which carried a len[gth]: field (similar to python) whereby it could be assumed that length/1 would be a constant time operation.

If anyone has the time to help point me in the right direction, I'd love to get a little more savvy on reading the "nuts and bolts" of erlang. Otherwise, as I mentioned, I'm happy to simply implement this as suggested.

p.s. Thank you for introducing the Enum.frequencies_by/2 function. Great Today I Learned (TIL)!

p.p.s to_column_pairs/3 actually traverses the list, so if we wanted to be particularly parsimonious, we could place the check there. I opted not do because it damages the readability, in my opinion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Your assumption is correct AFAIK. Readable > optimized for this function, 100%.
  2. Great point! I had it in my head that length was O(1). That appears to be incorrect: https://www.erlang.org/doc/efficiency_guide/commoncaveats#length-1. (Jose would know better than I, to please call me out if that's wrong.)

In light of those points, I retract my statement!

Though I'm glad it resulted in a TIL. And I wish I knew more about Erlang too. One thing at a time :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still like to request my changes, where we compare if Enum.count(pairs) != map_size(pairs_map) do. That's an O(n) operation (Enum.count/1 is O(n), map_size/1 is O(1)). I am afraid maybe_raise_column_duplicate/1 is O(n^2), as it traverses once and then again on col in seen.

Renames could be done programmatically too, for example, by downcasing 2048 columns. Still unlikely to matter but I think avoiding O(n^2) doesn't hurt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Man, this has been a really engaging way to get into the Explorer/Elixir/Erlang internals. I wouldn't have seen that coming!

I'm happy to implement the simple solution proposed by José. I would note that it fails to deal with the fact that rename will also raise a NIF panic when there are duplicated target columns, e.g. rename(df, [a: "first", b: "first"]. This is a little trickier, because the most reasonable place to deal with that is in to_column_pairs/3. That leads down a rabbit hole dealing with the other places that this function is used.

The to_column_pairs/3 use in other locations expects to allow LazySeries as values. In that case, they basically all look the same to any Map or MapSet. So you end up throwing up errors from mutate_with and summarise_with, as well as their dependent macros.

All this is to say, I spent a lot of time thinking about how to get this to work before I saw a funny little output in one of the tests that I wrote:

thread '<unnamed>' panicked at src/dataframe.rs:536:39:
should rename: Duplicate(ErrString("duplicate column names found"))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

And that immediately caught my attention. That's where I found this:

#[rustler::nif(schedule = "DirtyCpu")]
pub fn df_rename_columns(
    df: ExDataFrame,
    renames: Vec<(&str, &str)>,
) -> Result<ExDataFrame, ExplorerError> {
    let mut df = df.clone();
    for (original, new_name) in renames {
        df.rename(original, new_name).expect("should rename");
    }

    Ok(ExDataFrame::new(df))
}

The panic is called because of the expect and is causing the NIF panic. Swap that out for something like:

#[rustler::nif(schedule = "DirtyCpu")]
pub fn df_rename_columns(
    df: ExDataFrame,
    renames: Vec<(&str, &str)>,
) -> Result<ExDataFrame, ExplorerError> {
    let mut df = df.clone();
    for (original, new_name) in renames {
        match df.rename(original, new_name){
            Ok(df) => df,
            Err(e) => return Err(ExplorerError::Polars(e))
        };

    }

    Ok(ExDataFrame::new(df))
}

And you get a runtime error, with the same test output looking more like:

  1) test rename/2 with binary tuples and a target column that is duplicated (Explorer.DataFrameTest)
     test/explorer/data_frame_test.exs:2790
     Expected exception ErlangError but got RuntimeError (Polars Error: duplicate: duplicate column names found)
     code: assert_raise ErlangError, ~r"Erlang error: :nif_panicked", fn ->
     stacktrace:
       (explorer 0.9.0-dev) lib/explorer/polars_backend/shared.ex:77: Explorer.PolarsBackend.Shared.apply_dataframe/4
       test/explorer/data_frame_test.exs:2793: (test)

Now, I like the idea of knowing rust, but I honestly don't right now. This seems like the best solution to me because polars is already doing the check, but I would want someone who knows more about the interface between the two double checking this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I think after reading the docs on handling errors in Rust, this makes the most sense:

#[rustler::nif(schedule = "DirtyCpu")]
pub fn df_rename_columns(
    df: ExDataFrame,
    renames: Vec<(&str, &str)>,
) -> Result<ExDataFrame, ExplorerError> {
    let mut df = df.clone();
    for (original, new_name) in renames {
        df.rename(original, new_name)?;
    }

    Ok(ExDataFrame::new(df))
}

The downside from our perspective is that the error returned in several cases is still weird. You would end up seeing RuntimeError (Polars Error: not found: a) if you duplicated the :a column name, for example. It might make more sense to capture the error and derive the issue after the fact. This eliminates concern over performing a perhaps costly operation during the expected path, and pushes any extra computation into a sad path.

Thoughts?

Copy link
Member

@lkarthee lkarthee Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My view is RuntimeError Polars Error is better than nif_panic as it at least hints error has some thing with column a and it is duplicated. User will figure out something sooner or later.

Also regarding O(1), O(n), etc does it really matter for columns names as they will be less. Time complexity is kind of relevant for rows than columns. I agree with Jose on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pcapel great call on handling it at the Rust side as well. I think the Rust change + the conditional check is the way to go.

pcapel and others added 5 commits February 14, 2024 11:38
* Check pairs for duplicated source column values
* Refactor the reduction to use a list instead of a map.
  This preserves the ability to name the duplicated column
  while simplifying the procedure.

Co-authored-by: Philip Sampaio <philip.sampaio@gmail.com>
* df.rename propagates error, resulting in RuntimeError for duplicated
  columns
@pcapel pcapel force-pushed the fix/check-duplicate-keys-dataframe-rename-column branch from a30b1c9 to c0a5afb Compare February 14, 2024 17:41
@pcapel pcapel requested a review from philss February 14, 2024 17:42
@josevalim josevalim merged commit 424909e into elixir-explorer:main Feb 14, 2024
4 checks passed
@josevalim
Copy link
Member

💚 💙 💜 💛 ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NIF Panic with duplicated key in rename
5 participants