Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] In some cases ChunkedArray::Equals(std::shared_ptr<arrow::ChunkedArray>& other) does not treat NaN values as unequal consistently #37515

Closed
sgilmore10 opened this issue Sep 1, 2023 · 5 comments · Fixed by #37579
Assignees
Milestone

Comments

@sgilmore10
Copy link
Member

Describe the bug, including details regarding any error messages, version, and platform.

While working on #37448, I noticed the Equals() method of ChunkedArray treats NaNs as equal by default. I found this behavior surprising because the Equals() method of Array doesn't treat NaNs as equal by default.

Below is a program to reproduce this behavior:

#include <arrow/api.h>
#include <arrow/type.h>

#include <iostream>
#include <math.h>
#include <sstream>

arrow::Result<std::shared_ptr<arrow::ChunkedArray>> make_chunked_array() {
    arrow::NumericBuilder<arrow::DoubleType> builder;
    
    std::shared_ptr<arrow::Array> array;
    ARROW_RETURN_NOT_OK(builder.AppendValues({0, 1, NAN, 2, 4}));
    ARROW_RETURN_NOT_OK(builder.Finish(&array));
    
    return arrow::ChunkedArray::Make({array});
}

int main(int argc, char *argv[])
{
    auto maybe_chunked_array = make_chunked_array();
    if (!maybe_chunked_array.ok()) {
        return -1;
    }
    auto chunked_array = std::move(maybe_chunked_array).ValueUnsafe();
    auto array = chunked_array->chunk(0);
    
    std::stringstream stream;
    stream << "array contents: ";
    stream << "\n\n";
    stream << array->ToString();
    stream << "\n\n";
    stream << "array->Equals(array): ";
    stream << array->Equals(array);
    stream << "\n\n";

    stream << "chunked_array contents: ";
    stream << "\n\n";
    stream << chunked_array->ToString();
    stream << "\n\n";
    stream << "chunked_array->Equals(chunked_array): ";
    stream << chunked_array->Equals(chunked_array);
    
    std::cout << stream.str() << std::endl;
}

Here is the output of this program:

array contents: 

[
  0,
  1,
  nan,
  2,
  4
]

array->Equals(array): 0

chunked_array contents: 

[
  [
    0,
    1,
    nan,
    2,
    4
  ]
]

chunked_array->Equals(chunked_array): 1

Am I correct in that this is a bug?

Component(s)

C++

@js8544
Copy link
Collaborator

js8544 commented Sep 1, 2023

It doesn't treat NaNs as equal. But it does assumes that chunked arrays at the same address are equal, where Array doesn't have such assumption. See https://github.com/apache/arrow/blob/main/cpp/src/arrow/chunked_array.cc#L115

@sgilmore10
Copy link
Member Author

sgilmore10 commented Sep 1, 2023

Ah, thanks for pointing that out! It's a bit weird that Equals(ChunkedArray& other) and Equals(std::shared_ptr<ChunkedArray>& other) may return different results. For example, if you modified the program above to also call Equals(ChunkedArray& other), this would be the output:

chunked_array->Equals(chunked_array): 1
chunked_array->Equals(*chunked_array): 0

I wonder if it's worth removing this optimization for consistency? Or at least removing it for datatypes that can have NaNs?

@js8544
Copy link
Collaborator

js8544 commented Sep 4, 2023

I agree that this optimization can be removed for the sake of consistency.

@sgilmore10
Copy link
Member Author

I'll work on this.

@sgilmore10
Copy link
Member Author

take

@sgilmore10 sgilmore10 changed the title [C++] ChunkedArray::Equals() treats NaNs as equal by default [C++] ChunkedArray::Equals(std::shared_ptr<arrow::ChunkedArray>&other) doesn't check for NaN values when the two objects have the same memory addresss Sep 5, 2023
@sgilmore10 sgilmore10 changed the title [C++] ChunkedArray::Equals(std::shared_ptr<arrow::ChunkedArray>&other) doesn't check for NaN values when the two objects have the same memory addresss [C++] ChunkedArray::Equals(std::shared_ptr<arrow::ChunkedArray>& other) doesn't handle NaN values correctly if other has the same address as this Sep 5, 2023
@sgilmore10 sgilmore10 changed the title [C++] ChunkedArray::Equals(std::shared_ptr<arrow::ChunkedArray>& other) doesn't handle NaN values correctly if other has the same address as this [C++] ChunkedArray::Equals(std::shared_ptr<arrow::ChunkedArray>& other) sometimes incorrectly handle NaN values Sep 5, 2023
@sgilmore10 sgilmore10 changed the title [C++] ChunkedArray::Equals(std::shared_ptr<arrow::ChunkedArray>& other) sometimes incorrectly handle NaN values [C++] In some cases ChunkedArray::Equals(std::shared_ptr<arrow::ChunkedArray>& other) does not treat NaN values as unequal consistently Sep 5, 2023
@kou kou closed this as completed in #37579 Sep 7, 2023
kou pushed a commit that referenced this issue Sep 7, 2023
…::Equals(const std::shared_ptr<arrow::ChunkedArray>& other)` if the `ChunkedArray` can have `NaN` values (#37579)

### Rationale for this change

`ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` assumes that if the two `ChunkedArray`s share the same memory address, then they must be equal. However, this optimization doesn't take into account that `NaN` values are not considered equal by default. Consequently, this can lead to surprising, inconsistent results from a user's perspective. For example, `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)`  and `ChunkedArray::Equals(const ChunkedArray& other)` may return different results.

 The program below illustrates this inconsistency:

```c++
#include <arrow/api.h>
#include <arrow/type.h>

#include <iostream>
#include <math.h>
#include <sstream>

arrow::Result<std::shared_ptr<arrow::ChunkedArray>> make_chunked_array() {
    arrow::NumericBuilder<arrow::DoubleType> builder;
    
    std::shared_ptr<arrow::Array> array;
    ARROW_RETURN_NOT_OK(builder.AppendValues({0, 1, NAN, 2, 4}));
    ARROW_RETURN_NOT_OK(builder.Finish(&array));
    
    return arrow::ChunkedArray::Make({array});
}

int main(int argc, char *argv[])
{
    auto maybe_chunked_array = make_chunked_array();
    if (!maybe_chunked_array.ok()) {
        return -1;
    }
    auto chunked_array = std::move(maybe_chunked_array).ValueUnsafe();
    auto array = chunked_array->chunk(0);
    
    std::stringstream stream;

    stream << "chunked_array contents: ";
    stream << "\n\n";
    stream << chunked_array->ToString();
    stream << "\n\n";
    stream << "chunked_array->Equals(chunked_array): ";
    stream << chunked_array->Equals(chunked_array);
    stream << "chunked_array->Equals(*chunked_array): ";
    stream << chunked_array->Equals(*chunked_array);
    
    std::cout << stream.str() << std::endl;
}
```

Here is the output of this program:

```shell
chunked_array contents: 

[
  [
    0,
    1,
    nan,
    2,
    4
  ]
]

chunked_array->Equals(chunked_array): 1
chunked_array->Equals(*chunked_array): 0
```

### What changes are included in this PR?

Updated `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` to only return `true` early IF:
   - The two share the same address AND
   - They cannot have `NaN` values 

If both of those conditions are not satisfied, `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` will do the element-by-element comparison.

### Are these changes tested?

Yes. I added a new test case called `EqualsSameAddressWithNaNs` to `chunked_array_test.cc`.

### Are there any user-facing changes?

Yes. `ChunkedArray::Equals(const std::shared_ptr<arrow::ChunkedArray>& other)` may return `false` even if the two `ChunkedArray`s have the same memory address. This will only occur if the `ChunkedArray`'s contain `NaN` values.

* Closes: #37515

Authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
@kou kou added this to the 14.0.0 milestone Sep 7, 2023
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…dArray::Equals(const std::shared_ptr<arrow::ChunkedArray>& other)` if the `ChunkedArray` can have `NaN` values (apache#37579)

### Rationale for this change

`ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` assumes that if the two `ChunkedArray`s share the same memory address, then they must be equal. However, this optimization doesn't take into account that `NaN` values are not considered equal by default. Consequently, this can lead to surprising, inconsistent results from a user's perspective. For example, `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)`  and `ChunkedArray::Equals(const ChunkedArray& other)` may return different results.

 The program below illustrates this inconsistency:

```c++
#include <arrow/api.h>
#include <arrow/type.h>

#include <iostream>
#include <math.h>
#include <sstream>

arrow::Result<std::shared_ptr<arrow::ChunkedArray>> make_chunked_array() {
    arrow::NumericBuilder<arrow::DoubleType> builder;
    
    std::shared_ptr<arrow::Array> array;
    ARROW_RETURN_NOT_OK(builder.AppendValues({0, 1, NAN, 2, 4}));
    ARROW_RETURN_NOT_OK(builder.Finish(&array));
    
    return arrow::ChunkedArray::Make({array});
}

int main(int argc, char *argv[])
{
    auto maybe_chunked_array = make_chunked_array();
    if (!maybe_chunked_array.ok()) {
        return -1;
    }
    auto chunked_array = std::move(maybe_chunked_array).ValueUnsafe();
    auto array = chunked_array->chunk(0);
    
    std::stringstream stream;

    stream << "chunked_array contents: ";
    stream << "\n\n";
    stream << chunked_array->ToString();
    stream << "\n\n";
    stream << "chunked_array->Equals(chunked_array): ";
    stream << chunked_array->Equals(chunked_array);
    stream << "chunked_array->Equals(*chunked_array): ";
    stream << chunked_array->Equals(*chunked_array);
    
    std::cout << stream.str() << std::endl;
}
```

Here is the output of this program:

```shell
chunked_array contents: 

[
  [
    0,
    1,
    nan,
    2,
    4
  ]
]

chunked_array->Equals(chunked_array): 1
chunked_array->Equals(*chunked_array): 0
```

### What changes are included in this PR?

Updated `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` to only return `true` early IF:
   - The two share the same address AND
   - They cannot have `NaN` values 

If both of those conditions are not satisfied, `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` will do the element-by-element comparison.

### Are these changes tested?

Yes. I added a new test case called `EqualsSameAddressWithNaNs` to `chunked_array_test.cc`.

### Are there any user-facing changes?

Yes. `ChunkedArray::Equals(const std::shared_ptr<arrow::ChunkedArray>& other)` may return `false` even if the two `ChunkedArray`s have the same memory address. This will only occur if the `ChunkedArray`'s contain `NaN` values.

* Closes: apache#37515

Authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…dArray::Equals(const std::shared_ptr<arrow::ChunkedArray>& other)` if the `ChunkedArray` can have `NaN` values (apache#37579)

### Rationale for this change

`ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` assumes that if the two `ChunkedArray`s share the same memory address, then they must be equal. However, this optimization doesn't take into account that `NaN` values are not considered equal by default. Consequently, this can lead to surprising, inconsistent results from a user's perspective. For example, `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)`  and `ChunkedArray::Equals(const ChunkedArray& other)` may return different results.

 The program below illustrates this inconsistency:

```c++
#include <arrow/api.h>
#include <arrow/type.h>

#include <iostream>
#include <math.h>
#include <sstream>

arrow::Result<std::shared_ptr<arrow::ChunkedArray>> make_chunked_array() {
    arrow::NumericBuilder<arrow::DoubleType> builder;
    
    std::shared_ptr<arrow::Array> array;
    ARROW_RETURN_NOT_OK(builder.AppendValues({0, 1, NAN, 2, 4}));
    ARROW_RETURN_NOT_OK(builder.Finish(&array));
    
    return arrow::ChunkedArray::Make({array});
}

int main(int argc, char *argv[])
{
    auto maybe_chunked_array = make_chunked_array();
    if (!maybe_chunked_array.ok()) {
        return -1;
    }
    auto chunked_array = std::move(maybe_chunked_array).ValueUnsafe();
    auto array = chunked_array->chunk(0);
    
    std::stringstream stream;

    stream << "chunked_array contents: ";
    stream << "\n\n";
    stream << chunked_array->ToString();
    stream << "\n\n";
    stream << "chunked_array->Equals(chunked_array): ";
    stream << chunked_array->Equals(chunked_array);
    stream << "chunked_array->Equals(*chunked_array): ";
    stream << chunked_array->Equals(*chunked_array);
    
    std::cout << stream.str() << std::endl;
}
```

Here is the output of this program:

```shell
chunked_array contents: 

[
  [
    0,
    1,
    nan,
    2,
    4
  ]
]

chunked_array->Equals(chunked_array): 1
chunked_array->Equals(*chunked_array): 0
```

### What changes are included in this PR?

Updated `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` to only return `true` early IF:
   - The two share the same address AND
   - They cannot have `NaN` values 

If both of those conditions are not satisfied, `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` will do the element-by-element comparison.

### Are these changes tested?

Yes. I added a new test case called `EqualsSameAddressWithNaNs` to `chunked_array_test.cc`.

### Are there any user-facing changes?

Yes. `ChunkedArray::Equals(const std::shared_ptr<arrow::ChunkedArray>& other)` may return `false` even if the two `ChunkedArray`s have the same memory address. This will only occur if the `ChunkedArray`'s contain `NaN` values.

* Closes: apache#37515

Authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment