-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] In some cases ChunkedArray::Equals(std::shared_ptr<arrow::ChunkedArray>& other)
does not treat NaN
values as unequal consistently
#37515
Comments
It doesn't treat NaNs as equal. But it does assumes that chunked arrays at the same address are equal, where |
Ah, thanks for pointing that out! It's a bit weird that chunked_array->Equals(chunked_array): 1
chunked_array->Equals(*chunked_array): 0 I wonder if it's worth removing this optimization for consistency? Or at least removing it for datatypes that can have |
I agree that this optimization can be removed for the sake of consistency. |
I'll work on this. |
take |
ChunkedArray::Equals()
treats NaNs as equal by defaultChunkedArray::Equals(std::shared_ptr<arrow::ChunkedArray>&other)
doesn't check for NaN values when the two objects have the same memory addresss
ChunkedArray::Equals(std::shared_ptr<arrow::ChunkedArray>&other)
doesn't check for NaN values when the two objects have the same memory addresssChunkedArray::Equals(std::shared_ptr<arrow::ChunkedArray>& other)
doesn't handle NaN values correctly if other
has the same address as this
ChunkedArray::Equals(std::shared_ptr<arrow::ChunkedArray>& other)
doesn't handle NaN values correctly if other
has the same address as this
ChunkedArray::Equals(std::shared_ptr<arrow::ChunkedArray>& other)
sometimes incorrectly handle NaN values
ChunkedArray::Equals(std::shared_ptr<arrow::ChunkedArray>& other)
sometimes incorrectly handle NaN valuesChunkedArray::Equals(std::shared_ptr<arrow::ChunkedArray>& other)
does not treat NaN
values as unequal consistently
…::Equals(const std::shared_ptr<arrow::ChunkedArray>& other)` if the `ChunkedArray` can have `NaN` values (#37579) ### Rationale for this change `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` assumes that if the two `ChunkedArray`s share the same memory address, then they must be equal. However, this optimization doesn't take into account that `NaN` values are not considered equal by default. Consequently, this can lead to surprising, inconsistent results from a user's perspective. For example, `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` and `ChunkedArray::Equals(const ChunkedArray& other)` may return different results. The program below illustrates this inconsistency: ```c++ #include <arrow/api.h> #include <arrow/type.h> #include <iostream> #include <math.h> #include <sstream> arrow::Result<std::shared_ptr<arrow::ChunkedArray>> make_chunked_array() { arrow::NumericBuilder<arrow::DoubleType> builder; std::shared_ptr<arrow::Array> array; ARROW_RETURN_NOT_OK(builder.AppendValues({0, 1, NAN, 2, 4})); ARROW_RETURN_NOT_OK(builder.Finish(&array)); return arrow::ChunkedArray::Make({array}); } int main(int argc, char *argv[]) { auto maybe_chunked_array = make_chunked_array(); if (!maybe_chunked_array.ok()) { return -1; } auto chunked_array = std::move(maybe_chunked_array).ValueUnsafe(); auto array = chunked_array->chunk(0); std::stringstream stream; stream << "chunked_array contents: "; stream << "\n\n"; stream << chunked_array->ToString(); stream << "\n\n"; stream << "chunked_array->Equals(chunked_array): "; stream << chunked_array->Equals(chunked_array); stream << "chunked_array->Equals(*chunked_array): "; stream << chunked_array->Equals(*chunked_array); std::cout << stream.str() << std::endl; } ``` Here is the output of this program: ```shell chunked_array contents: [ [ 0, 1, nan, 2, 4 ] ] chunked_array->Equals(chunked_array): 1 chunked_array->Equals(*chunked_array): 0 ``` ### What changes are included in this PR? Updated `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` to only return `true` early IF: - The two share the same address AND - They cannot have `NaN` values If both of those conditions are not satisfied, `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` will do the element-by-element comparison. ### Are these changes tested? Yes. I added a new test case called `EqualsSameAddressWithNaNs` to `chunked_array_test.cc`. ### Are there any user-facing changes? Yes. `ChunkedArray::Equals(const std::shared_ptr<arrow::ChunkedArray>& other)` may return `false` even if the two `ChunkedArray`s have the same memory address. This will only occur if the `ChunkedArray`'s contain `NaN` values. * Closes: #37515 Authored-by: Sarah Gilmore <sgilmore@mathworks.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…dArray::Equals(const std::shared_ptr<arrow::ChunkedArray>& other)` if the `ChunkedArray` can have `NaN` values (apache#37579) ### Rationale for this change `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` assumes that if the two `ChunkedArray`s share the same memory address, then they must be equal. However, this optimization doesn't take into account that `NaN` values are not considered equal by default. Consequently, this can lead to surprising, inconsistent results from a user's perspective. For example, `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` and `ChunkedArray::Equals(const ChunkedArray& other)` may return different results. The program below illustrates this inconsistency: ```c++ #include <arrow/api.h> #include <arrow/type.h> #include <iostream> #include <math.h> #include <sstream> arrow::Result<std::shared_ptr<arrow::ChunkedArray>> make_chunked_array() { arrow::NumericBuilder<arrow::DoubleType> builder; std::shared_ptr<arrow::Array> array; ARROW_RETURN_NOT_OK(builder.AppendValues({0, 1, NAN, 2, 4})); ARROW_RETURN_NOT_OK(builder.Finish(&array)); return arrow::ChunkedArray::Make({array}); } int main(int argc, char *argv[]) { auto maybe_chunked_array = make_chunked_array(); if (!maybe_chunked_array.ok()) { return -1; } auto chunked_array = std::move(maybe_chunked_array).ValueUnsafe(); auto array = chunked_array->chunk(0); std::stringstream stream; stream << "chunked_array contents: "; stream << "\n\n"; stream << chunked_array->ToString(); stream << "\n\n"; stream << "chunked_array->Equals(chunked_array): "; stream << chunked_array->Equals(chunked_array); stream << "chunked_array->Equals(*chunked_array): "; stream << chunked_array->Equals(*chunked_array); std::cout << stream.str() << std::endl; } ``` Here is the output of this program: ```shell chunked_array contents: [ [ 0, 1, nan, 2, 4 ] ] chunked_array->Equals(chunked_array): 1 chunked_array->Equals(*chunked_array): 0 ``` ### What changes are included in this PR? Updated `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` to only return `true` early IF: - The two share the same address AND - They cannot have `NaN` values If both of those conditions are not satisfied, `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` will do the element-by-element comparison. ### Are these changes tested? Yes. I added a new test case called `EqualsSameAddressWithNaNs` to `chunked_array_test.cc`. ### Are there any user-facing changes? Yes. `ChunkedArray::Equals(const std::shared_ptr<arrow::ChunkedArray>& other)` may return `false` even if the two `ChunkedArray`s have the same memory address. This will only occur if the `ChunkedArray`'s contain `NaN` values. * Closes: apache#37515 Authored-by: Sarah Gilmore <sgilmore@mathworks.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…dArray::Equals(const std::shared_ptr<arrow::ChunkedArray>& other)` if the `ChunkedArray` can have `NaN` values (apache#37579) ### Rationale for this change `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` assumes that if the two `ChunkedArray`s share the same memory address, then they must be equal. However, this optimization doesn't take into account that `NaN` values are not considered equal by default. Consequently, this can lead to surprising, inconsistent results from a user's perspective. For example, `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` and `ChunkedArray::Equals(const ChunkedArray& other)` may return different results. The program below illustrates this inconsistency: ```c++ #include <arrow/api.h> #include <arrow/type.h> #include <iostream> #include <math.h> #include <sstream> arrow::Result<std::shared_ptr<arrow::ChunkedArray>> make_chunked_array() { arrow::NumericBuilder<arrow::DoubleType> builder; std::shared_ptr<arrow::Array> array; ARROW_RETURN_NOT_OK(builder.AppendValues({0, 1, NAN, 2, 4})); ARROW_RETURN_NOT_OK(builder.Finish(&array)); return arrow::ChunkedArray::Make({array}); } int main(int argc, char *argv[]) { auto maybe_chunked_array = make_chunked_array(); if (!maybe_chunked_array.ok()) { return -1; } auto chunked_array = std::move(maybe_chunked_array).ValueUnsafe(); auto array = chunked_array->chunk(0); std::stringstream stream; stream << "chunked_array contents: "; stream << "\n\n"; stream << chunked_array->ToString(); stream << "\n\n"; stream << "chunked_array->Equals(chunked_array): "; stream << chunked_array->Equals(chunked_array); stream << "chunked_array->Equals(*chunked_array): "; stream << chunked_array->Equals(*chunked_array); std::cout << stream.str() << std::endl; } ``` Here is the output of this program: ```shell chunked_array contents: [ [ 0, 1, nan, 2, 4 ] ] chunked_array->Equals(chunked_array): 1 chunked_array->Equals(*chunked_array): 0 ``` ### What changes are included in this PR? Updated `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` to only return `true` early IF: - The two share the same address AND - They cannot have `NaN` values If both of those conditions are not satisfied, `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` will do the element-by-element comparison. ### Are these changes tested? Yes. I added a new test case called `EqualsSameAddressWithNaNs` to `chunked_array_test.cc`. ### Are there any user-facing changes? Yes. `ChunkedArray::Equals(const std::shared_ptr<arrow::ChunkedArray>& other)` may return `false` even if the two `ChunkedArray`s have the same memory address. This will only occur if the `ChunkedArray`'s contain `NaN` values. * Closes: apache#37515 Authored-by: Sarah Gilmore <sgilmore@mathworks.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Describe the bug, including details regarding any error messages, version, and platform.
While working on #37448, I noticed the
Equals()
method ofChunkedArray
treats NaNs as equal by default. I found this behavior surprising because theEquals()
method ofArray
doesn't treat NaNs as equal by default.Below is a program to reproduce this behavior:
Here is the output of this program:
Am I correct in that this is a bug?
Component(s)
C++
The text was updated successfully, but these errors were encountered: