New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-1225: NaN values may lead to incorrect filtering under certai… #444
Conversation
I need to add a test when all values are NaNs. |
src/parquet/statistics.cc
Outdated
template <> | ||
inline int getValueEndOffset<float>(const float* values, int64_t count) { | ||
// Skip NaNs | ||
for (int64_t i = (count - 1); i > 0; i--) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be "i >= 0" instead of "i > 0"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct!
src/parquet/statistics.cc
Outdated
for (int64_t i = (count - 1); i > 0; i--) { | ||
if (!std::isnan(values[i])) return (i + 1); | ||
} | ||
return count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me it seems it should be "return 0"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
src/parquet/statistics.cc
Outdated
for (int64_t i = 0; i < count; i++) { | ||
if (!std::isnan(values[i])) return i; | ||
} | ||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it should be "return count"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
src/parquet/statistics.cc
Outdated
template <> | ||
inline int getValueEndOffset<double>(const double* values, int64_t count) { | ||
// Skip NaNs | ||
for (int64_t i = (count - 1); i > 0; i--) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be i >= 0
instead of i > 0
Think of the case when only the first element is a number, e.g.:
{3.14}
, or
{3.14, NaN, NaN, NaN, NaN, ..., NaN}
For these inputs, this function will return 0.
getValueBeginOffset()
will also return 0.
Usually, C++ ranges are interpreted as [first, last)
, ie. they are open at the end. Therefore, [0, 0)
is an empty range.
However, this solution will work, because at L178 you don't return when begin_offset_ == end_offset
, and minmax_element()
returns make_pair(first, first)
if the range is empty.
But I feel like it currently works by accident, and would be better to fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
src/parquet/statistics.cc
Outdated
@@ -107,8 +168,17 @@ void TypedRowGroupStatistics<DType>::Update(const T* values, int64_t num_not_nul | |||
// TODO: support distinct count? | |||
if (num_not_null == 0) return; | |||
|
|||
// PARQUET-1225: Handle NaNs | |||
// The problem arises only if the starting/ending value(s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do NaN-s at the end actually cause trouble? I had the impression that only NaN-s at the beginning are problematic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are a problem even in the end. The max value becomes NaN. This is specific to the implementation of minmax_element
. A possible implementation is specified here http://en.cppreference.com/w/cpp/algorithm/minmax_element
|
||
ASSERT_EQ(min, -3.0); | ||
ASSERT_EQ(max, 4.0); | ||
} | ||
} // namespace test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest adding a test case for all values being NaN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry, I just noticed you listed this yourself as something that is still left to be done.
src/parquet/statistics.cc
Outdated
@@ -96,6 +96,67 @@ void TypedRowGroupStatistics<DType>::Reset() { | |||
has_min_max_ = false; | |||
} | |||
|
|||
template <typename T> | |||
inline int getValueBeginOffset(const T* values, int64_t count) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use compliant naming in these new functions (function names should be capitalized)?
src/parquet/statistics.cc
Outdated
inline int getValueBeginOffset<float>(const float* values, int64_t count) { | ||
// Skip NaNs | ||
for (int64_t i = 0; i < count; i++) { | ||
if (!std::isnan(values[i])) return i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add braces
src/parquet/statistics.cc
Outdated
inline int getValueEndOffset<float>(const float* values, int64_t count) { | ||
// Skip NaNs | ||
for (int64_t i = (count - 1); i > 0; i--) { | ||
if (!std::isnan(values[i])) return (i + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Braces
src/parquet/statistics.cc
Outdated
} | ||
|
||
template <typename T> | ||
inline bool notNaN (const T* value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pass const T value
instead of pointer?
src/parquet/statistics.cc
Outdated
|
||
template <> | ||
inline bool notNaN<float>(const float* value) { | ||
return !std::isnan(*value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to use return value == value
here (with the pointer -> value change per above)?
src/parquet/statistics.cc
Outdated
inline int getValueBeginOffset<double>(const double* values, int64_t count) { | ||
// Skip NaNs | ||
for (int64_t i = 0; i < count; i++) { | ||
if (!std::isnan(values[i])) return i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Braces
src/parquet/statistics.cc
Outdated
inline int getValueEndOffset<double>(const double* values, int64_t count) { | ||
// Skip NaNs | ||
for (int64_t i = (count - 1); i > 0; i--) { | ||
if (!std::isnan(values[i])) return (i + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
braces
src/parquet/statistics.cc
Outdated
} | ||
|
||
template <> | ||
inline int getValueBeginOffset<float>(const float* values, int64_t count) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we used std::is_floating_point
and the functor pattern, we could avoid code duplication
template <typename T, typename Enable = void>
struct StatsHelper {
...
};
template <typename T>
struct StatsHelper<T, typename std::enable_if<std::is_floating_point<T>::value>::type> {
...
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this tip. Very helpful! Will include other review comments.
5a5e1c0
to
baf6f50
Compare
2ec37c9
to
1c229e2
Compare
All the changes have been made. @boroknagyz and @wesm please let me know if you have more feedback! |
Thanks for applying the changes! To me it looks good generally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM
@majetideepak @boroknagyz @zivanfi I think with the PR we are now ready to make a new RC? |
I think we are ready to make a new RC. thanks! |
I agree, thanks for your efforts! |
parquet-cpp
does not implement filtering (predicate pushdown). Clients such as Vertica, read the statistics from the metadata and implement their own filtering based on these stats.Therefore, the read path does not require any changes. We should document that the min/max value can potentially contain NaNs.
I made changes to the write path to ignore the NaNs.