PARQUET-1225: NaN values may lead to incorrect filtering under certai… #444

majetideepak · 2018-02-20T23:50:47Z

parquet-cpp does not implement filtering (predicate pushdown). Clients such as Vertica, read the statistics from the metadata and implement their own filtering based on these stats.
Therefore, the read path does not require any changes. We should document that the min/max value can potentially contain NaNs.
I made changes to the write path to ignore the NaNs.

majetideepak · 2018-02-20T23:55:00Z

I need to add a test when all values are NaNs.

boroknagyz · 2018-02-21T10:29:17Z

src/parquet/statistics.cc

+template <>
+inline int getValueEndOffset<float>(const float* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = (count - 1); i > 0; i--) {


I think it should be "i >= 0" instead of "i > 0"

boroknagyz · 2018-02-21T10:29:59Z

src/parquet/statistics.cc

+  for (int64_t i = (count - 1); i > 0; i--) {
+     if (!std::isnan(values[i])) return (i + 1);
+  }
+  return count;


To me it seems it should be "return 0"

boroknagyz · 2018-02-21T10:30:57Z

src/parquet/statistics.cc

+  for (int64_t i = 0; i < count; i++) {
+     if (!std::isnan(values[i])) return i;
+  }
+  return 0;


i think it should be "return count"

boroknagyz · 2018-02-21T10:49:35Z

src/parquet/statistics.cc

+template <>
+inline int getValueEndOffset<double>(const double* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = (count - 1); i > 0; i--) {


I think it should be i >= 0 instead of i > 0
Think of the case when only the first element is a number, e.g.:
{3.14}, or
{3.14, NaN, NaN, NaN, NaN, ..., NaN}
For these inputs, this function will return 0.
getValueBeginOffset() will also return 0.
Usually, C++ ranges are interpreted as [first, last), ie. they are open at the end. Therefore, [0, 0) is an empty range.
However, this solution will work, because at L178 you don't return when begin_offset_ == end_offset, and minmax_element() returns make_pair(first, first) if the range is empty.

But I feel like it currently works by accident, and would be better to fix this.

zivanfi · 2018-02-21T11:52:46Z

src/parquet/statistics.cc

@@ -107,8 +168,17 @@ void TypedRowGroupStatistics<DType>::Update(const T* values, int64_t num_not_nul
  // TODO: support distinct count?
  if (num_not_null == 0) return;

+  // PARQUET-1225: Handle NaNs
+  // The problem arises only if the starting/ending value(s)


Do NaN-s at the end actually cause trouble? I had the impression that only NaN-s at the beginning are problematic.

They are a problem even in the end. The max value becomes NaN. This is specific to the implementation of minmax_element. A possible implementation is specified here http://en.cppreference.com/w/cpp/algorithm/minmax_element

zivanfi · 2018-02-21T11:54:52Z

src/parquet/statistics-test.cc

+
+  ASSERT_EQ(min, -3.0);
+  ASSERT_EQ(max, 4.0);
+}
 }  // namespace test


I would suggest adding a test case for all values being NaN.

Ah, sorry, I just noticed you listed this yourself as something that is still left to be done.

wesm · 2018-02-21T15:27:17Z

src/parquet/statistics.cc

@@ -96,6 +96,67 @@ void TypedRowGroupStatistics<DType>::Reset() {
  has_min_max_ = false;
 }

+template <typename T>
+inline int getValueBeginOffset(const T* values, int64_t count) {


Can we use compliant naming in these new functions (function names should be capitalized)?

wesm · 2018-02-21T15:27:29Z

src/parquet/statistics.cc

+inline int getValueBeginOffset<float>(const float* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = 0; i < count; i++) {
+     if (!std::isnan(values[i])) return i;


wesm · 2018-02-21T15:28:07Z

src/parquet/statistics.cc

+inline int getValueEndOffset<float>(const float* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = (count - 1); i > 0; i--) {
+     if (!std::isnan(values[i])) return (i + 1);


wesm · 2018-02-21T15:29:08Z

src/parquet/statistics.cc

+}
+
+template <typename T>
+inline bool notNaN (const T* value) {


Pass const T value instead of pointer?

wesm · 2018-02-21T15:29:57Z

src/parquet/statistics.cc

+
+template <>
+inline bool notNaN<float>(const float* value) {
+  return !std::isnan(*value);


Would it be better to use return value == value here (with the pointer -> value change per above)?

wesm · 2018-02-21T15:30:03Z

src/parquet/statistics.cc

+inline int getValueBeginOffset<double>(const double* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = 0; i < count; i++) {
+     if (!std::isnan(values[i])) return i;


wesm · 2018-02-21T15:30:22Z

src/parquet/statistics.cc

+inline int getValueEndOffset<double>(const double* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = (count - 1); i > 0; i--) {
+     if (!std::isnan(values[i])) return (i + 1);


wesm · 2018-02-21T15:33:18Z

src/parquet/statistics.cc

+}
+
+template <>
+inline int getValueBeginOffset<float>(const float* values, int64_t count) {


If we used std::is_floating_point and the functor pattern, we could avoid code duplication

template <typename T, typename Enable = void> struct StatsHelper { ... }; template <typename T> struct StatsHelper<T, typename std::enable_if<std::is_floating_point<T>::value>::type> { ... };

Thanks for this tip. Very helpful! Will include other review comments.

…n circumstances

majetideepak · 2018-02-21T21:04:42Z

All the changes have been made. @boroknagyz and @wesm please let me know if you have more feedback!

boroknagyz · 2018-02-22T15:56:24Z

Thanks for applying the changes! To me it looks good generally.
The Update() and UpdateSpaced() functions share some common code parts (not necessarily introduced by this PR, e.g. the last if statements of the functions), maybe worth a little refactoring.

xhochy

+1, LGTM

xhochy · 2018-02-24T18:23:16Z

@majetideepak @boroknagyz @zivanfi I think with the PR we are now ready to make a new RC?

majetideepak · 2018-02-26T01:03:21Z

I think we are ready to make a new RC. thanks!

zivanfi · 2018-02-26T08:08:26Z

I agree, thanks for your efforts!

majetideepak changed the title ~~PARQUET-1225: NaN values may lead to incorrect filtering under certai…~~ [WIP] PARQUET-1225: NaN values may lead to incorrect filtering under certai… Feb 20, 2018

boroknagyz reviewed Feb 21, 2018

View reviewed changes

zivanfi reviewed Feb 21, 2018

View reviewed changes

wesm reviewed Feb 21, 2018

View reviewed changes

Deepak Majeti added 4 commits February 21, 2018 13:22

PARQUET-1225: NaN values may lead to incorrect filtering under certai…

63e889b

…n circumstances

review comments and add tests

ba44611

clang format

3144a6a

change api from NotNaN to IsNaN

baf6f50

majetideepak force-pushed the PARQUET-1225 branch from 5a5e1c0 to baf6f50 Compare February 21, 2018 18:28

majetideepak changed the title ~~[WIP] PARQUET-1225: NaN values may lead to incorrect filtering under certai…~~ PARQUET-1225: NaN values may lead to incorrect filtering under certai… Feb 21, 2018

fix logic for UpdateSpaced

1c229e2

majetideepak force-pushed the PARQUET-1225 branch from 2ec37c9 to 1c229e2 Compare February 21, 2018 19:03

fix compiler error

c02adb2

refactor code

c29ede2

xhochy approved these changes Feb 24, 2018

View reviewed changes

xhochy closed this in 29a4b07 Feb 24, 2018

majetideepak deleted the PARQUET-1225 branch February 28, 2018 17:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-1225: NaN values may lead to incorrect filtering under certai… #444

PARQUET-1225: NaN values may lead to incorrect filtering under certai… #444

majetideepak commented Feb 20, 2018 •

edited

majetideepak commented Feb 20, 2018

boroknagyz Feb 21, 2018

majetideepak Feb 21, 2018

boroknagyz Feb 21, 2018

majetideepak Feb 21, 2018

boroknagyz Feb 21, 2018

majetideepak Feb 21, 2018

boroknagyz Feb 21, 2018

majetideepak Feb 21, 2018

zivanfi Feb 21, 2018

majetideepak Feb 21, 2018

zivanfi Feb 21, 2018

zivanfi Feb 21, 2018

wesm Feb 21, 2018

wesm Feb 21, 2018

wesm Feb 21, 2018

wesm Feb 21, 2018

wesm Feb 21, 2018

wesm Feb 21, 2018

wesm Feb 21, 2018

wesm Feb 21, 2018

majetideepak Feb 21, 2018

majetideepak commented Feb 21, 2018

boroknagyz commented Feb 22, 2018

xhochy left a comment

xhochy commented Feb 24, 2018

majetideepak commented Feb 26, 2018

zivanfi commented Feb 26, 2018

PARQUET-1225: NaN values may lead to incorrect filtering under certai… #444

PARQUET-1225: NaN values may lead to incorrect filtering under certai… #444

Conversation

majetideepak commented Feb 20, 2018 • edited

majetideepak commented Feb 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

majetideepak commented Feb 21, 2018

boroknagyz commented Feb 22, 2018

xhochy left a comment

Choose a reason for hiding this comment

xhochy commented Feb 24, 2018

majetideepak commented Feb 26, 2018

zivanfi commented Feb 26, 2018

majetideepak commented Feb 20, 2018 •

edited