ORC-1147: Use `isNaN` instead of `isFinite` to determine the contain NaN values #1080

guiyanakuang · 2022-04-05T03:55:23Z

What changes were proposed in this pull request?

This pr is aimed at using isNaN instead of isFinite to determine the contain NaN values.
I want to exclude Double.POSITIVE_INFINITY / Double.NEGATIVE_INFINITY both cases, and only match NaN.

Why are the changes needed?

In the case of a sum overflow we can also predicate down to skip the corresponding strip.

How was this patch tested?

Added unit test.

dongjoon-hyun

Oh, @guiyanakuang . This is not correct, @guiyanakuang .

There exists multiple NaN values. Double.NaN is just one of them.

dongjoon-hyun · 2022-04-05T04:33:07Z

Please note that IEEE Standard defines NaN as range, not a single value.

guiyanakuang · 2022-04-05T04:55:47Z

Oh, @guiyanakuang . This is not correct, @guiyanakuang .

There exists multiple NaN values. Double.NaN is just one of them.

Maybe I should use the JDK's own method new Double(dstas.getSum()).isNaN(), where I want to exclude Double.POSITIVE_INFINITY / Double.NEGATIVE_INFINITY both cases, and only match NaN

dongjoon-hyun · 2022-04-05T05:28:10Z

Thank sounds better.

dongjoon-hyun · 2022-04-05T15:22:40Z

java/core/src/test/org/apache/orc/TestVectorOrcFile.java

+    // First two rows of data cause sum overflow, sum is not a finite value,
+    // but this does not prevent pushing down (range comparisons work fine)
+    // The same applies to the middle stripe
+    fcol.vector[0] = dbcol.vector[0] = Double.MAX_VALUE / 2 + Double.MAX_VALUE / 4;


Please define a constant variable once for Double.MAX_VALUE / 2 + Double.MAX_VALUE / 4 and use it.

dongjoon-hyun · 2022-04-05T15:29:08Z

java/core/src/test/org/apache/orc/TestVectorOrcFile.java

+    assertEquals(1000, batch.size);
+
+    rows.nextBatch(batch);
+    // Last strip should not be read, even if sum is not finite


strip -> stripe?

java/core/src/test/org/apache/orc/TestVectorOrcFile.java

dongjoon-hyun · 2022-04-05T15:31:32Z

java/core/src/test/org/apache/orc/TestVectorOrcFile.java

+
+    // First two rows of data cause sum overflow, sum is not a finite value,
+    // but this does not prevent pushing down (range comparisons work fine)
+    // The same applies to the middle stripe


Could you add some more illustration about how many stripes are used here?

dongjoon-hyun · 2022-04-05T15:31:44Z

cc @williamhyun

Stale

dongjoon-hyun · 2022-04-06T04:06:59Z

java/core/src/test/org/apache/orc/TestVectorOrcFile.java

+
+    // Here we are writing 3500 rows of data, with stripeSize set to 400000
+    // and rowIndexStride set to 1000, so 1 stripe will be written,
+    // indexed in 4 strides.


Thank you for the details.

dongjoon-hyun

+1, LGTM. Thank you, @guiyanakuang .

…values ### What changes were proposed in this pull request? This pr is aimed at using `isNaN` instead of `isFinite` to determine the contain NaN values. I want to exclude Double.POSITIVE_INFINITY / Double.NEGATIVE_INFINITY both cases, and only match NaN. ### Why are the changes needed? In the case of a sum overflow we can also predicate down to skip the corresponding strip. ### How was this patch tested? Added unit test. Closes #1080 Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 6b053d4) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…values ### What changes were proposed in this pull request? This pr is aimed at using `isNaN` instead of `isFinite` to determine the contain NaN values. I want to exclude Double.POSITIVE_INFINITY / Double.NEGATIVE_INFINITY both cases, and only match NaN. ### Why are the changes needed? In the case of a sum overflow we can also predicate down to skip the corresponding strip. ### How was this patch tested? Added unit test. Closes #1080 Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 6b053d4) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 7763697) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…values ### What changes were proposed in this pull request? This pr is aimed at using `isNaN` instead of `isFinite` to determine the contain NaN values. I want to exclude Double.POSITIVE_INFINITY / Double.NEGATIVE_INFINITY both cases, and only match NaN. ### Why are the changes needed? In the case of a sum overflow we can also predicate down to skip the corresponding strip. ### How was this patch tested? Added unit test. Closes apache#1080 Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

github-actions bot added the JAVA label Apr 5, 2022

dongjoon-hyun previously requested changes Apr 5, 2022

View reviewed changes

Use isNaN instead of isFinite to determine if there is a NaN write

ab50a8f

guiyanakuang force-pushed the ORC-1147 branch from 1f8e4a0 to ab50a8f Compare April 5, 2022 12:34

guiyanakuang changed the title ~~ORC-1147: Use Objects.equals(dstas.getSum(), Double.NaN) instead of isFinite to determine if there is a NaN write~~ ORC-1147: Use isNaN instead of isFinite to determine the contain NaN values Apr 5, 2022

dongjoon-hyun reviewed Apr 5, 2022

View reviewed changes

java/core/src/test/org/apache/orc/TestVectorOrcFile.java Show resolved Hide resolved

dongjoon-hyun reviewed Apr 5, 2022

View reviewed changes

Fix review comments

56c80ce

dongjoon-hyun reviewed Apr 6, 2022

View reviewed changes

dongjoon-hyun approved these changes Apr 6, 2022

View reviewed changes

dongjoon-hyun closed this in 6b053d4 Apr 6, 2022

dongjoon-hyun added this to the 1.7.4 milestone Apr 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ORC-1147: Use `isNaN` instead of `isFinite` to determine the contain NaN values #1080

ORC-1147: Use `isNaN` instead of `isFinite` to determine the contain NaN values #1080

guiyanakuang commented Apr 5, 2022 •

edited

dongjoon-hyun left a comment

dongjoon-hyun commented Apr 5, 2022

guiyanakuang commented Apr 5, 2022

dongjoon-hyun commented Apr 5, 2022

dongjoon-hyun Apr 5, 2022

guiyanakuang Apr 6, 2022

dongjoon-hyun Apr 5, 2022 •

edited

dongjoon-hyun Apr 5, 2022

dongjoon-hyun commented Apr 5, 2022

dongjoon-hyun Apr 6, 2022

dongjoon-hyun left a comment

ORC-1147: Use isNaN instead of isFinite to determine the contain NaN values #1080

ORC-1147: Use isNaN instead of isFinite to determine the contain NaN values #1080

Conversation

guiyanakuang commented Apr 5, 2022 • edited

What changes were proposed in this pull request?

Why are the changes needed?

How was this patch tested?

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Apr 5, 2022

guiyanakuang commented Apr 5, 2022

dongjoon-hyun commented Apr 5, 2022

dongjoon-hyun Apr 5, 2022

Choose a reason for hiding this comment

guiyanakuang Apr 6, 2022

Choose a reason for hiding this comment

dongjoon-hyun Apr 5, 2022 • edited

Choose a reason for hiding this comment

dongjoon-hyun Apr 5, 2022

Choose a reason for hiding this comment

dongjoon-hyun commented Apr 5, 2022

dongjoon-hyun Apr 6, 2022

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

ORC-1147: Use `isNaN` instead of `isFinite` to determine the contain NaN values #1080

ORC-1147: Use `isNaN` instead of `isFinite` to determine the contain NaN values #1080

guiyanakuang commented Apr 5, 2022 •

edited

dongjoon-hyun Apr 5, 2022 •

edited