[2.4][SPARK-26021][SQL][FOLLOWUP] only deal with NaN and -0.0 in UnsafeWriter #23265

cloud-fan · 2018-12-09T06:26:09Z

backport #23239 to 2.4

What changes were proposed in this pull request?

A followup of #23043

There are 4 places we need to deal with NaN and -0.0:

comparison expressions. -0.0 and 0.0 should be treated as same. Different NaNs should be treated as same.
Join keys. -0.0 and 0.0 should be treated as same. Different NaNs should be treated as same.
grouping keys. -0.0 and 0.0 should be assigned to the same group. Different NaNs should be assigned to the same group.
window partition keys. -0.0 and 0.0 should be treated as same. Different NaNs should be treated as same.

The case 1 is OK. Our comparison already handles NaN and -0.0, and for struct/array/map, we will recursively compare the fields/elements.

Case 2, 3 and 4 are problematic, as they compare UnsafeRow binary directly, and different NaNs have different binary representation, and the same thing happens for -0.0 and 0.0.

To fix it, a simple solution is: normalize float/double when building unsafe data (UnsafeRow, UnsafeArrayData, UnsafeMapData). Then we don't need to worry about it anymore.

Following this direction, this PR moves the handling of NaN and -0.0 from Platform to UnsafeWriter, so that places like UnsafeRow.setFloat will not handle them, which reduces the perf overhead. It's also easier to add comments explaining why we do it in UnsafeWriter.

How was this patch tested?

existing tests

A followup of apache#23043 There are 4 places we need to deal with NaN and -0.0: 1. comparison expressions. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same. 2. Join keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same. 3. grouping keys. `-0.0` and `0.0` should be assigned to the same group. Different NaNs should be assigned to the same group. 4. window partition keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same. The case 1 is OK. Our comparison already handles NaN and -0.0, and for struct/array/map, we will recursively compare the fields/elements. Case 2, 3 and 4 are problematic, as they compare `UnsafeRow` binary directly, and different NaNs have different binary representation, and the same thing happens for -0.0 and 0.0. To fix it, a simple solution is: normalize float/double when building unsafe data (`UnsafeRow`, `UnsafeArrayData`, `UnsafeMapData`). Then we don't need to worry about it anymore. Following this direction, this PR moves the handling of NaN and -0.0 from `Platform` to `UnsafeWriter`, so that places like `UnsafeRow.setFloat` will not handle them, which reduces the perf overhead. It's also easier to add comments explaining why we do it in `UnsafeWriter`. existing tests Closes apache#23239 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

cloud-fan · 2018-12-09T06:26:22Z

cc @dongjoon-hyun

SparkQA · 2018-12-09T08:05:01Z

Test build #99884 has finished for PR 23265 at commit 6a837c0.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-12-09T08:14:32Z

retest this please

SparkQA · 2018-12-09T12:59:07Z

Test build #99885 has finished for PR 23265 at commit 6a837c0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Thanks. Merged to branch-2.4.

…feWriter backport #23239 to 2.4 --------- ## What changes were proposed in this pull request? A followup of #23043 There are 4 places we need to deal with NaN and -0.0: 1. comparison expressions. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same. 2. Join keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same. 3. grouping keys. `-0.0` and `0.0` should be assigned to the same group. Different NaNs should be assigned to the same group. 4. window partition keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same. The case 1 is OK. Our comparison already handles NaN and -0.0, and for struct/array/map, we will recursively compare the fields/elements. Case 2, 3 and 4 are problematic, as they compare `UnsafeRow` binary directly, and different NaNs have different binary representation, and the same thing happens for -0.0 and 0.0. To fix it, a simple solution is: normalize float/double when building unsafe data (`UnsafeRow`, `UnsafeArrayData`, `UnsafeMapData`). Then we don't need to worry about it anymore. Following this direction, this PR moves the handling of NaN and -0.0 from `Platform` to `UnsafeWriter`, so that places like `UnsafeRow.setFloat` will not handle them, which reduces the perf overhead. It's also easier to add comments explaining why we do it in `UnsafeWriter`. ## How was this patch tested? existing tests Closes #23265 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…tDouble/Float This PR reverts #23043 and its followup #23265, from branch 2.4, because it has behavior changes. existing tests Closes #23389 from cloud-fan/revert. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…feWriter backport apache#23239 to 2.4 --------- ## What changes were proposed in this pull request? A followup of apache#23043 There are 4 places we need to deal with NaN and -0.0: 1. comparison expressions. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same. 2. Join keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same. 3. grouping keys. `-0.0` and `0.0` should be assigned to the same group. Different NaNs should be assigned to the same group. 4. window partition keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same. The case 1 is OK. Our comparison already handles NaN and -0.0, and for struct/array/map, we will recursively compare the fields/elements. Case 2, 3 and 4 are problematic, as they compare `UnsafeRow` binary directly, and different NaNs have different binary representation, and the same thing happens for -0.0 and 0.0. To fix it, a simple solution is: normalize float/double when building unsafe data (`UnsafeRow`, `UnsafeArrayData`, `UnsafeMapData`). Then we don't need to worry about it anymore. Following this direction, this PR moves the handling of NaN and -0.0 from `Platform` to `UnsafeWriter`, so that places like `UnsafeRow.setFloat` will not handle them, which reduces the perf overhead. It's also easier to add comments explaining why we do it in `UnsafeWriter`. ## How was this patch tested? existing tests Closes apache#23265 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…tDouble/Float This PR reverts apache#23043 and its followup apache#23265, from branch 2.4, because it has behavior changes. existing tests Closes apache#23389 from cloud-fan/revert. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…feWriter backport apache#23239 to 2.4 --------- ## What changes were proposed in this pull request? A followup of apache#23043 There are 4 places we need to deal with NaN and -0.0: 1. comparison expressions. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same. 2. Join keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same. 3. grouping keys. `-0.0` and `0.0` should be assigned to the same group. Different NaNs should be assigned to the same group. 4. window partition keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same. The case 1 is OK. Our comparison already handles NaN and -0.0, and for struct/array/map, we will recursively compare the fields/elements. Case 2, 3 and 4 are problematic, as they compare `UnsafeRow` binary directly, and different NaNs have different binary representation, and the same thing happens for -0.0 and 0.0. To fix it, a simple solution is: normalize float/double when building unsafe data (`UnsafeRow`, `UnsafeArrayData`, `UnsafeMapData`). Then we don't need to worry about it anymore. Following this direction, this PR moves the handling of NaN and -0.0 from `Platform` to `UnsafeWriter`, so that places like `UnsafeRow.setFloat` will not handle them, which reduces the perf overhead. It's also easier to add comments explaining why we do it in `UnsafeWriter`. ## How was this patch tested? existing tests Closes apache#23265 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…tDouble/Float This PR reverts apache#23043 and its followup apache#23265, from branch 2.4, because it has behavior changes. existing tests Closes apache#23389 from cloud-fan/revert. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…tDouble/Float This PR reverts apache/spark#23043 and its followup apache/spark#23265, from branch 2.4, because it has behavior changes. existing tests Closes #23389 from cloud-fan/revert. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit fa1abe2)

dongjoon-hyun approved these changes Dec 9, 2018

View reviewed changes

cloud-fan closed this Dec 10, 2018

cloud-fan mentioned this pull request Dec 27, 2018

[2.4] revert [SPARK-26021][SQL] replace minus zero with zero in Platform.putDouble/Float #23389

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2.4][SPARK-26021][SQL][FOLLOWUP] only deal with NaN and -0.0 in UnsafeWriter #23265

[2.4][SPARK-26021][SQL][FOLLOWUP] only deal with NaN and -0.0 in UnsafeWriter #23265

cloud-fan commented Dec 9, 2018

cloud-fan commented Dec 9, 2018

SparkQA commented Dec 9, 2018

cloud-fan commented Dec 9, 2018

SparkQA commented Dec 9, 2018

dongjoon-hyun left a comment

[2.4][SPARK-26021][SQL][FOLLOWUP] only deal with NaN and -0.0 in UnsafeWriter #23265

[2.4][SPARK-26021][SQL][FOLLOWUP] only deal with NaN and -0.0 in UnsafeWriter #23265

Conversation

cloud-fan commented Dec 9, 2018

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Dec 9, 2018

SparkQA commented Dec 9, 2018

cloud-fan commented Dec 9, 2018

SparkQA commented Dec 9, 2018

dongjoon-hyun left a comment

Choose a reason for hiding this comment