Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26021][SQL] replace minus zero with zero in Platform.putDouble/Float #23043

Conversation

adoron
Copy link

@adoron adoron commented Nov 15, 2018

GROUP BY treats -0.0 and 0.0 as different values which is unlike hive's behavior.
In addition current behavior with codegen is unpredictable (see example in JIRA ticket).

What changes were proposed in this pull request?

In Platform.putDouble/Float() checking if the value is -0.0, and if so replacing with 0.0.
This is used by UnsafeRow so it won't have -0.0 values.

How was this patch tested?

Added tests

@adoron
Copy link
Author

adoron commented Nov 15, 2018

@cloud-fan
Copy link
Contributor

This only works for attribute, not literal or intermedia result. Is there a better place to fix it?

@kiszk
Copy link
Member

kiszk commented Nov 15, 2018

IIUC, we discussed handling +0.0 and -0.0 before in another PR.
@srowen do you remember the previous discussion? I found the discussion. Good to know the semantics in Hive.

@@ -56,17 +56,32 @@ case class BoundReference(ordinal: Int, dataType: DataType, nullable: Boolean)
val javaType = JavaCode.javaType(dataType)
val value = CodeGenerator.getValue(ctx.INPUT_ROW, dataType, ordinal.toString)
if (nullable) {
ev.copy(code =
var codeBlock =
Copy link
Member

@kiszk kiszk Nov 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: better to use val instead of var.

} else {
ev.copy(code = code"$javaType ${ev.value} = $value;", isNull = FalseLiteral)
var codeBlock = code"$javaType ${ev.value} = $value;"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto


private def genReplaceMinusZeroWithZeroCode(javaType: String, value: String): Block = {
val code = s"\nif ($value == -0.0%c) $value = 0.0%c;"
var formattedCode = ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@srowen
Copy link
Member

srowen commented Nov 15, 2018

@kiszk This spun out of https://issues.apache.org/jira/browse/SPARK-24834 and #21794 ; is that what you may be thinking of? I'm not aware of others.

@cloud-fan
Copy link
Contributor

Before rushing to a fix that replaces -0.0 to 0.0, I'd like to know how this bug happens.

One possible reason might be, 0.0 and -0.0 have different binary format. Spark use unsafe API to write float/double, maybe we can investigate that first.

@srowen
Copy link
Member

srowen commented Nov 16, 2018

They do, FWIW:

scala> java.lang.Double.doubleToLongBits(0.0)
res1: Long = 0

scala> java.lang.Double.doubleToLongBits(-0.0)
res2: Long = -9223372036854775808

@cloud-fan
Copy link
Contributor

Looking at UnsafeRow.putFloat, it normalizes the value of Float.NaN. I think we should do the same there for -0.0, and other related places (check how we handle Float.NaN).

@adoron
Copy link
Author

adoron commented Nov 16, 2018

@cloud-fan that's what I thought as well at first, but the flow doesn't go through that code -
running Seq(0.0d, 0.0d, -0.0d).toDF("i").groupBy("i").count().collect() and adding a breakpoint.

The reason for -0.0 and 0.0 being put in different buckets of "group by" is in UnsafeFixedWidthAggregationMap::getAggregationBufferFromUnsafeRow():

public UnsafeRow getAggregationBufferFromUnsafeRow(UnsafeRow key) {
    return getAggregationBufferFromUnsafeRow(key, key.hashCode());
}

The hashing is done on the UnsafeRow, and by this point the whole row is hashed as a unit and it's hard to find the double columns and their value.

@sabanas
Copy link

sabanas commented Nov 16, 2018

@adoron indeed this doesn't pass through setFloat, but all values go through -
InternalRow.scala#L42

which goes through -
rows.scala#L209

so using such code for example solves your example -

  override def update(i: Int, value: Any): Unit = {
    val ignoreMinusZeroValue = value match {
      case v: Double => if (v == 0d) 0d else value
      case v: Float => if (v == 0f) 0f else value
      case _ => value
    }
    values(i) = ignoreMinusZeroValue
  }

not sure if that holds for other cases mentioned in this PR though.

@cloud-fan
Copy link
Contributor

UnsafeRow.set is not the only place to write float/double as binary data, can you check other places like UnsafeWriter?

@adoron
Copy link
Author

adoron commented Nov 17, 2018

@cloud-fan changing writeDouble/writeFloat in UnsafeWriter indeed do the trick!
I'll fix the PR. I was thinking about making the change in Platform::putDouble so all accesses get affected, in UnsafeRow and UnsafeWriter as well.

@@ -120,6 +120,9 @@ public static float getFloat(Object object, long offset) {
}

public static void putFloat(Object object, long offset, float value) {
if(value == -0.0f) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine to put this trick here, shall we also move the IsNaN logic to here as well?

byte[] floatBytes = new byte[Float.BYTES];
Platform.putDouble(doubleBytes, Platform.BYTE_ARRAY_OFFSET, -0.0d);
Platform.putFloat(floatBytes, Platform.BYTE_ARRAY_OFFSET, -0.0f);
Assert.assertEquals(0, Double.compare(0.0d, ByteBuffer.wrap(doubleBytes).getDouble()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure this test fails before the fix? IIUC 0.0 == -0.0 is ture, but they have different binary format

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW thanks for adding the unit test! It's a good complementary to the end-to-end test.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it fails. Indeed 0.0 == -0.0 so I'm using Double.compare == 0 to test this.

def assertResult[T](result: Array[Row], zero: T)(implicit ordering: Ordering[T]): Unit = {
assert(result.length == 1)
// using compare since 0.0 == -0.0 is true
assert(ordering.compare(result(0).getAs[T](0), zero) == 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of checking the result, I prefer the code snippet in the JIRA ticket, which is more obvious about where is the problem.

Let's run a group-by query, with both 0.0 and -0.0 in the input. Then we check the number of result rows, as ideally 0.0 and -0.0 is same, so we should only have one group(one result row).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow, below this I'm constructing Seqs with 0 and -0 like in the JIRA and in the assertResult helper I'm checking that there's only 1 line like you said.
Do you mean the check that the key is indeed 0.0 and not -0.0 is redundant?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah sorry I misread the code.

@kiszk
Copy link
Member

kiszk commented Nov 19, 2018

@srowen #21794 is what I thought.

@kiszk
Copy link
Member

kiszk commented Nov 19, 2018

Is it better to update this PR title now?

@kiszk
Copy link
Member

kiszk commented Nov 19, 2018

Do we need to consider GenerateSafeProjection, too? In other words, if the generated code or runtime does not use data in Unsafe, this +0.0/-0.0 problem may still exist.
Am I correct?

@adoron adoron changed the title [SPARK-26021][SQL] replace minus zero with zero in UnsafeProjection [SPARK-26021][SQL] replace minus zero with zero in Platform.putDouble/Float Nov 19, 2018
@adoron
Copy link
Author

adoron commented Nov 19, 2018

@kiszk is there a use case where the preliminary RDD isn't created with UnsafeRows? If not then the data will already be corrected on reading.

Anyway, looking at all different implementations of InternalRow.setDouble I found the following places that aren't handled (though I'm not sure there's a use case where a -0.0 can get there after the fix):

OnHeapColumnVector.putDouble
MutableDouble.update
GenericInternalRow.update
SpecificInternalRow.setDouble

val doublesBoxed =
groupByCollect(Seq(Double.box(0.0d), Double.box(0.0d), Double.box(-0.0d)).toDF(colName))
val floats =
groupByCollect(Seq(0.0f, -0.0f, 0.0f).toDF(colName))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we have to turn off whole-stage-codegen?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like leftovers from a different solution. Also there's no need to test the boxed version now that it's not in the codegen. I'll simplify the test.

@cloud-fan
Copy link
Contributor

ok to test

Platform.putDouble(doubleBytes, Platform.BYTE_ARRAY_OFFSET, -0.0d);
Platform.putFloat(floatBytes, Platform.BYTE_ARRAY_OFFSET, -0.0f);
Assert.assertEquals(0, Double.compare(0.0d, ByteBuffer.wrap(doubleBytes).getDouble()));
Assert.assertEquals(0, Float.compare(0.0f, ByteBuffer.wrap(floatBytes).getFloat()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use Platform.getFloat to read the value back? to match how we write it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and would be better to directly check that, the binary of 0.0 and -0.0 are same.

@SparkQA
Copy link

SparkQA commented Nov 22, 2018

Test build #99184 has finished for PR 23043 at commit 03408d3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Nov 23, 2018
…/Float

GROUP BY treats -0.0 and 0.0 as different values which is unlike hive's behavior.
In addition current behavior with codegen is unpredictable (see example in JIRA ticket).

## What changes were proposed in this pull request?

In Platform.putDouble/Float() checking if the value is -0.0, and if so replacing with 0.0.
This is used by UnsafeRow so it won't have -0.0 values.

## How was this patch tested?

Added tests

Closes #23043 from adoron/adoron-spark-26021-replace-minus-zero-with-zero.

Authored-by: Alon Doron <adoron@palantir.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 0ec7b99)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan
Copy link
Contributor

thanks, merging to master/2.4!

@asfgit asfgit closed this in 0ec7b99 Nov 23, 2018
asfgit pushed a commit that referenced this pull request Dec 8, 2018
## What changes were proposed in this pull request?

A followup of #23043

There are 4 places we need to deal with NaN and -0.0:
1. comparison expressions. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same.
2. Join keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same.
3. grouping keys. `-0.0` and `0.0` should be assigned to the same group. Different NaNs should be assigned to the same group.
4. window partition keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same.

The case 1 is OK. Our comparison already handles NaN and -0.0, and for struct/array/map, we will recursively compare the fields/elements.

Case 2, 3 and 4 are problematic, as they compare `UnsafeRow` binary directly, and different NaNs have different binary representation, and the same thing happens for -0.0 and 0.0.

To fix it, a simple solution is: normalize float/double when building unsafe data (`UnsafeRow`, `UnsafeArrayData`, `UnsafeMapData`). Then we don't need to worry about it anymore.

Following this direction, this PR moves the handling of NaN and -0.0 from `Platform` to `UnsafeWriter`, so that places like `UnsafeRow.setFloat` will not handle them, which reduces the perf overhead. It's also easier to add comments explaining why we do it in `UnsafeWriter`.

## How was this patch tested?

existing tests

Closes #23239 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
cloud-fan added a commit to cloud-fan/spark that referenced this pull request Dec 9, 2018
A followup of apache#23043

There are 4 places we need to deal with NaN and -0.0:
1. comparison expressions. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same.
2. Join keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same.
3. grouping keys. `-0.0` and `0.0` should be assigned to the same group. Different NaNs should be assigned to the same group.
4. window partition keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same.

The case 1 is OK. Our comparison already handles NaN and -0.0, and for struct/array/map, we will recursively compare the fields/elements.

Case 2, 3 and 4 are problematic, as they compare `UnsafeRow` binary directly, and different NaNs have different binary representation, and the same thing happens for -0.0 and 0.0.

To fix it, a simple solution is: normalize float/double when building unsafe data (`UnsafeRow`, `UnsafeArrayData`, `UnsafeMapData`). Then we don't need to worry about it anymore.

Following this direction, this PR moves the handling of NaN and -0.0 from `Platform` to `UnsafeWriter`, so that places like `UnsafeRow.setFloat` will not handle them, which reduces the perf overhead. It's also easier to add comments explaining why we do it in `UnsafeWriter`.

existing tests

Closes apache#23239 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
asfgit pushed a commit that referenced this pull request Dec 9, 2018
…feWriter

backport #23239 to 2.4

---------

## What changes were proposed in this pull request?

A followup of #23043

There are 4 places we need to deal with NaN and -0.0:
1. comparison expressions. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same.
2. Join keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same.
3. grouping keys. `-0.0` and `0.0` should be assigned to the same group. Different NaNs should be assigned to the same group.
4. window partition keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same.

The case 1 is OK. Our comparison already handles NaN and -0.0, and for struct/array/map, we will recursively compare the fields/elements.

Case 2, 3 and 4 are problematic, as they compare `UnsafeRow` binary directly, and different NaNs have different binary representation, and the same thing happens for -0.0 and 0.0.

To fix it, a simple solution is: normalize float/double when building unsafe data (`UnsafeRow`, `UnsafeArrayData`, `UnsafeMapData`). Then we don't need to worry about it anymore.

Following this direction, this PR moves the handling of NaN and -0.0 from `Platform` to `UnsafeWriter`, so that places like `UnsafeRow.setFloat` will not handle them, which reduces the perf overhead. It's also easier to add comments explaining why we do it in `UnsafeWriter`.

## How was this patch tested?

existing tests

Closes #23265 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@cloud-fan
Copy link
Contributor

cloud-fan commented Dec 26, 2018

Given the behavior change, I think we should revert it from branch 2.4.

Although I have a different fix without behavior change, that's a little risky to backport.

@cloud-fan cloud-fan reopened this Dec 26, 2018
@cloud-fan cloud-fan closed this Dec 26, 2018
asfgit pushed a commit that referenced this pull request Dec 27, 2018
…tDouble/Float

This PR reverts #23043 and its followup #23265, from branch 2.4, because it has behavior changes.

existing tests

Closes #23389 from cloud-fan/revert.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
asfgit pushed a commit that referenced this pull request Jan 9, 2019
## What changes were proposed in this pull request?

In #23043 , we introduced a behavior change: Spark users are not able to distinguish 0.0 and -0.0 anymore.

This PR proposes an alternative fix to the original bug, to retain the difference between 0.0 and -0.0 inside Spark.

The idea is, we can rewrite the window partition key, join key and grouping key during logical phase, to normalize the special floating numbers. Thus only operators care about special floating numbers need to pay the perf overhead, and end users can distinguish -0.0.

## How was this patch tested?

existing test

Closes #23388 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…/Float

GROUP BY treats -0.0 and 0.0 as different values which is unlike hive's behavior.
In addition current behavior with codegen is unpredictable (see example in JIRA ticket).

## What changes were proposed in this pull request?

In Platform.putDouble/Float() checking if the value is -0.0, and if so replacing with 0.0.
This is used by UnsafeRow so it won't have -0.0 values.

## How was this patch tested?

Added tests

Closes apache#23043 from adoron/adoron-spark-26021-replace-minus-zero-with-zero.

Authored-by: Alon Doron <adoron@palantir.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

a followup of apache#23043 . Add a test to show the minor behavior change introduced by apache#23043 , and add migration guide.

## How was this patch tested?

a new test

Closes apache#23141 from cloud-fan/follow.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

A followup of apache#23043

There are 4 places we need to deal with NaN and -0.0:
1. comparison expressions. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same.
2. Join keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same.
3. grouping keys. `-0.0` and `0.0` should be assigned to the same group. Different NaNs should be assigned to the same group.
4. window partition keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same.

The case 1 is OK. Our comparison already handles NaN and -0.0, and for struct/array/map, we will recursively compare the fields/elements.

Case 2, 3 and 4 are problematic, as they compare `UnsafeRow` binary directly, and different NaNs have different binary representation, and the same thing happens for -0.0 and 0.0.

To fix it, a simple solution is: normalize float/double when building unsafe data (`UnsafeRow`, `UnsafeArrayData`, `UnsafeMapData`). Then we don't need to worry about it anymore.

Following this direction, this PR moves the handling of NaN and -0.0 from `Platform` to `UnsafeWriter`, so that places like `UnsafeRow.setFloat` will not handle them, which reduces the perf overhead. It's also easier to add comments explaining why we do it in `UnsafeWriter`.

## How was this patch tested?

existing tests

Closes apache#23239 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

In apache#23043 , we introduced a behavior change: Spark users are not able to distinguish 0.0 and -0.0 anymore.

This PR proposes an alternative fix to the original bug, to retain the difference between 0.0 and -0.0 inside Spark.

The idea is, we can rewrite the window partition key, join key and grouping key during logical phase, to normalize the special floating numbers. Thus only operators care about special floating numbers need to pay the perf overhead, and end users can distinguish -0.0.

## How was this patch tested?

existing test

Closes apache#23388 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 23, 2019
…/Float

GROUP BY treats -0.0 and 0.0 as different values which is unlike hive's behavior.
In addition current behavior with codegen is unpredictable (see example in JIRA ticket).

## What changes were proposed in this pull request?

In Platform.putDouble/Float() checking if the value is -0.0, and if so replacing with 0.0.
This is used by UnsafeRow so it won't have -0.0 values.

## How was this patch tested?

Added tests

Closes apache#23043 from adoron/adoron-spark-26021-replace-minus-zero-with-zero.

Authored-by: Alon Doron <adoron@palantir.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 0ec7b99)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 23, 2019
…feWriter

backport apache#23239 to 2.4

---------

## What changes were proposed in this pull request?

A followup of apache#23043

There are 4 places we need to deal with NaN and -0.0:
1. comparison expressions. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same.
2. Join keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same.
3. grouping keys. `-0.0` and `0.0` should be assigned to the same group. Different NaNs should be assigned to the same group.
4. window partition keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same.

The case 1 is OK. Our comparison already handles NaN and -0.0, and for struct/array/map, we will recursively compare the fields/elements.

Case 2, 3 and 4 are problematic, as they compare `UnsafeRow` binary directly, and different NaNs have different binary representation, and the same thing happens for -0.0 and 0.0.

To fix it, a simple solution is: normalize float/double when building unsafe data (`UnsafeRow`, `UnsafeArrayData`, `UnsafeMapData`). Then we don't need to worry about it anymore.

Following this direction, this PR moves the handling of NaN and -0.0 from `Platform` to `UnsafeWriter`, so that places like `UnsafeRow.setFloat` will not handle them, which reduces the perf overhead. It's also easier to add comments explaining why we do it in `UnsafeWriter`.

## How was this patch tested?

existing tests

Closes apache#23265 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 23, 2019
…tDouble/Float

This PR reverts apache#23043 and its followup apache#23265, from branch 2.4, because it has behavior changes.

existing tests

Closes apache#23389 from cloud-fan/revert.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Aug 1, 2019
…/Float

GROUP BY treats -0.0 and 0.0 as different values which is unlike hive's behavior.
In addition current behavior with codegen is unpredictable (see example in JIRA ticket).

## What changes were proposed in this pull request?

In Platform.putDouble/Float() checking if the value is -0.0, and if so replacing with 0.0.
This is used by UnsafeRow so it won't have -0.0 values.

## How was this patch tested?

Added tests

Closes apache#23043 from adoron/adoron-spark-26021-replace-minus-zero-with-zero.

Authored-by: Alon Doron <adoron@palantir.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 0ec7b99)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Aug 1, 2019
…feWriter

backport apache#23239 to 2.4

---------

## What changes were proposed in this pull request?

A followup of apache#23043

There are 4 places we need to deal with NaN and -0.0:
1. comparison expressions. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same.
2. Join keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same.
3. grouping keys. `-0.0` and `0.0` should be assigned to the same group. Different NaNs should be assigned to the same group.
4. window partition keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same.

The case 1 is OK. Our comparison already handles NaN and -0.0, and for struct/array/map, we will recursively compare the fields/elements.

Case 2, 3 and 4 are problematic, as they compare `UnsafeRow` binary directly, and different NaNs have different binary representation, and the same thing happens for -0.0 and 0.0.

To fix it, a simple solution is: normalize float/double when building unsafe data (`UnsafeRow`, `UnsafeArrayData`, `UnsafeMapData`). Then we don't need to worry about it anymore.

Following this direction, this PR moves the handling of NaN and -0.0 from `Platform` to `UnsafeWriter`, so that places like `UnsafeRow.setFloat` will not handle them, which reduces the perf overhead. It's also easier to add comments explaining why we do it in `UnsafeWriter`.

## How was this patch tested?

existing tests

Closes apache#23265 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Aug 1, 2019
…tDouble/Float

This PR reverts apache#23043 and its followup apache#23265, from branch 2.4, because it has behavior changes.

existing tests

Closes apache#23389 from cloud-fan/revert.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
zhongjinhan pushed a commit to zhongjinhan/spark-1 that referenced this pull request Sep 3, 2019
…feWriter

backport apache/spark#23239 to 2.4

---------

## What changes were proposed in this pull request?

A followup of apache/spark#23043

There are 4 places we need to deal with NaN and -0.0:
1. comparison expressions. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same.
2. Join keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same.
3. grouping keys. `-0.0` and `0.0` should be assigned to the same group. Different NaNs should be assigned to the same group.
4. window partition keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same.

The case 1 is OK. Our comparison already handles NaN and -0.0, and for struct/array/map, we will recursively compare the fields/elements.

Case 2, 3 and 4 are problematic, as they compare `UnsafeRow` binary directly, and different NaNs have different binary representation, and the same thing happens for -0.0 and 0.0.

To fix it, a simple solution is: normalize float/double when building unsafe data (`UnsafeRow`, `UnsafeArrayData`, `UnsafeMapData`). Then we don't need to worry about it anymore.

Following this direction, this PR moves the handling of NaN and -0.0 from `Platform` to `UnsafeWriter`, so that places like `UnsafeRow.setFloat` will not handle them, which reduces the perf overhead. It's also easier to add comments explaining why we do it in `UnsafeWriter`.

## How was this patch tested?

existing tests

Closes #23265 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 33460c5)
zhongjinhan pushed a commit to zhongjinhan/spark-1 that referenced this pull request Sep 3, 2019
…tDouble/Float

This PR reverts apache/spark#23043 and its followup apache/spark#23265, from branch 2.4, because it has behavior changes.

existing tests

Closes #23389 from cloud-fan/revert.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit fa1abe2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants