Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26021][SQL][followup] add test for special floating point values #23141

Closed
wants to merge 2 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Nov 26, 2018

What changes were proposed in this pull request?

a followup of #23043 . Add a test to show the minor behavior change introduced by #23043 , and add migration guide.

How was this patch tested?

a new test

@cloud-fan
Copy link
Contributor Author

cc @adoron @kiszk @viirya

- In Spark version 2.4 and earlier, `Dataset.groupByKey` results to a grouped dataset with key attribute wrongly named as "value", if the key is non-struct type, e.g. int, string, array, etc. This is counterintuitive and makes the schema of aggregation queries weird. For example, the schema of `ds.groupByKey(...).count()` is `(value, count)`. Since Spark 3.0, we name the grouping attribute to "key". The old behaviour is preserved under a newly added configuration `spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue` with a default value of `false`.

- In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but users can still distinguish them via `Dataset.show`, `Dataset.collect` etc. Since Spark 3.0, float/double -0.0 is replaced by 0.0 internally, and users can't distinguish them any more.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked presto and postgres, the behaviors are same. Hive distinguishes -0.0 and 0.0, but it has the group by bug.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What version of hive did you test?
It was fixed in https://issues.apache.org/jira/browse/HIVE-11174

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran few simple queries on Hive 2.1.

Simple comparison seems ok:

hive> select 1 where 0.0=-0.0;
OK
1
Time taken: 0.047 seconds, Fetched: 1 row(s)
hive> select 1 where -0.0<0.0;
OK
Time taken: 0.053 seconds

But group by behavior seems not correct:

hive> select * from test;
OK
0.0
-0.0
0.0
Time taken: 0.11 seconds, Fetched: 3 row(s)
hive> select * from test;
OK
0.0
-0.0
0.0
Time taken: 0.11 seconds, Fetched: 3 row(s)
hive> select a, count(*) from test group by a;
-0.0	3
Time taken: 1.308 seconds, Fetched: 1 row(s)

@SparkQA
Copy link

SparkQA commented Nov 26, 2018

Test build #99263 has finished for PR 23141 at commit 8a9103c.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Nov 26, 2018

retest this please.

@SparkQA
Copy link

SparkQA commented Nov 26, 2018

Test build #99269 has finished for PR 23141 at commit 8a9103c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Nov 26, 2018

LGTM

@SparkQA
Copy link

SparkQA commented Nov 26, 2018

Test build #99279 has finished for PR 23141 at commit 7c590bc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

thanks, merging to master!

@asfgit asfgit closed this in 09a91d9 Nov 28, 2018
byte[] doubleBytes2 = new byte[Double.BYTES];
byte[] floatBytes2 = new byte[Float.BYTES];
Platform.putDouble(doubleBytes, Platform.BYTE_ARRAY_OFFSET, 0.0d);
Platform.putFloat(floatBytes, Platform.BYTE_ARRAY_OFFSET, 0.0f);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, these should be doubleByte2 and floatBytes2.
I'll comment on the follow-up PR #23239 , too. We can fix there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah good catch! I'm surprised this test passed before...

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

a followup of apache#23043 . Add a test to show the minor behavior change introduced by apache#23043 , and add migration guide.

## How was this patch tested?

a new test

Closes apache#23141 from cloud-fan/follow.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants