[SPARK-10999] [SQL] Coalesce should be able to handle UnsafeRow #9024

liancheng · 2015-10-08T06:09:28Z

No description provided.

liancheng · 2015-10-08T06:09:46Z

@JoshRosen This is the issue we discussed this afternoon.

rxin · 2015-10-08T06:13:50Z

LGTM

JoshRosen · 2015-10-08T06:28:45Z

LGTM as well.

JoshRosen · 2015-10-08T06:28:47Z

Jenkins, retest this please.

SparkQA · 2015-10-08T08:37:01Z

Test build #43388 has finished for PR 9024 at commit 5e0fd07.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-10-08T16:20:11Z

Thanks, I'm merging this to master.

… formats ## What changes were proposed in this pull request? This patch addresses a correctness bug in Spark 1.6.x in where `coalesce()` declares that it can process `UnsafeRows` but mis-declares that it always outputs safe rows. If UnsafeRow and other Row types are compared for equality then we will get spurious `false` comparisons, leading to wrong answers in operators which perform whole-row comparison (such as `distinct()` or `except()`). An example of a query impacted by this bug is given in the [JIRA ticket](https://issues.apache.org/jira/browse/SPARK-17618). The problem is that the validity of our row format conversion rules depends on operators which handle `unsafeRows` (signalled by overriding `canProcessUnsafeRows`) correctly reporting their output row format (which is done by overriding `outputsUnsafeRows`). In #9024, we overrode `canProcessUnsafeRows` but forgot to override `outputsUnsafeRows`, leading to the incorrect `equals()` comparison. Our interface design is flawed because correctness depends on operators correctly overriding multiple methods this problem could have been prevented by a design which coupled row format methods / metadata into a single method / class so that all three methods had to be overridden at the same time. This patch addresses this issue by adding missing `outputsUnsafeRows` overrides. In order to ensure that bugs in this logic are uncovered sooner, I have modified `UnsafeRow.equals()` to throw an `IllegalArgumentException` if it is called with an object that is not an `UnsafeRow`. ## How was this patch tested? I believe that the stronger misuse-checking in `UnsafeRow.equals()` is sufficient to detect and prevent this class of bug. Author: Josh Rosen <joshrosen@databricks.com> Closes #15185 from JoshRosen/SPARK-17618.

… formats ## What changes were proposed in this pull request? This patch addresses a correctness bug in Spark 1.6.x in where `coalesce()` declares that it can process `UnsafeRows` but mis-declares that it always outputs safe rows. If UnsafeRow and other Row types are compared for equality then we will get spurious `false` comparisons, leading to wrong answers in operators which perform whole-row comparison (such as `distinct()` or `except()`). An example of a query impacted by this bug is given in the [JIRA ticket](https://issues.apache.org/jira/browse/SPARK-17618). The problem is that the validity of our row format conversion rules depends on operators which handle `unsafeRows` (signalled by overriding `canProcessUnsafeRows`) correctly reporting their output row format (which is done by overriding `outputsUnsafeRows`). In apache#9024, we overrode `canProcessUnsafeRows` but forgot to override `outputsUnsafeRows`, leading to the incorrect `equals()` comparison. Our interface design is flawed because correctness depends on operators correctly overriding multiple methods this problem could have been prevented by a design which coupled row format methods / metadata into a single method / class so that all three methods had to be overridden at the same time. This patch addresses this issue by adding missing `outputsUnsafeRows` overrides. In order to ensure that bugs in this logic are uncovered sooner, I have modified `UnsafeRow.equals()` to throw an `IllegalArgumentException` if it is called with an object that is not an `UnsafeRow`. ## How was this patch tested? I believe that the stronger misuse-checking in `UnsafeRow.equals()` is sufficient to detect and prevent this class of bug. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#15185 from JoshRosen/SPARK-17618. (cherry picked from commit e2ce0ca)

Lets Coalesce handle UnsafeRow

5e0fd07

asfgit closed this in 59b0606 Oct 8, 2015

liancheng deleted the spark-10999.coalesce-unsafe-row-handling branch October 8, 2015 16:23

JoshRosen mentioned this pull request Sep 21, 2016

[SPARK-17618] Fix invalid comparisons between UnsafeRow and other row formats #15185

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-10999] [SQL] Coalesce should be able to handle UnsafeRow #9024

[SPARK-10999] [SQL] Coalesce should be able to handle UnsafeRow #9024

liancheng commented Oct 8, 2015

liancheng commented Oct 8, 2015

rxin commented Oct 8, 2015

JoshRosen commented Oct 8, 2015

JoshRosen commented Oct 8, 2015

SparkQA commented Oct 8, 2015

liancheng commented Oct 8, 2015

[SPARK-10999] [SQL] Coalesce should be able to handle UnsafeRow #9024

[SPARK-10999] [SQL] Coalesce should be able to handle UnsafeRow #9024

Conversation

liancheng commented Oct 8, 2015

liancheng commented Oct 8, 2015

rxin commented Oct 8, 2015

JoshRosen commented Oct 8, 2015

JoshRosen commented Oct 8, 2015

SparkQA commented Oct 8, 2015

liancheng commented Oct 8, 2015