-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-9293] [SPARK-9813] Analysis should check that set operations are only performed on tables with equal numbers of columns #7631
Conversation
Test build #38297 has finished for PR 7631 at commit
|
Huh, it looks like this failed some Hive tests due to the new assertion that I added. |
@@ -181,6 +181,7 @@ object HiveTypeCoercion { | |||
planName: String, | |||
left: LogicalPlan, | |||
right: LogicalPlan): (LogicalPlan, LogicalPlan) = { | |||
require(left.output.length == right.output.length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someone has tried to add this restrict before in #6174, but failed as hive support different length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My patch for this was motivated by the fact that my fuzz tester was throwing runtime errors for queries involving UNION ALL with differing numbers of columns: if you run such a UNION ALL query and then attempt to convert the result to a SchemaRDD you can get ArrayIndexOutOfBounds exceptions in CatalystTypeConverters. I'll see if there's a better way to fix this issue.
@cloud-fan what does Hive do? null for all the missing columns? |
I tried it locally, hive will report error if will union 2 |
Test build #40999 has finished for PR 7631 at commit
|
There is no particular reason that |
Test build #41413 has finished for PR 7631 at commit
|
Test build #41414 has finished for PR 7631 at commit
|
@marmbrus, I've brought this up to date based on your suggestion above, so do you mind taking a look to check whether I've handled the |
LGTM |
1 similar comment
LGTM |
Merging into master and branch-1.5 |
…re only performed on tables with equal numbers of columns This patch adds an analyzer rule to ensure that set operations (union, intersect, and except) are only applied to tables with the same number of columns. Without this rule, there are scenarios where invalid queries can return incorrect results instead of failing with error messages; SPARK-9813 provides one example of this problem. In other cases, the invalid query can crash at runtime with extremely confusing exceptions. I also performed a bit of cleanup to refactor some of those logical operators' code into a common `SetOperation` base class. Author: Josh Rosen <joshrosen@databricks.com> Closes #7631 from JoshRosen/SPARK-9293. (cherry picked from commit 82268f0) Signed-off-by: Michael Armbrust <michael@databricks.com>
This patch adds an analyzer rule to ensure that set operations (union, intersect, and except) are only applied to tables with the same number of columns. Without this rule, there are scenarios where invalid queries can return incorrect results instead of failing with error messages; SPARK-9813 provides one example of this problem. In other cases, the invalid query can crash at runtime with extremely confusing exceptions.
I also performed a bit of cleanup to refactor some of those logical operators' code into a common
SetOperation
base class.