Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-9293] [SPARK-9813] Analysis should check that set operations are only performed on tables with equal numbers of columns #7631

Closed
wants to merge 6 commits into from

Conversation

JoshRosen
Copy link
Contributor

This patch adds an analyzer rule to ensure that set operations (union, intersect, and except) are only applied to tables with the same number of columns. Without this rule, there are scenarios where invalid queries can return incorrect results instead of failing with error messages; SPARK-9813 provides one example of this problem. In other cases, the invalid query can crash at runtime with extremely confusing exceptions.

I also performed a bit of cleanup to refactor some of those logical operators' code into a common SetOperation base class.

@SparkQA
Copy link

SparkQA commented Jul 24, 2015

Test build #38297 has finished for PR 7631 at commit 326d759.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class ChangePrecision(child: Expression) extends UnaryExpression
    • abstract class SetOperation(left: LogicalPlan, right: LogicalPlan) extends BinaryNode
    • case class Union(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)
    • case class Intersect(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)
    • case class Except(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)
    • case class DecimalType(precision: Int, scale: Int) extends FractionalType
    • case class DecimalConversion(precision: Int, scale: Int) extends JDBCConversion

@JoshRosen JoshRosen changed the title [SPARK-9293] Add analysis rule to ensure that set operations are only performed on tables with equal numbers of columns [SPARK-9293] Analysis should check that set operations are only performed on tables with equal numbers of columns Jul 24, 2015
@JoshRosen
Copy link
Contributor Author

Huh, it looks like this failed some Hive tests due to the new assertion that I added.

@@ -181,6 +181,7 @@ object HiveTypeCoercion {
planName: String,
left: LogicalPlan,
right: LogicalPlan): (LogicalPlan, LogicalPlan) = {
require(left.output.length == right.output.length)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someone has tried to add this restrict before in #6174, but failed as hive support different length.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My patch for this was motivated by the fact that my fuzz tester was throwing runtime errors for queries involving UNION ALL with differing numbers of columns: if you run such a UNION ALL query and then attempt to convert the result to a SchemaRDD you can get ArrayIndexOutOfBounds exceptions in CatalystTypeConverters. I'll see if there's a better way to fix this issue.

@rxin
Copy link
Contributor

rxin commented Jul 24, 2015

@cloud-fan what does Hive do? null for all the missing columns?

@cloud-fan
Copy link
Contributor

I tried it locally, hive will report error if will union 2 select with different output length. But in our test we union 2 InsertIntoTable operations which seems a special case for hive(or maybe the output of InsertIntoTable is Nil in hive?).

@SparkQA
Copy link

SparkQA commented Aug 17, 2015

Test build #40999 has finished for PR 7631 at commit b92d817.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class SetOperation(left: LogicalPlan, right: LogicalPlan) extends BinaryNode
    • case class Union(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)
    • case class Intersect(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)
    • case class Except(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)

@marmbrus
Copy link
Contributor

There is no particular reason that InsertIntoTable should have the same schema as its child. Initially I was thinking that you might want to chain the insert with other operations. However, [the current implementation(https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L257) doesn't take advantage of this possibility so I'd propose we just make its output empty so that we can merge this patch.

@SparkQA
Copy link

SparkQA commented Aug 23, 2015

Test build #41413 has finished for PR 7631 at commit a0bd50c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class SetOperation(left: LogicalPlan, right: LogicalPlan) extends BinaryNode
    • case class Union(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)
    • case class Intersect(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)
    • case class Except(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)

@SparkQA
Copy link

SparkQA commented Aug 23, 2015

Test build #41414 has finished for PR 7631 at commit 362e749.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class SetOperation(left: LogicalPlan, right: LogicalPlan) extends BinaryNode
    • case class Union(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)
    • case class Intersect(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)
    • case class Except(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)

@JoshRosen
Copy link
Contributor Author

@marmbrus, I've brought this up to date based on your suggestion above, so do you mind taking a look to check whether I've handled the output correctly here?

@cloud-fan
Copy link
Contributor

LGTM

1 similar comment
@marmbrus
Copy link
Contributor

LGTM

@JoshRosen JoshRosen changed the title [SPARK-9293] Analysis should check that set operations are only performed on tables with equal numbers of columns [SPARK-9293] [SPARK-9813] Analysis should check that set operations are only performed on tables with equal numbers of columns Aug 25, 2015
@marmbrus
Copy link
Contributor

Merging into master and branch-1.5

asfgit pushed a commit that referenced this pull request Aug 25, 2015
…re only performed on tables with equal numbers of columns

This patch adds an analyzer rule to ensure that set operations (union, intersect, and except) are only applied to tables with the same number of columns. Without this rule, there are scenarios where invalid queries can return incorrect results instead of failing with error messages; SPARK-9813 provides one example of this problem. In other cases, the invalid query can crash at runtime with extremely confusing exceptions.

I also performed a bit of cleanup to refactor some of those logical operators' code into a common `SetOperation` base class.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7631 from JoshRosen/SPARK-9293.

(cherry picked from commit 82268f0)
Signed-off-by: Michael Armbrust <michael@databricks.com>
@asfgit asfgit closed this in 82268f0 Aug 25, 2015
@JoshRosen JoshRosen deleted the SPARK-9293 branch August 29, 2016 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants