Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2781][SQL] Check resolution of LogicalPlans in Analyzer. #1706

Closed
wants to merge 5 commits into from

Conversation

staple
Copy link
Contributor

@staple staple commented Aug 1, 2014

LogicalPlan contains a ‘resolved’ attribute indicating that all of its execution requirements have been resolved. This attribute is not checked before query execution. The analyzer contains a step to check that all Expressions are resolved, but this is not equivalent to checking all LogicalPlans. In particular, the Union plan’s implementation of ‘resolved’ verifies that the types of its children’s columns are compatible. Because the analyzer does not check that a Union plan is resolved, it is possible to execute a Union plan that outputs different types in the same column. See SPARK-2781 for an example.

This patch adds two checks to the analyzer’s CheckResolution rule. First, each logical plan is checked to see if it is not resolved despite its children being resolved. This allows the ‘problem’ unresolved plan to be included in the TreeNodeException for reporting. Then as a backstop the root plan is checked to see if it is resolved, which recursively checks that the entire plan tree is resolved. Note that the resolved attribute is implemented recursively, and this patch also explicitly checks the resolved attribute on each logical plan in the tree. I assume the query plan trees will not be large enough for this redundant checking to meaningfully impact performance.

Because this patch starts validating that LogicalPlans are resolved before execution, I had to fix some cases where unresolved plans were passing through the analyzer as part of the implementation of the hive query system. In particular, HiveContext applies the CreateTables and PreInsertionCasts, and ExtractPythonUdfs rules manually after the analyzer runs. I moved these rules to the analyzer stage (for hive queries only), in the process completing a code TODO indicating the rules should be moved to the analyzer.

It’s worth noting that moving the CreateTables rule means introducing an analyzer rule with a significant side effect - in this case the side effect is creating a hive table. The rule will only attempt to create a table once even if its batch is executed multiple times, because it converts the InsertIntoCreatedTable plan it matches against into an InsertIntoTable. Additionally, these hive rules must be added to the Resolution batch rather than as a separate batch because hive rules rules may be needed to resolve non-root nodes, leaving the root to be resolved on a subsequent batch iteration. For example, the hive compatibility test auto_smb_mapjoin_14, and others, make use of a query plan where the root is a Union and its children are each a hive InsertIntoTable.

Mixing the custom hive rules with standard analyzer rules initially resulted in an additional failure because of policy differences between spark sql and hive when casting a boolean to a string. Hive casts booleans to strings as “true” / “false” while spark sql casts booleans to strings as “1” / “0” (causing the cast1.q test to fail). This behavior is a result of the BooleanCasts rule in HiveTypeCoercion.scala, and from looking at the implementation of BooleanCasts I think converting to to “1”/“0” is potentially a programming mistake. (If the BooleanCasts rule is disabled, casting produces “true”/“false” instead.) I believe “true” / “false” should be the behavior for spark sql - I changed the behavior so bools are converted to “true”/“false” to be consistent with hive, and none of the existing spark tests failed.

Finally, in some initial testing with hive it appears that an implicit type coercion of boolean to string results in a lowercase string, e.g. CONCAT( TRUE, “” ) -> “true” while an explicit cast produces an all caps string, e.g. CAST( TRUE AS STRING ) -> “TRUE”. The change I’ve made just converts to lowercase strings in all cases. I believe it is at least more correct than the existing spark sql implementation where all Cast expressions become “1” / “0”.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@pwendell
Copy link
Contributor

pwendell commented Aug 1, 2014

@staple can you add [SQL] to the title of this PR? That way it gets filtered properly by our internal sorting tools.

@staple staple changed the title [SPARK-2781] Check resolution of LogicalPlans in Analyzer. [SPARK-2781][SQL] Check resolution of LogicalPlans in Analyzer. Aug 1, 2014
@staple
Copy link
Contributor Author

staple commented Aug 1, 2014

Sure, fixed it.

@marmbrus
Copy link
Contributor

marmbrus commented Sep 3, 2014

Hi @staple, thanks for working on this and sorry it took me so long to review it! Now that the 1.1 release is on its way it would be great to revisit this. Overall, the changes you propose seem pretty reasonable to me. A few notes:

@SparkQA
Copy link

SparkQA commented Sep 5, 2014

Can one of the admins verify this patch?

@staple staple force-pushed the SPARK-2781 branch 2 times, most recently from 0b9878c to 80a27dc Compare September 9, 2014 17:08
@staple
Copy link
Contributor Author

staple commented Sep 9, 2014

Ok, I merged with master.

The primary master change resulting in a merge conflict with this patch was the addition of a call to the new ExtractPythonUdfs rule in HiveContext. For now I decided to leave the explicit call to the ExtractPythonUdfs rule in place, rather than incorporate it into the Analyzer along with the CreateTables and PreInsertionCasts rules I moved there. I am not aware that ExtractPythonUdfs can affect query plan resolution, and I haven’t checked what the consequences would be if the ExtractPythonUdfs rule were called multiple times (on multiple PythonUDFs) which could happen if the rule were added to an Analyzer batch.

@staple
Copy link
Contributor Author

staple commented Sep 9, 2014

For #1846, either before or after works for me.

/**
* Override to provide additional rules for the "Resolution" batch.
*/
val extendedRules: List[Rule[LogicalPlan]] = Nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I'd use the more general Seq instead of List for any declared interfaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd made it a List in order to use the triple colon concat operator in Analyzer for consistency with the existing code that uses double colon, but sure I can change to Seq instead since that's preferred.

@marmbrus
Copy link
Contributor

Oh, hmmm, its actually probably a bug that ExtractPythonUdfs is not run to fixed point. I think it is safe for you to add it using extendedRules instead of how its done now.

@marmbrus
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Sep 10, 2014

QA tests have started for PR 1706 at commit 80a27dc.

  • This patch merges cleanly.

@staple
Copy link
Contributor Author

staple commented Sep 10, 2014

Ok, sure I'll move ExtractPythonUdfs as you suggest. Thanks for taking a look!

@SparkQA
Copy link

SparkQA commented Sep 10, 2014

QA tests have finished for PR 1706 at commit 80a27dc.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Last(child: Expression) extends PartialAggregate with trees.UnaryNode[Expression]
    • case class LastFunction(expr: Expression, base: AggregateExpression) extends AggregateFunction
    • case class Abs(child: Expression) extends UnaryExpression

@staple
Copy link
Contributor Author

staple commented Sep 10, 2014

Ok, I fixed the test failure (I forgot to update a commit after a merge) and addressed the review comments.

@SparkQA
Copy link

SparkQA commented Sep 10, 2014

QA tests have started for PR 1706 at commit dd71904.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 10, 2014

QA tests have finished for PR 1706 at commit dd71904.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@staple
Copy link
Contributor Author

staple commented Sep 10, 2014

Hi, it looks like the above failure came from upstream, not this patch, fixed here:
[HOTFIX] Fix scala style issue introduced by #2276
26503fd

@marmbrus
Copy link
Contributor

Yeah sorry our testing infra has been kinda flakey recently. Asked to test again.

@marmbrus
Copy link
Contributor

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Sep 10, 2014

QA tests have started for PR 1706 at commit dd71904.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 10, 2014

QA tests have finished for PR 1706 at commit dd71904.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@staple
Copy link
Contributor Author

staple commented Sep 10, 2014

I fixed a compilation error that arose due to a non 'conflicting' merge issue with an upstream commit from earlier today.

@marmbrus
Copy link
Contributor

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Sep 10, 2014

QA tests have started for PR 1706 at commit 32683c4.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have finished for PR 1706 at commit 32683c4.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

Thanks for doing this! Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants