[SPARK-14354][SQL] Let Expand take name expressions and infer output attributes #12138

viirya · 2016-04-03T13:39:43Z

What changes were proposed in this pull request?

JIRA: https://issues.apache.org/jira/browse/SPARK-14354

Currently we create Expand operator by specifying projections (Seq[Seq[Expression]]) and its output. We allow Expand to reuse child operator's attributes and so make its constraints invalid when we change the corresponding values of these attributes (e.g., making them null when doing a roll up). We should let it take name expressions and infer output itself.

The problem is we re-use child's output as Expands output. We will create new attributes in Expand because Expand actually performs multiple projections. However, we let the projections in Expand as Expression instead of NamedExpression and re-use child output attributes. Thus there is a inconsistency between Expand's output attributes and projected values.

The obvious example for this inconsistency is constraints. Previously Expand inherits child's constraints. As we will change child's output values by projections (e.g., set it as null), these constraints bound on child's attributes are not valid.

In previous PR we just set Expands validConstraints to empty to avoid such inconsistency. But as the result, we don't have reliable constraints after Expand operator.

How was this patch tested?

Modified ConstraintPropagationSuite.

SparkQA · 2016-04-03T15:04:37Z

Test build #54806 has finished for PR 12138 at commit 376d2e1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-04T13:11:07Z

Test build #54845 has finished for PR 12138 at commit c9a3887.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-04-05T14:57:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -1659,11 +1665,12 @@ object TimeWindowing extends Rule[LogicalPlan] {
          val windowEnd = windowStart + window.windowDuration

          CreateNamedStruct(
-            Literal(WINDOW_START) :: windowStart ::


Previously we manually set the output of Expand here as TimestampType (windowAttr). As windowStart and windowEnd are producing long values, when we infer output from Expand's projections, we will get LongType instead of TimestampType. So we need to explicitly convert the LongType to TimestampType.

SparkQA · 2016-04-05T15:05:01Z

Test build #54991 has finished for PR 12138 at commit 8a18acd.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class TimestampFromLong(child: Expression) extends UnaryExpression with ExpectsInputTypes

viirya · 2016-04-05T15:12:48Z

retest this please.

SparkQA · 2016-04-05T17:02:24Z

Test build #54993 has finished for PR 12138 at commit 8a18acd.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class TimestampFromLong(child: Expression) extends UnaryExpression with ExpectsInputTypes

viirya · 2016-04-09T00:29:28Z

ping @marmbrus @yhuai @cloud-fan

cloud-fan · 2016-04-09T02:50:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala

    child: LogicalPlan) extends UnaryNode {
+  override def output: Seq[Attribute] = {
+    // Take the first projection as output
+    val preOutput = projections.head.map(_.toAttribute)


This seems a waste that we make all projections Seq[NamedExpression], but only use the first one to produce attribute.

yea, kind of. If we only make the first projection as Seq[NamedExpression], I think it might cause little confusing.

cloud-fan · 2016-04-09T02:56:36Z

If my understanding is right, the problem is: when child output changes(e.g. making them null when doing a roll up), the output of Expand can't reflect it. I have a simpler idea, when set the output for Expand, use a placeholder to reference to a child column, and define a output method in Expand that replace the placeholder with child attribute. How about it?

viirya · 2016-04-11T04:16:12Z

@cloud-fan Thanks for comment.

The problem is we re-use child's output as Expands output. We will create new attributes in Expand because Expand actually performs multiple projections. However, we let the projections in Expand as Expression instead of NamedExpression and re-use child output attributes. Thus there is a inconsistency between Expand's output attributes and projected values.

The obvious example for this inconsistency is constraints. Previously Expand inherits child's constraints. As we will change child's output values by projections (e.g., set it as null), these constraints bound on child's attributes are not valid.

In previous PR we just set Expands validConstraints to empty to avoid such inconsistency. But as the result, we don't have reliable constraints after Expand operator.

cloud-fan · 2016-04-11T05:26:14Z

Yea, and in this PR we use NamedExpression to avoid re-using child's output, which should work. My proposal is that, we can use placeholders(maybe BoundReference) to reference child's output, and define an output method in Expand that replace the placeholder with child attribute. I haven't looked through all cases, but if this proposal can work, it should be simpler to implement.

viirya · 2016-04-11T06:15:05Z

@cloud-fan yea. this solution looks complicated due to that I need to fit it into current usage of Expand... I will try your proposal and see if it works. Thanks!

viirya · 2016-04-11T07:11:42Z

@cloud-fan As you replace the placeholder with child attributes, does it mean we re-use child's output too?

cloud-fan · 2016-04-11T08:46:33Z

no it's different when we do it in a method. Everytime the child output changes, Expand.output will change too(think about Filter).

viirya · 2016-04-11T09:01:37Z

We may not talk the same thing. Not the change of child output causes problem. We create new attributes in Expand. But we use child's output as Expand's output.

For example, if the child output is [a, b, c]. Currently we set it as Expand's output too.
But when we do a roll up, we may set a, b, c to null. But our output in Expand doesn't reflect this change. So the constraints referring [a, b, c] become invalid.

cloud-fan · 2016-04-11T09:22:58Z

But our output in Expand doesn't reflect this change.

It's because Expand.output is a val, if we make it a def, and use child.output to construct the output of Expand every time we call the output method, the problem should be fixed.

viirya · 2016-04-11T09:33:50Z

Why? Every time we call output method on Expand, it still uses child output to construct its output. But child's output doesn't know the values are changed now. Isn't it?

viirya · 2016-04-11T09:37:52Z

I think the output attributes of Expand operator should based on its projections, not child's output?

cloud-fan · 2016-04-11T11:56:36Z

I may misunderstand this problem, could you add a test case in this PR to show what's wrong before?

viirya · 2016-04-13T06:58:04Z

@cloud-fan The obvious wrong case is Expand's constraints. I've modified the test in ConstraintPropagationSuite. Basically the previous usage works except for constraints inference.

Infer Expand output from projections.

376d2e1

Fix tests.

c9a3887

viirya added 2 commits April 5, 2016 08:35

Use correct exprId.

606b100

Fix test due to wrongly set output in Expand in previous implementation.

8a18acd

viirya reviewed Apr 5, 2016
View reviewed changes

cloud-fan reviewed Apr 9, 2016
View reviewed changes

viirya closed this Oct 6, 2016

viirya deleted the expand-name-expr branch December 27, 2023 18:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-14354][SQL] Let Expand take name expressions and infer output attributes #12138

[SPARK-14354][SQL] Let Expand take name expressions and infer output attributes #12138

viirya commented Apr 3, 2016

SparkQA commented Apr 3, 2016

SparkQA commented Apr 4, 2016

viirya Apr 5, 2016

SparkQA commented Apr 5, 2016

viirya commented Apr 5, 2016

SparkQA commented Apr 5, 2016

viirya commented Apr 9, 2016

cloud-fan Apr 9, 2016

viirya Apr 13, 2016

cloud-fan commented Apr 9, 2016

viirya commented Apr 11, 2016

cloud-fan commented Apr 11, 2016

viirya commented Apr 11, 2016

viirya commented Apr 11, 2016

cloud-fan commented Apr 11, 2016

viirya commented Apr 11, 2016

cloud-fan commented Apr 11, 2016

viirya commented Apr 11, 2016

viirya commented Apr 11, 2016

cloud-fan commented Apr 11, 2016

viirya commented Apr 13, 2016

[SPARK-14354][SQL] Let Expand take name expressions and infer output attributes #12138

[SPARK-14354][SQL] Let Expand take name expressions and infer output attributes #12138

Conversation

viirya commented Apr 3, 2016

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Apr 3, 2016

SparkQA commented Apr 4, 2016

viirya Apr 5, 2016

Choose a reason for hiding this comment

SparkQA commented Apr 5, 2016

viirya commented Apr 5, 2016

SparkQA commented Apr 5, 2016

viirya commented Apr 9, 2016

cloud-fan Apr 9, 2016

Choose a reason for hiding this comment

viirya Apr 13, 2016

Choose a reason for hiding this comment

cloud-fan commented Apr 9, 2016

viirya commented Apr 11, 2016

cloud-fan commented Apr 11, 2016

viirya commented Apr 11, 2016

viirya commented Apr 11, 2016

cloud-fan commented Apr 11, 2016

viirya commented Apr 11, 2016

cloud-fan commented Apr 11, 2016

viirya commented Apr 11, 2016

viirya commented Apr 11, 2016

cloud-fan commented Apr 11, 2016

viirya commented Apr 13, 2016