[SPARK-28227][SQL] Support TRANSFORM with aggregation. #25028

AngersZhuuuu · 2019-07-02T06:23:57Z

What changes were proposed in this pull request?

For Spark SQL, it can't support sql like :
SELECT TRANSFORM ( d2, max(d1) as maxd1, cast(sum(d3) as string))
USING 'cat' AS (a,b,c)
FROM script_trans
WHERE d1 <= 100
GROUP BY d2
HAVING maxd1 > 0

But in hive, it can support this kind SQL.
This makes our SQL migration difficult and complex。
This PR is to support use Aggregation with TRANSFORM and make SQL migration from Hive to Sparkeasier.

How was this patch tested?

Added unit test.

AngersZhuuuu · 2019-07-04T06:53:35Z

@cloud-fan @gatorsmile @HyukjinKwon @jerryshao @wangyum
Hi all, could you help to review this and give some advise.

AngersZhuuuu · 2019-07-10T03:59:47Z

@dongjoon-hyun
Could you help to review this for me and can you help me to @ who work on this part

wangyum · 2019-07-12T15:27:21Z

Could we implement ScriptTransformation in sql/core first(SPARK-15694)?

HyukjinKwon · 2019-07-18T07:45:57Z

Yea, I agree with @wangyum's. Can we? @AngersZhuuuu

AngersZhuuuu · 2019-07-18T07:54:32Z

Yea, I agree with @wangyum's. Can we? @AngersZhuuuu

Nowadays I am doing for support SparkThriftServer to enable proxy client user's authentication and make a PR for master.
#25179

After that, I will focus on this problem.

AmplabJenkins · 2019-09-16T18:10:14Z

Can one of the admins verify this patch?

HyukjinKwon · 2019-09-16T23:49:04Z

@AngersZhuuuu, let's fix #25028 (comment) first. Feel free to open another PR.

AngersZhuuuu · 2019-09-17T01:32:40Z

@AngersZhuuuu, let's fix #25028 (comment) first. Feel free to open another PR.

It's OK.

alfozan · 2019-10-29T02:38:52Z

@AngersZhuuuu one small regression:

test query:
MAP k / 10 USING 'cat' AS (one) from (select 10 as k);

Results:
with master:
1.0

with this PR:

Exception:
Error in query: cannot resolve '(k / 10)' given input columns: [(CAST(k AS DOUBLE) / CAST(10 AS DOUBLE))];;

This is a simplified test case from HiveCompatibilitySuite:
https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/ql/src/test/queries/clientpositive/mapreduce1.q

AngersZhuuuu · 2019-10-29T02:47:27Z

@AngersZhuuuu one small regression:

test query:
MAP k / 10 USING 'cat' AS (one) from (select 10 as k);

Results:
with master:
1.0

with this PR:

Exception:
Error in query: cannot resolve '(k / 10)' given input columns: [(CAST(k AS DOUBLE) / CAST(10 AS DOUBLE))];;

This is a simplified test case from HiveCompatibilitySuite:
https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/ql/src/test/queries/clientpositive/mapreduce1.q

Yeah, i found this problem too. In some complex column expressions, we can't find a direct attribute name for a col, so I use sql pretty name. But some complex col's final name is not the same as toPrettySQL. Then cause this problem.

alfozan · 2019-10-29T04:03:33Z

Regarding the issue in #25028 (comment)

Instead of

    val namedExpressions = expressions.map {
      case e: NamedExpression => e
      case e: Expression => UnresolvedAlias(e)
    }

How about also using a manual aliasing function to create an alias for complex column expressions (i.e. BinaryExpression) instead of relying on toPrettySQL

Example:

   val aliasFunc = (position: Int, e: Expression) => s"gen_alias_${position}"

    val namedExpressions = expressions.zipWithIndex.map {
      case (e: NamedExpression, _) => e
      case (e: BinaryExpression, index) => UnresolvedAlias(e, Some(aliasFunc(index, _)))
      case (e: Expression, _) => UnresolvedAlias(e)
    }

Test query runs successfully.

AngersZhuuuu · 2019-10-29T04:16:04Z

Regarding the issue in #25028 (comment)

Instead of
    val namedExpressions = expressions.map {
      case e: NamedExpression => e
      case e: Expression => UnresolvedAlias(e)
    }
How about also using a manual aliasing function to create an alias for complex column expressions (i.e. BinaryExpression) instead of relying on toPrettySQL

Example:
   val aliasFunc = (position: Int, e: Expression) => s"gen_alias_${position}"

    val namedExpressions = expressions.zipWithIndex.map {
      case (e: NamedExpression, _) => e
      case (e: BinaryExpression, index) => UnresolvedAlias(e, Some(aliasFunc(index, _)))
      case (e: Expression, _) => UnresolvedAlias(e)
    }
Test query runs successfully.

Good suggestion, make subquery 's output and transform's input with same alias. Can solve this problem. Nice method.

alfozan · 2019-10-29T04:27:56Z

Follow up - I think It's even better to always use a manual aliasing function and not just for a subset of expressions:

   val aliasFunc = (position: Int, e: Expression) => s"gen_alias_${position}"

    val namedExpressions = expressions.zipWithIndex.map {
      case (e: NamedExpression, _) => e
      case (e: Expression, index) => UnresolvedAlias(e, Some(aliasFunc(index, _)))
    }

alfozan · 2019-10-29T04:29:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

+    val namedExpressions = expressions.map {
+      case e: NamedExpression => e
+      case e: Expression => UnresolvedAlias(e)
+    }


re discussion in #25028 (comment) and #25028 (comment)

Alternative suggestion:

val aliasFunc = (position: Int, e: Expression) => s"gen_alias_${position}" val namedExpressions = expressions.zipWithIndex.map { case (e: NamedExpression, _) => e case (e: Expression, index) => UnresolvedAlias(e, Some(aliasFunc(index, _))) }

It is ok even to rename all expression with aliasFunc in this place.
After current busy job I will restart this pr and follow your nice suggestion.

alfozan · 2019-10-29T10:13:07Z

Another issue:

test query:

FROM (select 1 as key, 100 as value) src
MAP src.*, src.key, CAST(src.key / 10 AS INT), CAST(src.key % 10 AS INT), src.value
USING 'cat' AS (k, v, tkey, ten, one, tvalue);

Results:
with master:
1 100 1 0 1 100

with this PR:
1 100 1 100 1 0

This is a simplified test case from HiveCompatibilitySuite:
https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/ql/src/test/queries/clientpositive/mapreduce7.q

alfozan · 2019-10-29T11:06:01Z

simpler test case:

FROM (select 1 as key, 100 as value) src
MAP src.*, CAST(src.key % 10 AS INT), src.value
USING 'cat' AS (k, v, one, tvalue);

AngersZhuuuu · 2019-10-31T02:13:03Z

simpler test case:

FROM (select 1 as key, 100 as value) src
MAP src.*, CAST(src.key % 10 AS INT), src.value
USING 'cat' AS (k, v, one, tvalue);

Star cause the problem . Trying solve this .

AngersZhuuuu · 2019-10-31T06:25:48Z

simpler test case:

FROM (select 1 as key, 100 as value) src
MAP src.*, CAST(src.key % 10 AS INT), src.value
USING 'cat' AS (k, v, one, tvalue);

Found the reason, in current mode of trasform, in here

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

Line 1098 in 095f7b0

case t: ScriptTransformation if containsStar(t.input) =>

It will expand star with t.child.output, after t.child was analyzed, it will have src.key, src.value, gen_alias_2 src.value output, then expand method will make all t.child's output match src as transform's input. Then this error happened.

change to

      // If the script transformation input contains Stars, expand it.
      case t: ScriptTransformation if containsStar(t.input) =>
        t.copy(
          input = t.child.output
        )

This change is reasonable since transform's input is it's child's output.

alfozan · 2019-11-01T04:36:29Z

Another test query:

SELECT TRANSFORM(key, abs(key))
USING 'cat'
FROM (SELECT DISTINCT key FROM src);

The physical plan looks different with this PR applied. in particular, HashAggregate gets another output column, while it should be just one:

HashAggregate(keys=[key#18], functions=[], output=[key#18, **another_output**])

AngersZhuuuu · 2019-11-01T05:55:57Z

Another test query:
SELECT TRANSFORM(key, abs(key))
USING 'cat'
FROM (SELECT DISTINCT key FROM src);
The physical plan looks different with this PR applied. in particular, HashAggregate gets another output column, while it should be just one:

HashAggregate(keys=[key#18], functions=[], output=[key#18, **another_output**])

Since we can solve every thing with path a Seq(UnresolvedStar(None)) to ScriptTrasnform

alfozan · 2019-11-01T07:17:54Z

Here's what I mean:

test query:

SELECT TRANSFORM(key, abs(key))
USING 'cat'
FROM (SELECT DISTINCT key FROM src);

The physical plan without this PR (on master):

ScriptTransformation [key#7, abs(key#7)], cat, [key#9, value#10], HiveScriptIOSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim,	)),List((field.delim,	)),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),true)
+- *(2) HashAggregate(keys=[1#16], functions=[], output=[key#7])
   +- Exchange hashpartitioning(1#16, 1), true, [id=#27]
      +- *(1) HashAggregate(keys=[1 AS 1#16], functions=[], output=[1#16])
         +- *(1) Scan OneRowRelation[]

The physical plan with this PR:

ScriptTransformation [key#0, abs(key)#9], cat, [key#2, value#3], HiveScriptIOSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim,	)),List((field.delim,	)),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),true)
+- *(2) HashAggregate(keys=[1#10], functions=[], output=[key#0, abs(key)#9])
   +- Exchange hashpartitioning(1#10, 1), true, [id=#21]
      +- *(1) HashAggregate(keys=[1 AS 1#10], functions=[], output=[1#10])
         +- *(1) Scan OneRowRelation[]

Difference:
+- *(2) HashAggregate(keys=[1#16], functions=[], output=[key#7]) vs
+- *(2) HashAggregate(keys=[1#10], functions=[], output=[key#0, abs(key)#9])

AngersZhuuuu · 2019-11-01T07:27:35Z

Another issue:

query:
MAP k / 10 USING 'cat' AS (aa) from (select 10 as k);

Error in query: cannot resolve '(k / 10)' given input columns: [(CAST(k AS DOUBLE) / CAST(10 AS DOUBLE))];;

This can be solved by #25028 (comment)

alfozan · 2019-11-01T07:31:42Z

Another issue:
query:
MAP k / 10 USING 'cat' AS (aa) from (select 10 as k);
Error in query: cannot resolve '(k / 10)' given input columns: [(CAST(k AS DOUBLE) / CAST(10 AS DOUBLE))];;

This can be solved by #25028 (comment)

Yes we already discussed a solution in #25028 (comment)
which I can confirm it works.

Currently the open issue is #25028 (comment)

AngersZhuuuu · 2019-11-01T07:31:59Z

Here's what I mean:

test query:

SELECT TRANSFORM(key, abs(key))
USING 'cat'
FROM (SELECT DISTINCT key FROM src);

The physical plan without this PR (on master):

ScriptTransformation [key#7, abs(key#7)], cat, [key#9, value#10], HiveScriptIOSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim,	)),List((field.delim,	)),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),true)
+- *(2) HashAggregate(keys=[1#16], functions=[], output=[key#7])
   +- Exchange hashpartitioning(1#16, 1), true, [id=#27]
      +- *(1) HashAggregate(keys=[1 AS 1#16], functions=[], output=[1#16])
         +- *(1) Scan OneRowRelation[]

The physical plan with this PR:

ScriptTransformation [key#0, abs(key)#9], cat, [key#2, value#3], HiveScriptIOSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim,	)),List((field.delim,	)),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),true)
+- *(2) HashAggregate(keys=[1#10], functions=[], output=[key#0, abs(key)#9])
   +- Exchange hashpartitioning(1#10, 1), true, [id=#21]
      +- *(1) HashAggregate(keys=[1 AS 1#10], functions=[], output=[1#10])
         +- *(1) Scan OneRowRelation[]

Difference:
+- *(2) HashAggregate(keys=[1#16], functions=[], output=[key#7]) vs
+- *(2) HashAggregate(keys=[1#10], functions=[], output=[key#0, abs(key)#9])

This right since we now tread transform's child as an entire logical plan. It 's final output is two column, it's right. And transform use it 's out put as transform 's input. Reasonable

AngersZhuuuu · 2019-11-01T08:37:13Z

@alfozan
You can see my branch's newest change. and check the update

AngersZhuuuu · 2019-11-01T08:42:16Z

Difference:
+- *(2) HashAggregate(keys=[1#16], functions=[], output=[key#7]) vs
+- *(2) HashAggregate(keys=[1#10], functions=[], output=[key#0, abs(key)#9])

Seems after my pr. the output is right.

alfozan · 2019-11-01T08:58:57Z

@alfozan
You can see my branch's newest change. and check the update

Could you share a link to the new branch/PR?

AngersZhuuuu · 2019-11-01T09:03:51Z

@alfozan
You can see my branch's newest change. and check the update

Could you share a link to the new branch/PR?

AngersZhuuuu@f18a6f9

alfozan · 2019-11-01T09:46:15Z

Looks good! Thank you.

AngersZhuuuu · 2019-11-01T09:47:40Z

Looks good! Thank you.

Also thank you for so many error case to make this pr better.

AngersZhuuuu · 2019-11-05T01:39:06Z

@alfozan
Seems you are working on https://issues.apache.org/jira/browse/SPARK-15694
But I am confused that current ScripTransform execution process has been implemented, are you clear what we need to do?

alfozan · 2019-11-05T02:35:47Z

@alfozan
Seems you are working on https://issues.apache.org/jira/browse/SPARK-15694
But I am confused that current ScripTransform execution process has been implemented, are you clear what we need to do?

yes - we presented our work in https://databricks.com/session_eu19/powering-custom-apps-at-facebook-using-spark-script-transformation. I plan on sending the PRs later this month.

AngersZhuuuu · 2019-11-05T02:44:58Z

yes - we presented the work in https://databricks.com/session_eu19/powering-custom-apps-at-facebook-using-spark-script-transformation and I'll send the PRs later this month.

please ping me when you raise a pr to learn your great job.

AngersZhuuuu · 2019-12-21T11:25:48Z

@HyukjinKwon
Can i reopen this since @alfozan is working or transform script.

HyukjinKwon · 2019-12-23T07:27:45Z

@AngersZhuuuu, I didn't read all fully but can we make https://issues.apache.org/jira/browse/SPARK-15694 done first as @wangyum's advice?

AngersZhuuuu · 2019-12-23T07:33:57Z

@AngersZhuuuu, I didn't read all fully but can we make https://issues.apache.org/jira/browse/SPARK-15694 done first as @wangyum's advice?

Seems @alfozan is doing that part?

HyukjinKwon · 2019-12-24T01:01:35Z

Shall we wait for that first? I would like to see how it's implemented first, and possibly we should match the impl.

alfozan · 2019-12-24T03:11:46Z

@AngersZhuuuu, I didn't read all fully but can we make https://issues.apache.org/jira/browse/SPARK-15694 done first as @wangyum's advice?

Seems @alfozan is doing that part?

Yes I'm

[SPARK-28227] SUPPORT FOR TRANSFORM with aggreation

203182f

AngersZhuuuu closed this Jul 2, 2019

Angers added 2 commits July 2, 2019 18:04

fix column reference of child's output

ce38b54

[SPARK-28227] add unit test

646a8d3

AngersZhuuuu reopened this Jul 2, 2019

Angers added 2 commits July 2, 2019 21:06

format code style

2465871

add test case add format code

8472000

dongjoon-hyun added the SQL label Jul 2, 2019

[SPARK-29227] add unit test case

9e60de6

AngersZhuuuu changed the title ~~[WIP][SPARK-28227][SQL] Support TRANSFORM with aggregation.~~ [SPARK-28227][SQL] Support TRANSFORM with aggregation. Jul 3, 2019

AngersZhuuuu closed this Jul 12, 2019

AngersZhuuuu reopened this Jul 12, 2019

HyukjinKwon closed this Sep 16, 2019

alfozan reviewed Oct 29, 2019

View reviewed changes

[SPARK-28227][SQL] Support TRANSFORM with aggregation. #25028

[SPARK-28227][SQL] Support TRANSFORM with aggregation. #25028

Conversation

AngersZhuuuu commented Jul 2, 2019 • edited

What changes were proposed in this pull request?

How was this patch tested?

AngersZhuuuu commented Jul 4, 2019

AngersZhuuuu commented Jul 10, 2019

wangyum commented Jul 12, 2019

HyukjinKwon commented Jul 18, 2019

AngersZhuuuu commented Jul 18, 2019

AmplabJenkins commented Sep 16, 2019

HyukjinKwon commented Sep 16, 2019

AngersZhuuuu commented Sep 17, 2019

alfozan commented Oct 29, 2019 • edited

AngersZhuuuu commented Oct 29, 2019

alfozan commented Oct 29, 2019

AngersZhuuuu commented Oct 29, 2019

alfozan commented Oct 29, 2019

alfozan Oct 29, 2019 • edited

Choose a reason for hiding this comment

AngersZhuuuu Oct 29, 2019

Choose a reason for hiding this comment

alfozan commented Oct 29, 2019

alfozan commented Oct 29, 2019

AngersZhuuuu commented Oct 31, 2019

AngersZhuuuu commented Oct 31, 2019

alfozan commented Nov 1, 2019

AngersZhuuuu commented Nov 1, 2019

alfozan commented Nov 1, 2019

AngersZhuuuu commented Nov 1, 2019

alfozan commented Nov 1, 2019

AngersZhuuuu commented Nov 1, 2019 • edited

AngersZhuuuu commented Nov 1, 2019

AngersZhuuuu commented Nov 1, 2019 • edited

alfozan commented Nov 1, 2019

AngersZhuuuu commented Nov 1, 2019

alfozan commented Nov 1, 2019

AngersZhuuuu commented Nov 1, 2019

AngersZhuuuu commented Nov 5, 2019

alfozan commented Nov 5, 2019 • edited

AngersZhuuuu commented Nov 5, 2019

AngersZhuuuu commented Dec 21, 2019

HyukjinKwon commented Dec 23, 2019

AngersZhuuuu commented Dec 23, 2019

HyukjinKwon commented Dec 24, 2019

alfozan commented Dec 24, 2019

AngersZhuuuu commented Jul 2, 2019 •

edited

alfozan commented Oct 29, 2019 •

edited

alfozan Oct 29, 2019 •

edited

AngersZhuuuu commented Nov 1, 2019 •

edited

AngersZhuuuu commented Nov 1, 2019 •

edited

alfozan commented Nov 5, 2019 •

edited