New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-28227][SQL] Support TRANSFORM with aggregation. #25028
Conversation
@cloud-fan @gatorsmile @HyukjinKwon @jerryshao @wangyum |
@dongjoon-hyun |
Could we implement |
Yea, I agree with @wangyum's. Can we? @AngersZhuuuu |
Nowadays I am doing for support SparkThriftServer to enable proxy client user's authentication and make a PR for master. After that, I will focus on this problem. |
Can one of the admins verify this patch? |
@AngersZhuuuu, let's fix #25028 (comment) first. Feel free to open another PR. |
It's OK. |
@AngersZhuuuu one small regression: test query: Results: with this PR: Exception: This is a simplified test case from |
Yeah, i found this problem too. In some complex column expressions, we can't find a direct attribute name for a col, so I use sql pretty name. But some complex col's final name is not the same as |
Regarding the issue in #25028 (comment) Instead of
How about also using a manual aliasing function to create an alias for complex column expressions (i.e. BinaryExpression) instead of relying on Example:
Test query runs successfully. |
Good suggestion, make subquery 's output and transform's input with same alias. Can solve this problem. Nice method. |
Follow up - I think It's even better to always use a manual aliasing function and not just for a subset of expressions:
|
val namedExpressions = expressions.map { | ||
case e: NamedExpression => e | ||
case e: Expression => UnresolvedAlias(e) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re discussion in #25028 (comment) and #25028 (comment)
Alternative suggestion:
val aliasFunc = (position: Int, e: Expression) => s"gen_alias_${position}"
val namedExpressions = expressions.zipWithIndex.map {
case (e: NamedExpression, _) => e
case (e: Expression, index) => UnresolvedAlias(e, Some(aliasFunc(index, _)))
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is ok even to rename all expression with aliasFunc
in this place.
After current busy job I will restart this pr and follow your nice suggestion.
Another issue: test query:
Results: with this PR: This is a simplified test case from HiveCompatibilitySuite: |
simpler test case:
|
Star cause the problem . Trying solve this . |
Found the reason, in current mode of trasform, in here spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Line 1098 in 095f7b0
It will expand star with t.child.output, after t.child was analyzed, it will have change to
This change is reasonable since transform's input is it's child's output. |
Another test query:
The physical plan looks different with this PR applied. in particular, HashAggregate gets another output column, while it should be just one:
|
Since we can solve every thing with path a |
Here's what I mean: test query:
The physical plan without this PR (on master):
The physical plan with this PR:
Difference: |
This can be solved by #25028 (comment) |
Yes we already discussed a solution in #25028 (comment) Currently the open issue is #25028 (comment) |
This right since we now tread transform's child as an entire logical plan. It 's final output is two column, it's right. And transform use it 's out put as transform 's input. Reasonable |
@alfozan |
Seems after my pr. the output is right. |
Could you share a link to the new branch/PR? |
|
Looks good! Thank you. |
Also thank you for so many error case to make this pr better. |
@alfozan |
yes - we presented our work in https://databricks.com/session_eu19/powering-custom-apps-at-facebook-using-spark-script-transformation. I plan on sending the PRs later this month. |
please ping me when you raise a pr to learn your great job. |
@HyukjinKwon |
@AngersZhuuuu, I didn't read all fully but can we make https://issues.apache.org/jira/browse/SPARK-15694 done first as @wangyum's advice? |
Seems @alfozan is doing that part? |
Shall we wait for that first? I would like to see how it's implemented first, and possibly we should match the impl. |
Yes I'm |
What changes were proposed in this pull request?
For Spark SQL, it can't support sql like :
SELECT TRANSFORM ( d2, max(d1) as maxd1, cast(sum(d3) as string))
USING 'cat' AS (a,b,c)
FROM script_trans
WHERE d1 <= 100
GROUP BY d2
HAVING maxd1 > 0
But in hive, it can support this kind SQL.
This makes our SQL migration difficult and complex。
This PR is to support use Aggregation with TRANSFORM and make SQL migration from Hive to Sparkeasier.
How was this patch tested?
Added unit test.