New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-41391][SQL] The output column name of groupBy.agg(count_distinct) is incorrect #40116
Conversation
I guess you may need to |
|
Eh, this does not explain the issue at all. Please do so. |
I have enabled the workflows on the branch. Is there something else that I need to do? |
Sean not sure which issue you were referring to. I updated the why the changes are needed section of the pull request to mirror what Zheng had already put in his pull request. |
This is about SPARK-41391? it also doesn't contain a simple description of what you're reporting, just code snippets. I can work it out, but this could be explained in just a few sentences |
Please fix the PR description too https://spark.apache.org/contributing.html |
Sean I tried to correct the two things pointed out by you. Let me know if that works |
Looks better. Title should start with |
Not sure how my checkins are causing javadoc genration error |
It's the |
Is there anything else that I need to do for the fix to be accepted? |
@cloud-fan or @HyukjinKwon do you have an opinion? |
/** | ||
* Returns true if `exprs` contains a star. | ||
*/ | ||
def containsStar(exprs: Seq[Expression]): Boolean = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be private.
since it is only used once, i think we can inline it.
@@ -89,9 +89,18 @@ class RelationalGroupedDataset protected[sql]( | |||
case expr: NamedExpression => expr | |||
case a: AggregateExpression if a.aggregateFunction.isInstanceOf[TypedAggregateExpression] => | |||
UnresolvedAlias(a, Some(Column.generateAlias)) | |||
case ag: UnresolvedFunction if (containsStar(Seq(ag))) || ag.isDistinct => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's weird to have this special case. Shall we always use UnresolvedAlias
?
Not sure why the suggested changes made the build fail in the |
/** | ||
* Returns true if `exprs` contains a star. | ||
*/ | ||
@inline final private def containsStar(exprs: Seq[Expression]): Boolean = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's probably remove this.
Any comments. Apparently having all expr as unresolvedAlias is not working. |
Can you share the test failures? Maybe we just need to update the tests with the different alias name. |
I think the test is easy to fix. It wants to test the aggregate function result, but not the generated alias, so we just change the testing query to add alias explicitly.
|
Couple of questions
023-03-03T04:05:16.9822203Z 04:05:16.978 ERROR org.apache.spark.scheduler.TaskSetManager: Task 0 in stage 393.0 failed 1 times; aborting job |
The auto-generated alias name is fragile and we are trying to improve it at #40126 Can you give some examples of how the new update changes the alias name? If it's not reasonable, we should keep the previous code. |
I am attaching a file showing some failures when all the aggregate expressions were made UnresolvedAlias. My latest checkin where I only make those aggregate expressions that have "*" as UnresolvedAlias works. The build went through.So it is essentially the unresolvedstar() that is being produced by the toPrettySQL for the agg expr with star that the Analyzer is not able to resolve. |
The single quote indicates that the expression is unresolved, I think it doesn't matter here. |
@@ -40,12 +40,15 @@ abstract class SQLImplicits extends LowPrioritySQLImplicits { | |||
*/ | |||
implicit class StringToColumn(val sc: StringContext) { | |||
def $(args: Any*): ColumnName = { | |||
new ColumnName(sc.s(args: _*)) | |||
if (sc.parts.length == 1 && sc.parts.contains("*")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this change fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yuo this is redundant. Removed it in the latest build
if (containsStar(Seq(expr))) { | ||
UnresolvedAlias(expr, None) | ||
} else { | ||
Alias(expr, toPrettySQL(expr))() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want a surgical fix, shall we fix how toPrettySQL
handles star?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want a surgical fix, shall we fix how
toPrettySQL
handles star?
Sure we can fix toPrettySQL. But the best we can do is to get count(distinct *) which is not the same as what spark.sql produces.
If we want to duplicate spark.sql behavior the best option would be to create unresolvedAlias for expr containing "*" as pushed in the latest build.
Perhaps the following would be better solution. Instead of looking for star any UnresolvedFunction should have UnresolvedAlias. Any comments? private[this] def alias(expr: Expression): NamedExpression = expr match { |
SGTM. Or more aggressively, any expression should have |
Right. This is simple 1 file fix with addition of test case versus the other one which may involve number of files. |
Please see if this fix can be pulled. |
@@ -89,7 +89,12 @@ class RelationalGroupedDataset protected[sql]( | |||
case expr: NamedExpression => expr | |||
case a: AggregateExpression if a.aggregateFunction.isInstanceOf[TypedAggregateExpression] => | |||
UnresolvedAlias(a, Some(Column.generateAlias)) | |||
case expr: Expression => Alias(expr, toPrettySQL(expr))() | |||
case expr: Expression => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
case u: UnresolvedFunction => UnresolvedAlias(expr, None)
case expr: Expression => Alias(expr, toPrettySQL(expr))()
@@ -40,12 +40,11 @@ abstract class SQLImplicits extends LowPrioritySQLImplicits { | |||
*/ | |||
implicit class StringToColumn(val sc: StringContext) { | |||
def $(args: Any*): ColumnName = { | |||
new ColumnName(sc.s(args: _*)) | |||
new ColumnName(sc.s(args: _*)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary change
@@ -45,7 +45,7 @@ abstract class SQLImplicits extends LowPrioritySQLImplicits { | |||
} | |||
|
|||
// Primitives | |||
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will fail scalastyle check
thanks, merging to master! |
What changes were proposed in this pull request?
correct the output column name of groupBy.agg(count_distinct), so the "*" is expanded correctly into column names and the output column has the distinct keyword.
Why are the changes needed?
Output column name for groupBy.agg(count_distinct) is incorrect . However similar queries in spark sql return correct output column. For groupBy.agg queries on dataframe "*" is not expanded correctly in the output column and the distinct keyword is missing from output column.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added UT