-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32159][SQL] Fix integration between Aggregator[Array[_], _, _] and UnresolvedMapObjects #28983
Conversation
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
Outdated
Show resolved
Hide resolved
The fix I coded here works if the function being used to apply to array elements is "identity". Experimentally, it seems to also work if values are being cast (e.g. array elements are float but aggregator is expecting array of double). The casting expressions I see are being added externally, like |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
Outdated
Show resolved
Hide resolved
@cloud-fan if we do this, and resolve these on the driver, does that avoid having to resolve these |
Yes. Encoder is a container of expression. If the expression is resolved, then when we serialize and send encoders to executors, we don't need to resolve it again at executor side and can use it directly. |
@cloud-fan thanks, I will try adding such a rule for ScalaAggregator |
when trying to refer to either ScalaAggregator or Aggregator over in catalyst, I'm running into some scoping problems, which are all similar to: [error] /home/eje/git/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala:27: object expressions is not a member of package org.apache.spark.sql I believe these are related to some scoping firewall with catalyst, for example: package org.apache.spark.sql
/**
* The physical execution component of Spark SQL. Note that this is a private package.
* All classes in catalyst are considered an internal API to Spark SQL and are subject
* to change between minor releases.
*/
package object execution I tried moving ScalaAggregator over to |
I think you don't need to move the classes and how about using |
Would that be safe? My reading of the extensions API is that a user could completely reset any pre-applied extensions. I don't see any other pre-defined rules being applied this way in the spark code currently. |
Have you checked the existing predfined ones? https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala#L174-L180 |
tentatively this looks to be working in my repl testing. The unit tests appear to bypass the use of session builder and they are currently failing. I'm playing with configuring an instance of SparkSessionExtensions in the unit testing spark session |
Test build #124931 has finished for PR 28983 at commit
|
passing existing aggregation unit tests, but I still need to write a new test for array input types |
Test build #124971 has finished for PR 28983 at commit
|
Test build #124975 has finished for PR 28983 at commit
|
@cloud-fan @maropu using an extension rule works. The main caveat is that if a spark session is constructed via a non-standard path that sidesteps BaseSessionStateBuilder, it won't pick this rule up, for example as with TestHive. |
Probably, you also need to add a new rule in |
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/UDAQuerySuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/udaf.scala
Outdated
Show resolved
Hide resolved
Yea, this is not good and should be refactored, but this is the case for now. The extra analyzer rules have to be repeated in |
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/udaf.scala
Outdated
Show resolved
Hide resolved
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/UDAQuerySuite.scala
Outdated
Show resolved
Hide resolved
Test build #125220 has finished for PR 28983 at commit
|
Test build #125259 has started for PR 28983 at commit |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/udaf.scala
Outdated
Show resolved
Hide resolved
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/UDAQuerySuite.scala
Outdated
Show resolved
Hide resolved
Test build #125269 has finished for PR 28983 at commit
|
retest this please |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
Outdated
Show resolved
Hide resolved
Test build #125284 has finished for PR 28983 at commit
|
retest this please |
Test build #125323 has finished for PR 28983 at commit
|
retest this please |
Test build #125359 has started for PR 28983 at commit |
Test build #125397 has finished for PR 28983 at commit
|
thanks, merging to master/3.0! |
… and UnresolvedMapObjects Context: The fix for SPARK-27296 introduced by #25024 allows `Aggregator` objects to appear in queries. This works fine for aggregators with atomic input types, e.g. `Aggregator[Double, _, _]`. However it can cause a null pointer exception if the input type is `Array[_]`. This was historically considered an ignorable case for serialization of `UnresolvedMapObjects`, but the new ScalaAggregator class causes these expressions to be serialized over to executors because the resolve-and-bind is being deferred. ### What changes were proposed in this pull request? A new rule `ResolveEncodersInScalaAgg` that performs the resolution of the expressions contained in the encoders so that properly resolved expressions are serialized over to executors. ### Why are the changes needed? Applying an aggregator of the form `Aggregator[Array[_], _, _]` using `functions.udaf()` currently causes a null pointer error in Catalyst. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? A unit test has been added that does aggregation with array types for input, buffer, and output. I have done additional testing with my own custom aggregators in the spark REPL. Closes #28983 from erikerlandson/fix-spark-32159. Authored-by: Erik Erlandson <eerlands@redhat.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 1cb5bfc) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
case p => p.transformExpressionsUp { | ||
case agg: ScalaAggregator[_, _, _] => | ||
agg.copy( | ||
inputEncoder = agg.inputEncoder.resolveAndBind(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A followup we can do is to resolve and bind using the actual input data types, so that we can do casting or reorder fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be nice. I tried this and but the way I did it wasn't having any effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan what I had done earlier was:
object ResolveEncodersInScalaAgg extends Rule[LogicalPlan] {
override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp {
case p if !p.resolved => p
case p => p.transformExpressionsUp {
case agg: ScalaAggregator[_, _, _] =>
val children = agg.children
require(children.length > 0, "Missing aggregator input")
val dataType: DataType = if (children.length == 1) children.head.dataType else {
StructType(children.map(_.dataType).zipWithIndex.map { case (dt, j) =>
StructField(s"_$j", dt, true)
})
}
val attrs = if (agg.inputEncoder.isSerializedAsStructForTopLevel) {
dataType.asInstanceOf[StructType].toAttributes
} else {
(new StructType().add("input", dataType)).toAttributes
}
agg.copy(
inputEncoder = agg.inputEncoder.resolveAndBind(attrs),
bufferEncoder = agg.bufferEncoder.resolveAndBind())
}
}
}
This also passes unit tests, but it would still fail if I tried to give it Float
data, so it's not automatically casting.
Context: The fix for SPARK-27296 introduced by #25024 allows
Aggregator
objects to appear in queries. This works fine for aggregators with atomic input types, e.g.Aggregator[Double, _, _]
.However it can cause a null pointer exception if the input type is
Array[_]
. This was historically considered an ignorable case for serialization ofUnresolvedMapObjects
, but the new ScalaAggregator class causes these expressions to be serialized over to executors because the resolve-and-bind is being deferred.What changes were proposed in this pull request?
A new rule
ResolveEncodersInScalaAgg
that performs the resolution of the expressions contained in the encoders so that properly resolved expressions are serialized over to executors.Why are the changes needed?
Applying an aggregator of the form
Aggregator[Array[_], _, _]
usingfunctions.udaf()
currently causes a null pointer error in Catalyst.Does this PR introduce any user-facing change?
No.
How was this patch tested?
A unit test has been added that does aggregation with array types for input, buffer, and output. I have done additional testing with my own custom aggregators in the spark REPL.