[SPARK-32159][SQL] Fix integration between Aggregator[Array[_], _, _] and UnresolvedMapObjects #28983

erikerlandson · 2020-07-02T21:20:07Z

Context: The fix for SPARK-27296 introduced by #25024 allows Aggregator objects to appear in queries. This works fine for aggregators with atomic input types, e.g. Aggregator[Double, _, _].

However it can cause a null pointer exception if the input type is Array[_]. This was historically considered an ignorable case for serialization of UnresolvedMapObjects, but the new ScalaAggregator class causes these expressions to be serialized over to executors because the resolve-and-bind is being deferred.

What changes were proposed in this pull request?

A new rule ResolveEncodersInScalaAgg that performs the resolution of the expressions contained in the encoders so that properly resolved expressions are serialized over to executors.

Why are the changes needed?

Applying an aggregator of the form Aggregator[Array[_], _, _] using functions.udaf() currently causes a null pointer error in Catalyst.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

A unit test has been added that does aggregation with array types for input, buffer, and output. I have done additional testing with my own custom aggregators in the spark REPL.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

erikerlandson · 2020-07-02T22:26:18Z

The fix I coded here works if the function being used to apply to array elements is "identity". Experimentally, it seems to also work if values are being cast (e.g. array elements are float but aggregator is expecting array of double). The casting expressions I see are being added externally, like (do map objects).toDoubleArray. This fix will not work if there is some non-identity transform being applied to array elements. I'm not sure if that case actually arises or not. I do not think it does in the scenario of aggregator inputs.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

erikerlandson · 2020-07-03T13:08:23Z

We can follow ResolveEncodersInUDF: add a rule to resolve the encoders in ScalaAggregator at driver side.

@cloud-fan if we do this, and resolve these on the driver, does that avoid having to resolve these UnresolvedMapObject? on the executor side?

cloud-fan · 2020-07-03T14:43:07Z

does that avoid having to resolve these UnresolvedMapObject? on the executor side?

Yes. Encoder is a container of expression. If the expression is resolved, then when we serialize and send encoders to executors, we don't need to resolve it again at executor side and can use it directly.

erikerlandson · 2020-07-03T15:56:13Z

@cloud-fan thanks, I will try adding such a rule for ScalaAggregator

erikerlandson · 2020-07-03T23:50:28Z

when trying to refer to either ScalaAggregator or Aggregator over in catalyst, I'm running into some scoping problems, which are all similar to:

[error] /home/eje/git/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala:27: object expressions is not a member of package org.apache.spark.sql

I believe these are related to some scoping firewall with catalyst, for example:

package org.apache.spark.sql

/**
 * The physical execution component of Spark SQL. Note that this is a private package.
 * All classes in catalyst are considered an internal API to Spark SQL and are subject
 * to change between minor releases.
 */
package object execution

I tried moving ScalaAggregator over to org.apache.spark.sql.catalyst.expressions, but now it can't see Aggregator, and I can't move that without breaking backward compatibility.

maropu · 2020-07-04T13:19:54Z

I think you don't need to move the classes and how about using extendedResolutionRules for that purpose?

erikerlandson · 2020-07-04T14:06:19Z

how about using extendedResolutionRules for that purpose?

Would that be safe? My reading of the extensions API is that a user could completely reset any pre-applied extensions. I don't see any other pre-defined rules being applied this way in the spark code currently.

maropu · 2020-07-04T14:14:26Z

Have you checked the existing predfined ones? https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala#L174-L180

erikerlandson · 2020-07-05T20:01:42Z

Have you checked the existing predfined ones?

tentatively this looks to be working in my repl testing. The unit tests appear to bypass the use of session builder and they are currently failing. I'm playing with configuring an instance of SparkSessionExtensions in the unit testing spark session

SparkQA · 2020-07-05T20:01:58Z

Test build #124931 has finished for PR 28983 at commit 1a501d9.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

…codersInScalaAgg

erikerlandson · 2020-07-05T22:29:43Z

passing existing aggregation unit tests, but I still need to write a new test for array input types

SparkQA · 2020-07-05T22:45:36Z

Test build #124971 has finished for PR 28983 at commit bc2d880.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class TestHiveExtensions extends (SparkSessionExtensions => Unit)

SparkQA · 2020-07-05T23:33:15Z

Test build #124975 has finished for PR 28983 at commit 399cbab.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

erikerlandson · 2020-07-06T00:09:27Z

@cloud-fan @maropu using an extension rule works. The main caveat is that if a spark session is constructed via a non-standard path that sidesteps BaseSessionStateBuilder, it won't pick this rule up, for example as with TestHive.

maropu · 2020-07-06T00:21:30Z

Probably, you also need to add a new rule in HiveSessionStateBuilder.

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/UDAQuerySuite.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/udaf.scala

cloud-fan · 2020-07-06T10:55:06Z

Probably, you also need to add a new rule in HiveSessionStateBuilder.

Yea, this is not good and should be refactored, but this is the case for now. The extra analyzer rules have to be repeated in HiveSessionStateBuilder.

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/udaf.scala

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/UDAQuerySuite.scala

…n easier

SparkQA · 2020-07-07T20:46:04Z

Test build #125220 has finished for PR 28983 at commit d3c5d4d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-07T21:25:33Z

Test build #125259 has started for PR 28983 at commit aca7b51.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/udaf.scala

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/UDAQuerySuite.scala

SparkQA · 2020-07-08T02:49:20Z

Test build #125269 has finished for PR 28983 at commit ee96cc0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-07-08T02:58:50Z

retest this please

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

SparkQA · 2020-07-08T07:05:02Z

Test build #125284 has finished for PR 28983 at commit ee96cc0.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-07-08T07:21:41Z

retest this please

SparkQA · 2020-07-08T09:05:25Z

Test build #125323 has finished for PR 28983 at commit ee96cc0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-07-08T12:21:04Z

retest this please

SparkQA · 2020-07-08T12:24:25Z

Test build #125359 has started for PR 28983 at commit ee96cc0.

SparkQA · 2020-07-09T05:02:29Z

Test build #125397 has finished for PR 28983 at commit 622ac1c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-07-09T08:42:15Z

thanks, merging to master/3.0!

… and UnresolvedMapObjects Context: The fix for SPARK-27296 introduced by #25024 allows `Aggregator` objects to appear in queries. This works fine for aggregators with atomic input types, e.g. `Aggregator[Double, _, _]`. However it can cause a null pointer exception if the input type is `Array[_]`. This was historically considered an ignorable case for serialization of `UnresolvedMapObjects`, but the new ScalaAggregator class causes these expressions to be serialized over to executors because the resolve-and-bind is being deferred. ### What changes were proposed in this pull request? A new rule `ResolveEncodersInScalaAgg` that performs the resolution of the expressions contained in the encoders so that properly resolved expressions are serialized over to executors. ### Why are the changes needed? Applying an aggregator of the form `Aggregator[Array[_], _, _]` using `functions.udaf()` currently causes a null pointer error in Catalyst. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? A unit test has been added that does aggregation with array types for input, buffer, and output. I have done additional testing with my own custom aggregators in the spark REPL. Closes #28983 from erikerlandson/fix-spark-32159. Authored-by: Erik Erlandson <eerlands@redhat.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 1cb5bfc) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan · 2020-07-09T08:44:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/udaf.scala

+    case p => p.transformExpressionsUp {
+      case agg: ScalaAggregator[_, _, _] =>
+        agg.copy(
+          inputEncoder = agg.inputEncoder.resolveAndBind(),


A followup we can do is to resolve and bind using the actual input data types, so that we can do casting or reorder fields.

That would be nice. I tried this and but the way I did it wasn't having any effect.

@cloud-fan what I had done earlier was:

object ResolveEncodersInScalaAgg extends Rule[LogicalPlan] { override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp { case p if !p.resolved => p case p => p.transformExpressionsUp { case agg: ScalaAggregator[_, _, _] => val children = agg.children require(children.length > 0, "Missing aggregator input") val dataType: DataType = if (children.length == 1) children.head.dataType else { StructType(children.map(_.dataType).zipWithIndex.map { case (dt, j) => StructField(s"_$j", dt, true) }) } val attrs = if (agg.inputEncoder.isSerializedAsStructForTopLevel) { dataType.asInstanceOf[StructType].toAttributes } else { (new StructType().add("input", dataType)).toAttributes } agg.copy( inputEncoder = agg.inputEncoder.resolveAndBind(attrs), bufferEncoder = agg.bufferEncoder.resolveAndBind()) } } }

This also passes unit tests, but it would still fail if I tried to give it Float data, so it's not automatically casting.

fix SPARK-32159 - intercept null function in MapObjects

1a501d9

probot-autolabeler bot added the SQL label Jul 2, 2020

erikerlandson requested a review from cloud-fan July 2, 2020 21:21

dongjoon-hyun reviewed Jul 2, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Jul 2, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala Outdated Show resolved Hide resolved

viirya reviewed Jul 3, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala Outdated Show resolved Hide resolved

erikerlandson mentioned this pull request Jul 5, 2020

Toward release 0.5.0 - fast TDigest and Spark-3 Aggregator API isarn/isarn-sketches-spark#20

Merged

erikerlandson added 4 commits July 5, 2020 15:01

revert println

73299e8

add informative exception for null function

20012b3

move resolution of ScalaAggregator inputEncoder to new rule ResolveEn…

a8dd23d

…codersInScalaAgg

add ResolveEncodersInScalaAgg rule to TestHive

bc2d880

add unit test for array input type

399cbab

HyukjinKwon reviewed Jul 6, 2020

View reviewed changes

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/UDAQuerySuite.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jul 6, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/udaf.scala Outdated Show resolved Hide resolved

erikerlandson added 2 commits July 7, 2020 05:53

add comment to new rule

a4858d5

make ResolveEncodersInScalaAgg an object

d3c5d4d

cloud-fan reviewed Jul 7, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/udaf.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jul 7, 2020

View reviewed changes

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/UDAQuerySuite.scala Outdated Show resolved Hide resolved

erikerlandson added 2 commits July 7, 2020 13:22

add bufferEncoder as a parameter to ScalaAggregator to make resolutio…

814956c

…n easier

test Array type with input, buffer and output encoders

aca7b51

maropu reviewed Jul 7, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala Outdated Show resolved Hide resolved

maropu reviewed Jul 7, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/udaf.scala Outdated Show resolved Hide resolved

maropu reviewed Jul 7, 2020

View reviewed changes

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/UDAQuerySuite.scala Outdated Show resolved Hide resolved

erikerlandson added 3 commits July 7, 2020 17:01

more detail on array test name

e679c01

add type link to scaladoc

c632437

improved exception message and corresponding comments

ee96cc0

cloud-fan reviewed Jul 8, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala Outdated Show resolved Hide resolved

cloud-fan approved these changes Jul 8, 2020

View reviewed changes

viirya approved these changes Jul 8, 2020

View reviewed changes

too much 'an'

622ac1c

cloud-fan closed this in 1cb5bfc Jul 9, 2020

cloud-fan reviewed Jul 9, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32159][SQL] Fix integration between Aggregator[Array[_], _, _] and UnresolvedMapObjects #28983

[SPARK-32159][SQL] Fix integration between Aggregator[Array[_], _, _] and UnresolvedMapObjects #28983

erikerlandson commented Jul 2, 2020 •

edited

Loading

erikerlandson commented Jul 2, 2020

erikerlandson commented Jul 3, 2020

cloud-fan commented Jul 3, 2020

erikerlandson commented Jul 3, 2020

erikerlandson commented Jul 3, 2020 •

edited

Loading

maropu commented Jul 4, 2020

erikerlandson commented Jul 4, 2020

maropu commented Jul 4, 2020

erikerlandson commented Jul 5, 2020

SparkQA commented Jul 5, 2020

erikerlandson commented Jul 5, 2020

SparkQA commented Jul 5, 2020

SparkQA commented Jul 5, 2020

erikerlandson commented Jul 6, 2020

maropu commented Jul 6, 2020

cloud-fan commented Jul 6, 2020

SparkQA commented Jul 7, 2020

SparkQA commented Jul 7, 2020

SparkQA commented Jul 8, 2020

maropu commented Jul 8, 2020

SparkQA commented Jul 8, 2020

cloud-fan commented Jul 8, 2020

SparkQA commented Jul 8, 2020

cloud-fan commented Jul 8, 2020

SparkQA commented Jul 8, 2020

SparkQA commented Jul 9, 2020

cloud-fan commented Jul 9, 2020

cloud-fan Jul 9, 2020

erikerlandson Jul 9, 2020

erikerlandson Jul 9, 2020

[SPARK-32159][SQL] Fix integration between Aggregator[Array[_], _, _] and UnresolvedMapObjects #28983

[SPARK-32159][SQL] Fix integration between Aggregator[Array[_], _, _] and UnresolvedMapObjects #28983

Conversation

erikerlandson commented Jul 2, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

erikerlandson commented Jul 2, 2020

erikerlandson commented Jul 3, 2020

cloud-fan commented Jul 3, 2020

erikerlandson commented Jul 3, 2020

erikerlandson commented Jul 3, 2020 • edited Loading

maropu commented Jul 4, 2020

erikerlandson commented Jul 4, 2020

maropu commented Jul 4, 2020

erikerlandson commented Jul 5, 2020

SparkQA commented Jul 5, 2020

erikerlandson commented Jul 5, 2020

SparkQA commented Jul 5, 2020

SparkQA commented Jul 5, 2020

erikerlandson commented Jul 6, 2020

maropu commented Jul 6, 2020

cloud-fan commented Jul 6, 2020

SparkQA commented Jul 7, 2020

SparkQA commented Jul 7, 2020

SparkQA commented Jul 8, 2020

maropu commented Jul 8, 2020

SparkQA commented Jul 8, 2020

cloud-fan commented Jul 8, 2020

SparkQA commented Jul 8, 2020

cloud-fan commented Jul 8, 2020

SparkQA commented Jul 8, 2020

SparkQA commented Jul 9, 2020

cloud-fan commented Jul 9, 2020

cloud-fan Jul 9, 2020

Choose a reason for hiding this comment

erikerlandson Jul 9, 2020

Choose a reason for hiding this comment

erikerlandson Jul 9, 2020

Choose a reason for hiding this comment

erikerlandson commented Jul 2, 2020 •

edited

Loading

erikerlandson commented Jul 3, 2020 •

edited

Loading