[SPARK-22739][Catalyst][WIP] Additional Expression Support for Objects #20085

bdrillard · 2017-12-26T19:43:37Z

What changes were proposed in this pull request?

This PR is a work-in-progress adding additional Expression support for object types. It intends to provide necessary expressions to support custom encoders (see discussion in Spark-Avro).

This is an initial review, looking for feedback concerning a few questions and guidance concerning best unit-testing practices for new Expression classes in Catalyst.

bdrillard · 2017-12-26T19:43:54Z

cc: @marmbrus

viirya · 2017-12-27T03:48:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

- * @param returnNullable When false, indicating the invoked method will always return
- *                       non-null value.
- */
+  * Invokes a static function, returning the result.  By default, any of the arguments being null


Why we change the comment style? Looks not consistent with others.

Those additional spaces shouldn't be there, I've fixed them.

viirya · 2017-12-27T03:59:24Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala

@@ -390,8 +391,8 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper {

  test("SPARK-22696: InitializeJavaBean should not use global variables") {


InitializeJavaBean -> InitializeObject.

viirya · 2017-12-27T04:10:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

+case class ValueIfType(
+  value: Expression,
+  checkedType: Class[_],
+  dataType: DataType) extends Expression with NonSQLExpression {


Will we have different data type other than value.dataType?

viirya · 2017-12-27T04:10:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

+ * @param dataType    The type returned by the expression
+  */
+case class ValueIfType(
+  value: Expression,


Should we limit the data type of value to ObjectType?

bdrillard · 2017-12-27T17:32:58Z

@viirya I've found the same intent of a ValueIfType function can be attained by adding a simpler InstanceOf expressions that can be used as the predicate to the existing If expression, and then using ObjectCast on the results. That approach handles your first question. To your second question, it makes sense the input value expression should always have a DataType of ObjectType. Is there a way you'd prefer to make that check? Or throw some kind of exception of value.dataType != ObjectType?

marmbrus · 2018-01-03T20:10:58Z

/cc @cloud-fan @sameeragarwal

sameeragarwal · 2018-01-03T20:11:29Z

jenkins add to whitelist

SparkQA · 2018-01-03T20:13:37Z

Test build #85642 has finished for PR 20085 at commit 4b07b66.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class InstanceOf(

bdrillard · 2018-01-03T20:27:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

@@ -1237,47 +1342,91 @@ case class DecodeUsingSerializer[T](child: Expression, tag: ClassTag[T], kryo: B
 }



In order to support initializations on more complicated objects, it makes sense to generalize InitializeJavaBean to an InitializeObject that can take a sequence of method names associated with a sequence of those methods' arguments. It seems thought that on plan analysis, Spark fails to resolve the column names against the Expression children when those child expressions are gathered from a Seq[Expression], yielding errors like:

Resolved attribute(s) 'field1,'field2 missing from field1#2,field2#3 in operator 'DeserializeToObject initializeobject(newInstance(class org.apache.spark.sql.catalyst.expressions.GenericBean), (setField1,List(assertnotnull('field1))), (setField2,List('field2.toString))), obj#4: org.apache.spark.sql.catalyst.expressions.GenericBean. Attribute(s) with the same name appear in the operation: field1,field2. Please check if the right attribute(s) are used.; org.apache.spark.sql.AnalysisException: Resolved attribute(s) 'field1,'field2 missing from field1#2,field2#3 in operator 'DeserializeToObject initializeobject(newInstance(class org.apache.spark.sql.catalyst.expressions.GenericBean), (setField1,List(assertnotnull('field1))), (setField2,List('field2.toString))), obj#4: org.apache.spark.sql.catalyst.expressions.GenericBean. Attribute(s) with the same name appear in the operation: field1,field2. Please check if the right attribute(s) are used.; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)

Interestingly, if we change the setters signature from Seq[(String, Seq[Expression])] to Seq[(String, (Expression, Expression))] (which is the use case for Spark-Avro, where objects are initialized by calling put with an integer index argument and then some object argument), the plan will resolve. But of course, such a function signature would in a sense be hard-coded for Avro.

Any ideas why passing a sequence of child expression arguments would yield the analysis error above, while a tuple of those same arguments would not?

bdrillard · 2018-01-03T20:28:43Z

sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/expressions/GenericBean.java

+    result = 31 * result + (field2 != null ? field2.hashCode() : 0);
+    return result;
+  }
+}


This object here exists just as an easy unit test for the InitializeObject problem I describe above, it doesn't necessarily need to stay as a test resource.

bdrillard · 2018-01-03T20:29:12Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala

+
+    assert(beanFromRow.getField1 == bean.getField1)
+    assert(beanFromRow.getField2 == bean.getField2)
+  }


This test case above demonstrates the issue I encountered with using a sequence of initialization arguments on an object.

bdrillard · 2018-01-03T20:32:38Z

I've added some comments describing an issue I've had with generalizing InitializeJavaBean, which I thought I'd added to this PR earlier but seem to have not been submitted.

cloud-fan · 2018-01-09T09:03:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

-case class InitializeJavaBean(beanInstance: Expression, setters: Map[String, Expression])
+case class InitializeObject(
+  objectInstance: Expression,
+  setters: Seq[(String, Seq[Expression])])


To generalize, I think we can just have a NewObject expression, which just do new SomeClass, the setters are just a bunch of Invoke.

We can make use of NewInstance which just creates an object of a class, but it's not clear how we can make use of a sequence of Invoke, since all these setter methods would have void return types, we can't chain them in a fluent manner.

sameeragarwal · 2018-01-14T01:36:23Z

@bdrillard @viirya @cloud-fan are we still targeting this for 2.3?

viirya · 2018-01-14T01:40:22Z

Seems to me this doesn't need to be urgent to be in 2.3.

marmbrus · 2018-01-14T21:11:50Z

This blocks better support for encoders on spark-avro, and seems safe, so I'd really like to include it in possible.

AmplabJenkins · 2018-01-18T17:30:22Z

Can one of the admins verify this patch?

cloud-fan · 2018-02-08T08:52:12Z

Hi @bdrillard , sorry for the late reply, as I was thinking hard about this problem. I think we all agree that we should have more object-related expressions, so that it's more flexible for Spark and other projects to do many things with the codegen-able expressions.

However we should think hard about what object-related expressions Spark should provide, and make sure they are orthogonal and composable. I'm OK with most of the expressions you added, but have some other thoughts about InitializeObject.

I propose to improve the existing NewObject, and introduce a new phase: initialize phase. So NewObject should have a list of expression as its constructor parameters, and another list of expressions as post-hoc initializaion. To create a case class, the construct parameters expressions are not empty and initializing expressions are empty. To create a java bean, it's the opposite.

bdrillard · 2018-05-17T00:47:14Z

Closing this PR in favor of #21348.

adding new expressions

b197384

bdrillard force-pushed the spark_expressions branch from 52c63f8 to 167f3a7 Compare December 26, 2017 20:13

viirya reviewed Dec 27, 2017

View reviewed changes

bdrillard force-pushed the spark_expressions branch from 5c35737 to 5fc052c Compare December 27, 2017 04:29

adding test case for initialize object generalization

cca458d

bdrillard force-pushed the spark_expressions branch from bb40336 to cca458d Compare December 27, 2017 04:36

using instanceof rather than valueiftype

4b07b66

bdrillard force-pushed the spark_expressions branch from 228c1dc to 4b07b66 Compare December 27, 2017 17:18

bdrillard commented Jan 3, 2018

View reviewed changes

cloud-fan reviewed Jan 9, 2018

View reviewed changes

bdrillard mentioned this pull request Jan 15, 2018

[SPARK-22826][SQL] findWiderTypeForTwo Fails over StructField of Array #20010

Closed

bdrillard mentioned this pull request May 17, 2018

[SPARK-22739][Catalyst] Additional Expression Support for Objects #21348

Closed

bdrillard closed this May 17, 2018

bdrillard mentioned this pull request Oct 30, 2018

[SPARK-25789][SQL] Support for Dataset of Avro #22878

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22739][Catalyst][WIP] Additional Expression Support for Objects #20085

[SPARK-22739][Catalyst][WIP] Additional Expression Support for Objects #20085

bdrillard commented Dec 26, 2017

bdrillard commented Dec 26, 2017

viirya Dec 27, 2017

bdrillard Dec 27, 2017

viirya Dec 27, 2017

bdrillard Dec 27, 2017

viirya Dec 27, 2017

viirya Dec 27, 2017

bdrillard commented Dec 27, 2017

marmbrus commented Jan 3, 2018

sameeragarwal commented Jan 3, 2018

SparkQA commented Jan 3, 2018

bdrillard Jan 3, 2018 •

edited

bdrillard Jan 3, 2018

bdrillard Jan 3, 2018

bdrillard commented Jan 3, 2018

cloud-fan Jan 9, 2018

bdrillard Jan 15, 2018

sameeragarwal commented Jan 14, 2018

viirya commented Jan 14, 2018

marmbrus commented Jan 14, 2018

AmplabJenkins commented Jan 18, 2018

cloud-fan commented Feb 8, 2018

bdrillard commented May 17, 2018

		@@ -390,8 +391,8 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper {

		test("SPARK-22696: InitializeJavaBean should not use global variables") {

		@@ -1237,47 +1342,91 @@ case class DecodeUsingSerializer[T](child: Expression, tag: ClassTag[T], kryo: B
		}

[SPARK-22739][Catalyst][WIP] Additional Expression Support for Objects #20085

[SPARK-22739][Catalyst][WIP] Additional Expression Support for Objects #20085

Conversation

bdrillard commented Dec 26, 2017

What changes were proposed in this pull request?

bdrillard commented Dec 26, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdrillard commented Dec 27, 2017

marmbrus commented Jan 3, 2018

sameeragarwal commented Jan 3, 2018

SparkQA commented Jan 3, 2018

bdrillard Jan 3, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdrillard commented Jan 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sameeragarwal commented Jan 14, 2018

viirya commented Jan 14, 2018

marmbrus commented Jan 14, 2018

AmplabJenkins commented Jan 18, 2018

cloud-fan commented Feb 8, 2018

bdrillard commented May 17, 2018

bdrillard Jan 3, 2018 •

edited