Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22739][Catalyst][WIP] Additional Expression Support for Objects #20085

Closed
wants to merge 3 commits into from

Conversation

bdrillard
Copy link

What changes were proposed in this pull request?

This PR is a work-in-progress adding additional Expression support for object types. It intends to provide necessary expressions to support custom encoders (see discussion in Spark-Avro).

This is an initial review, looking for feedback concerning a few questions and guidance concerning best unit-testing practices for new Expression classes in Catalyst.

@bdrillard
Copy link
Author

cc: @marmbrus

* @param returnNullable When false, indicating the invoked method will always return
* non-null value.
*/
* Invokes a static function, returning the result. By default, any of the arguments being null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we change the comment style? Looks not consistent with others.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those additional spaces shouldn't be there, I've fixed them.

@@ -390,8 +391,8 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper {

test("SPARK-22696: InitializeJavaBean should not use global variables") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InitializeJavaBean -> InitializeObject.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

case class ValueIfType(
value: Expression,
checkedType: Class[_],
dataType: DataType) extends Expression with NonSQLExpression {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we have different data type other than value.dataType?

* @param dataType The type returned by the expression
*/
case class ValueIfType(
value: Expression,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we limit the data type of value to ObjectType?

@bdrillard
Copy link
Author

@viirya I've found the same intent of a ValueIfType function can be attained by adding a simpler InstanceOf expressions that can be used as the predicate to the existing If expression, and then using ObjectCast on the results. That approach handles your first question. To your second question, it makes sense the input value expression should always have a DataType of ObjectType. Is there a way you'd prefer to make that check? Or throw some kind of exception of value.dataType != ObjectType?

@marmbrus
Copy link
Contributor

marmbrus commented Jan 3, 2018

/cc @cloud-fan @sameeragarwal

@sameeragarwal
Copy link
Member

jenkins add to whitelist

@SparkQA
Copy link

SparkQA commented Jan 3, 2018

Test build #85642 has finished for PR 20085 at commit 4b07b66.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class InstanceOf(

@@ -1237,47 +1342,91 @@ case class DecodeUsingSerializer[T](child: Expression, tag: ClassTag[T], kryo: B
}

Copy link
Author

@bdrillard bdrillard Jan 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to support initializations on more complicated objects, it makes sense to generalize InitializeJavaBean to an InitializeObject that can take a sequence of method names associated with a sequence of those methods' arguments. It seems thought that on plan analysis, Spark fails to resolve the column names against the Expression children when those child expressions are gathered from a Seq[Expression], yielding errors like:

Resolved attribute(s) 'field1,'field2 missing from field1#2,field2#3 in operator 'DeserializeToObject initializeobject(newInstance(class org.apache.spark.sql.catalyst.expressions.GenericBean), (setField1,List(assertnotnull('field1))), (setField2,List('field2.toString))), obj#4: org.apache.spark.sql.catalyst.expressions.GenericBean. Attribute(s) with the same name appear in the operation: field1,field2. Please check if the right attribute(s) are used.;
org.apache.spark.sql.AnalysisException: Resolved attribute(s) 'field1,'field2 missing from field1#2,field2#3 in operator 'DeserializeToObject initializeobject(newInstance(class org.apache.spark.sql.catalyst.expressions.GenericBean), (setField1,List(assertnotnull('field1))), (setField2,List('field2.toString))), obj#4: org.apache.spark.sql.catalyst.expressions.GenericBean. Attribute(s) with the same name appear in the operation: field1,field2. Please check if the right attribute(s) are used.;
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)

Interestingly, if we change the setters signature from Seq[(String, Seq[Expression])] to Seq[(String, (Expression, Expression))] (which is the use case for Spark-Avro, where objects are initialized by calling put with an integer index argument and then some object argument), the plan will resolve. But of course, such a function signature would in a sense be hard-coded for Avro.

Any ideas why passing a sequence of child expression arguments would yield the analysis error above, while a tuple of those same arguments would not?

result = 31 * result + (field2 != null ? field2.hashCode() : 0);
return result;
}
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This object here exists just as an easy unit test for the InitializeObject problem I describe above, it doesn't necessarily need to stay as a test resource.


assert(beanFromRow.getField1 == bean.getField1)
assert(beanFromRow.getField2 == bean.getField2)
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case above demonstrates the issue I encountered with using a sequence of initialization arguments on an object.

@bdrillard
Copy link
Author

I've added some comments describing an issue I've had with generalizing InitializeJavaBean, which I thought I'd added to this PR earlier but seem to have not been submitted.

case class InitializeJavaBean(beanInstance: Expression, setters: Map[String, Expression])
case class InitializeObject(
objectInstance: Expression,
setters: Seq[(String, Seq[Expression])])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To generalize, I think we can just have a NewObject expression, which just do new SomeClass, the setters are just a bunch of Invoke.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make use of NewInstance which just creates an object of a class, but it's not clear how we can make use of a sequence of Invoke, since all these setter methods would have void return types, we can't chain them in a fluent manner.

@sameeragarwal
Copy link
Member

@bdrillard @viirya @cloud-fan are we still targeting this for 2.3?

@viirya
Copy link
Member

viirya commented Jan 14, 2018

Seems to me this doesn't need to be urgent to be in 2.3.

@marmbrus
Copy link
Contributor

This blocks better support for encoders on spark-avro, and seems safe, so I'd really like to include it in possible.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@cloud-fan
Copy link
Contributor

Hi @bdrillard , sorry for the late reply, as I was thinking hard about this problem. I think we all agree that we should have more object-related expressions, so that it's more flexible for Spark and other projects to do many things with the codegen-able expressions.

However we should think hard about what object-related expressions Spark should provide, and make sure they are orthogonal and composable. I'm OK with most of the expressions you added, but have some other thoughts about InitializeObject.

I propose to improve the existing NewObject, and introduce a new phase: initialize phase. So NewObject should have a list of expression as its constructor parameters, and another list of expressions as post-hoc initializaion. To create a case class, the construct parameters expressions are not empty and initializing expressions are empty. To create a java bean, it's the opposite.

@bdrillard
Copy link
Author

Closing this PR in favor of #21348.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants