New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22739][Catalyst][WIP] Additional Expression Support for Objects #20085
Conversation
cc: @marmbrus |
52c63f8
to
167f3a7
Compare
* @param returnNullable When false, indicating the invoked method will always return | ||
* non-null value. | ||
*/ | ||
* Invokes a static function, returning the result. By default, any of the arguments being null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we change the comment style? Looks not consistent with others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those additional spaces shouldn't be there, I've fixed them.
@@ -390,8 +391,8 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper { | |||
|
|||
test("SPARK-22696: InitializeJavaBean should not use global variables") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InitializeJavaBean
-> InitializeObject
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
case class ValueIfType( | ||
value: Expression, | ||
checkedType: Class[_], | ||
dataType: DataType) extends Expression with NonSQLExpression { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we have different data type other than value.dataType
?
* @param dataType The type returned by the expression | ||
*/ | ||
case class ValueIfType( | ||
value: Expression, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we limit the data type of value
to ObjectType
?
5c35737
to
5fc052c
Compare
bb40336
to
cca458d
Compare
228c1dc
to
4b07b66
Compare
@viirya I've found the same intent of a |
jenkins add to whitelist |
Test build #85642 has finished for PR 20085 at commit
|
@@ -1237,47 +1342,91 @@ case class DecodeUsingSerializer[T](child: Expression, tag: ClassTag[T], kryo: B | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to support initializations on more complicated objects, it makes sense to generalize InitializeJavaBean
to an InitializeObject
that can take a sequence of method names associated with a sequence of those methods' arguments. It seems thought that on plan analysis, Spark fails to resolve the column names against the Expression children
when those child expressions are gathered from a Seq[Expression]
, yielding errors like:
Resolved attribute(s) 'field1,'field2 missing from field1#2,field2#3 in operator 'DeserializeToObject initializeobject(newInstance(class org.apache.spark.sql.catalyst.expressions.GenericBean), (setField1,List(assertnotnull('field1))), (setField2,List('field2.toString))), obj#4: org.apache.spark.sql.catalyst.expressions.GenericBean. Attribute(s) with the same name appear in the operation: field1,field2. Please check if the right attribute(s) are used.;
org.apache.spark.sql.AnalysisException: Resolved attribute(s) 'field1,'field2 missing from field1#2,field2#3 in operator 'DeserializeToObject initializeobject(newInstance(class org.apache.spark.sql.catalyst.expressions.GenericBean), (setField1,List(assertnotnull('field1))), (setField2,List('field2.toString))), obj#4: org.apache.spark.sql.catalyst.expressions.GenericBean. Attribute(s) with the same name appear in the operation: field1,field2. Please check if the right attribute(s) are used.;
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
Interestingly, if we change the setters
signature from Seq[(String, Seq[Expression])]
to Seq[(String, (Expression, Expression))]
(which is the use case for Spark-Avro, where objects are initialized by calling put
with an integer index argument and then some object argument), the plan will resolve. But of course, such a function signature would in a sense be hard-coded for Avro.
Any ideas why passing a sequence of child expression arguments would yield the analysis error above, while a tuple of those same arguments would not?
result = 31 * result + (field2 != null ? field2.hashCode() : 0); | ||
return result; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This object here exists just as an easy unit test for the InitializeObject
problem I describe above, it doesn't necessarily need to stay as a test resource.
|
||
assert(beanFromRow.getField1 == bean.getField1) | ||
assert(beanFromRow.getField2 == bean.getField2) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case above demonstrates the issue I encountered with using a sequence of initialization arguments on an object.
I've added some comments describing an issue I've had with generalizing |
case class InitializeJavaBean(beanInstance: Expression, setters: Map[String, Expression]) | ||
case class InitializeObject( | ||
objectInstance: Expression, | ||
setters: Seq[(String, Seq[Expression])]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To generalize, I think we can just have a NewObject
expression, which just do new SomeClass
, the setters are just a bunch of Invoke
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make use of NewInstance
which just creates an object of a class, but it's not clear how we can make use of a sequence of Invoke
, since all these setter methods would have void
return types, we can't chain them in a fluent manner.
@bdrillard @viirya @cloud-fan are we still targeting this for 2.3? |
Seems to me this doesn't need to be urgent to be in 2.3. |
This blocks better support for encoders on spark-avro, and seems safe, so I'd really like to include it in possible. |
Can one of the admins verify this patch? |
Hi @bdrillard , sorry for the late reply, as I was thinking hard about this problem. I think we all agree that we should have more object-related expressions, so that it's more flexible for Spark and other projects to do many things with the codegen-able expressions. However we should think hard about what object-related expressions Spark should provide, and make sure they are orthogonal and composable. I'm OK with most of the expressions you added, but have some other thoughts about I propose to improve the existing |
Closing this PR in favor of #21348. |
What changes were proposed in this pull request?
This PR is a work-in-progress adding additional
Expression
support for object types. It intends to provide necessary expressions to support custom encoders (see discussion in Spark-Avro).This is an initial review, looking for feedback concerning a few questions and guidance concerning best unit-testing practices for new
Expression
classes in Catalyst.