-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23595][SQL] ValidateExternalType should support interpreted execution #20757
Conversation
Test build #88037 has finished for PR 20757 at commit
|
private lazy val checkType = expected match { | ||
case _: DecimalType => | ||
(value: Any) => { | ||
Seq(classOf[java.math.BigDecimal], classOf[scala.math.BigDecimal], classOf[Decimal]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do 3 instanceOf checks instead? This seems a bit expensive.
} | ||
case _: ArrayType => | ||
(value: Any) => { | ||
value.getClass.isAssignableFrom(classOf[Seq[_]]) || value.getClass.isArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is cheaper? isAssignableFrom
or isArray
? If it is the latter then we should swap the order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea for that and what do u know about it in jvm? @kiszk @rednaxelafx I feel isArray
seems cheaper...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi guys, sorry I'm late.
In your new code you're doing:
+ case _: ArrayType =>
+ (value: Any) => {
+ value.getClass.isArray || value.isInstanceOf[Seq[_]]
+ }
which is good. xxx.getClass().isAssignableFrom(some_class_literal)
in the old version of this PR is actually backwards, it should have been some_class_literal.isAssignableFrom(xxx.getClass())
, e.g.
scala> classOf[String].isAssignableFrom(classOf[Object])
res0: Boolean = false
scala> classOf[Object].isAssignableFrom(classOf[String])
res1: Boolean = true
and the latter is semantically the same as xxx.isInstanceOf[some_class]
. isInstanceOf[]
is guaranteed to be at least as fast as some_class_literal.isAssignableFrom(xxx.getClass())
, and in general isInstanceOf[]
is faster.
xxx.getClass().isArray()
has a fixed overhead, whereas isInstanceOf[]
can have a fast path slightly faster than the isArray
and a slow path that can be much slower than isArray
. So putting the isArray
check first in your new code makes more sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For those curious:
In HotSpot, the straightforward interpreter/C1 implementation of xxx.getClass().isArray()
path is actually something like:
// for getClass()
klazz = xxx._klass; // read the hidden klass pointer field from the object header
clazz = klazz._java_mirror; // read the java.lang.Class reference from the Klass
// for clazz.isArray(): go through JNI and call the native JVM_IsArrayClass() inside HotSpot
klazz1 = clazz->_klass;
result = klazz1->oop_is_array();
So a JNI native method call is involved and that's not really fast. But C2 will optimize this into something similar to:
klazz = xxx._klass;
result = inlined klazz->oop_is_array();
So that's pretty fast. No need to load the java.lang.Class
(aka "Java Mirror") reference anymore.
In the xxx.isInstanceOf[Seq[_]]
case, again the interpreter version would go through a JNI native method call, whereas the C1/C2 versions will inline a fast path logic and do a quick comparison against a per-type cache. This fast path check has similar overhead to the C2 isArray()
overhead, and the slow path is a slow linear search over an array of implemented interfaces of the klass which can be much slower than the simple isArray
check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the valuable info., kris! Based on the comment, the current version is ok to me.
} | ||
case _ if ScalaReflection.isNativeType(expected) => | ||
(value: Any) => { | ||
value.getClass.isAssignableFrom(ScalaReflection.classForNativeTypeOf(expected)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move the ScalaReflection.classForNativeTypeOf(expected)
out of the function body and into a val
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might also be a bit faster if we use isInstance
here. All of these classes do not have subclasses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
} | ||
case _ => | ||
(value: Any) => { | ||
value.getClass.isAssignableFrom(dataType.asInstanceOf[ObjectType].cls) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar as above. Let's move dataType.asInstanceOf[ObjectType].cls
out of the function and into a val
.
Also where is this code path in the code generated version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I thought the previous version is some difficult to read , I cleaned up the code along with the generated version. How about it?
} | ||
case _ => | ||
(value: Any) => { | ||
value.getClass.isAssignableFrom(dataType.asInstanceOf[ObjectType].cls) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The external type for PythonUserDefinedType
may not necessarily be ObjectType
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh..., great catch.
@maropu looks pretty solid. I left a few comments. |
@hvanhovell @viirya thanks for kindly reviews. All the comments addressed and check again? thanks! |
Test build #88046 has finished for PR 20757 at commit
|
Test build #88047 has finished for PR 20757 at commit
|
Test build #88069 has finished for PR 20757 at commit
|
@hvanhovell ping |
|
||
private val errMsg = s" is not a valid external type for schema of ${expected.simpleString}" | ||
|
||
private lazy val dataTypeClazz = if (dataType.isInstanceOf[ObjectType]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we fold this into the checkType
functions. I'd rather not deference a lazy val every time we check a type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll try
ScalaReflection.classForNativeTypeOf(dataType) | ||
} | ||
|
||
private lazy val checkType = expected match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for readability add the type here.
value.getClass.isArray || value.isInstanceOf[Seq[_]] | ||
} | ||
case _ => | ||
val dataTypeClazz = if (dataType.isInstanceOf[ObjectType]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, two more things.
- Let's just move this logic into the
classForNativeTypeOf
function. - Where are we dealing with
PythonUserDefinedType
orUserDefinedType
? TheclassForNativeTypeOf
function does not appear to be handling those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both types (PythonUserDefinedType
or UserDefinedType
) are converted into native types or ObjectType in dataType
by externalDataTypeForInput
. So, IIUC we can handle the two types here, too.
@@ -121,6 +121,19 @@ object ScalaReflection extends ScalaReflection { | |||
case _ => false | |||
} | |||
|
|||
def classForNativeTypeOf(dt: DataType): Class[_] = dt match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this match CodeGenerator.boxedType
in terms of functionality? See my previous comments, but I am also missing complex types (struct, map & array).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this function would be better to only handles the native
types that have the same format between the Spark SQL internals formats and externals (struct, map, and array have different formats between them). https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala#L209 So, this function needs to handle more wider types than CodeGenerator.boxedType
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a difference between how we implement an expression and how we use an expression. In this case the implementations should behave the same, and not only in the context in which it is being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll brush up the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hvanhovell How about the latest commit?
Test build #88088 has finished for PR 20757 at commit
|
Test build #88107 has finished for PR 20757 at commit
|
3662fb6
to
32a8dae
Compare
Test build #88111 has finished for PR 20757 at commit
|
value.getClass.isArray || value.isInstanceOf[Seq[_]] | ||
} | ||
case _ => | ||
val dataTypeClazz = RowEncoder.getClassFromExternalType(dataType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this always return the same result as CodeGenerator.boxedType(dataType)
in terms of functionality? I don't think it does, since it misses support for DateType
, TimestampType
, DecimalType
, StringType
, StructType
, MapType
, ArrayType
and UserDefinedType
. The thing is that it does not really matter how this expression is currently used (for datasets), what matters is how the code generated version is implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll recheck again based on the comment. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, could you check again?
@@ -1440,7 +1463,7 @@ case class ValidateExternalType(child: Expression, expected: DataType) | |||
Seq(classOf[java.math.BigDecimal], classOf[scala.math.BigDecimal], classOf[Decimal]) | |||
.map(cls => s"$obj instanceof ${cls.getName}").mkString(" || ") | |||
case _: ArrayType => | |||
s"$obj instanceof ${classOf[Seq[_]].getName} || $obj.getClass().isArray()" | |||
s"$obj.getClass().isArray() || $obj instanceof ${classOf[Seq[_]].getName}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need to change codegen implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the @rednaxelafx 's comment, I changed this codegen, too. #20757 (comment)
Test build #88169 has finished for PR 20757 at commit
|
retest this please |
Test build #88170 has finished for PR 20757 at commit
|
Test build #88171 has finished for PR 20757 at commit
|
ping |
Test build #88919 has finished for PR 20757 at commit
|
Test build #89084 has finished for PR 20757 at commit
|
retest this please |
Test build #89090 has finished for PR 20757 at commit
|
ping |
Test build #89550 has finished for PR 20757 at commit
|
retest this please |
Test build #89566 has finished for PR 20757 at commit
|
|
||
private val errMsg = s" is not a valid external type for schema of ${expected.simpleString}" | ||
|
||
// This function is corresponding to `CodeGenerator.boxedType` | ||
private def boxedType(dt: DataType): Class[_] = dataType match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this to ScalaReflection
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@maropu can you update the PR? I have one trivial comment. LGTM otherwise. |
Many thanks for your checks! I'll fix soon! |
Test build #89599 has finished for PR 20757 at commit
|
LGTM - merging to master. Thanks! |
Thanks! |
What changes were proposed in this pull request?
This pr supported interpreted mode for
ValidateExternalType
.How was this patch tested?
Added tests in
ObjectExpressionsSuite
.