Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23595][SQL] ValidateExternalType should support interpreted execution #20757

Closed
wants to merge 7 commits into from

Conversation

maropu
Copy link
Member

@maropu maropu commented Mar 7, 2018

What changes were proposed in this pull request?

This pr supported interpreted mode for ValidateExternalType.

How was this patch tested?

Added tests in ObjectExpressionsSuite.

@SparkQA
Copy link

SparkQA commented Mar 7, 2018

Test build #88037 has finished for PR 20757 at commit d53cfea.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

private lazy val checkType = expected match {
case _: DecimalType =>
(value: Any) => {
Seq(classOf[java.math.BigDecimal], classOf[scala.math.BigDecimal], classOf[Decimal])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do 3 instanceOf checks instead? This seems a bit expensive.

}
case _: ArrayType =>
(value: Any) => {
value.getClass.isAssignableFrom(classOf[Seq[_]]) || value.getClass.isArray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is cheaper? isAssignableFrom or isArray? If it is the latter then we should swap the order.

Copy link
Member Author

@maropu maropu Mar 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea for that and what do u know about it in jvm? @kiszk @rednaxelafx I feel isArray seems cheaper...?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi guys, sorry I'm late.

In your new code you're doing:

+    case _: ArrayType =>
+      (value: Any) => {
+        value.getClass.isArray || value.isInstanceOf[Seq[_]]
+      }

which is good. xxx.getClass().isAssignableFrom(some_class_literal) in the old version of this PR is actually backwards, it should have been some_class_literal.isAssignableFrom(xxx.getClass()), e.g.

scala> classOf[String].isAssignableFrom(classOf[Object])
res0: Boolean = false

scala> classOf[Object].isAssignableFrom(classOf[String])
res1: Boolean = true

and the latter is semantically the same as xxx.isInstanceOf[some_class]. isInstanceOf[] is guaranteed to be at least as fast as some_class_literal.isAssignableFrom(xxx.getClass()), and in general isInstanceOf[] is faster.

xxx.getClass().isArray() has a fixed overhead, whereas isInstanceOf[] can have a fast path slightly faster than the isArray and a slow path that can be much slower than isArray. So putting the isArray check first in your new code makes more sense to me.

Copy link
Contributor

@rednaxelafx rednaxelafx Mar 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For those curious:

In HotSpot, the straightforward interpreter/C1 implementation of xxx.getClass().isArray() path is actually something like:

// for getClass()
klazz = xxx._klass; // read the hidden klass pointer field from the object header
clazz = klazz._java_mirror; // read the java.lang.Class reference from the Klass
// for clazz.isArray(): go through JNI and call the native JVM_IsArrayClass() inside HotSpot
klazz1 = clazz->_klass;
result = klazz1->oop_is_array();

So a JNI native method call is involved and that's not really fast. But C2 will optimize this into something similar to:

klazz = xxx._klass;
result = inlined klazz->oop_is_array();

So that's pretty fast. No need to load the java.lang.Class (aka "Java Mirror") reference anymore.

In the xxx.isInstanceOf[Seq[_]] case, again the interpreter version would go through a JNI native method call, whereas the C1/C2 versions will inline a fast path logic and do a quick comparison against a per-type cache. This fast path check has similar overhead to the C2 isArray() overhead, and the slow path is a slow linear search over an array of implemented interfaces of the klass which can be much slower than the simple isArray check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the valuable info., kris! Based on the comment, the current version is ok to me.

}
case _ if ScalaReflection.isNativeType(expected) =>
(value: Any) => {
value.getClass.isAssignableFrom(ScalaReflection.classForNativeTypeOf(expected))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move the ScalaReflection.classForNativeTypeOf(expected) out of the function body and into a val.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also be a bit faster if we use isInstance here. All of these classes do not have subclasses.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

}
case _ =>
(value: Any) => {
value.getClass.isAssignableFrom(dataType.asInstanceOf[ObjectType].cls)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar as above. Let's move dataType.asInstanceOf[ObjectType].cls out of the function and into a val.

Also where is this code path in the code generated version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I thought the previous version is some difficult to read , I cleaned up the code along with the generated version. How about it?

}
case _ =>
(value: Any) => {
value.getClass.isAssignableFrom(dataType.asInstanceOf[ObjectType].cls)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The external type for PythonUserDefinedType may not necessarily be ObjectType.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh..., great catch.

@hvanhovell
Copy link
Contributor

@maropu looks pretty solid. I left a few comments.

@maropu
Copy link
Member Author

maropu commented Mar 7, 2018

@hvanhovell @viirya thanks for kindly reviews. All the comments addressed and check again? thanks!

@SparkQA
Copy link

SparkQA commented Mar 7, 2018

Test build #88046 has finished for PR 20757 at commit a204197.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 7, 2018

Test build #88047 has finished for PR 20757 at commit d951cd7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 8, 2018

Test build #88069 has finished for PR 20757 at commit dbd8b2f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Mar 8, 2018

@hvanhovell ping


private val errMsg = s" is not a valid external type for schema of ${expected.simpleString}"

private lazy val dataTypeClazz = if (dataType.isInstanceOf[ObjectType]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we fold this into the checkType functions. I'd rather not deference a lazy val every time we check a type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll try

ScalaReflection.classForNativeTypeOf(dataType)
}

private lazy val checkType = expected match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for readability add the type here.

value.getClass.isArray || value.isInstanceOf[Seq[_]]
}
case _ =>
val dataTypeClazz = if (dataType.isInstanceOf[ObjectType]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, two more things.

  1. Let's just move this logic into the classForNativeTypeOf function.
  2. Where are we dealing with PythonUserDefinedType or UserDefinedType? The classForNativeTypeOf function does not appear to be handling those.

Copy link
Member Author

@maropu maropu Mar 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both types (PythonUserDefinedType or UserDefinedType) are converted into native types or ObjectType in dataType by externalDataTypeForInput. So, IIUC we can handle the two types here, too.

@@ -121,6 +121,19 @@ object ScalaReflection extends ScalaReflection {
case _ => false
}

def classForNativeTypeOf(dt: DataType): Class[_] = dt match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this match CodeGenerator.boxedType in terms of functionality? See my previous comments, but I am also missing complex types (struct, map & array).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this function would be better to only handles the native types that have the same format between the Spark SQL internals formats and externals (struct, map, and array have different formats between them). https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala#L209 So, this function needs to handle more wider types than CodeGenerator.boxedType.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a difference between how we implement an expression and how we use an expression. In this case the implementations should behave the same, and not only in the context in which it is being used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll brush up the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell How about the latest commit?

@SparkQA
Copy link

SparkQA commented Mar 8, 2018

Test build #88088 has finished for PR 20757 at commit 638a8ac.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 9, 2018

Test build #88107 has finished for PR 20757 at commit 66141ea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 9, 2018

Test build #88111 has finished for PR 20757 at commit 32a8dae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

value.getClass.isArray || value.isInstanceOf[Seq[_]]
}
case _ =>
val dataTypeClazz = RowEncoder.getClassFromExternalType(dataType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this always return the same result as CodeGenerator.boxedType(dataType) in terms of functionality? I don't think it does, since it misses support for DateType, TimestampType, DecimalType, StringType, StructType, MapType, ArrayType and UserDefinedType. The thing is that it does not really matter how this expression is currently used (for datasets), what matters is how the code generated version is implemented.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll recheck again based on the comment. Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, could you check again?

@@ -1440,7 +1463,7 @@ case class ValidateExternalType(child: Expression, expected: DataType)
Seq(classOf[java.math.BigDecimal], classOf[scala.math.BigDecimal], classOf[Decimal])
.map(cls => s"$obj instanceof ${cls.getName}").mkString(" || ")
case _: ArrayType =>
s"$obj instanceof ${classOf[Seq[_]].getName} || $obj.getClass().isArray()"
s"$obj.getClass().isArray() || $obj instanceof ${classOf[Seq[_]].getName}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need to change codegen implementation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the @rednaxelafx 's comment, I changed this codegen, too. #20757 (comment)

@SparkQA
Copy link

SparkQA commented Mar 12, 2018

Test build #88169 has finished for PR 20757 at commit 6dc1fc2.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Mar 12, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Mar 12, 2018

Test build #88170 has finished for PR 20757 at commit 6dc1fc2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 12, 2018

Test build #88171 has finished for PR 20757 at commit 6dc1fc2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Mar 20, 2018

ping

@SparkQA
Copy link

SparkQA commented Apr 5, 2018

Test build #88919 has finished for PR 20757 at commit dd1159b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 10, 2018

Test build #89084 has finished for PR 20757 at commit a787f9a.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Apr 10, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Apr 10, 2018

Test build #89090 has finished for PR 20757 at commit a787f9a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Apr 11, 2018

ping

@SparkQA
Copy link

SparkQA commented Apr 19, 2018

Test build #89550 has finished for PR 20757 at commit 24f2879.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Apr 19, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Apr 19, 2018

Test build #89566 has finished for PR 20757 at commit 24f2879.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


private val errMsg = s" is not a valid external type for schema of ${expected.simpleString}"

// This function is corresponding to `CodeGenerator.boxedType`
private def boxedType(dt: DataType): Class[_] = dataType match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this to ScalaReflection?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@hvanhovell
Copy link
Contributor

@maropu can you update the PR? I have one trivial comment. LGTM otherwise.

@maropu
Copy link
Member Author

maropu commented Apr 19, 2018

Many thanks for your checks! I'll fix soon!

@SparkQA
Copy link

SparkQA commented Apr 20, 2018

Test build #89599 has finished for PR 20757 at commit 7474811.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

LGTM - merging to master. Thanks!

@asfgit asfgit closed this in 0dd97f6 Apr 20, 2018
@maropu
Copy link
Member Author

maropu commented Apr 20, 2018

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants