[SPARK-21255][SQL][WIP] Fixed NPE when creating encoder for enum #18488

mike0sv · 2017-06-30T13:47:16Z

What changes were proposed in this pull request?

Fixed NPE when creating encoder for enum.

When you try to create an encoder for Enum type (or bean with enum property) via Encoders.bean(...), it fails with NullPointerException at TypeToken:495.
I did a little research and it turns out, that in JavaTypeInference following code

  def getJavaBeanReadableProperties(beanClass: Class[_]): Array[PropertyDescriptor] = {
    val beanInfo = Introspector.getBeanInfo(beanClass)
    beanInfo.getPropertyDescriptors.filterNot(_.getName == "class")
      .filter(_.getReadMethod != null)
  }

filters out properties named "class", because we wouldn't want to serialize that. But enum types have another property of type Class named "declaringClass", which we are trying to inspect recursively. Eventually we try to inspect ClassLoader class, which has property "defaultAssertionStatus" with no read method, which leads to NPE at TypeToken:495.

I added property name "declaringClass" to filtering to resolve this.

How was this patch tested?

Unit test in JavaDatasetSuite which creates an encoder for enum

srowen · 2017-06-30T13:51:33Z

sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/EnumEncoderSuite.java

@@ -0,0 +1,32 @@
+package org.apache.spark.sql.catalyst;


(Copy the copyright header you see in other files)
Or, it looks like Encoders-related tests are otherwise all in JavaDatasetSuite. That may be a better place anyway.

kiszk · 2017-06-30T15:17:09Z

This fix looks good. Would it be possible to put details on why this problem happens and how it is fixed in the description of this PR, as written in JIRA entry?

mike0sv · 2017-06-30T15:24:20Z

@kiszk check it out

kiszk · 2017-07-01T06:55:48Z

@mike0sv looks good, thanks. It would help us for ease of understanding in the future.

kiszk · 2017-07-01T07:05:24Z

ok to test

HyukjinKwon

Hm.. sounds the current test fails during code generation.

final test.org.apache.spark.sql.JavaDatasetSuite$EnumBean value1 = false ? null : new test.org.apache.spark.sql.JavaDatasetSuite$EnumBean();

This looks because it tries to create the object via enum's private constructor. @kiszk, I believe you know this bit well. What do you think about this?

HyukjinKwon · 2017-07-01T07:16:26Z

sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java

+  }
+
+  @Test
+  public void testEnum() throws Exception {


Do we need throws Exception {?

HyukjinKwon · 2017-07-01T07:19:46Z

sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java

+    List<EnumBean> data = Arrays.asList(EnumBean.B);
+    Dataset<EnumBean> ds = spark.createDataset(data, Encoders.bean(EnumBean.class));
+
+    Assert.assertEquals(ds.collectAsList().size(), 1);


I would go Assert.assertEquals(data, ds.collectAsList());

HyukjinKwon · 2017-07-01T07:44:48Z

The reproducer in the JIRA used show() with rows and it was fine but it looks a problem with type specific operations of dataset.

gatorsmile · 2017-07-03T04:28:33Z

ok to test

SparkQA · 2017-07-03T06:05:51Z

Test build #79070 has finished for PR 18488 at commit 120bb32.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2017-07-03T11:27:06Z

Hm, backing up, this shouldn't actually work, should it? enums aren't beans, and this newer error that gets uncovered more directly demonstrates that.

Do you need to define an encoder to use an enum? it should just be serialized as a string or int already. If not, that's what needs to be fixed.

See also https://issues.apache.org/jira/browse/SPARK-17248

SparkQA · 2017-07-03T16:39:00Z

Test build #79104 has finished for PR 18488 at commit 2cb0834.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

mike0sv · 2017-07-03T16:48:52Z

I reworked the code to ser/de enums into ints (according to declaring order). However, I recreate the mapping for each object, which is very bad obviously. I need to create mapping once (for each partition I guess) and then use it for all objects. Please tell me how it can be achieved.
Also, is there a better place for my ser/de methods?

srowen · 2017-07-03T17:15:33Z

How about just using Encoders.javaSerialization() with enums?

mike0sv · 2017-07-03T18:05:30Z

It won't work if I have enum field inside regular java bean.

srowen · 2017-07-03T18:30:29Z

I see, so the bean encoder assumes every property is a bean.

SparkQA · 2017-07-03T19:09:25Z

Test build #79105 has finished for PR 18488 at commit 637daa4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2017-07-03T19:20:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala

-          val returnType = typeToken.method(property.getReadMethod).getReturnType
-          val (dataType, nullable) = inferDataType(returnType, seenTypeSet + other)
-          new StructField(property.getName, dataType, nullable)
+        if (typeToken.getRawType.isEnum) {


Could this be a case rawType if rawType.isEnum => case instead?

I'd also imagine the most natural type for an enum is a string, not a struct containing an int, but I haven't maybe thought that through

Probably, but int is cheaper. However, if we use string, there will be a possibility for meaningful queries. Also, there are strange cases with complex enums with different fields, even with complex types. I'd say string with constant name is enough, thou

srowen · 2017-07-03T19:22:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala

+  }
+
+  /** Returns a mapping from int to enum value for given enum type */
+  def enumDeserializer[T](enum: Class[T]): Int => T = {


The enum's values() value is already effectively a mapping from int to enum values -- much cheaper to access?

Good point. But if we use string values, there is already a hashmap inside enum implementation, accessible via valueOf.

But it's still not very good, because we have to lookup enum by class name for each object we serialize/deserialize. Do you have any ideas?

mike0sv · 2017-07-04T00:54:03Z

Ran into something strange. Changed ints to strings and it worked fine. But then I added a test for encoding bean with enum inside and the test failed. It failed because in my implementation deserializer returns Object (AnyRef), and if enum is top-level object it works fine. But if enum is a field, it tries to set it via setter, which does not accept arbitrary object. So, I changed implementation of deserializer to the following.

  def enumDeserializer[T <: Enum[T]](enum: Class[T]): InternalRow => T = {
    assert(enum.isEnum)
    value: InternalRow =>
      Enum.valueOf(enum, value.toString)
  }

  def deserializeEnumName[T <: Enum[T]](typeDummy: T, inputObject: InternalRow): T = {
    enumDeserializer(typeDummy.getClass.asInstanceOf[Class[T]])(inputObject)
  }

Now it failed with "Assignment conversion not possible from type "java.lang.Enum" to type "test.org.apache.spark.sql.JavaDatasetSuite$EnumBean"" at

private test.org.apache.spark.sql.JavaDatasetSuite$EnumBean argValue;
private InternalRow argValue1;
........
final test.org.apache.spark.sql.JavaDatasetSuite$EnumBean value2 = resultIsNull ? null : org.apache.spark.sql.types.DataTypes.deserializeEnumName(argValue, argValue1);

which is odd, because if I call it from regular code it compiles just fine.

    EnumBean argValue = EnumBean.values()[0];
    InternalRow argValue1 = null;
    final EnumBean bean = deserializeEnumName(argValue, argValue1);

I even tried moving code to java class just in case, but that did no good.
Is there any difference in compiling environments that could be the cause of that?

SparkQA · 2017-08-02T15:34:01Z

Test build #80163 has finished for PR 18488 at commit 8587437.

This patch fails Scala style tests.
This patch does not merge cleanly.
This patch adds no public classes.

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

SparkQA · 2017-08-02T17:59:04Z

Test build #80166 has finished for PR 18488 at commit 7f2019f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mike0sv · 2017-08-10T10:45:24Z

@srowen @HyukjinKwon hey guys, I think i got this, take a look. some sparkr tests failed for some reason, but I think it's not my fault =|

HyukjinKwon · 2017-08-10T14:28:59Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionInfo.java

@@ -79,7 +79,7 @@ public ExpressionInfo(
        assert name != null;
        assert arguments != null;
        assert examples != null;
-        assert examples.isEmpty() || examples.startsWith("\n    Examples:");
+        assert examples.isEmpty() || examples.startsWith(System.lineSeparator() + "    Examples:");


I guess this one is not related?

No, but without this it's not possible to run tests if you have different line separators (on windows for example)

I don't think we support Windows for dev. This assertion should probably be weakened anyway but that's a separate issue from this PR.

ok, I got rid of it

SparkQA · 2017-08-10T17:11:08Z

Test build #80489 has finished for PR 18488 at commit bc049b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mike0sv · 2017-08-14T14:48:13Z

@srowen @HyukjinKwon it seems like it's all ok now

SparkQA · 2017-08-15T16:27:27Z

Test build #80685 has finished for PR 18488 at commit fbdc599.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

mike0sv · 2017-08-15T17:29:23Z

@srowen @HyukjinKwon , retest this please :)

HyukjinKwon · 2017-08-15T22:10:09Z

retest this please

SparkQA · 2017-08-16T00:51:36Z

Test build #80700 has finished for PR 18488 at commit fbdc599.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mike0sv · 2017-08-22T11:13:27Z

@srowen @HyukjinKwon what's your status on this? anything else I can do?

srowen · 2017-08-22T11:47:13Z

I don't feel especially qualified to review this, and I'm hesitant because I know the core encoder/decoder framework is an important piece that needs some care.

Your change looks good though. It does seem carefully targeted at only changing the enum cases, has tests, and does fix an identified problem that prevents a normal use case from working.

Am I right that this serializes enums as their string value?

Are there any downsides to this change -- am I missing any behavior it changes or breaks?

mike0sv · 2017-08-22T13:12:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala

+      case other if other.isEnum =>
+        (StructType(Seq(StructField(typeToken.getRawType.getSimpleName,
+          StringType, nullable = false))), true)
+


We use struct type with string field to store enum type and it's value

mike0sv · 2017-08-22T13:14:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala

+        StaticInvoke(JavaTypeInference.getClass, ObjectType(other), "deserializeEnumName",
+          expressions.Literal.create(other.getEnumConstants.apply(0), ObjectType(other))
+            :: getPath :: Nil)
+


we pass literal value of first enum constant to resolve type parameter of deserializeEnumName method

mike0sv · 2017-08-22T13:20:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala

+  /** Returns value index for given enum type and value */
+  def serializeEnumName[T <: Enum[T]](enum: UTF8String, inputObject: T): UTF8String = {
+    enumSerializer(Utils.classForName(enum.toString).asInstanceOf[Class[T]])(inputObject)
+  }


Utils.classForName delegates to Class.forName, which operates on native level, so additional optimizations like caching are not required

mike0sv · 2017-08-22T13:21:11Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala

+  def enumSerializer[T <: Enum[T]](enum: Class[T]): T => UTF8String = {
+    assert(enum.isEnum)
+    inputObject: T =>
+      UTF8String.fromString(inputObject.name())


we use enum constant name as field value

mike0sv · 2017-08-22T13:24:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala

+  def enumDeserializer[T <: Enum[T]](enum: Class[T]): InternalRow => T = {
+    assert(enum.isEnum)
+    value: InternalRow =>
+      Enum.valueOf(enum, value.getUTF8String(0).toString)


Enum.valueOf uses cached string->value map

mike0sv · 2017-08-22T13:26:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala

+          CreateNamedStruct(expressions.Literal("enum") ::
+          StaticInvoke(JavaTypeInference.getClass, StringType, "serializeEnumName",
+          expressions.Literal.create(other.getName, StringType) :: inputObject :: Nil) :: Nil)
+


we pass enum class name via literal to serializer

mike0sv · 2017-08-22T13:28:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala

  // TODO: improve error message for java bean encoder.
  def javaBean[T](beanClass: Class[T]): ExpressionEncoder[T] = {
-    val schema = JavaTypeInference.inferDataType(beanClass)._1
+    val schema = if (beanClass.isEnum) {
+      javaEnumSchema(beanClass)


If we use enum as top level object, we need another level of structType for it to be compatible with our ser/de structure

mike0sv · 2017-08-22T13:32:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

@@ -154,13 +154,13 @@ case class StaticInvoke(
    val evaluate = if (returnNullable) {
      if (ctx.defaultValue(dataType) == "null") {
        s"""
-          ${ev.value} = $callFunc;
+          ${ev.value} = (($javaType) ($callFunc));


explicitly cast value to needed type, because without this generated code didn't compile with something like "cannot assign value of type Enum to %RealEnumClassName%"

from janino documentation: "Type arguments: Are parsed, but otherwise ignored. The most significant restriction that follows is that you must cast return values from method invocations, e.g. "(String) myMap.get(key)"

mike0sv · 2017-08-22T13:41:07Z

@srowen you are right, we store string values of constant names (for test example, we would get A and B values, not google/elgoog)
I commented some of the changes for clarity, but I don't see any downsides.
There is only one change which affects not enum-specific code, the casting in code generation, but it's completely harmless

mike0sv · 2017-08-22T23:04:40Z

Found this in janino documentation, it explains the need for explicit casting: "Type arguments: Are parsed, but otherwise ignored. The most significant restriction that follows is that you must cast return values from method invocations, e.g. "(String) myMap.get(key)"

SparkQA · 2017-08-23T13:44:53Z

Test build #3898 has finished for PR 18488 at commit fbdc599.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2017-08-25T06:22:40Z

Merged to master

cloud-fan · 2017-08-26T01:15:51Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala

@@ -118,6 +119,10 @@ object JavaTypeInference {
        val (valueDataType, nullable) = inferDataType(valueType, seenTypeSet)
        (MapType(keyDataType, valueDataType, nullable), true)

+      case other if other.isEnum =>
+        (StructType(Seq(StructField(typeToken.getRawType.getSimpleName,


why we map enum to struct type? shouldn't enum always have a single field?

## What changes were proposed in this pull request? This is a follow-up for apache#18488, to simplify the code. The major change is, we should map java enum to string type, instead of a struct type with a single string field. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes apache#19066 from cloud-fan/fix.

mike0sv added 2 commits June 30, 2017 16:40

[SPARK-21255][SQL] Fixed NPE when creating encoder for enum

2b6144d

[SPARK-21255][SQL] Fixed NPE when creating encoder for enum

7e95645

mike0sv changed the title ~~Enum support~~ [SPARK-21255][SQL] Fixed NPE when creating encoder for enum Jun 30, 2017

srowen reviewed Jun 30, 2017

View reviewed changes

[SPARK-21255][SQL] Moved test

120bb32

HyukjinKwon reviewed Jul 1, 2017

View reviewed changes

[SPARK-21255][SQL][WIP] Serializer and deserializer for enum

2cb0834

[SPARK-21255][SQL][WIP] fix scalastyle

637daa4

mike0sv changed the title ~~[SPARK-21255][SQL] Fixed NPE when creating encoder for enum~~ [SPARK-21255][SQL][WIP] Fixed NPE when creating encoder for enum Jul 3, 2017

srowen reviewed Jul 3, 2017

View reviewed changes

[SPARK-21255][SQL][WIP] changed serialization to string

8587437

mike0sv added 2 commits August 2, 2017 18:37

Merge branch 'master' into enum-support

0addbb7

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

Merge branch 'master' into enum-support

7f2019f

HyukjinKwon reviewed Aug 10, 2017

View reviewed changes

[SPARK-21255][SQL] removed separator fix

fbdc599

mike0sv commented Aug 22, 2017

View reviewed changes

asfgit closed this in 7d16776 Aug 25, 2017

cloud-fan reviewed Aug 26, 2017

View reviewed changes

cloud-fan mentioned this pull request Aug 28, 2017

[SPARK-21255][SQL] simplify encoder for java enum #19066

Closed

[SPARK-21255][SQL][WIP] Fixed NPE when creating encoder for enum #18488

[SPARK-21255][SQL][WIP] Fixed NPE when creating encoder for enum #18488

Conversation

mike0sv commented Jun 30, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kiszk commented Jun 30, 2017

mike0sv commented Jun 30, 2017

kiszk commented Jul 1, 2017

kiszk commented Jul 1, 2017

HyukjinKwon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Jul 1, 2017

gatorsmile commented Jul 3, 2017

SparkQA commented Jul 3, 2017

srowen commented Jul 3, 2017

SparkQA commented Jul 3, 2017

mike0sv commented Jul 3, 2017 • edited Loading

srowen commented Jul 3, 2017

mike0sv commented Jul 3, 2017

srowen commented Jul 3, 2017

SparkQA commented Jul 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mike0sv commented Jul 4, 2017

SparkQA commented Aug 2, 2017

SparkQA commented Aug 2, 2017

mike0sv commented Aug 10, 2017

Choose a reason for hiding this comment

mike0sv Aug 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 10, 2017

mike0sv commented Aug 14, 2017

SparkQA commented Aug 15, 2017

mike0sv commented Aug 15, 2017

HyukjinKwon commented Aug 15, 2017

SparkQA commented Aug 16, 2017

mike0sv commented Aug 22, 2017

srowen commented Aug 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mike0sv Aug 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mike0sv commented Aug 22, 2017

mike0sv commented Aug 22, 2017

SparkQA commented Aug 23, 2017

srowen commented Aug 25, 2017

Choose a reason for hiding this comment

mike0sv commented Jun 30, 2017 •

edited

Loading

mike0sv commented Jul 3, 2017 •

edited

Loading

mike0sv Aug 10, 2017 •

edited

Loading

mike0sv Aug 22, 2017 •

edited

Loading