Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21255][SQL][WIP] Fixed NPE when creating encoder for enum #18488

Closed
wants to merge 17 commits into from

Conversation

mike0sv
Copy link
Contributor

@mike0sv mike0sv commented Jun 30, 2017

What changes were proposed in this pull request?

Fixed NPE when creating encoder for enum.

When you try to create an encoder for Enum type (or bean with enum property) via Encoders.bean(...), it fails with NullPointerException at TypeToken:495.
I did a little research and it turns out, that in JavaTypeInference following code

  def getJavaBeanReadableProperties(beanClass: Class[_]): Array[PropertyDescriptor] = {
    val beanInfo = Introspector.getBeanInfo(beanClass)
    beanInfo.getPropertyDescriptors.filterNot(_.getName == "class")
      .filter(_.getReadMethod != null)
  }

filters out properties named "class", because we wouldn't want to serialize that. But enum types have another property of type Class named "declaringClass", which we are trying to inspect recursively. Eventually we try to inspect ClassLoader class, which has property "defaultAssertionStatus" with no read method, which leads to NPE at TypeToken:495.

I added property name "declaringClass" to filtering to resolve this.

How was this patch tested?

Unit test in JavaDatasetSuite which creates an encoder for enum

@mike0sv mike0sv changed the title Enum support [SPARK-21255][SQL] Fixed NPE when creating encoder for enum Jun 30, 2017
@@ -0,0 +1,32 @@
package org.apache.spark.sql.catalyst;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Copy the copyright header you see in other files)
Or, it looks like Encoders-related tests are otherwise all in JavaDatasetSuite. That may be a better place anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@kiszk
Copy link
Member

kiszk commented Jun 30, 2017

This fix looks good. Would it be possible to put details on why this problem happens and how it is fixed in the description of this PR, as written in JIRA entry?

@mike0sv
Copy link
Contributor Author

mike0sv commented Jun 30, 2017

@kiszk check it out

@kiszk
Copy link
Member

kiszk commented Jul 1, 2017

@mike0sv looks good, thanks. It would help us for ease of understanding in the future.

@kiszk
Copy link
Member

kiszk commented Jul 1, 2017

ok to test

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. sounds the current test fails during code generation.

final test.org.apache.spark.sql.JavaDatasetSuite$EnumBean value1 = false ? null : new test.org.apache.spark.sql.JavaDatasetSuite$EnumBean();

This looks because it tries to create the object via enum's private constructor. @kiszk, I believe you know this bit well. What do you think about this?

}

@Test
public void testEnum() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need throws Exception {?

List<EnumBean> data = Arrays.asList(EnumBean.B);
Dataset<EnumBean> ds = spark.createDataset(data, Encoders.bean(EnumBean.class));

Assert.assertEquals(ds.collectAsList().size(), 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would go Assert.assertEquals(data, ds.collectAsList());

@HyukjinKwon
Copy link
Member

The reproducer in the JIRA used show() with rows and it was fine but it looks a problem with type specific operations of dataset.

@gatorsmile
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jul 3, 2017

Test build #79070 has finished for PR 18488 at commit 120bb32.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jul 3, 2017

Hm, backing up, this shouldn't actually work, should it? enums aren't beans, and this newer error that gets uncovered more directly demonstrates that.

Do you need to define an encoder to use an enum? it should just be serialized as a string or int already. If not, that's what needs to be fixed.

See also https://issues.apache.org/jira/browse/SPARK-17248

@SparkQA
Copy link

SparkQA commented Jul 3, 2017

Test build #79104 has finished for PR 18488 at commit 2cb0834.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mike0sv
Copy link
Contributor Author

mike0sv commented Jul 3, 2017

I reworked the code to ser/de enums into ints (according to declaring order). However, I recreate the mapping for each object, which is very bad obviously. I need to create mapping once (for each partition I guess) and then use it for all objects. Please tell me how it can be achieved.
Also, is there a better place for my ser/de methods?

@mike0sv mike0sv changed the title [SPARK-21255][SQL] Fixed NPE when creating encoder for enum [SPARK-21255][SQL][WIP] Fixed NPE when creating encoder for enum Jul 3, 2017
@srowen
Copy link
Member

srowen commented Jul 3, 2017

How about just using Encoders.javaSerialization() with enums?

@mike0sv
Copy link
Contributor Author

mike0sv commented Jul 3, 2017

It won't work if I have enum field inside regular java bean.

@srowen
Copy link
Member

srowen commented Jul 3, 2017

I see, so the bean encoder assumes every property is a bean.

@SparkQA
Copy link

SparkQA commented Jul 3, 2017

Test build #79105 has finished for PR 18488 at commit 637daa4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val returnType = typeToken.method(property.getReadMethod).getReturnType
val (dataType, nullable) = inferDataType(returnType, seenTypeSet + other)
new StructField(property.getName, dataType, nullable)
if (typeToken.getRawType.isEnum) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be a case rawType if rawType.isEnum => case instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also imagine the most natural type for an enum is a string, not a struct containing an int, but I haven't maybe thought that through

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, but int is cheaper. However, if we use string, there will be a possibility for meaningful queries. Also, there are strange cases with complex enums with different fields, even with complex types. I'd say string with constant name is enough, thou

}

/** Returns a mapping from int to enum value for given enum type */
def enumDeserializer[T](enum: Class[T]): Int => T = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enum's values() value is already effectively a mapping from int to enum values -- much cheaper to access?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. But if we use string values, there is already a hashmap inside enum implementation, accessible via valueOf.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's still not very good, because we have to lookup enum by class name for each object we serialize/deserialize. Do you have any ideas?

@mike0sv
Copy link
Contributor Author

mike0sv commented Jul 4, 2017

Ran into something strange. Changed ints to strings and it worked fine. But then I added a test for encoding bean with enum inside and the test failed. It failed because in my implementation deserializer returns Object (AnyRef), and if enum is top-level object it works fine. But if enum is a field, it tries to set it via setter, which does not accept arbitrary object. So, I changed implementation of deserializer to the following.

  def enumDeserializer[T <: Enum[T]](enum: Class[T]): InternalRow => T = {
    assert(enum.isEnum)
    value: InternalRow =>
      Enum.valueOf(enum, value.toString)
  }

  def deserializeEnumName[T <: Enum[T]](typeDummy: T, inputObject: InternalRow): T = {
    enumDeserializer(typeDummy.getClass.asInstanceOf[Class[T]])(inputObject)
  }

Now it failed with "Assignment conversion not possible from type "java.lang.Enum" to type "test.org.apache.spark.sql.JavaDatasetSuite$EnumBean"" at

private test.org.apache.spark.sql.JavaDatasetSuite$EnumBean argValue;
private InternalRow argValue1;
........
final test.org.apache.spark.sql.JavaDatasetSuite$EnumBean value2 = resultIsNull ? null : org.apache.spark.sql.types.DataTypes.deserializeEnumName(argValue, argValue1);

which is odd, because if I call it from regular code it compiles just fine.

    EnumBean argValue = EnumBean.values()[0];
    InternalRow argValue1 = null;
    final EnumBean bean = deserializeEnumName(argValue, argValue1);

I even tried moving code to java class just in case, but that did no good.
Is there any difference in compiling environments that could be the cause of that?

@SparkQA
Copy link

SparkQA commented Aug 2, 2017

Test build #80163 has finished for PR 18488 at commit 8587437.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

# Conflicts:
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
@SparkQA
Copy link

SparkQA commented Aug 2, 2017

Test build #80166 has finished for PR 18488 at commit 7f2019f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mike0sv
Copy link
Contributor Author

mike0sv commented Aug 10, 2017

@srowen @HyukjinKwon hey guys, I think i got this, take a look. some sparkr tests failed for some reason, but I think it's not my fault =|

@@ -79,7 +79,7 @@ public ExpressionInfo(
assert name != null;
assert arguments != null;
assert examples != null;
assert examples.isEmpty() || examples.startsWith("\n Examples:");
assert examples.isEmpty() || examples.startsWith(System.lineSeparator() + " Examples:");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this one is not related?

Copy link
Contributor Author

@mike0sv mike0sv Aug 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but without this it's not possible to run tests if you have different line separators (on windows for example)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we support Windows for dev. This assertion should probably be weakened anyway but that's a separate issue from this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I got rid of it

@SparkQA
Copy link

SparkQA commented Aug 10, 2017

Test build #80489 has finished for PR 18488 at commit bc049b3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mike0sv
Copy link
Contributor Author

mike0sv commented Aug 14, 2017

@srowen @HyukjinKwon it seems like it's all ok now

@SparkQA
Copy link

SparkQA commented Aug 15, 2017

Test build #80685 has finished for PR 18488 at commit fbdc599.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mike0sv
Copy link
Contributor Author

mike0sv commented Aug 15, 2017

@srowen @HyukjinKwon , retest this please :)

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Aug 16, 2017

Test build #80700 has finished for PR 18488 at commit fbdc599.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mike0sv
Copy link
Contributor Author

mike0sv commented Aug 22, 2017

@srowen @HyukjinKwon what's your status on this? anything else I can do?

@srowen
Copy link
Member

srowen commented Aug 22, 2017

I don't feel especially qualified to review this, and I'm hesitant because I know the core encoder/decoder framework is an important piece that needs some care.

Your change looks good though. It does seem carefully targeted at only changing the enum cases, has tests, and does fix an identified problem that prevents a normal use case from working.

Am I right that this serializes enums as their string value?

Are there any downsides to this change -- am I missing any behavior it changes or breaks?

case other if other.isEnum =>
(StructType(Seq(StructField(typeToken.getRawType.getSimpleName,
StringType, nullable = false))), true)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use struct type with string field to store enum type and it's value

StaticInvoke(JavaTypeInference.getClass, ObjectType(other), "deserializeEnumName",
expressions.Literal.create(other.getEnumConstants.apply(0), ObjectType(other))
:: getPath :: Nil)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we pass literal value of first enum constant to resolve type parameter of deserializeEnumName method

/** Returns value index for given enum type and value */
def serializeEnumName[T <: Enum[T]](enum: UTF8String, inputObject: T): UTF8String = {
enumSerializer(Utils.classForName(enum.toString).asInstanceOf[Class[T]])(inputObject)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Utils.classForName delegates to Class.forName, which operates on native level, so additional optimizations like caching are not required

def enumSerializer[T <: Enum[T]](enum: Class[T]): T => UTF8String = {
assert(enum.isEnum)
inputObject: T =>
UTF8String.fromString(inputObject.name())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use enum constant name as field value

def enumDeserializer[T <: Enum[T]](enum: Class[T]): InternalRow => T = {
assert(enum.isEnum)
value: InternalRow =>
Enum.valueOf(enum, value.getUTF8String(0).toString)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enum.valueOf uses cached string->value map

CreateNamedStruct(expressions.Literal("enum") ::
StaticInvoke(JavaTypeInference.getClass, StringType, "serializeEnumName",
expressions.Literal.create(other.getName, StringType) :: inputObject :: Nil) :: Nil)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we pass enum class name via literal to serializer

// TODO: improve error message for java bean encoder.
def javaBean[T](beanClass: Class[T]): ExpressionEncoder[T] = {
val schema = JavaTypeInference.inferDataType(beanClass)._1
val schema = if (beanClass.isEnum) {
javaEnumSchema(beanClass)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use enum as top level object, we need another level of structType for it to be compatible with our ser/de structure

@@ -154,13 +154,13 @@ case class StaticInvoke(
val evaluate = if (returnNullable) {
if (ctx.defaultValue(dataType) == "null") {
s"""
${ev.value} = $callFunc;
${ev.value} = (($javaType) ($callFunc));
Copy link
Contributor Author

@mike0sv mike0sv Aug 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explicitly cast value to needed type, because without this generated code didn't compile with something like "cannot assign value of type Enum to %RealEnumClassName%"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from janino documentation: "Type arguments: Are parsed, but otherwise ignored. The most significant restriction that follows is that you must cast return values from method invocations, e.g. "(String) myMap.get(key)"

@mike0sv
Copy link
Contributor Author

mike0sv commented Aug 22, 2017

@srowen you are right, we store string values of constant names (for test example, we would get A and B values, not google/elgoog)
I commented some of the changes for clarity, but I don't see any downsides.
There is only one change which affects not enum-specific code, the casting in code generation, but it's completely harmless

@mike0sv
Copy link
Contributor Author

mike0sv commented Aug 22, 2017

Found this in janino documentation, it explains the need for explicit casting: "Type arguments: Are parsed, but otherwise ignored. The most significant restriction that follows is that you must cast return values from method invocations, e.g. "(String) myMap.get(key)"

@SparkQA
Copy link

SparkQA commented Aug 23, 2017

Test build #3898 has finished for PR 18488 at commit fbdc599.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Aug 25, 2017

Merged to master

@asfgit asfgit closed this in 7d16776 Aug 25, 2017
@@ -118,6 +119,10 @@ object JavaTypeInference {
val (valueDataType, nullable) = inferDataType(valueType, seenTypeSet)
(MapType(keyDataType, valueDataType, nullable), true)

case other if other.isEnum =>
(StructType(Seq(StructField(typeToken.getRawType.getSimpleName,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we map enum to struct type? shouldn't enum always have a single field?

ghost pushed a commit to dbtsai/spark that referenced this pull request Aug 29, 2017
## What changes were proposed in this pull request?

This is a follow-up for apache#18488, to simplify the code.

The major change is, we should map java enum to string type, instead of a struct type with a single string field.

## How was this patch tested?

existing tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#19066 from cloud-fan/fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants