-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21255][SQL][WIP] Fixed NPE when creating encoder for enum #18488
Conversation
@@ -0,0 +1,32 @@ | |||
package org.apache.spark.sql.catalyst; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Copy the copyright header you see in other files)
Or, it looks like Encoders-related tests are otherwise all in JavaDatasetSuite. That may be a better place anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
This fix looks good. Would it be possible to put details on why this problem happens and how it is fixed in the description of this PR, as written in JIRA entry? |
@kiszk check it out |
@mike0sv looks good, thanks. It would help us for ease of understanding in the future. |
ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm.. sounds the current test fails during code generation.
final test.org.apache.spark.sql.JavaDatasetSuite$EnumBean value1 = false ? null : new test.org.apache.spark.sql.JavaDatasetSuite$EnumBean();
This looks because it tries to create the object via enum's private constructor. @kiszk, I believe you know this bit well. What do you think about this?
} | ||
|
||
@Test | ||
public void testEnum() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need throws Exception {
?
List<EnumBean> data = Arrays.asList(EnumBean.B); | ||
Dataset<EnumBean> ds = spark.createDataset(data, Encoders.bean(EnumBean.class)); | ||
|
||
Assert.assertEquals(ds.collectAsList().size(), 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would go Assert.assertEquals(data, ds.collectAsList());
The reproducer in the JIRA used |
ok to test |
Test build #79070 has finished for PR 18488 at commit
|
Hm, backing up, this shouldn't actually work, should it? enums aren't beans, and this newer error that gets uncovered more directly demonstrates that. Do you need to define an encoder to use an enum? it should just be serialized as a string or int already. If not, that's what needs to be fixed. |
Test build #79104 has finished for PR 18488 at commit
|
I reworked the code to ser/de enums into ints (according to declaring order). However, I recreate the mapping for each object, which is very bad obviously. I need to create mapping once (for each partition I guess) and then use it for all objects. Please tell me how it can be achieved. |
How about just using Encoders.javaSerialization() with enums? |
It won't work if I have enum field inside regular java bean. |
I see, so the bean encoder assumes every property is a bean. |
Test build #79105 has finished for PR 18488 at commit
|
val returnType = typeToken.method(property.getReadMethod).getReturnType | ||
val (dataType, nullable) = inferDataType(returnType, seenTypeSet + other) | ||
new StructField(property.getName, dataType, nullable) | ||
if (typeToken.getRawType.isEnum) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be a case rawType if rawType.isEnum =>
case instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also imagine the most natural type for an enum is a string, not a struct containing an int, but I haven't maybe thought that through
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, but int is cheaper. However, if we use string, there will be a possibility for meaningful queries. Also, there are strange cases with complex enums with different fields, even with complex types. I'd say string with constant name is enough, thou
} | ||
|
||
/** Returns a mapping from int to enum value for given enum type */ | ||
def enumDeserializer[T](enum: Class[T]): Int => T = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The enum's values()
value is already effectively a mapping from int to enum values -- much cheaper to access?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. But if we use string values, there is already a hashmap inside enum implementation, accessible via valueOf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it's still not very good, because we have to lookup enum by class name for each object we serialize/deserialize. Do you have any ideas?
Ran into something strange. Changed ints to strings and it worked fine. But then I added a test for encoding bean with enum inside and the test failed. It failed because in my implementation deserializer returns Object (AnyRef), and if enum is top-level object it works fine. But if enum is a field, it tries to set it via setter, which does not accept arbitrary object. So, I changed implementation of deserializer to the following.
Now it failed with "Assignment conversion not possible from type "java.lang.Enum" to type "test.org.apache.spark.sql.JavaDatasetSuite$EnumBean"" at
which is odd, because if I call it from regular code it compiles just fine.
I even tried moving code to java class just in case, but that did no good. |
Test build #80163 has finished for PR 18488 at commit
|
# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
Test build #80166 has finished for PR 18488 at commit
|
@srowen @HyukjinKwon hey guys, I think i got this, take a look. some sparkr tests failed for some reason, but I think it's not my fault =| |
@@ -79,7 +79,7 @@ public ExpressionInfo( | |||
assert name != null; | |||
assert arguments != null; | |||
assert examples != null; | |||
assert examples.isEmpty() || examples.startsWith("\n Examples:"); | |||
assert examples.isEmpty() || examples.startsWith(System.lineSeparator() + " Examples:"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this one is not related?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but without this it's not possible to run tests if you have different line separators (on windows for example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we support Windows for dev. This assertion should probably be weakened anyway but that's a separate issue from this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I got rid of it
Test build #80489 has finished for PR 18488 at commit
|
@srowen @HyukjinKwon it seems like it's all ok now |
Test build #80685 has finished for PR 18488 at commit
|
@srowen @HyukjinKwon , retest this please :) |
retest this please |
Test build #80700 has finished for PR 18488 at commit
|
@srowen @HyukjinKwon what's your status on this? anything else I can do? |
I don't feel especially qualified to review this, and I'm hesitant because I know the core encoder/decoder framework is an important piece that needs some care. Your change looks good though. It does seem carefully targeted at only changing the enum cases, has tests, and does fix an identified problem that prevents a normal use case from working. Am I right that this serializes enums as their string value? Are there any downsides to this change -- am I missing any behavior it changes or breaks? |
case other if other.isEnum => | ||
(StructType(Seq(StructField(typeToken.getRawType.getSimpleName, | ||
StringType, nullable = false))), true) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use struct type with string field to store enum type and it's value
StaticInvoke(JavaTypeInference.getClass, ObjectType(other), "deserializeEnumName", | ||
expressions.Literal.create(other.getEnumConstants.apply(0), ObjectType(other)) | ||
:: getPath :: Nil) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we pass literal value of first enum constant to resolve type parameter of deserializeEnumName method
/** Returns value index for given enum type and value */ | ||
def serializeEnumName[T <: Enum[T]](enum: UTF8String, inputObject: T): UTF8String = { | ||
enumSerializer(Utils.classForName(enum.toString).asInstanceOf[Class[T]])(inputObject) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Utils.classForName delegates to Class.forName, which operates on native level, so additional optimizations like caching are not required
def enumSerializer[T <: Enum[T]](enum: Class[T]): T => UTF8String = { | ||
assert(enum.isEnum) | ||
inputObject: T => | ||
UTF8String.fromString(inputObject.name()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use enum constant name as field value
def enumDeserializer[T <: Enum[T]](enum: Class[T]): InternalRow => T = { | ||
assert(enum.isEnum) | ||
value: InternalRow => | ||
Enum.valueOf(enum, value.getUTF8String(0).toString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enum.valueOf uses cached string->value map
CreateNamedStruct(expressions.Literal("enum") :: | ||
StaticInvoke(JavaTypeInference.getClass, StringType, "serializeEnumName", | ||
expressions.Literal.create(other.getName, StringType) :: inputObject :: Nil) :: Nil) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we pass enum class name via literal to serializer
// TODO: improve error message for java bean encoder. | ||
def javaBean[T](beanClass: Class[T]): ExpressionEncoder[T] = { | ||
val schema = JavaTypeInference.inferDataType(beanClass)._1 | ||
val schema = if (beanClass.isEnum) { | ||
javaEnumSchema(beanClass) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use enum as top level object, we need another level of structType for it to be compatible with our ser/de structure
@@ -154,13 +154,13 @@ case class StaticInvoke( | |||
val evaluate = if (returnNullable) { | |||
if (ctx.defaultValue(dataType) == "null") { | |||
s""" | |||
${ev.value} = $callFunc; | |||
${ev.value} = (($javaType) ($callFunc)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explicitly cast value to needed type, because without this generated code didn't compile with something like "cannot assign value of type Enum to %RealEnumClassName%"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from janino documentation: "Type arguments: Are parsed, but otherwise ignored. The most significant restriction that follows is that you must cast return values from method invocations, e.g. "(String) myMap.get(key)"
@srowen you are right, we store string values of constant names (for test example, we would get A and B values, not google/elgoog) |
Found this in janino documentation, it explains the need for explicit casting: "Type arguments: Are parsed, but otherwise ignored. The most significant restriction that follows is that you must cast return values from method invocations, e.g. "(String) myMap.get(key)" |
Test build #3898 has finished for PR 18488 at commit
|
Merged to master |
@@ -118,6 +119,10 @@ object JavaTypeInference { | |||
val (valueDataType, nullable) = inferDataType(valueType, seenTypeSet) | |||
(MapType(keyDataType, valueDataType, nullable), true) | |||
|
|||
case other if other.isEnum => | |||
(StructType(Seq(StructField(typeToken.getRawType.getSimpleName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we map enum to struct type? shouldn't enum always have a single field?
## What changes were proposed in this pull request? This is a follow-up for apache#18488, to simplify the code. The major change is, we should map java enum to string type, instead of a struct type with a single string field. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes apache#19066 from cloud-fan/fix.
What changes were proposed in this pull request?
Fixed NPE when creating encoder for enum.
When you try to create an encoder for Enum type (or bean with enum property) via Encoders.bean(...), it fails with NullPointerException at TypeToken:495.
I did a little research and it turns out, that in JavaTypeInference following code
filters out properties named "class", because we wouldn't want to serialize that. But enum types have another property of type Class named "declaringClass", which we are trying to inspect recursively. Eventually we try to inspect ClassLoader class, which has property "defaultAssertionStatus" with no read method, which leads to NPE at TypeToken:495.
I added property name "declaringClass" to filtering to resolve this.
How was this patch tested?
Unit test in JavaDatasetSuite which creates an encoder for enum