New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19666][SQL] Skip a property without getter in Java schema inference and allow empty bean in encoder creation #17013
Conversation
@@ -123,7 +123,11 @@ object JavaTypeInference { | |||
val beanInfo = Introspector.getBeanInfo(typeToken.getRawType) | |||
val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == "class") | |||
val fields = properties.map { property => | |||
val returnType = typeToken.method(property.getReadMethod).getReturnType | |||
val readMethod = Option(property.getReadMethod).getOrElse { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This
May return null if the property can't be read.
It seems ideally we should filter all out assuming from
I am willing to do this but it seems it also needs a behaviour change and a (I guess) big sweep. So, this PR just proposes to fix the exception message for now. |
cc @cloud-fan who I saw in other related PRs (I found while tracking down the history). |
val returnType = typeToken.method(property.getReadMethod).getReturnType | ||
val readMethod = Option(property.getReadMethod).getOrElse { | ||
throw new UnsupportedOperationException( | ||
s"Cannot read the property ${property.getName} because it does not have the getter") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think the class
is better than the property
because property
is internal and it is some meaningless for users? In the bottom, an error message uses class
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala#L433.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, actually, let me try to print both. Thank you.
@@ -123,7 +123,11 @@ object JavaTypeInference { | |||
val beanInfo = Introspector.getBeanInfo(typeToken.getRawType) | |||
val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == "class") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to guard an empty properties
? In JavaTypeInference#serializerFor
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala#L421, it takes the case into account.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, It seems loosely related with the JIRA here. I am fine with changing it if any committer could confirm. Otherwise, I would like to avoid a behaviour change here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like behavior should be the same in this case, in this same file. I think you could also improve that error handling.
Is there anywhere else in the code that checks the getter? is the error message consistent with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like behavior should be the same in this case, in this same file. I think you could also improve that error handling.
The check he pointed out is related only with Encoders.bean(...)
API as it seems only used in deserializerFor/serializerFor
via ExpressionEncoder.scala#L85, which seems always executed after the codes I proposed to fix here.
The below codes just do not allow a bean with empty property after the properties without getters are filtered in JavaTypeInference.scala#L137 and it checks if they are empty in JavaTypeInference.scala#L421.
spark.createDataset(data, Encoders.bean(EmptyBean.class));
it throws an exception as below:
java.lang.UnsupportedOperationException: Cannot infer type for class EmptyBean because it is not bean-compliant
at org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$serializerFor(JavaTypeInference.scala:437)
at org.apache.spark.sql.catalyst.JavaTypeInference$.serializerFor(JavaTypeInference.scala:342)
If we do not use Encoders.bean(...)
, then it allows empty schema as below:
spark.createDataFrame(data, EmptyBean.class);
For the sake of consistency, I guess we might have to consider removing the checks there , JavaTypeInference.scala#L421, and allowing this empty case because this seems possible in Scala API too as below:
import spark.implicits._
case class A()
spark.createDataset(Seq(A()))
spark.createDataFrame(Seq(A()))
Is there anywhere else in the code that checks the getter? is the error message consistent with that?
It seems there are several places
-
JavaTypeInference.scala#L137 in the
Encoders.bean(...)
code path where these are being filtered out.JavaTypeInference.inferDataType
is being called ahead. (The same checking seems not being applied inJavaTypeInference.inferDataType
due to [SPARK-19666][SQL] Skip a property without getter in Java schema inference and allow empty bean in encoder creation #17013 (comment).) -
SparkSession.scala#L345-L354 uses
SQLContext.beansToRows
butJavaTypeInference.inferDataType
is being called ahead ingetSchema
. -
SparkSession.scala#L375-L380 uses
SQLContext.beansToRows
butJavaTypeInference.inferDataType
is being called ahead ingetSchema
.
It seems the error message is consistent with this because the change I proposed here seems executed first before these places.
Java
spark.createDataFrame(data, BeanWithoutGetter.class);
spark.createDataset(data, Encoders.bean(BeanWithoutGetter.class));
Scala
spark.createDataFrame(rdd, classOf[BeanWithoutGetter])
prints
java.lang.UnsupportedOperationException: Cannot infer type for class BeanWithoutGetter because property a does not have the getter
at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2$$anonfun$3.apply(JavaTypeInference.scala:127)
at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2$$anonfun$3.apply(JavaTypeInference.scala:127)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I would like to fix the empty property change and sweep all related tests if there are (I guess this relates to #17013 (comment)) or avoid this change here because it seems loosely related.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I editted the comments above for more details and preventing ambiguity.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally a property without a getter is not a bean property, let's fix the empty property problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry for the long comment that maybe a bit messed around. So, there are two code paths
-
JavaTypeInference.serializerFor
/deserializerFor
: non-empty required & getter/setter required -
JavaTypeInference.inferDataType
: empty one OK & getter only OK
Could I please ask which case you want just to double check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean JavaTypeInference
. Why we throw exception for a bean property without getter instead of not treating it as a property?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sure. Then, let me ignore a property without the getter in JavaTypeInference.inferDataType
, and allow empty property in JavaTypeInference.serializerFor
/deserializerFor
.
Thanks for your review @maropu. |
Test build #73210 has finished for PR 17013 at commit
|
Test build #73216 has finished for PR 17013 at commit
|
Test build #73257 has finished for PR 17013 at commit
|
Test build #73256 has finished for PR 17013 at commit
|
Test build #73258 has finished for PR 17013 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's looking fairly good to me.
def getJavaBeanPropertiesWithGetters(beanClass: Class[_]): Array[PropertyDescriptor] = { | ||
val beanInfo = Introspector.getBeanInfo(beanClass) | ||
beanInfo.getPropertyDescriptors | ||
.filterNot(_.getName == "class").filter(p => p.getReadMethod != null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a brevity thing, but, (_.getReadMethod != null)
?
beanInfo.getPropertyDescriptors | ||
.filterNot(_.getName == "class").filter(p => p.getReadMethod != null) | ||
} | ||
|
||
private def getJavaBeanProperties(beanClass: Class[_]): Array[PropertyDescriptor] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this call the new method above and additionally filter for a setter?
The naming ends up a bit funny because the first method sounds like it returns a subset of the second method, but it's the other way around. The first one returns "readable properties"? and the second returns "read-write properties"? not sure, might be worth clarifying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it seems so. Let me try to address this.
Test build #73283 has finished for PR 17013 at commit
|
Test build #73284 has finished for PR 17013 at commit
|
LGTM, merging to master! |
…rence and allow empty bean in encoder creation ## What changes were proposed in this pull request? This PR proposes to fix two. **Skip a property without a getter in beans** Currently, if we use a JavaBean without the getter as below: ```java public static class BeanWithoutGetter implements Serializable { private String a; public void setA(String a) { this.a = a; } } BeanWithoutGetter bean = new BeanWithoutGetter(); List<BeanWithoutGetter> data = Arrays.asList(bean); spark.createDataFrame(data, BeanWithoutGetter.class).show(); ``` - Before It throws an exception as below: ``` java.lang.NullPointerException at org.spark_project.guava.reflect.TypeToken.method(TypeToken.java:465) at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:126) at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125) ``` - After ``` ++ || ++ || ++ ``` **Supports empty bean in encoder creation** ```java public static class EmptyBean implements Serializable {} EmptyBean bean = new EmptyBean(); List<EmptyBean> data = Arrays.asList(bean); spark.createDataset(data, Encoders.bean(EmptyBean.class)).show(); ``` - Before throws an exception as below: ``` java.lang.UnsupportedOperationException: Cannot infer type for class EmptyBean because it is not bean-compliant at org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$serializerFor(JavaTypeInference.scala:436) at org.apache.spark.sql.catalyst.JavaTypeInference$.serializerFor(JavaTypeInference.scala:341) ``` - After ``` ++ || ++ || ++ ``` ## How was this patch tested? Unit test in `JavaDataFrameSuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#17013 from HyukjinKwon/SPARK-19666.
What changes were proposed in this pull request?
This PR proposes to fix two.
Skip a property without a getter in beans
Currently, if we use a JavaBean without the getter as below:
It throws an exception as below:
Supports empty bean in encoder creation
throws an exception as below:
How was this patch tested?
Unit test in
JavaDataFrameSuite
.