[SPARK-19666][SQL] Skip a property without getter in Java schema inference and allow empty bean in encoder creation #17013

HyukjinKwon · 2017-02-21T09:29:04Z

What changes were proposed in this pull request?

This PR proposes to fix two.

Skip a property without a getter in beans

Currently, if we use a JavaBean without the getter as below:

public static class BeanWithoutGetter implements Serializable {
  private String a;

  public void setA(String a) {
    this.a = a;
  }
}

BeanWithoutGetter bean = new BeanWithoutGetter();
List<BeanWithoutGetter> data = Arrays.asList(bean);
spark.createDataFrame(data, BeanWithoutGetter.class).show();

Before

It throws an exception as below:

java.lang.NullPointerException
	at org.spark_project.guava.reflect.TypeToken.method(TypeToken.java:465)
	at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:126)
	at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)

After

++
||
++
||
++

Supports empty bean in encoder creation

public static class EmptyBean implements Serializable {}

EmptyBean bean = new EmptyBean();
List<EmptyBean> data = Arrays.asList(bean);
spark.createDataset(data, Encoders.bean(EmptyBean.class)).show();

Before

throws an exception as below:

java.lang.UnsupportedOperationException: Cannot infer type for class EmptyBean because it is not bean-compliant
	at org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$serializerFor(JavaTypeInference.scala:436)
	at org.apache.spark.sql.catalyst.JavaTypeInference$.serializerFor(JavaTypeInference.scala:341)

After

++
||
++
||
++

How was this patch tested?

Unit test in JavaDataFrameSuite.

HyukjinKwon · 2017-02-21T09:35:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala

@@ -123,7 +123,11 @@ object JavaTypeInference {
        val beanInfo = Introspector.getBeanInfo(typeToken.getRawType)
        val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == "class")
        val fields = properties.map { property =>
-          val returnType = typeToken.method(property.getReadMethod).getReturnType
+          val readMethod = Option(property.getReadMethod).getOrElse {


This

May return null if the property can't be read.

HyukjinKwon · 2017-02-21T09:36:56Z

It seems ideally we should filter all out assuming from

// TODO: we should only collect properties that have getter and setter. However, some tests
// pass in scala case class as java bean class which doesn't have getter and setter.

I am willing to do this but it seems it also needs a behaviour change and a (I guess) big sweep. So, this PR just proposes to fix the exception message for now.

HyukjinKwon · 2017-02-21T09:37:19Z

cc @cloud-fan who I saw in other related PRs (I found while tracking down the history).

maropu · 2017-02-21T10:27:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala

-          val returnType = typeToken.method(property.getReadMethod).getReturnType
+          val readMethod = Option(property.getReadMethod).getOrElse {
+            throw new UnsupportedOperationException(
+              s"Cannot read the property ${property.getName} because it does not have the getter")


nit: I think the class is better than the property because property is internal and it is some meaningless for users? In the bottom, an error message uses class https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala#L433.

Sure, actually, let me try to print both. Thank you.

maropu · 2017-02-21T10:30:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala

@@ -123,7 +123,11 @@ object JavaTypeInference {
        val beanInfo = Introspector.getBeanInfo(typeToken.getRawType)
        val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == "class")


We don't need to guard an empty properties? In JavaTypeInference#serializerFor https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala#L421, it takes the case into account.

Yes, It seems loosely related with the JIRA here. I am fine with changing it if any committer could confirm. Otherwise, I would like to avoid a behaviour change here.

It seems like behavior should be the same in this case, in this same file. I think you could also improve that error handling.

Is there anywhere else in the code that checks the getter? is the error message consistent with that?

It seems like behavior should be the same in this case, in this same file. I think you could also improve that error handling.

The check he pointed out is related only with Encoders.bean(...) API as it seems only used in deserializerFor/serializerFor via ExpressionEncoder.scala#L85, which seems always executed after the codes I proposed to fix here.

The below codes just do not allow a bean with empty property after the properties without getters are filtered in JavaTypeInference.scala#L137 and it checks if they are empty in JavaTypeInference.scala#L421.

spark.createDataset(data, Encoders.bean(EmptyBean.class));

it throws an exception as below:

java.lang.UnsupportedOperationException: Cannot infer type for class EmptyBean because it is not bean-compliant at org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$serializerFor(JavaTypeInference.scala:437) at org.apache.spark.sql.catalyst.JavaTypeInference$.serializerFor(JavaTypeInference.scala:342)

If we do not use Encoders.bean(...), then it allows empty schema as below:

spark.createDataFrame(data, EmptyBean.class);

For the sake of consistency, I guess we might have to consider removing the checks there , JavaTypeInference.scala#L421, and allowing this empty case because this seems possible in Scala API too as below:

import spark.implicits._ case class A() spark.createDataset(Seq(A())) spark.createDataFrame(Seq(A()))

Is there anywhere else in the code that checks the getter? is the error message consistent with that?

It seems there are several places

JavaTypeInference.scala#L137 in the Encoders.bean(...) code path where these are being filtered out. JavaTypeInference.inferDataType is being called ahead. (The same checking seems not being applied in JavaTypeInference.inferDataType due to [SPARK-19666][SQL] Skip a property without getter in Java schema inference and allow empty bean in encoder creation #17013 (comment).)

SparkSession.scala#L345-L354 uses SQLContext.beansToRows but JavaTypeInference.inferDataType is being called ahead in getSchema.

SparkSession.scala#L375-L380 uses SQLContext.beansToRows but JavaTypeInference.inferDataType is being called ahead in getSchema.

It seems the error message is consistent with this because the change I proposed here seems executed first before these places.

Java

spark.createDataFrame(data, BeanWithoutGetter.class);

spark.createDataset(data, Encoders.bean(BeanWithoutGetter.class));

Scala

spark.createDataFrame(rdd, classOf[BeanWithoutGetter])

prints

java.lang.UnsupportedOperationException: Cannot infer type for class BeanWithoutGetter because property a does not have the getter at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2$$anonfun$3.apply(JavaTypeInference.scala:127) at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2$$anonfun$3.apply(JavaTypeInference.scala:127)

Honestly, I would like to fix the empty property change and sweep all related tests if there are (I guess this relates to #17013 (comment)) or avoid this change here because it seems loosely related.

(I editted the comments above for more details and preventing ambiguity.)

ideally a property without a getter is not a bean property, let's fix the empty property problem.

I am sorry for the long comment that maybe a bit messed around. So, there are two code paths

JavaTypeInference.serializerFor/deserializerFor: non-empty required & getter/setter required

JavaTypeInference.inferDataType: empty one OK & getter only OK

Could I please ask which case you want just to double check?

I mean JavaTypeInference. Why we throw exception for a bean property without getter instead of not treating it as a property?

Ah, sure. Then, let me ignore a property without the getter in JavaTypeInference.inferDataType, and allow empty property in JavaTypeInference.serializerFor/deserializerFor.

HyukjinKwon · 2017-02-21T11:10:11Z

Thanks for your review @maropu.

SparkQA · 2017-02-21T11:30:36Z

Test build #73210 has finished for PR 17013 at commit 9ad4789.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-21T13:00:43Z

Test build #73216 has finished for PR 17013 at commit ae4c9aa.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
s\"Cannot infer type for class $

…oder creation

SparkQA · 2017-02-22T05:33:25Z

Test build #73257 has finished for PR 17013 at commit 5808d71.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-22T06:04:23Z

Test build #73256 has finished for PR 17013 at commit 91cee26.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-22T06:15:22Z

Test build #73258 has finished for PR 17013 at commit ed686fa.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public static class BeanWithoutGetter implements Serializable

srowen

It's looking fairly good to me.

srowen · 2017-02-22T13:22:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala

+  def getJavaBeanPropertiesWithGetters(beanClass: Class[_]): Array[PropertyDescriptor] = {
+    val beanInfo = Introspector.getBeanInfo(beanClass)
+    beanInfo.getPropertyDescriptors
+      .filterNot(_.getName == "class").filter(p => p.getReadMethod != null)


Just a brevity thing, but, (_.getReadMethod != null)?

srowen · 2017-02-22T13:24:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala

+    beanInfo.getPropertyDescriptors
+      .filterNot(_.getName == "class").filter(p => p.getReadMethod != null)
+  }
+
  private def getJavaBeanProperties(beanClass: Class[_]): Array[PropertyDescriptor] = {


Can this call the new method above and additionally filter for a setter?
The naming ends up a bit funny because the first method sounds like it returns a subset of the second method, but it's the other way around. The first one returns "readable properties"? and the second returns "read-write properties"? not sure, might be worth clarifying

Yes, it seems so. Let me try to address this.

SparkQA · 2017-02-22T18:20:45Z

Test build #73283 has finished for PR 17013 at commit 3604855.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-22T18:28:17Z

Test build #73284 has finished for PR 17013 at commit ac5cc7d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-02-22T20:42:56Z

LGTM, merging to master!

…rence and allow empty bean in encoder creation ## What changes were proposed in this pull request? This PR proposes to fix two. **Skip a property without a getter in beans** Currently, if we use a JavaBean without the getter as below: ```java public static class BeanWithoutGetter implements Serializable { private String a; public void setA(String a) { this.a = a; } } BeanWithoutGetter bean = new BeanWithoutGetter(); List<BeanWithoutGetter> data = Arrays.asList(bean); spark.createDataFrame(data, BeanWithoutGetter.class).show(); ``` - Before It throws an exception as below: ``` java.lang.NullPointerException at org.spark_project.guava.reflect.TypeToken.method(TypeToken.java:465) at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:126) at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125) ``` - After ``` ++ || ++ || ++ ``` **Supports empty bean in encoder creation** ```java public static class EmptyBean implements Serializable {} EmptyBean bean = new EmptyBean(); List<EmptyBean> data = Arrays.asList(bean); spark.createDataset(data, Encoders.bean(EmptyBean.class)).show(); ``` - Before throws an exception as below: ``` java.lang.UnsupportedOperationException: Cannot infer type for class EmptyBean because it is not bean-compliant at org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$serializerFor(JavaTypeInference.scala:436) at org.apache.spark.sql.catalyst.JavaTypeInference$.serializerFor(JavaTypeInference.scala:341) ``` - After ``` ++ || ++ || ++ ``` ## How was this patch tested? Unit test in `JavaDataFrameSuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#17013 from HyukjinKwon/SPARK-19666.

Improve error message for invalid JavaBean

9ad4789

HyukjinKwon commented Feb 21, 2017

View reviewed changes

maropu reviewed Feb 21, 2017

View reviewed changes

Address comments

ae4c9aa

HyukjinKwon added 2 commits February 22, 2017 12:49

Allow setter only bean in java schema inference and empty bean in enc…

91cee26

…oder creation

Cleaner and comments

5808d71

HyukjinKwon changed the title ~~[SPARK-19666][SQL] Improve error message for JavaBean without getter~~ [SPARK-19666][SQL] Skip a property without getter in Java schema inference and allow empty bean in encoder creation Feb 22, 2017

Cleaner tests

ed686fa

srowen reviewed Feb 22, 2017

View reviewed changes

HyukjinKwon added 2 commits February 23, 2017 01:15

Address comments

3604855

Cleaner

ac5cc7d

asfgit closed this in 37112fc Feb 22, 2017

HyukjinKwon deleted the SPARK-19666 branch January 2, 2018 03:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19666][SQL] Skip a property without getter in Java schema inference and allow empty bean in encoder creation #17013

[SPARK-19666][SQL] Skip a property without getter in Java schema inference and allow empty bean in encoder creation #17013

HyukjinKwon commented Feb 21, 2017 •

edited

HyukjinKwon Feb 21, 2017

HyukjinKwon commented Feb 21, 2017 •

edited

HyukjinKwon commented Feb 21, 2017 •

edited

maropu Feb 21, 2017

HyukjinKwon Feb 21, 2017

maropu Feb 21, 2017

HyukjinKwon Feb 21, 2017

srowen Feb 21, 2017

HyukjinKwon Feb 21, 2017 •

edited

HyukjinKwon Feb 21, 2017 •

edited

HyukjinKwon Feb 22, 2017

cloud-fan Feb 22, 2017

HyukjinKwon Feb 22, 2017 •

edited

cloud-fan Feb 22, 2017

HyukjinKwon Feb 22, 2017

HyukjinKwon commented Feb 21, 2017

SparkQA commented Feb 21, 2017

SparkQA commented Feb 21, 2017

SparkQA commented Feb 22, 2017

SparkQA commented Feb 22, 2017

SparkQA commented Feb 22, 2017

srowen left a comment

srowen Feb 22, 2017

srowen Feb 22, 2017

HyukjinKwon Feb 22, 2017

SparkQA commented Feb 22, 2017

SparkQA commented Feb 22, 2017

cloud-fan commented Feb 22, 2017

		@@ -123,7 +123,11 @@ object JavaTypeInference {
		val beanInfo = Introspector.getBeanInfo(typeToken.getRawType)
		val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == "class")

[SPARK-19666][SQL] Skip a property without getter in Java schema inference and allow empty bean in encoder creation #17013

[SPARK-19666][SQL] Skip a property without getter in Java schema inference and allow empty bean in encoder creation #17013

Conversation

HyukjinKwon commented Feb 21, 2017 • edited

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

HyukjinKwon commented Feb 21, 2017 • edited

HyukjinKwon commented Feb 21, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Feb 21, 2017 • edited

Choose a reason for hiding this comment

HyukjinKwon Feb 21, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Feb 22, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Feb 21, 2017

SparkQA commented Feb 21, 2017

SparkQA commented Feb 21, 2017

SparkQA commented Feb 22, 2017

SparkQA commented Feb 22, 2017

SparkQA commented Feb 22, 2017

srowen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 22, 2017

SparkQA commented Feb 22, 2017

cloud-fan commented Feb 22, 2017

HyukjinKwon commented Feb 21, 2017 •

edited

HyukjinKwon commented Feb 21, 2017 •

edited

HyukjinKwon commented Feb 21, 2017 •

edited

HyukjinKwon Feb 21, 2017 •

edited

HyukjinKwon Feb 21, 2017 •

edited

HyukjinKwon Feb 22, 2017 •

edited