Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19666][SQL] Skip a property without getter in Java schema inference and allow empty bean in encoder creation #17013

Closed
wants to merge 7 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Feb 21, 2017

What changes were proposed in this pull request?

This PR proposes to fix two.

Skip a property without a getter in beans

Currently, if we use a JavaBean without the getter as below:

public static class BeanWithoutGetter implements Serializable {
  private String a;

  public void setA(String a) {
    this.a = a;
  }
}

BeanWithoutGetter bean = new BeanWithoutGetter();
List<BeanWithoutGetter> data = Arrays.asList(bean);
spark.createDataFrame(data, BeanWithoutGetter.class).show();
  • Before

It throws an exception as below:

java.lang.NullPointerException
	at org.spark_project.guava.reflect.TypeToken.method(TypeToken.java:465)
	at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:126)
	at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
  • After
++
||
++
||
++

Supports empty bean in encoder creation

public static class EmptyBean implements Serializable {}

EmptyBean bean = new EmptyBean();
List<EmptyBean> data = Arrays.asList(bean);
spark.createDataset(data, Encoders.bean(EmptyBean.class)).show();
  • Before

throws an exception as below:

java.lang.UnsupportedOperationException: Cannot infer type for class EmptyBean because it is not bean-compliant
	at org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$serializerFor(JavaTypeInference.scala:436)
	at org.apache.spark.sql.catalyst.JavaTypeInference$.serializerFor(JavaTypeInference.scala:341)
  • After
++
||
++
||
++

How was this patch tested?

Unit test in JavaDataFrameSuite.

@@ -123,7 +123,11 @@ object JavaTypeInference {
val beanInfo = Introspector.getBeanInfo(typeToken.getRawType)
val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == "class")
val fields = properties.map { property =>
val returnType = typeToken.method(property.getReadMethod).getReturnType
val readMethod = Option(property.getReadMethod).getOrElse {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This

May return null if the property can't be read.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Feb 21, 2017

It seems ideally we should filter all out assuming from

// TODO: we should only collect properties that have getter and setter. However, some tests
// pass in scala case class as java bean class which doesn't have getter and setter.

I am willing to do this but it seems it also needs a behaviour change and a (I guess) big sweep. So, this PR just proposes to fix the exception message for now.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Feb 21, 2017

cc @cloud-fan who I saw in other related PRs (I found while tracking down the history).

val returnType = typeToken.method(property.getReadMethod).getReturnType
val readMethod = Option(property.getReadMethod).getOrElse {
throw new UnsupportedOperationException(
s"Cannot read the property ${property.getName} because it does not have the getter")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the class is better than the property because property is internal and it is some meaningless for users? In the bottom, an error message uses class https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala#L433.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, actually, let me try to print both. Thank you.

@@ -123,7 +123,11 @@ object JavaTypeInference {
val beanInfo = Introspector.getBeanInfo(typeToken.getRawType)
val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == "class")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to guard an empty properties? In JavaTypeInference#serializerFor https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala#L421, it takes the case into account.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, It seems loosely related with the JIRA here. I am fine with changing it if any committer could confirm. Otherwise, I would like to avoid a behaviour change here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like behavior should be the same in this case, in this same file. I think you could also improve that error handling.

Is there anywhere else in the code that checks the getter? is the error message consistent with that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like behavior should be the same in this case, in this same file. I think you could also improve that error handling.

The check he pointed out is related only with Encoders.bean(...) API as it seems only used in deserializerFor/serializerFor via ExpressionEncoder.scala#L85, which seems always executed after the codes I proposed to fix here.

The below codes just do not allow a bean with empty property after the properties without getters are filtered in JavaTypeInference.scala#L137 and it checks if they are empty in JavaTypeInference.scala#L421.

spark.createDataset(data, Encoders.bean(EmptyBean.class));

it throws an exception as below:

java.lang.UnsupportedOperationException: Cannot infer type for class EmptyBean because it is not bean-compliant
	at org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$serializerFor(JavaTypeInference.scala:437)
	at org.apache.spark.sql.catalyst.JavaTypeInference$.serializerFor(JavaTypeInference.scala:342)

If we do not use Encoders.bean(...), then it allows empty schema as below:

spark.createDataFrame(data, EmptyBean.class);

For the sake of consistency, I guess we might have to consider removing the checks there , JavaTypeInference.scala#L421, and allowing this empty case because this seems possible in Scala API too as below:

import spark.implicits._

case class A()
spark.createDataset(Seq(A()))
spark.createDataFrame(Seq(A()))

Is there anywhere else in the code that checks the getter? is the error message consistent with that?

It seems there are several places

It seems the error message is consistent with this because the change I proposed here seems executed first before these places.

Java

spark.createDataFrame(data, BeanWithoutGetter.class);
spark.createDataset(data, Encoders.bean(BeanWithoutGetter.class));

Scala

 spark.createDataFrame(rdd, classOf[BeanWithoutGetter])

prints

java.lang.UnsupportedOperationException: Cannot infer type for class BeanWithoutGetter because property a does not have the getter
	at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2$$anonfun$3.apply(JavaTypeInference.scala:127)
	at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2$$anonfun$3.apply(JavaTypeInference.scala:127)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I would like to fix the empty property change and sweep all related tests if there are (I guess this relates to #17013 (comment)) or avoid this change here because it seems loosely related.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I editted the comments above for more details and preventing ambiguity.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally a property without a getter is not a bean property, let's fix the empty property problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry for the long comment that maybe a bit messed around. So, there are two code paths

  • JavaTypeInference.serializerFor/deserializerFor: non-empty required & getter/setter required

  • JavaTypeInference.inferDataType: empty one OK & getter only OK

Could I please ask which case you want just to double check?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean JavaTypeInference. Why we throw exception for a bean property without getter instead of not treating it as a property?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sure. Then, let me ignore a property without the getter in JavaTypeInference.inferDataType, and allow empty property in JavaTypeInference.serializerFor/deserializerFor.

@HyukjinKwon
Copy link
Member Author

Thanks for your review @maropu.

@SparkQA
Copy link

SparkQA commented Feb 21, 2017

Test build #73210 has finished for PR 17013 at commit 9ad4789.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 21, 2017

Test build #73216 has finished for PR 17013 at commit ae4c9aa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • s\"Cannot infer type for class $

@HyukjinKwon HyukjinKwon changed the title [SPARK-19666][SQL] Improve error message for JavaBean without getter [SPARK-19666][SQL] Skip a property without getter in Java schema inference and allow empty bean in encoder creation Feb 22, 2017
@SparkQA
Copy link

SparkQA commented Feb 22, 2017

Test build #73257 has finished for PR 17013 at commit 5808d71.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 22, 2017

Test build #73256 has finished for PR 17013 at commit 91cee26.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 22, 2017

Test build #73258 has finished for PR 17013 at commit ed686fa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public static class BeanWithoutGetter implements Serializable

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's looking fairly good to me.

def getJavaBeanPropertiesWithGetters(beanClass: Class[_]): Array[PropertyDescriptor] = {
val beanInfo = Introspector.getBeanInfo(beanClass)
beanInfo.getPropertyDescriptors
.filterNot(_.getName == "class").filter(p => p.getReadMethod != null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a brevity thing, but, (_.getReadMethod != null)?

beanInfo.getPropertyDescriptors
.filterNot(_.getName == "class").filter(p => p.getReadMethod != null)
}

private def getJavaBeanProperties(beanClass: Class[_]): Array[PropertyDescriptor] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this call the new method above and additionally filter for a setter?
The naming ends up a bit funny because the first method sounds like it returns a subset of the second method, but it's the other way around. The first one returns "readable properties"? and the second returns "read-write properties"? not sure, might be worth clarifying

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it seems so. Let me try to address this.

@SparkQA
Copy link

SparkQA commented Feb 22, 2017

Test build #73283 has finished for PR 17013 at commit 3604855.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 22, 2017

Test build #73284 has finished for PR 17013 at commit ac5cc7d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM, merging to master!

@asfgit asfgit closed this in 37112fc Feb 22, 2017
Yunni pushed a commit to Yunni/spark that referenced this pull request Feb 27, 2017
…rence and allow empty bean in encoder creation

## What changes were proposed in this pull request?

This PR proposes to fix two.

**Skip a property without a getter in beans**

Currently, if we use a JavaBean without the getter as below:

```java
public static class BeanWithoutGetter implements Serializable {
  private String a;

  public void setA(String a) {
    this.a = a;
  }
}

BeanWithoutGetter bean = new BeanWithoutGetter();
List<BeanWithoutGetter> data = Arrays.asList(bean);
spark.createDataFrame(data, BeanWithoutGetter.class).show();
```

- Before

It throws an exception as below:

```
java.lang.NullPointerException
	at org.spark_project.guava.reflect.TypeToken.method(TypeToken.java:465)
	at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:126)
	at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
```

- After

```
++
||
++
||
++
```

**Supports empty bean in encoder creation**

```java
public static class EmptyBean implements Serializable {}

EmptyBean bean = new EmptyBean();
List<EmptyBean> data = Arrays.asList(bean);
spark.createDataset(data, Encoders.bean(EmptyBean.class)).show();
```

- Before

throws an exception as below:

```
java.lang.UnsupportedOperationException: Cannot infer type for class EmptyBean because it is not bean-compliant
	at org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$serializerFor(JavaTypeInference.scala:436)
	at org.apache.spark.sql.catalyst.JavaTypeInference$.serializerFor(JavaTypeInference.scala:341)
```

- After

```
++
||
++
||
++
```

## How was this patch tested?

Unit test in `JavaDataFrameSuite`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes apache#17013 from HyukjinKwon/SPARK-19666.
@HyukjinKwon HyukjinKwon deleted the SPARK-19666 branch January 2, 2018 03:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants