[SPARK-27776][SQL]Avoid duplicate Java reflection in DataSource. #24647

beliefer · 2019-05-20T09:23:43Z

What changes were proposed in this pull request?

I checked the code of
org.apache.spark.sql.execution.datasources.DataSource
, there exists duplicate Java reflection.
sourceSchema,createSource,createSink,resolveRelation,writeAndRead, all the methods call the providingClass.getConstructor().newInstance().
The instance of providingClass is stateless, such as:
KafkaSourceProvider
RateSourceProvider
TextSocketSourceProvider
JdbcRelationProvider
ConsoleSinkProvider

AFAIK, Java reflection will result in significant performance issue.
The oracle website https://docs.oracle.com/javase/tutorial/reflect/index.html contains some performance description about Java reflection:

Performance Overhead
Because reflection involves types that are dynamically resolved, certain Java virtual machine optimizations can not be performed. Consequently, reflective operations have slower performance than their non-reflective counterparts, and should be avoided in sections of code which are called frequently in performance-sensitive applications.

I have found some performance cost test of Java reflection as follows:
https://blog.frankel.ch/performance-cost-of-reflection/ contains performance cost test.
https://stackoverflow.com/questions/435553/java-reflection-performance has a discussion of java reflection.

So I think should avoid duplicate Java reflection and reuse the instance of providingClass.

How was this patch tested?

Exists UT.

beliefer · 2019-05-20T09:25:54Z

cc @dongjoon-hyun @srowen

srowen · 2019-05-20T11:51:56Z

I'm not sure about the contract here, whether providers are required to be stateless. If they're not then this would be a problem for another instance that has state, or if these acquire state at some point.

Generally reflection isn't particularly slow now, and certainly trivial compared to the other computation here. I am not sure we should make this change.

gaborgsomogyi

AFAIK, Java reflection will result in significant performance issue.

Since it's not called too many times the gain could be couple of milliseconds but I think this makes the code more simple.

LGTM (pending tests).

gaborgsomogyi · 2019-05-20T11:59:44Z

I'm not sure about the contract here, whether providers are required to be stateless.

Since providingClass is state already having similar thing shouldn't be an issue. The gain is questionable though...

SparkQA · 2019-05-20T12:14:39Z

Test build #105566 has finished for PR 24647 at commit daff979.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2019-05-21T02:21:03Z

I'm not sure about the contract here, whether providers are required to be stateless. If they're not then this would be a problem for another instance that has state, or if these acquire state at some point.

First of all, I am glad to see your reply.
I have the same question as you.I investigate all the providers and found every implementation is stateless indeed. I even investigated the providers provided by third parties, such as com.mongodb.spark.sql.DefaultSource,org.elasticsearch.spark.sql.DefaultSource.
We gain to ensure it by adding a contract, such as an annotation Stateless.

Another reason I created this PR is providingClass.getConstructor().newInstance() is only called in org.apache.spark.sql.execution.datasources.DataSource. The behavior provides a good encapsulation.
I modified providingInstance with private, so as to advance the encapsulation.

beliefer · 2019-05-21T02:27:55Z

I'm not sure about the contract here, whether providers are required to be stateless.

Since providingClass is state already having similar thing shouldn't be an issue. The gain is questionable though...

AFAIK, Java reflection will result in significant performance issue.

Since it's not called too many times the gain could be couple of milliseconds but I think this makes the code more simple.

LGTM (pending tests).

Yes, I agree. Many a little makes a mickle.

HeartSaVioR · 2019-05-21T06:46:19Z

I'm not sure about the contract here, whether providers are required to be stateless. If they're not then this would be a problem for another instance that has state, or if these acquire state at some point.

While agreeing this point, extracting these calls into method would help readability. I'm not sure Spark community is open to allow minor refactor patch.

beliefer · 2019-05-21T07:04:00Z

I'm not sure about the contract here, whether providers are required to be stateless. If they're not then this would be a problem for another instance that has state, or if these acquire state at some point.

While agreeing this point, extracting these calls into method would help readability. I'm not sure Spark community is open to allow minor refactor patch.

Thanks for your review. I modified providingInstance with private, so as to advance the encapsulation.

HeartSaVioR · 2019-05-21T07:28:02Z

I'm not sure you're understanding the points here. Just assuming we don't want to change the current behavior (but open to performance boost), this patch is different from current for two points of view:

This patch will show different behavior if provider provides different instance per constructor call (this is actually obvious) and the instance, or DataSource leverages the fact.
This patch will show different behavior if instance is stateful, like calling method changes internal state.

That might be what you already checked before (as you're saying all the implementations you've checked are stateless), but it's not guaranteed by interface contract. I'm not saying something is better and something is worse - just would like to let you understand the point what @srowen is saying (and what I've agreed).

If this change brings enough value it might be possible to add the contract in interface, but I'm not the one to right to weigh the value. That's why I just only see the chance to refactor (extract the call to method) to simplify the code, nothing else.

beliefer · 2019-05-21T07:47:57Z

This patch will show different behavior if provider provides different instance per constructor call (this is actually obvious) and the instance, or DataSource leverages the fact.

First of all, I'm glad to see your detailed description.
I understand your second point and agreed but your first point.
There are only called the constructor without parameters, so I think every provider only provides one behavior. Maybe I missed some detail.

To second point, we gain to ensure it by adding a contract, such as an annotation Stateless

SparkQA · 2019-05-21T08:11:19Z

Test build #105601 has finished for PR 24647 at commit e61c3f3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-21T11:39:58Z

Test build #105611 has finished for PR 24647 at commit a173068.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-05-21T13:39:32Z

To be clear, you could cache the constructor object, but still make a new instance each time. And the creation of the object could be in a private method. I think that much is fine.

beliefer · 2019-05-22T02:18:04Z

Based on the above discussion, we all not sure the contract whether providers are required to be stateless. So I agree with @srowen that do not make such assumptions for stateless, just improve a little encapsulation.

SparkQA · 2019-05-22T05:23:37Z

Test build #105651 has finished for PR 24647 at commit e3014cb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-05-22T12:26:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

@@ -105,6 +105,8 @@ case class DataSource(
      case _ => cls
    }
  }
+  private def providingInstance = providingClass.getConstructor().newInstance()


This should be declared like a method for clarity, with a return type. Also add a new line above. I'd use () in the declaration and invocations for clarity too.

If we add a return type, only Any could use here. This method is modified by private, so whether the return type can be omitted?

The type would be something like _ <: DataSource, not Any, but, OK.

I used Any temporarily. Because the type of providingClass is Class[_]. If we really want use _ <: DataSourceRegister, we have to adust the implement of DataSource.lookupDataSource.

SparkQA · 2019-05-24T05:07:35Z

Test build #105742 has finished for PR 24647 at commit e5e98cb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-05-26T01:22:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

@@ -105,6 +105,9 @@ case class DataSource(
      case _ => cls
    }
  }
+
+  private def providingInstance(): Any = providingClass.getConstructor().newInstance()


It's definitely at least an AnyRef. Can you write private def [T <: AnyRef] providingInstance(): T? You can even tighten the bound with a cast, as we know the supertype of what's being returned.
However I don't think it's worth it. You can use AnyRef or omit the type here.

Yes ,I think it's not worth too. I omit return type here.

SparkQA · 2019-05-27T05:51:11Z

Test build #105811 has finished for PR 24647 at commit b3c0f34.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

I think this ends up being kind of trivial, but OK

srowen · 2019-05-28T14:26:51Z

Merged to master

beliefer · 2019-05-29T02:36:40Z

@srowen Thanks for your help. @gaborgsomogyi @HeartSaVioR Thanks for your review.

Avoid duplicate Java reflection.

fac7bec

reset cls

daff979

gaborgsomogyi approved these changes May 20, 2019

View reviewed changes

advance encapsulation.

e61c3f3

trigger UT.

a173068

make a new instance each time.

e3014cb

srowen reviewed May 22, 2019

View reviewed changes

adjust code style.

e5e98cb

srowen reviewed May 26, 2019

View reviewed changes

omit return type.

b3c0f34

srowen approved these changes May 27, 2019

View reviewed changes

srowen closed this in c30b529 May 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27776][SQL]Avoid duplicate Java reflection in DataSource. #24647

[SPARK-27776][SQL]Avoid duplicate Java reflection in DataSource. #24647

beliefer commented May 20, 2019 •

edited

beliefer commented May 20, 2019

srowen commented May 20, 2019

gaborgsomogyi left a comment

gaborgsomogyi commented May 20, 2019

SparkQA commented May 20, 2019

beliefer commented May 21, 2019 •

edited

beliefer commented May 21, 2019

HeartSaVioR commented May 21, 2019

beliefer commented May 21, 2019

HeartSaVioR commented May 21, 2019

beliefer commented May 21, 2019 •

edited

SparkQA commented May 21, 2019

SparkQA commented May 21, 2019

srowen commented May 21, 2019

beliefer commented May 22, 2019

SparkQA commented May 22, 2019

srowen May 22, 2019

beliefer May 23, 2019 •

edited

srowen May 23, 2019

beliefer May 24, 2019

SparkQA commented May 24, 2019

srowen May 26, 2019

beliefer May 27, 2019

SparkQA commented May 27, 2019

srowen left a comment

srowen commented May 28, 2019

beliefer commented May 29, 2019

[SPARK-27776][SQL]Avoid duplicate Java reflection in DataSource. #24647

[SPARK-27776][SQL]Avoid duplicate Java reflection in DataSource. #24647

Conversation

beliefer commented May 20, 2019 • edited

What changes were proposed in this pull request?

How was this patch tested?

beliefer commented May 20, 2019

srowen commented May 20, 2019

gaborgsomogyi left a comment

Choose a reason for hiding this comment

gaborgsomogyi commented May 20, 2019

SparkQA commented May 20, 2019

beliefer commented May 21, 2019 • edited

beliefer commented May 21, 2019

HeartSaVioR commented May 21, 2019

beliefer commented May 21, 2019

HeartSaVioR commented May 21, 2019

beliefer commented May 21, 2019 • edited

SparkQA commented May 21, 2019

SparkQA commented May 21, 2019

srowen commented May 21, 2019

beliefer commented May 22, 2019

SparkQA commented May 22, 2019

srowen May 22, 2019

Choose a reason for hiding this comment

beliefer May 23, 2019 • edited

Choose a reason for hiding this comment

srowen May 23, 2019

Choose a reason for hiding this comment

beliefer May 24, 2019

Choose a reason for hiding this comment

SparkQA commented May 24, 2019

srowen May 26, 2019

Choose a reason for hiding this comment

beliefer May 27, 2019

Choose a reason for hiding this comment

SparkQA commented May 27, 2019

srowen left a comment

Choose a reason for hiding this comment

srowen commented May 28, 2019

beliefer commented May 29, 2019

beliefer commented May 20, 2019 •

edited

beliefer commented May 21, 2019 •

edited

beliefer commented May 21, 2019 •

edited

beliefer May 23, 2019 •

edited