Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27776][SQL]Avoid duplicate Java reflection in DataSource. #24647

Closed
wants to merge 7 commits into from
Closed

[SPARK-27776][SQL]Avoid duplicate Java reflection in DataSource. #24647

wants to merge 7 commits into from

Conversation

beliefer
Copy link
Contributor

@beliefer beliefer commented May 20, 2019

What changes were proposed in this pull request?

I checked the code of
org.apache.spark.sql.execution.datasources.DataSource
, there exists duplicate Java reflection.
sourceSchema,createSource,createSink,resolveRelation,writeAndRead, all the methods call the providingClass.getConstructor().newInstance().
The instance of providingClass is stateless, such as:
KafkaSourceProvider
RateSourceProvider
TextSocketSourceProvider
JdbcRelationProvider
ConsoleSinkProvider

AFAIK, Java reflection will result in significant performance issue.
The oracle website https://docs.oracle.com/javase/tutorial/reflect/index.html contains some performance description about Java reflection:

Performance Overhead
Because reflection involves types that are dynamically resolved, certain Java virtual machine optimizations can not be performed. Consequently, reflective operations have slower performance than their non-reflective counterparts, and should be avoided in sections of code which are called frequently in performance-sensitive applications.

I have found some performance cost test of Java reflection as follows:
https://blog.frankel.ch/performance-cost-of-reflection/ contains performance cost test.
https://stackoverflow.com/questions/435553/java-reflection-performance has a discussion of java reflection.

So I think should avoid duplicate Java reflection and reuse the instance of providingClass.

How was this patch tested?

Exists UT.

@beliefer
Copy link
Contributor Author

cc @dongjoon-hyun @srowen

@srowen
Copy link
Member

srowen commented May 20, 2019

I'm not sure about the contract here, whether providers are required to be stateless. If they're not then this would be a problem for another instance that has state, or if these acquire state at some point.

Generally reflection isn't particularly slow now, and certainly trivial compared to the other computation here. I am not sure we should make this change.

Copy link
Contributor

@gaborgsomogyi gaborgsomogyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, Java reflection will result in significant performance issue.

Since it's not called too many times the gain could be couple of milliseconds but I think this makes the code more simple.

LGTM (pending tests).

@gaborgsomogyi
Copy link
Contributor

I'm not sure about the contract here, whether providers are required to be stateless.

Since providingClass is state already having similar thing shouldn't be an issue. The gain is questionable though...

@SparkQA
Copy link

SparkQA commented May 20, 2019

Test build #105566 has finished for PR 24647 at commit daff979.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

beliefer commented May 21, 2019

I'm not sure about the contract here, whether providers are required to be stateless. If they're not then this would be a problem for another instance that has state, or if these acquire state at some point.

First of all, I am glad to see your reply.
I have the same question as you.I investigate all the providers and found every implementation is stateless indeed. I even investigated the providers provided by third parties, such as com.mongodb.spark.sql.DefaultSource,org.elasticsearch.spark.sql.DefaultSource.
We gain to ensure it by adding a contract, such as an annotation Stateless.

Another reason I created this PR is providingClass.getConstructor().newInstance() is only called in org.apache.spark.sql.execution.datasources.DataSource. The behavior provides a good encapsulation.
I modified providingInstance with private, so as to advance the encapsulation.

@beliefer
Copy link
Contributor Author

I'm not sure about the contract here, whether providers are required to be stateless.

Since providingClass is state already having similar thing shouldn't be an issue. The gain is questionable though...

AFAIK, Java reflection will result in significant performance issue.

Since it's not called too many times the gain could be couple of milliseconds but I think this makes the code more simple.

LGTM (pending tests).

Yes, I agree. Many a little makes a mickle.

@HeartSaVioR
Copy link
Contributor

I'm not sure about the contract here, whether providers are required to be stateless. If they're not then this would be a problem for another instance that has state, or if these acquire state at some point.

While agreeing this point, extracting these calls into method would help readability. I'm not sure Spark community is open to allow minor refactor patch.

@beliefer
Copy link
Contributor Author

I'm not sure about the contract here, whether providers are required to be stateless. If they're not then this would be a problem for another instance that has state, or if these acquire state at some point.

While agreeing this point, extracting these calls into method would help readability. I'm not sure Spark community is open to allow minor refactor patch.

Thanks for your review. I modified providingInstance with private, so as to advance the encapsulation.

@HeartSaVioR
Copy link
Contributor

I'm not sure you're understanding the points here. Just assuming we don't want to change the current behavior (but open to performance boost), this patch is different from current for two points of view:

  1. This patch will show different behavior if provider provides different instance per constructor call (this is actually obvious) and the instance, or DataSource leverages the fact.
  2. This patch will show different behavior if instance is stateful, like calling method changes internal state.

That might be what you already checked before (as you're saying all the implementations you've checked are stateless), but it's not guaranteed by interface contract. I'm not saying something is better and something is worse - just would like to let you understand the point what @srowen is saying (and what I've agreed).

If this change brings enough value it might be possible to add the contract in interface, but I'm not the one to right to weigh the value. That's why I just only see the chance to refactor (extract the call to method) to simplify the code, nothing else.

@beliefer
Copy link
Contributor Author

beliefer commented May 21, 2019

  1. This patch will show different behavior if provider provides different instance per constructor call (this is actually obvious) and the instance, or DataSource leverages the fact.

First of all, I'm glad to see your detailed description.
I understand your second point and agreed but your first point.
There are only called the constructor without parameters, so I think every provider only provides one behavior. Maybe I missed some detail.

To second point, we gain to ensure it by adding a contract, such as an annotation Stateless

@SparkQA
Copy link

SparkQA commented May 21, 2019

Test build #105601 has finished for PR 24647 at commit e61c3f3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 21, 2019

Test build #105611 has finished for PR 24647 at commit a173068.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented May 21, 2019

To be clear, you could cache the constructor object, but still make a new instance each time. And the creation of the object could be in a private method. I think that much is fine.

@beliefer
Copy link
Contributor Author

Based on the above discussion, we all not sure the contract whether providers are required to be stateless. So I agree with @srowen that do not make such assumptions for stateless, just improve a little encapsulation.

@SparkQA
Copy link

SparkQA commented May 22, 2019

Test build #105651 has finished for PR 24647 at commit e3014cb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -105,6 +105,8 @@ case class DataSource(
case _ => cls
}
}
private def providingInstance = providingClass.getConstructor().newInstance()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be declared like a method for clarity, with a return type. Also add a new line above. I'd use () in the declaration and invocations for clarity too.

Copy link
Contributor Author

@beliefer beliefer May 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add a return type, only Any could use here. This method is modified by private, so whether the return type can be omitted?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type would be something like _ <: DataSource, not Any, but, OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used Any temporarily. Because the type of providingClass is Class[_]. If we really want use _ <: DataSourceRegister, we have to adust the implement of DataSource.lookupDataSource.

@SparkQA
Copy link

SparkQA commented May 24, 2019

Test build #105742 has finished for PR 24647 at commit e5e98cb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -105,6 +105,9 @@ case class DataSource(
case _ => cls
}
}

private def providingInstance(): Any = providingClass.getConstructor().newInstance()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely at least an AnyRef. Can you write private def [T <: AnyRef] providingInstance(): T? You can even tighten the bound with a cast, as we know the supertype of what's being returned.
However I don't think it's worth it. You can use AnyRef or omit the type here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes ,I think it's not worth too. I omit return type here.

@SparkQA
Copy link

SparkQA commented May 27, 2019

Test build #105811 has finished for PR 24647 at commit b3c0f34.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this ends up being kind of trivial, but OK

@srowen
Copy link
Member

srowen commented May 28, 2019

Merged to master

@srowen srowen closed this in c30b529 May 28, 2019
@beliefer
Copy link
Contributor Author

@srowen Thanks for your help. @gaborgsomogyi @HeartSaVioR Thanks for your review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants