New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27776][SQL]Avoid duplicate Java reflection in DataSource. #24647
Conversation
I'm not sure about the contract here, whether providers are required to be stateless. If they're not then this would be a problem for another instance that has state, or if these acquire state at some point. Generally reflection isn't particularly slow now, and certainly trivial compared to the other computation here. I am not sure we should make this change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, Java reflection will result in significant performance issue.
Since it's not called too many times the gain could be couple of milliseconds but I think this makes the code more simple.
LGTM (pending tests).
Since |
Test build #105566 has finished for PR 24647 at commit
|
First of all, I am glad to see your reply. Another reason I created this PR is |
Yes, I agree. Many a little makes a mickle. |
While agreeing this point, extracting these calls into method would help readability. I'm not sure Spark community is open to allow minor refactor patch. |
Thanks for your review. I modified |
I'm not sure you're understanding the points here. Just assuming we don't want to change the current behavior (but open to performance boost), this patch is different from current for two points of view:
That might be what you already checked before (as you're saying all the implementations you've checked are stateless), but it's not guaranteed by interface contract. I'm not saying something is better and something is worse - just would like to let you understand the point what @srowen is saying (and what I've agreed). If this change brings enough value it might be possible to add the contract in interface, but I'm not the one to right to weigh the value. That's why I just only see the chance to refactor (extract the call to method) to simplify the code, nothing else. |
First of all, I'm glad to see your detailed description. To second point, we gain to ensure it by adding a contract, such as an annotation |
Test build #105601 has finished for PR 24647 at commit
|
Test build #105611 has finished for PR 24647 at commit
|
To be clear, you could cache the constructor object, but still make a new instance each time. And the creation of the object could be in a private method. I think that much is fine. |
Based on the above discussion, we all not sure the contract whether providers are required to be stateless. So I agree with @srowen that do not make such assumptions for stateless, just improve a little encapsulation. |
Test build #105651 has finished for PR 24647 at commit
|
@@ -105,6 +105,8 @@ case class DataSource( | |||
case _ => cls | |||
} | |||
} | |||
private def providingInstance = providingClass.getConstructor().newInstance() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be declared like a method for clarity, with a return type. Also add a new line above. I'd use ()
in the declaration and invocations for clarity too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we add a return type, only Any
could use here. This method is modified by private, so whether the return type can be omitted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type would be something like _ <: DataSource
, not Any
, but, OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used Any
temporarily. Because the type of providingClass
is Class[_]
. If we really want use _ <: DataSourceRegister
, we have to adust the implement of DataSource.lookupDataSource
.
Test build #105742 has finished for PR 24647 at commit
|
@@ -105,6 +105,9 @@ case class DataSource( | |||
case _ => cls | |||
} | |||
} | |||
|
|||
private def providingInstance(): Any = providingClass.getConstructor().newInstance() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's definitely at least an AnyRef
. Can you write private def [T <: AnyRef] providingInstance(): T
? You can even tighten the bound with a cast, as we know the supertype of what's being returned.
However I don't think it's worth it. You can use AnyRef or omit the type here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes ,I think it's not worth too. I omit return type here.
Test build #105811 has finished for PR 24647 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this ends up being kind of trivial, but OK
Merged to master |
@srowen Thanks for your help. @gaborgsomogyi @HeartSaVioR Thanks for your review. |
What changes were proposed in this pull request?
I checked the code of
org.apache.spark.sql.execution.datasources.DataSource
, there exists duplicate Java reflection.
sourceSchema
,createSource
,createSink
,resolveRelation
,writeAndRead
, all the methods call theprovidingClass.getConstructor().newInstance()
.The instance of
providingClass
is stateless, such as:KafkaSourceProvider
RateSourceProvider
TextSocketSourceProvider
JdbcRelationProvider
ConsoleSinkProvider
AFAIK, Java reflection will result in significant performance issue.
The oracle website https://docs.oracle.com/javase/tutorial/reflect/index.html contains some performance description about Java reflection:
I have found some performance cost test of Java reflection as follows:
https://blog.frankel.ch/performance-cost-of-reflection/ contains performance cost test.
https://stackoverflow.com/questions/435553/java-reflection-performance has a discussion of java reflection.
So I think should avoid duplicate Java reflection and reuse the instance of
providingClass
.How was this patch tested?
Exists UT.