[SPARK-26739][SQL] Standardized Join Types for DataFrames#24151
[SPARK-26739][SQL] Standardized Join Types for DataFrames#24151agrawalpooja wants to merge 1 commit intoapache:masterfrom
Conversation
|
I have a quick question here: |
There was a problem hiding this comment.
Isn't this a breaking change ? Are we allowed to do that ? cc @HyukjinKwon
I also have a question for Python and R APIs. Is there an equivalent mechanism for those APIs ?
There was a problem hiding this comment.
Yes, of course it does break, and we shouldn't do this.
There was a problem hiding this comment.
Technically mima check should fail. If we should do this, we should add an overriden definition and probably deprecate existing one.
There was a problem hiding this comment.
@HyukjinKwon do you mean to maintain both the definitions for now (with string as well as JoinType enum)? and then eventually deprecate the string one?
d6023a8 to
663dc1b
Compare
| import org.apache.spark.mllib.optimization.NNLS | ||
| import org.apache.spark.rdd.RDD | ||
| import org.apache.spark.sql.{DataFrame, Dataset} | ||
| import org.apache.spark.sql.catalyst.plans._ |
There was a problem hiding this comment.
When I first read the JIRA, I thought you wanted to provide some kind of enums without fixing APIs at all. catalyst is an internal API. it's not supposed to be a user face interface as well.
There was a problem hiding this comment.
@HyukjinKwon yep, initially I created an enum and was using that. But, later someone pointed out in JIRA that we already have a JoinType class which we can reuse here.
Is it fine if I use a enum here?
(The motive is to have a standardised join types and detect the invalid join types at compile time itself)
|
|
|
Probably, I guess this idea came from save modes. I don't have strong opinion on this too .. |
|
Can one of the admins verify this patch? |
|
ping @agrawalpooja to update or close |
|
We're closing this PR because it hasn't been updated in a while. If you'd like to revive this PR, please reopen it! |
What changes were proposed in this pull request?
Tries the address the concern mentioned in SPARK-26739
To summarise, currently, in the join functions on DataFrames, the join types are defined via a string parameter called joinType. In order for a developer to know which joins are possible, they must look up the API call for join. While this works fine, it can cause the developer to make a typo resulting in improper joins and/or unexpected errors that aren't evident at compile time. The objective of this improvement would be to allow developers to use a common definition for join types (by enum or constants) called JoinTypes. This would contain the possible joins and remove the possibility of a typo. It would also allow Spark to alter the names of the joins in the future without impacting end-users.
How was this patch tested?
Tested via Unit tests