Skip to content

[SPARK-26739][SQL] Standardized Join Types for DataFrames#24151

Closed
agrawalpooja wants to merge 1 commit intoapache:masterfrom
agrawalpooja:SPARK-26739-sql-jointype
Closed

[SPARK-26739][SQL] Standardized Join Types for DataFrames#24151
agrawalpooja wants to merge 1 commit intoapache:masterfrom
agrawalpooja:SPARK-26739-sql-jointype

Conversation

@agrawalpooja
Copy link

What changes were proposed in this pull request?

Tries the address the concern mentioned in SPARK-26739
To summarise, currently, in the join functions on DataFrames, the join types are defined via a string parameter called joinType. In order for a developer to know which joins are possible, they must look up the API call for join. While this works fine, it can cause the developer to make a typo resulting in improper joins and/or unexpected errors that aren't evident at compile time. The objective of this improvement would be to allow developers to use a common definition for join types (by enum or constants) called JoinTypes. This would contain the possible joins and remove the possibility of a typo. It would also allow Spark to alter the names of the joins in the future without impacting end-users.

How was this patch tested?

Tested via Unit tests

@agrawalpooja
Copy link
Author

I have a quick question here:
Should the existing methods with string joinType be deprecated?
In this PR, I have deprecated it for now. But can keep it to maintain backward compatibility, if needed.

Copy link
Contributor

@dilipbiswal dilipbiswal Mar 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this a breaking change ? Are we allowed to do that ? cc @HyukjinKwon
I also have a question for Python and R APIs. Is there an equivalent mechanism for those APIs ?

Copy link
Member

@HyukjinKwon HyukjinKwon Mar 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, of course it does break, and we shouldn't do this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically mima check should fail. If we should do this, we should add an overriden definition and probably deprecate existing one.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon do you mean to maintain both the definitions for now (with string as well as JoinType enum)? and then eventually deprecate the string one?

@agrawalpooja agrawalpooja force-pushed the SPARK-26739-sql-jointype branch from d6023a8 to 663dc1b Compare March 20, 2019 07:45
@agrawalpooja agrawalpooja changed the title [SPARK-26739][SQL][WIP] Standardized Join Types for DataFrames [SPARK-26739][SQL] Standardized Join Types for DataFrames Mar 20, 2019
import org.apache.spark.mllib.optimization.NNLS
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, Dataset}
import org.apache.spark.sql.catalyst.plans._
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I first read the JIRA, I thought you wanted to provide some kind of enums without fixing APIs at all. catalyst is an internal API. it's not supposed to be a user face interface as well.

Copy link
Author

@agrawalpooja agrawalpooja Mar 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon yep, initially I created an enum and was using that. But, later someone pointed out in JIRA that we already have a JoinType class which we can reuse here.
Is it fine if I use a enum here?
(The motive is to have a standardised join types and detect the invalid join types at compile time itself)

@maropu
Copy link
Member

maropu commented Mar 22, 2019

JoinType is a internal class now, so I think we should discuss whether its ok to expose JoinType first as @HyukjinKwon suggested .... cc: @gatorsmile @cloud-fan

@HyukjinKwon
Copy link
Member

Probably, I guess this idea came from save modes. I don't have strong opinion on this too ..

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HyukjinKwon
Copy link
Member

ping @agrawalpooja to update or close

@github-actions
Copy link

github-actions bot commented Jan 1, 2020

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

@github-actions github-actions bot added the Stale label Jan 1, 2020
@github-actions github-actions bot closed this Jan 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants