[SPARK-26739][SQL] Standardized Join Types for DataFrames by agrawalpooja · Pull Request #24151 · apache/spark

agrawalpooja · 2019-03-20T05:51:41Z

What changes were proposed in this pull request?

Tries the address the concern mentioned in SPARK-26739
To summarise, currently, in the join functions on DataFrames, the join types are defined via a string parameter called joinType. In order for a developer to know which joins are possible, they must look up the API call for join. While this works fine, it can cause the developer to make a typo resulting in improper joins and/or unexpected errors that aren't evident at compile time. The objective of this improvement would be to allow developers to use a common definition for join types (by enum or constants) called JoinTypes. This would contain the possible joins and remove the possibility of a typo. It would also allow Spark to alter the names of the joins in the future without impacting end-users.

How was this patch tested?

Tested via Unit tests

agrawalpooja · 2019-03-20T05:59:41Z

I have a quick question here:
Should the existing methods with string joinType be deprecated?
In this PR, I have deprecated it for now. But can keep it to maintain backward compatibility, if needed.

dilipbiswal · 2019-03-20T05:58:54Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

Isn't this a breaking change ? Are we allowed to do that ? cc @HyukjinKwon
I also have a question for Python and R APIs. Is there an equivalent mechanism for those APIs ?

Yes, of course it does break, and we shouldn't do this.

Technically mima check should fail. If we should do this, we should add an overriden definition and probably deprecate existing one.

@HyukjinKwon do you mean to maintain both the definitions for now (with string as well as JoinType enum)? and then eventually deprecate the string one?

HyukjinKwon · 2019-03-20T11:19:49Z

mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala

 import org.apache.spark.mllib.optimization.NNLS
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.catalyst.plans._


When I first read the JIRA, I thought you wanted to provide some kind of enums without fixing APIs at all. catalyst is an internal API. it's not supposed to be a user face interface as well.

@HyukjinKwon yep, initially I created an enum and was using that. But, later someone pointed out in JIRA that we already have a JoinType class which we can reuse here.
Is it fine if I use a enum here?
(The motive is to have a standardised join types and detect the invalid join types at compile time itself)

maropu · 2019-03-22T02:34:54Z

JoinType is a internal class now, so I think we should discuss whether its ok to expose JoinType first as @HyukjinKwon suggested .... cc: @gatorsmile @cloud-fan

HyukjinKwon · 2019-03-22T03:59:49Z

Probably, I guess this idea came from save modes. I don't have strong opinion on this too ..

AmplabJenkins · 2019-09-16T18:15:07Z

Can one of the admins verify this patch?

HyukjinKwon · 2019-09-17T00:09:04Z

ping @agrawalpooja to update or close

github-actions · 2020-01-01T00:06:01Z

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

dilipbiswal reviewed Mar 20, 2019

View reviewed changes

Added standardized jointypes for dataframes

663dc1b

agrawalpooja force-pushed the SPARK-26739-sql-jointype branch from d6023a8 to 663dc1b Compare March 20, 2019 07:45

agrawalpooja changed the title ~~[SPARK-26739][SQL][WIP] Standardized Join Types for DataFrames~~ [SPARK-26739][SQL] Standardized Join Types for DataFrames Mar 20, 2019

HyukjinKwon reviewed Mar 20, 2019

View reviewed changes

dongjoon-hyun added the SQL label Jun 14, 2019

github-actions bot added the Stale label Jan 1, 2020

github-actions bot closed this Jan 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-26739][SQL] Standardized Join Types for DataFrames#24151

[SPARK-26739][SQL] Standardized Join Types for DataFrames#24151
agrawalpooja wants to merge 1 commit intoapache:masterfrom
agrawalpooja:SPARK-26739-sql-jointype

agrawalpooja commented Mar 20, 2019

Uh oh!

agrawalpooja commented Mar 20, 2019

Uh oh!

dilipbiswal Mar 20, 2019 •

edited

Loading

Uh oh!

HyukjinKwon Mar 20, 2019 •

edited

Loading

Uh oh!

HyukjinKwon Mar 20, 2019

Uh oh!

agrawalpooja Mar 25, 2019

Uh oh!

HyukjinKwon Mar 20, 2019

Uh oh!

agrawalpooja Mar 25, 2019 •

edited

Loading

Uh oh!

maropu commented Mar 22, 2019

Uh oh!

HyukjinKwon commented Mar 22, 2019

Uh oh!

AmplabJenkins commented Sep 16, 2019

Uh oh!

HyukjinKwon commented Sep 17, 2019

Uh oh!

github-actions bot commented Jan 1, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

agrawalpooja commented Mar 20, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

agrawalpooja commented Mar 20, 2019

Uh oh!

dilipbiswal Mar 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Mar 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Mar 20, 2019

Choose a reason for hiding this comment

Uh oh!

agrawalpooja Mar 25, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Mar 20, 2019

Choose a reason for hiding this comment

Uh oh!

agrawalpooja Mar 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Mar 22, 2019

Uh oh!

HyukjinKwon commented Mar 22, 2019

Uh oh!

AmplabJenkins commented Sep 16, 2019

Uh oh!

HyukjinKwon commented Sep 17, 2019

Uh oh!

github-actions bot commented Jan 1, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

dilipbiswal Mar 20, 2019 •

edited

Loading

HyukjinKwon Mar 20, 2019 •

edited

Loading

agrawalpooja Mar 25, 2019 •

edited

Loading