Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates #28830

Closed
wants to merge 3 commits into from

Conversation

maropu
Copy link
Member

@maropu maropu commented Jun 15, 2020

What changes were proposed in this pull request?

This PR partially revert SPARK-31292 in order to provide a hot-fix for a bug in Dataset.dropDuplicates; we must preserve the input order of colNames for groupCols because the Streaming's state store depends on the groupCols order.

Why are the changes needed?

Bug fix.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added tests in DataFrameSuite.

@maropu
Copy link
Member Author

maropu commented Jun 15, 2020

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jun 15, 2020

Thank you for making a fix swiftly, @maropu .
cc @dbtsai and @holdenk

@xuanyuanking
Copy link
Member

xuanyuanking commented Jun 15, 2020 via email

@HeartSaVioR
Copy link
Contributor

Let's use [SS] instead as it's specific to SS issue.

@maropu maropu changed the title [SPARK-31990][SQL] Preserves the input order of colNames in dropDuplicates [SPARK-31990][SQL][SS] Preserves the input order of colNames in dropDuplicates Jun 15, 2020
@maropu maropu changed the title [SPARK-31990][SQL][SS] Preserves the input order of colNames in dropDuplicates [SPARK-31990][SS] Preserves the input order of colNames in dropDuplicates Jun 15, 2020
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jun 15, 2020

Hi, All.
This issue is marked as a hotfix for the blocker issue, but the validation of this issue looks non-trivial. Since toSet.toSeq is used since Apache Spark 2.2.0 (SPARK-19497) and SPARK-31292 is just an Improvement issue with Trivial priority. I'd like to propose to revert SPARK-31292 from branch-3.0 first. We will keep SPARK-31292 in master branch still and proceed this @maropu 's PR to find a better way for Apache Spark 3.1.0.

I know that the reverting is not a good solution for the original author as mentioned by @HeartSaVioR in the dev mailing list, but I believe that is the proper way in this case to cut Apache Spark 3.0.1. How do you think about that?

@gatorsmile
Copy link
Member

Yes. I prefer to reverting the original fix in 3.0.1. and then discuss how to solve/avoid the problems in a proper way.

@maropu
Copy link
Member Author

maropu commented Jun 15, 2020

okay, I'll revert that part in this PR first.

Copy link
Member

@xuanyuanking xuanyuanking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix @maropu! I think maybe we can simplify the bugfix by combining it together with #28707. WDYT? I'll also reference this PR with #28707.

This reverts commit 3e02ad6.
@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Jun 15, 2020

+1 to partial revert which should be also OK with author. (I guess it was applied simply by pattern, and it wasn’t for some intended improvement, so no problem for author as well.)

@xuanyuanking
Copy link
Member

Yep, I think just revert that part is good enough. I will give more context and details on #28707.

@dongjoon-hyun
Copy link
Member

Ya. +1 for partial revert in this PR.

@maropu maropu changed the title [SPARK-31990][SS] Preserves the input order of colNames in dropDuplicates [SPARK-31990][SPARK-31292][SS] Revert: Use toSet.toSeq in Dataset.dropDuplicates Jun 15, 2020
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-31990][SPARK-31292][SS] Revert: Use toSet.toSeq in Dataset.dropDuplicates [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates Jun 15, 2020
@maropu
Copy link
Member Author

maropu commented Jun 15, 2020

Thanks for the quick fix @maropu! I think maybe we can simplify the bugfix by combining it together with #28707. WDYT? I'll also reference this PR with #28707.

@xuanyuanking yea, looks fine to me. Could you takes this over? Thanks, anyway!

Copy link
Member

@xuanyuanking xuanyuanking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Sure, I'm writing comments in #28707. Will cc all of us and reference it with SPARK-31990 and this PR soon.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Jun 15, 2020

How we plan to consolidate both? How we will write JIRA title/description and PR title/description? Which is the type of the consolidated issue? Is the consolidated issue a blocker?

Things would be simpler if we merge the partial revert as it is, and spend our efforts to discuss how to guide known issues - this is one of candidates for Spark 3.0.0. This is clearly a bugfix which is a "blocker" preventing some of end users migrate to Spark 3.0.0, worth to have its own JIRA issue, and also commit. Sure, this may need to be placed on migration guide or release note as well.

It'd be no harm for #28707 to wait for this patch to be merged, and rebase to fix the test failure.

@xuanyuanking
Copy link
Member

How we plan to consolidate both? How we will write JIRA title/description and PR title/description? Which is the type of the consolidated issue? Is the consolidated issue a blocker?

Here's my plan to consolidate both: #28707 (comment), this will also comment in JIRA & PR description.
Yes, #28707 is blocking by this fix.

@HyukjinKwon
Copy link
Member

I am okay to revert it for now but I couldn't fully follow why we expect an explicit order from a set. Has it been ever guaranteed somewhere? Using distinct, we can expect the deterministic order but we're reverting back to using a set because of the deterministic order (?).

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jun 15, 2020

The last commit is to trying to preserve the previous behavior (whatever it was) of Apache Spark 2.2.0 ~ 2.4.6 although there is no guarantee which is safe or not. We will revisit the correct way later after 3.0.1.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jun 15, 2020

I am okay with that; I was just wondering even the previous behaviour was deterministic or not, and SPARK-31292 looked righter to me. Given that we're going to revisit anyway, LGTM from me too.

val groupCols = colNames.distinct.flatMap { (colName: String) =>
// SPARK-31990: We must keep `toSet.toSeq` here because of the backward compatibility issue
// (the Streaming's state store depends on the `groupCols` order).
val groupCols = colNames.toSet.toSeq.flatMap { (colName: String) =>
Copy link
Contributor

@cloud-fan cloud-fan Jun 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good for now.

In the future, this may still be broken by Scala version upgrade, and hopefully @xuanyuanking 's unsafe row validation can detect it. Then we can change it and use a deterministic order, as it will be broken anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I also mentioned this at #28830 (comment), we can relay on the validation checking and integrated tests.

Copy link
Contributor

@HeartSaVioR HeartSaVioR Jun 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting that we need to have "concrete" solution eventually - if columns are all having same type neither #28830 nor #24173 catch the change and the result becomes silently incorrect. I roughly remember the similar issue on pyspark, which was trying to fix the issue on order vs name, don't remember how it ended up. cc. @HyukjinKwon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that was fixed in a way by adding an env variable. That case also was specific to Python 2 which is deprecated now so it's rather a corner case.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as it is.

@HeartSaVioR
Copy link
Contributor

Btw now we know it is broken in Spark 3.0.0, and we will fix it again in Spark 3.0.1. Do we have some best practice to follow on guiding such change to end users?

@SparkQA
Copy link

SparkQA commented Jun 15, 2020

Test build #124020 has finished for PR 28830 at commit 3e02ad6.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 15, 2020

Test build #124027 has finished for PR 28830 at commit 7546ba4.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member

retest this please

@cloud-fan
Copy link
Contributor

Btw now we know it is broken in Spark 3.0.0, and we will fix it again in Spark 3.0.1.

I think we should list it as a known issue of 3.0.0, and release 3.0.1 soon.

@SparkQA
Copy link

SparkQA commented Jun 15, 2020

Test build #124036 has finished for PR 28830 at commit 7546ba4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

dongjoon-hyun pushed a commit that referenced this pull request Jun 15, 2020
### What changes were proposed in this pull request?

This PR partially revert SPARK-31292 in order to provide a hot-fix for a bug in `Dataset.dropDuplicates`; we must preserve the input order of `colNames` for `groupCols` because the Streaming's state store depends on the `groupCols` order.

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added tests in `DataFrameSuite`.

Closes #28830 from maropu/SPARK-31990.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 7f7b4dd)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun
Copy link
Member

Thank you all. Merged to master/3.0.

@maropu
Copy link
Member Author

maropu commented Jun 15, 2020

Thanks, all!

holdenk pushed a commit to holdenk/spark that referenced this pull request Jun 25, 2020
### What changes were proposed in this pull request?

This PR partially revert SPARK-31292 in order to provide a hot-fix for a bug in `Dataset.dropDuplicates`; we must preserve the input order of `colNames` for `groupCols` because the Streaming's state store depends on the `groupCols` order.

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added tests in `DataFrameSuite`.

Closes apache#28830 from maropu/SPARK-31990.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 7f7b4dd)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants