[SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates #28830

maropu · 2020-06-15T01:21:52Z

What changes were proposed in this pull request?

This PR partially revert SPARK-31292 in order to provide a hot-fix for a bug in Dataset.dropDuplicates; we must preserve the input order of colNames for groupCols because the Streaming's state store depends on the groupCols order.

Why are the changes needed?

Bug fix.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added tests in DataFrameSuite.

maropu · 2020-06-15T01:26:27Z

cc: @xuanyuanking @HeartSaVioR @srowen @gatorsmile

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

dongjoon-hyun · 2020-06-15T01:54:00Z

Thank you for making a fix swiftly, @maropu .
cc @dbtsai and @holdenk

xuanyuanking · 2020-06-15T02:00:07Z

Yes, this incompatible bug is found by a WIP validation logic. I will reply the details and reference the PR soon. Dongjoon Hyun <notifications@github.com>于2020年6月15日周一09:54写道：

…

Thank you for making a swift fix, @maropu <https://github.com/maropu> . cc @dbtsai <https://github.com/dbtsai> and @holdenk <https://github.com/holdenk> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#28830 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABE4DZNXKFRD2FJJRKBWICDRWV5MJANCNFSM4N5W4OKA> .

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

HeartSaVioR · 2020-06-15T02:34:00Z

Let's use [SS] instead as it's specific to SS issue.

dongjoon-hyun · 2020-06-15T03:11:00Z

Hi, All.
This issue is marked as a hotfix for the blocker issue, but the validation of this issue looks non-trivial. Since toSet.toSeq is used since Apache Spark 2.2.0 (SPARK-19497) and SPARK-31292 is just an Improvement issue with Trivial priority. I'd like to propose to revert SPARK-31292 from branch-3.0 first. We will keep SPARK-31292 in master branch still and proceed this @maropu 's PR to find a better way for Apache Spark 3.1.0.

I know that the reverting is not a good solution for the original author as mentioned by @HeartSaVioR in the dev mailing list, but I believe that is the proper way in this case to cut Apache Spark 3.0.1. How do you think about that?

gatorsmile · 2020-06-15T03:13:05Z

Yes. I prefer to reverting the original fix in 3.0.1. and then discuss how to solve/avoid the problems in a proper way.

maropu · 2020-06-15T03:13:36Z

okay, I'll revert that part in this PR first.

xuanyuanking

Thanks for the quick fix @maropu! I think maybe we can simplify the bugfix by combining it together with #28707. WDYT? I'll also reference this PR with #28707.

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

This reverts commit 3e02ad6.

HeartSaVioR · 2020-06-15T03:18:12Z

+1 to partial revert which should be also OK with author. (I guess it was applied simply by pattern, and it wasn’t for some intended improvement, so no problem for author as well.)

xuanyuanking · 2020-06-15T03:18:16Z

Yep, I think just revert that part is good enough. I will give more context and details on #28707.

dongjoon-hyun · 2020-06-15T03:18:38Z

Ya. +1 for partial revert in this PR.

maropu · 2020-06-15T03:39:59Z

Thanks for the quick fix @maropu! I think maybe we can simplify the bugfix by combining it together with #28707. WDYT? I'll also reference this PR with #28707.

@xuanyuanking yea, looks fine to me. Could you takes this over? Thanks, anyway!

xuanyuanking

@maropu Sure, I'm writing comments in #28707. Will cc all of us and reference it with SPARK-31990 and this PR soon.

HeartSaVioR · 2020-06-15T03:57:14Z

How we plan to consolidate both? How we will write JIRA title/description and PR title/description? Which is the type of the consolidated issue? Is the consolidated issue a blocker?

Things would be simpler if we merge the partial revert as it is, and spend our efforts to discuss how to guide known issues - this is one of candidates for Spark 3.0.0. This is clearly a bugfix which is a "blocker" preventing some of end users migrate to Spark 3.0.0, worth to have its own JIRA issue, and also commit. Sure, this may need to be placed on migration guide or release note as well.

It'd be no harm for #28707 to wait for this patch to be merged, and rebase to fix the test failure.

xuanyuanking · 2020-06-15T05:49:34Z

How we plan to consolidate both? How we will write JIRA title/description and PR title/description? Which is the type of the consolidated issue? Is the consolidated issue a blocker?

Here's my plan to consolidate both: #28707 (comment), this will also comment in JIRA & PR description.
Yes, #28707 is blocking by this fix.

HyukjinKwon · 2020-06-15T06:21:50Z

I am okay to revert it for now but I couldn't fully follow why we expect an explicit order from a set. Has it been ever guaranteed somewhere? Using distinct, we can expect the deterministic order but we're reverting back to using a set because of the deterministic order (?).

dongjoon-hyun · 2020-06-15T06:25:07Z

The last commit is to trying to preserve the previous behavior (whatever it was) of Apache Spark 2.2.0 ~ 2.4.6 although there is no guarantee which is safe or not. We will revisit the correct way later after 3.0.1.

HyukjinKwon · 2020-06-15T06:27:55Z

I am okay with that; I was just wondering even the previous behaviour was deterministic or not, and SPARK-31292 looked righter to me. Given that we're going to revisit anyway, LGTM from me too.

cloud-fan · 2020-06-15T06:33:28Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

-    val groupCols = colNames.distinct.flatMap { (colName: String) =>
+    // SPARK-31990: We must keep `toSet.toSeq` here because of the backward compatibility issue
+    // (the Streaming's state store depends on the `groupCols` order).
+    val groupCols = colNames.toSet.toSeq.flatMap { (colName: String) =>


I think this is good for now.

In the future, this may still be broken by Scala version upgrade, and hopefully @xuanyuanking 's unsafe row validation can detect it. Then we can change it and use a deterministic order, as it will be broken anyway.

Yep, I also mentioned this at #28830 (comment), we can relay on the validation checking and integrated tests.

Worth noting that we need to have "concrete" solution eventually - if columns are all having same type neither #28830 nor #24173 catch the change and the result becomes silently incorrect. I roughly remember the similar issue on pyspark, which was trying to fix the issue on order vs name, don't remember how it ended up. cc. @HyukjinKwon

Ah, that was fixed in a way by adding an env variable. That case also was specific to Python 2 which is deprecated now so it's rather a corner case.

HeartSaVioR

LGTM as it is.

HeartSaVioR · 2020-06-15T07:01:51Z

Btw now we know it is broken in Spark 3.0.0, and we will fix it again in Spark 3.0.1. Do we have some best practice to follow on guiding such change to end users?

SparkQA · 2020-06-15T07:05:02Z

Test build #124020 has finished for PR 28830 at commit 3e02ad6.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-15T07:05:02Z

Test build #124027 has finished for PR 28830 at commit 7546ba4.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2020-06-15T07:09:50Z

retest this please

cloud-fan · 2020-06-15T07:19:17Z

Btw now we know it is broken in Spark 3.0.0, and we will fix it again in Spark 3.0.1.

I think we should list it as a known issue of 3.0.0, and release 3.0.1 soon.

SparkQA · 2020-06-15T13:19:54Z

Test build #124036 has finished for PR 28830 at commit 7546ba4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

### What changes were proposed in this pull request? This PR partially revert SPARK-31292 in order to provide a hot-fix for a bug in `Dataset.dropDuplicates`; we must preserve the input order of `colNames` for `groupCols` because the Streaming's state store depends on the `groupCols` order. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests in `DataFrameSuite`. Closes #28830 from maropu/SPARK-31990. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 7f7b4dd) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

dongjoon-hyun · 2020-06-15T14:49:26Z

Thank you all. Merged to master/3.0.

maropu · 2020-06-15T15:13:03Z

Thanks, all!

### What changes were proposed in this pull request? This PR partially revert SPARK-31292 in order to provide a hot-fix for a bug in `Dataset.dropDuplicates`; we must preserve the input order of `colNames` for `groupCols` because the Streaming's state store depends on the `groupCols` order. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests in `DataFrameSuite`. Closes apache#28830 from maropu/SPARK-31990. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 7f7b4dd) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

Fix

3e02ad6

probot-autolabeler bot added the SQL label Jun 15, 2020

maropu commented Jun 15, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Jun 15, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala Outdated Show resolved Hide resolved

maropu changed the title ~~[SPARK-31990][SQL] Preserves the input order of colNames in dropDuplicates~~ [SPARK-31990][SQL][SS] Preserves the input order of colNames in dropDuplicates Jun 15, 2020

maropu changed the title ~~[SPARK-31990][SQL][SS] Preserves the input order of colNames in dropDuplicates~~ [SPARK-31990][SS] Preserves the input order of colNames in dropDuplicates Jun 15, 2020

xuanyuanking reviewed Jun 15, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala Outdated Show resolved Hide resolved

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala Outdated Show resolved Hide resolved

Revert "Fix"

2bcb5c5

This reverts commit 3e02ad6.

maropu changed the title ~~[SPARK-31990][SS] Preserves the input order of colNames in dropDuplicates~~ [SPARK-31990][SPARK-31292][SS] Revert: Use toSet.toSeq in Dataset.dropDuplicates Jun 15, 2020

Revert

7546ba4

maropu force-pushed the SPARK-31990 branch from 0990c1b to 7546ba4 Compare June 15, 2020 03:24

dongjoon-hyun changed the title ~~[SPARK-31990][SPARK-31292][SS] Revert: Use toSet.toSeq in Dataset.dropDuplicates~~ [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates Jun 15, 2020

xuanyuanking approved these changes Jun 15, 2020

View reviewed changes

xuanyuanking mentioned this pull request Jun 15, 2020

[SPARK-31894][SS] Introduce UnsafeRow format validation for streaming state store #28707

Closed

dongjoon-hyun approved these changes Jun 15, 2020

View reviewed changes

HyukjinKwon approved these changes Jun 15, 2020

View reviewed changes

cloud-fan reviewed Jun 15, 2020

View reviewed changes

HeartSaVioR approved these changes Jun 15, 2020

View reviewed changes

gengliangwang approved these changes Jun 15, 2020

View reviewed changes

dongjoon-hyun closed this in 7f7b4dd Jun 15, 2020

xuanyuanking mentioned this pull request Jun 17, 2020

[SPARK-31905][SS] Add compatibility tests for streaming state store format #28725

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates #28830

[SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates #28830

maropu commented Jun 15, 2020 •

edited by dongjoon-hyun

Loading

maropu commented Jun 15, 2020

dongjoon-hyun commented Jun 15, 2020 •

edited

Loading

xuanyuanking commented Jun 15, 2020 via email

HeartSaVioR commented Jun 15, 2020

dongjoon-hyun commented Jun 15, 2020 •

edited

Loading

gatorsmile commented Jun 15, 2020

maropu commented Jun 15, 2020

xuanyuanking left a comment

HeartSaVioR commented Jun 15, 2020 •

edited

Loading

xuanyuanking commented Jun 15, 2020

dongjoon-hyun commented Jun 15, 2020

maropu commented Jun 15, 2020

xuanyuanking left a comment

HeartSaVioR commented Jun 15, 2020 •

edited

Loading

xuanyuanking commented Jun 15, 2020

HyukjinKwon commented Jun 15, 2020

dongjoon-hyun commented Jun 15, 2020 •

edited

Loading

HyukjinKwon commented Jun 15, 2020 •

edited

Loading

cloud-fan Jun 15, 2020 •

edited

Loading

xuanyuanking Jun 15, 2020

HeartSaVioR Jun 15, 2020 •

edited

Loading

HyukjinKwon Jun 15, 2020

HeartSaVioR left a comment

HeartSaVioR commented Jun 15, 2020

SparkQA commented Jun 15, 2020

SparkQA commented Jun 15, 2020

gengliangwang commented Jun 15, 2020

cloud-fan commented Jun 15, 2020

SparkQA commented Jun 15, 2020

dongjoon-hyun commented Jun 15, 2020

maropu commented Jun 15, 2020

[SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates #28830

[SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates #28830

Conversation

maropu commented Jun 15, 2020 • edited by dongjoon-hyun Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

maropu commented Jun 15, 2020

dongjoon-hyun commented Jun 15, 2020 • edited Loading

xuanyuanking commented Jun 15, 2020 via email

HeartSaVioR commented Jun 15, 2020

dongjoon-hyun commented Jun 15, 2020 • edited Loading

gatorsmile commented Jun 15, 2020

maropu commented Jun 15, 2020

xuanyuanking left a comment

Choose a reason for hiding this comment

HeartSaVioR commented Jun 15, 2020 • edited Loading

xuanyuanking commented Jun 15, 2020

dongjoon-hyun commented Jun 15, 2020

maropu commented Jun 15, 2020

xuanyuanking left a comment

Choose a reason for hiding this comment

HeartSaVioR commented Jun 15, 2020 • edited Loading

xuanyuanking commented Jun 15, 2020

HyukjinKwon commented Jun 15, 2020

dongjoon-hyun commented Jun 15, 2020 • edited Loading

HyukjinKwon commented Jun 15, 2020 • edited Loading

cloud-fan Jun 15, 2020 • edited Loading

Choose a reason for hiding this comment

xuanyuanking Jun 15, 2020

Choose a reason for hiding this comment

HeartSaVioR Jun 15, 2020 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon Jun 15, 2020

Choose a reason for hiding this comment

HeartSaVioR left a comment

Choose a reason for hiding this comment

HeartSaVioR commented Jun 15, 2020

SparkQA commented Jun 15, 2020

SparkQA commented Jun 15, 2020

gengliangwang commented Jun 15, 2020

cloud-fan commented Jun 15, 2020

SparkQA commented Jun 15, 2020

dongjoon-hyun commented Jun 15, 2020

maropu commented Jun 15, 2020

maropu commented Jun 15, 2020 •

edited by dongjoon-hyun

Loading

dongjoon-hyun commented Jun 15, 2020 •

edited

Loading

dongjoon-hyun commented Jun 15, 2020 •

edited

Loading

HeartSaVioR commented Jun 15, 2020 •

edited

Loading

HeartSaVioR commented Jun 15, 2020 •

edited

Loading

dongjoon-hyun commented Jun 15, 2020 •

edited

Loading

HyukjinKwon commented Jun 15, 2020 •

edited

Loading

cloud-fan Jun 15, 2020 •

edited

Loading

HeartSaVioR Jun 15, 2020 •

edited

Loading