-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates #28830
Conversation
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
Outdated
Show resolved
Hide resolved
Yes, this incompatible bug is found by a WIP validation logic. I will reply
the details and reference the PR soon.
Dongjoon Hyun <notifications@github.com>于2020年6月15日 周一09:54写道:
… Thank you for making a swift fix, @maropu <https://github.com/maropu> .
cc @dbtsai <https://github.com/dbtsai> and @holdenk
<https://github.com/holdenk>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#28830 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABE4DZNXKFRD2FJJRKBWICDRWV5MJANCNFSM4N5W4OKA>
.
|
Let's use |
Hi, All. I know that the reverting is not a good solution for the original author as mentioned by @HeartSaVioR in the dev mailing list, but I believe that is the proper way in this case to cut Apache Spark 3.0.1. How do you think about that? |
Yes. I prefer to reverting the original fix in 3.0.1. and then discuss how to solve/avoid the problems in a proper way. |
okay, I'll revert that part in this PR first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
Outdated
Show resolved
Hide resolved
This reverts commit 3e02ad6.
+1 to partial revert which should be also OK with author. (I guess it was applied simply by pattern, and it wasn’t for some intended improvement, so no problem for author as well.) |
Yep, I think just revert that part is good enough. I will give more context and details on #28707. |
Ya. +1 for partial revert in this PR. |
@xuanyuanking yea, looks fine to me. Could you takes this over? Thanks, anyway! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu Sure, I'm writing comments in #28707. Will cc all of us and reference it with SPARK-31990 and this PR soon.
How we plan to consolidate both? How we will write JIRA title/description and PR title/description? Which is the type of the consolidated issue? Is the consolidated issue a blocker? Things would be simpler if we merge the partial revert as it is, and spend our efforts to discuss how to guide known issues - this is one of candidates for Spark 3.0.0. This is clearly a bugfix which is a "blocker" preventing some of end users migrate to Spark 3.0.0, worth to have its own JIRA issue, and also commit. Sure, this may need to be placed on migration guide or release note as well. It'd be no harm for #28707 to wait for this patch to be merged, and rebase to fix the test failure. |
Here's my plan to consolidate both: #28707 (comment), this will also comment in JIRA & PR description. |
I am okay to revert it for now but I couldn't fully follow why we expect an explicit order from a set. Has it been ever guaranteed somewhere? Using |
The last commit is to trying to preserve the previous behavior (whatever it was) of Apache Spark 2.2.0 ~ 2.4.6 although there is no guarantee which is safe or not. We will revisit the correct way later after 3.0.1. |
I am okay with that; I was just wondering even the previous behaviour was deterministic or not, and SPARK-31292 looked righter to me. Given that we're going to revisit anyway, LGTM from me too. |
val groupCols = colNames.distinct.flatMap { (colName: String) => | ||
// SPARK-31990: We must keep `toSet.toSeq` here because of the backward compatibility issue | ||
// (the Streaming's state store depends on the `groupCols` order). | ||
val groupCols = colNames.toSet.toSeq.flatMap { (colName: String) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good for now.
In the future, this may still be broken by Scala version upgrade, and hopefully @xuanyuanking 's unsafe row validation can detect it. Then we can change it and use a deterministic order, as it will be broken anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I also mentioned this at #28830 (comment), we can relay on the validation checking and integrated tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth noting that we need to have "concrete" solution eventually - if columns are all having same type neither #28830 nor #24173 catch the change and the result becomes silently incorrect. I roughly remember the similar issue on pyspark, which was trying to fix the issue on order vs name, don't remember how it ended up. cc. @HyukjinKwon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that was fixed in a way by adding an env variable. That case also was specific to Python 2 which is deprecated now so it's rather a corner case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as it is.
Btw now we know it is broken in Spark 3.0.0, and we will fix it again in Spark 3.0.1. Do we have some best practice to follow on guiding such change to end users? |
Test build #124020 has finished for PR 28830 at commit
|
Test build #124027 has finished for PR 28830 at commit
|
retest this please |
I think we should list it as a known issue of 3.0.0, and release 3.0.1 soon. |
Test build #124036 has finished for PR 28830 at commit
|
### What changes were proposed in this pull request? This PR partially revert SPARK-31292 in order to provide a hot-fix for a bug in `Dataset.dropDuplicates`; we must preserve the input order of `colNames` for `groupCols` because the Streaming's state store depends on the `groupCols` order. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests in `DataFrameSuite`. Closes #28830 from maropu/SPARK-31990. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 7f7b4dd) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Thank you all. Merged to master/3.0. |
Thanks, all! |
### What changes were proposed in this pull request? This PR partially revert SPARK-31292 in order to provide a hot-fix for a bug in `Dataset.dropDuplicates`; we must preserve the input order of `colNames` for `groupCols` because the Streaming's state store depends on the `groupCols` order. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests in `DataFrameSuite`. Closes apache#28830 from maropu/SPARK-31990. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 7f7b4dd) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
This PR partially revert SPARK-31292 in order to provide a hot-fix for a bug in
Dataset.dropDuplicates
; we must preserve the input order ofcolNames
forgroupCols
because the Streaming's state store depends on thegroupCols
order.Why are the changes needed?
Bug fix.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added tests in
DataFrameSuite
.