[SPARK-25829][SQL] Add config spark.sql.legacy.allowDuplicatedMapKeys and change the default behavior#27478
[SPARK-25829][SQL] Add config spark.sql.legacy.allowDuplicatedMapKeys and change the default behavior#27478xuanyuanking wants to merge 7 commits intoapache:masterfrom
spark.sql.legacy.allowDuplicatedMapKeys and change the default behavior#27478Conversation
|
Test build #117998 has finished for PR 27478 at commit
|
|
Retest this please. |
There was a problem hiding this comment.
Can we have one test that checks the exception when this configuration is false?
There was a problem hiding this comment.
Thanks, added in ArrayBasedMapBuilderSuite.
|
Test build #118005 has finished for PR 27478 at commit
|
docs/sql-migration-guide.md
Outdated
There was a problem hiding this comment.
Hi, @cloud-fan , @gatorsmile , @rxin , @marmbrus .
So, RuntimeException by default is better in Apache Spark 3.0.0?
There was a problem hiding this comment.
The principle is still no failure by default, but version upgrade is a different case. We should try our best to avoid "silent result changing".
We may want to introduce a new category of configs, which is for migration-only, and should be removed in the next major version.
Thoughts? @srowen @viirya @maropu @bart-samwel
There was a problem hiding this comment.
Should we mention this spark.sql.deduplicateMapKey.lastWinsPolicy.enabled config is only useful for migration and could be removed in future version?
There was a problem hiding this comment.
This idea sounds reasonable. If we think a behavior change is correct but for migration reason, we don't want a silent behavior change, a config like this one is an option we can use to notify users such change.
There was a problem hiding this comment.
I agree with the idea to avoid "silent result changing". Btw, we couldn't keep the old (spark 2.4) behaviour for duplicate keys by using a legacy option? If we couldn't do because of some reasons, the proposed one (the runtime exception) looks reasonable to me, too.
There was a problem hiding this comment.
+1 for the idea of highlight this config is migration only.
There was a problem hiding this comment.
I'm okay with the rule, @cloud-fan . At this time, could you add a guideline into our website? Actually, many correctness bug fixes in the migration guide have been silent result changing with a legacy fallback config.
There was a problem hiding this comment.
Agree to have a guideline if we need the new rule, to separate the silent result changing here is caused by the behavior change, not by the bug fixes.
There was a problem hiding this comment.
What I have in mind is:
- if it's a bug fix (the previous result is definitely wrong), then no config is needed. If the impact is big, we can add a legacy config which is false by default.
- if it makes the behavior better, we should either add a config and use the old behavior by default, or fail by default and ask users to set config explicitly and pick the desired behavior.
I'm trying to think more cases, will send an email to the dev list soon.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
|
Test build #118040 has finished for PR 27478 at commit
|
|
Test build #118052 has finished for PR 27478 at commit
|
|
I google searched "deduplicate map key" and there is no matching result. |
|
Thanks Gengliang, as the config still related to legacy behavior, I rename it to |
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
|
LGTM, can you update PR title as well? |
|
Test build #118423 has finished for PR 27478 at commit
|
spark.sql.deduplicateMapKey.lastWinsPolicy.enabled and change the default behaviorspark.sql.legacy.allowDuplicatedMapKeys and change the default behavior
|
Test build #118468 has finished for PR 27478 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
can you check the map-related builtin expressions? seems they use wrong examples that have duplicated keys.
|
Test build #118547 has finished for PR 27478 at commit
|
|
retest this please |
|
Test build #118537 has finished for PR 27478 at commit
|
|
retest it please |
|
retest this please |
|
Test build #118567 has finished for PR 27478 at commit
|
|
retest it please |
|
retest this please |
|
Test build #118551 has finished for PR 27478 at commit
|
|
thanks, merging to master/3.0! |
…s` and change the default behavior ### What changes were proposed in this pull request? This is a follow-up for #23124, add a new config `spark.sql.legacy.allowDuplicatedMapKeys` to control the behavior of removing duplicated map keys in build-in functions. With the default value `false`, Spark will throw a RuntimeException while duplicated keys are found. ### Why are the changes needed? Prevent silent behavior changes. ### Does this PR introduce any user-facing change? Yes, new config added and the default behavior for duplicated map keys changed to RuntimeException thrown. ### How was this patch tested? Modify existing UT. Closes #27478 from xuanyuanking/SPARK-25892-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ab186e3) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
|
Test build #118578 has finished for PR 27478 at commit
|
|
Thanks all for the review. |
…s` and change the default behavior ### What changes were proposed in this pull request? This is a follow-up for apache#23124, add a new config `spark.sql.legacy.allowDuplicatedMapKeys` to control the behavior of removing duplicated map keys in build-in functions. With the default value `false`, Spark will throw a RuntimeException while duplicated keys are found. ### Why are the changes needed? Prevent silent behavior changes. ### Does this PR introduce any user-facing change? Yes, new config added and the default behavior for duplicated map keys changed to RuntimeException thrown. ### How was this patch tested? Modify existing UT. Closes apache#27478 from xuanyuanking/SPARK-25892-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This is a follow-up for #23124, add a new config
spark.sql.legacy.allowDuplicatedMapKeysto control the behavior of removing duplicated map keys in build-in functions. With the default valuefalse, Spark will throw a RuntimeException while duplicated keys are found.Why are the changes needed?
Prevent silent behavior changes.
Does this PR introduce any user-facing change?
Yes, new config added and the default behavior for duplicated map keys changed to RuntimeException thrown.
How was this patch tested?
Modify existing UT.