[SPARK-37575][SQL][FOLLOWUP] Update migration guide for null values saving in CSV data source#34905
[SPARK-37575][SQL][FOLLOWUP] Update migration guide for null values saving in CSV data source#34905itholic wants to merge 2 commits intoapache:masterfrom itholic:SPARK-37575-followup
Conversation
|
Test build #146213 has finished for PR 34905 at commit
|
|
Kubernetes integration test starting |
|
|
||
| - Since Spark 3.3, the `strfmt` in `format_string(strfmt, obj, ...)` and `printf(strfmt, obj, ...)` will no longer support to use "0$" to specify the first argument, the first argument should always reference by "1$" when use argument index to indicating the position of the argument in the argument list. | ||
|
|
||
| - Since Spark 3.3, nulls are written as empty strings in CSV data source by default. In Spark 3.2 or earlier, nulls were written as empty strings as quoted empty strings, `""`. To restore the previous behavior, set `nullValue` to `""`. |
There was a problem hiding this comment.
To restore the previous behavior, set
nullValueto""
Actually, it is correct but if an user sets the option stupidly as recommended:
scala> val df = Seq("abc", null, "def").toDF()
df: org.apache.spark.sql.DataFrame = [value: string]
scala> df.repartition(1).write.option("nullValue", "").mode("overwrite").csv("/Users/maximgekk/tmp/csv3")$ csv3 cat ./part-00000-5830ac7c-3653-41ec-a2f7-c56934ef56d9-c000.csv
abc
def
but:
scala> df.repartition(1).write.option("nullValue", "\"\"").mode("overwrite").csv("/Users/maximgekk/tmp/csv4")
$ csv4 cat ./part-00000-6a5b0628-8924-4300-9699-89f4df903db9-c000.csv
abc
""
def
There was a problem hiding this comment.
Let me just merge as is ...
|
Test build #146217 has finished for PR 34905 at commit
|
|
Kubernetes integration test status failure |
|
Merged to master. |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
…of write null value in csv to unquoted empty string ### What changes were proposed in this pull request? Add a legacy flag `spark.sql.legacy.nullValueWrittenAsQuotedEmptyStringCsv` for the breaking change introduced in #34853 and #34905 (followup). The flag is disabled by default, so the null values written as csv will output an unquoted empty string. When the legacy flag is enabled, the null will output quoted empty string. ### Why are the changes needed? The original commit is a breaking change, and breaking changes should be encouraged to add a flag to turn it off for smooth migration between versions. ### Does this PR introduce _any_ user-facing change? With the default value of the conf, there is no user-facing difference. If users turn this conf off, they can restore the pre-change behavior. ### How was this patch tested? Through unit tests. Closes #36110 from anchovYu/flags-null-to-csv. Authored-by: Xinyi Yu <xinyi.yu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…of write null value in csv to unquoted empty string ### What changes were proposed in this pull request? Add a legacy flag `spark.sql.legacy.nullValueWrittenAsQuotedEmptyStringCsv` for the breaking change introduced in #34853 and #34905 (followup). The flag is disabled by default, so the null values written as csv will output an unquoted empty string. When the legacy flag is enabled, the null will output quoted empty string. ### Why are the changes needed? The original commit is a breaking change, and breaking changes should be encouraged to add a flag to turn it off for smooth migration between versions. ### Does this PR introduce _any_ user-facing change? With the default value of the conf, there is no user-facing difference. If users turn this conf off, they can restore the pre-change behavior. ### How was this patch tested? Through unit tests. Closes #36110 from anchovYu/flags-null-to-csv. Authored-by: Xinyi Yu <xinyi.yu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 965f872) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This is follow-up for #34853, to mention the behavior changes to migration guide, too.
See also #34853 (comment)
Why are the changes needed?
We should mention the behavior change to the migration guide, although it's bug fix.
Does this PR introduce any user-facing change?
The explanation is added to the migration guide as below:
How was this patch tested?
Manually built docs