Skip to content

[SPARK-37575][SQL][FOLLOWUP] Update migration guide for null values saving in CSV data source#34905

Closed
itholic wants to merge 2 commits intoapache:masterfrom
itholic:SPARK-37575-followup
Closed

[SPARK-37575][SQL][FOLLOWUP] Update migration guide for null values saving in CSV data source#34905
itholic wants to merge 2 commits intoapache:masterfrom
itholic:SPARK-37575-followup

Conversation

@itholic
Copy link
Contributor

@itholic itholic commented Dec 15, 2021

What changes were proposed in this pull request?

This is follow-up for #34853, to mention the behavior changes to migration guide, too.

See also #34853 (comment)

Why are the changes needed?

We should mention the behavior change to the migration guide, although it's bug fix.

Does this PR introduce any user-facing change?

The explanation is added to the migration guide as below:

Screen Shot 2021-12-15 at 2 54 10 PM

How was this patch tested?

Manually built docs

@SparkQA
Copy link

SparkQA commented Dec 15, 2021

Test build #146213 has finished for PR 34905 at commit c90d819.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@itholic itholic changed the title [SPARK-37575][FOLLOWUP][SQL] Update migration guide for null values saving [SPARK-37575][FOLLOWUP][CSV] Update migration guide for null values saving in CSV data source Dec 15, 2021
@HyukjinKwon HyukjinKwon changed the title [SPARK-37575][FOLLOWUP][CSV] Update migration guide for null values saving in CSV data source [SPARK-37575][SQL][FOLLOWUP] Update migration guide for null values saving in CSV data source Dec 15, 2021
@SparkQA
Copy link

SparkQA commented Dec 15, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50687/


- Since Spark 3.3, the `strfmt` in `format_string(strfmt, obj, ...)` and `printf(strfmt, obj, ...)` will no longer support to use "0$" to specify the first argument, the first argument should always reference by "1$" when use argument index to indicating the position of the argument in the argument list.

- Since Spark 3.3, nulls are written as empty strings in CSV data source by default. In Spark 3.2 or earlier, nulls were written as empty strings as quoted empty strings, `""`. To restore the previous behavior, set `nullValue` to `""`.
Copy link
Member

@MaxGekk MaxGekk Dec 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To restore the previous behavior, set nullValue to ""

Actually, it is correct but if an user sets the option stupidly as recommended:

scala> val df = Seq("abc", null, "def").toDF()
df: org.apache.spark.sql.DataFrame = [value: string]

scala> df.repartition(1).write.option("nullValue", "").mode("overwrite").csv("/Users/maximgekk/tmp/csv3")
$ csv3 cat ./part-00000-5830ac7c-3653-41ec-a2f7-c56934ef56d9-c000.csv
abc
def

but:

scala> df.repartition(1).write.option("nullValue", "\"\"").mode("overwrite").csv("/Users/maximgekk/tmp/csv4")
$ csv4 cat ./part-00000-6a5b0628-8924-4300-9699-89f4df903db9-c000.csv
abc
""
def

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me just merge as is ...

@SparkQA
Copy link

SparkQA commented Dec 15, 2021

Test build #146217 has finished for PR 34905 at commit 253afac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 15, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50687/

@HyukjinKwon
Copy link
Member

Merged to master.

@SparkQA
Copy link

SparkQA commented Dec 15, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50691/

@SparkQA
Copy link

SparkQA commented Dec 15, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50691/

cloud-fan pushed a commit that referenced this pull request Apr 15, 2022
…of write null value in csv to unquoted empty string

### What changes were proposed in this pull request?

Add a legacy flag `spark.sql.legacy.nullValueWrittenAsQuotedEmptyStringCsv` for the breaking change introduced in #34853 and #34905 (followup).

The flag is disabled by default, so the null values written as csv will output an unquoted empty string. When the legacy flag is enabled, the null will output quoted empty string.

### Why are the changes needed?
The original commit is a breaking change, and breaking changes should be encouraged to add a flag to turn it off for smooth migration between versions.

### Does this PR introduce _any_ user-facing change?
With the default value of the conf, there is no user-facing difference.
If users turn this conf off, they can restore the pre-change behavior.

### How was this patch tested?
Through unit tests.

Closes #36110 from anchovYu/flags-null-to-csv.

Authored-by: Xinyi Yu <xinyi.yu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Apr 15, 2022
…of write null value in csv to unquoted empty string

### What changes were proposed in this pull request?

Add a legacy flag `spark.sql.legacy.nullValueWrittenAsQuotedEmptyStringCsv` for the breaking change introduced in #34853 and #34905 (followup).

The flag is disabled by default, so the null values written as csv will output an unquoted empty string. When the legacy flag is enabled, the null will output quoted empty string.

### Why are the changes needed?
The original commit is a breaking change, and breaking changes should be encouraged to add a flag to turn it off for smooth migration between versions.

### Does this PR introduce _any_ user-facing change?
With the default value of the conf, there is no user-facing difference.
If users turn this conf off, they can restore the pre-change behavior.

### How was this patch tested?
Through unit tests.

Closes #36110 from anchovYu/flags-null-to-csv.

Authored-by: Xinyi Yu <xinyi.yu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 965f872)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants