[SPARK-37604][SQL] Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files by wayneguow · Pull Request #34860 · apache/spark

wayneguow · 2021-12-10T05:18:33Z

What changes were proposed in this pull request?

We change the effect of emptyValueInRead option in CSVOptions to that any fields matching this string will be set as "" when reading csv files. Before this change, the effect of it is to convert quoted empty strings "\"\"" to defined emptyValueInRead strings.

Why are the changes needed?

After this change, the effect of emptyValueInRead option is more in line with the user's usage habits and it will has the similar behavior as nullValue option when reading csv files. We can parse dataframe from csv files as the same as the original one when writing and reading with the same emptyValue(emptyValueInWrite == emptyValueInRead) setting.

Before this change, for codes follows:

val data = Seq(("Tesla", "")).toDF("make", "comment")
data.show()
data.write.options(Map("header" -> "true", "emptyValue" -> "EMPTY")).csv(path)

dataframe shows as:

make	comment
Tesla

the csv data shows as:

Tesla,EMPTY

If we read this csv file with the same emptyValue value:

spark.read.options(Map("header" -> "true", "emptyValue" -> "EMPTY")).csv(path)

We finally get DataFame data shows as:

make	comment
Tesla	EMPTY

After this change, we can parse the csv file to get the same dataframe.

make	comment
Tesla

Does this PR introduce any user-facing change?

Yes, uses should know this change clearly and be carefully to deal with the emptyValue parameter with self-defined value.

How was this patch tested?

Upgrade related test cases that already exists in CSVSuite.

HyukjinKwon · 2021-12-12T06:58:08Z

ok to test

HyukjinKwon

Hm,

emptyValue in read path reads the value as "".
emptyValue in write path writes "" as the value.

But the current change will change the behaviour of emptyValue in read to:

emptyValue in read path reads the value as "" as the value.

I think either way make sense so I wouldn't introduce an unnecessary breaking change here.

SparkQA · 2021-12-12T07:50:31Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50579/

SparkQA · 2021-12-12T08:34:55Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50579/

wayneguow · 2021-12-12T09:11:12Z

Yes, I agree with you that it would make a breaking change. The parameter emptyValue which designed in CsvParserSettings of Univocity is used instead of an empty strings when reading csv files.

What makes me confused is that is there any better way to read csv files which are written with self-defined emptyValue. In writing, if emptyValue is set to "EMPTY", when reading, we can't recognize empty strings("EMPTY") but got a "EMPTY" string rather than "". We need to handle "EMPTY" strings with hardcode rather than setting options when reading.

SparkQA · 2021-12-12T09:19:51Z

Test build #146105 has finished for PR 34860 at commit 36bb86d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-12-15T18:11:18Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50712/

SparkQA · 2021-12-15T19:10:49Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50712/

SparkQA · 2021-12-15T22:12:12Z

Test build #146238 has finished for PR 34860 at commit 52eef09.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wayneguow · 2021-12-16T04:17:28Z

Maybe a kind of possible approach to avoid a breaking change is that adding another boolean option in CSVOptions such as parseEmptyValueAsEmpty which is default set as false.

/**
 * If set to true, emptyValueInRead strings are converted to empty strings "" when reading.
 */
val parseEmptyValueAsEmpty = getBool("parseEmptyValueAsEmpty", false)

And with this option, we can change the nullSafeDatum method in UnivocityParser as follows:

private def nullSafeDatum(
     datum: String,
     name: String,
     nullable: Boolean,
     options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
    if (!nullable) {
      throw new RuntimeException(s"null value found but field $name is not nullable.")
    }
    null
  } else if (options.parseEmptyValueAsEmpty && datum == options.emptyValueInRead) {
    converter.apply("")
  } else {
    converter.apply(datum)
  }
}

With this option, the default behavior is same with before, but for users such as me, we can succeed to parse emptyValue strings to "" when reading csv files.

wayneguow · 2021-12-17T15:52:15Z

Currently, maybe it is not widely required.

Change the behavior of emptyValueInRead parameter in CSVOptions

36bb86d

github-actions bot added the SQL label Dec 10, 2021

HyukjinKwon changed the title ~~[SPARK-37604][SQL]Change the behavior of emptyValueInRead parameter in CSVOptions~~ [SPARK-37604][SQL] Change the behavior of emptyValueInRead parameter in CSVOptions Dec 12, 2021

HyukjinKwon reviewed Dec 12, 2021

View reviewed changes

wayneguow added 2 commits December 16, 2021 00:54

Merge branch 'master' into SPARK-37604

5a06775

upgrade test cases

52eef09

wayneguow changed the title ~~[SPARK-37604][SQL] Change the behavior of emptyValueInRead parameter in CSVOptions~~ [SPARK-37604][SQL] Change emptyValueInRead to that any fields matching this string will be set as "" when reading Dec 15, 2021

wayneguow changed the title ~~[SPARK-37604][SQL] Change emptyValueInRead to that any fields matching this string will be set as "" when reading~~ [SPARK-37604][SQL] Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading Dec 15, 2021

wayneguow changed the title ~~[SPARK-37604][SQL] Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading~~ [SPARK-37604][SQL] Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files Dec 15, 2021

wayneguow closed this Dec 17, 2021

wayneguow deleted the SPARK-37604 branch February 11, 2025 04:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-37604][SQL] Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files#34860

[SPARK-37604][SQL] Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files#34860
wayneguow wants to merge 3 commits intoapache:masterfrom
wayneguow:SPARK-37604

wayneguow commented Dec 10, 2021 •

edited

Loading

Uh oh!

HyukjinKwon commented Dec 12, 2021

Uh oh!

HyukjinKwon left a comment

Uh oh!

SparkQA commented Dec 12, 2021

Uh oh!

SparkQA commented Dec 12, 2021

Uh oh!

wayneguow commented Dec 12, 2021 •

edited

Loading

Uh oh!

SparkQA commented Dec 12, 2021

Uh oh!

SparkQA commented Dec 15, 2021

Uh oh!

SparkQA commented Dec 15, 2021

Uh oh!

SparkQA commented Dec 15, 2021

Uh oh!

wayneguow commented Dec 16, 2021 •

edited

Loading

Uh oh!

wayneguow commented Dec 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

wayneguow commented Dec 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon commented Dec 12, 2021

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 12, 2021

Uh oh!

SparkQA commented Dec 12, 2021

Uh oh!

wayneguow commented Dec 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Dec 12, 2021

Uh oh!

SparkQA commented Dec 15, 2021

Uh oh!

SparkQA commented Dec 15, 2021

Uh oh!

SparkQA commented Dec 15, 2021

Uh oh!

wayneguow commented Dec 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wayneguow commented Dec 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

wayneguow commented Dec 10, 2021 •

edited

Loading

wayneguow commented Dec 12, 2021 •

edited

Loading

wayneguow commented Dec 16, 2021 •

edited

Loading