Skip to content

[SPARK-37604][SQL] Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files#34860

Closed
wayneguow wants to merge 3 commits intoapache:masterfrom
wayneguow:SPARK-37604
Closed

[SPARK-37604][SQL] Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files#34860
wayneguow wants to merge 3 commits intoapache:masterfrom
wayneguow:SPARK-37604

Conversation

@wayneguow
Copy link
Contributor

@wayneguow wayneguow commented Dec 10, 2021

What changes were proposed in this pull request?

We change the effect of emptyValueInRead option in CSVOptions to that any fields matching this string will be set as "" when reading csv files. Before this change, the effect of it is to convert quoted empty strings "\"\"" to defined emptyValueInRead strings.

Why are the changes needed?

After this change, the effect of emptyValueInRead option is more in line with the user's usage habits and it will has the similar behavior as nullValue option when reading csv files. We can parse dataframe from csv files as the same as the original one when writing and reading with the same emptyValue(emptyValueInWrite == emptyValueInRead) setting.

Before this change, for codes follows:

val data = Seq(("Tesla", "")).toDF("make", "comment")
data.show()
data.write.options(Map("header" -> "true", "emptyValue" -> "EMPTY")).csv(path)

dataframe shows as:

make comment
Tesla

the csv data shows as:

Tesla,EMPTY

If we read this csv file with the same emptyValue value:

spark.read.options(Map("header" -> "true", "emptyValue" -> "EMPTY")).csv(path)

We finally get DataFame data shows as:

make comment
Tesla EMPTY

After this change, we can parse the csv file to get the same dataframe.

make comment
Tesla

Does this PR introduce any user-facing change?

Yes, uses should know this change clearly and be carefully to deal with the emptyValue parameter with self-defined value.

How was this patch tested?

Upgrade related test cases that already exists in CSVSuite.

@github-actions github-actions bot added the SQL label Dec 10, 2021
@HyukjinKwon HyukjinKwon changed the title [SPARK-37604][SQL]Change the behavior of emptyValueInRead parameter in CSVOptions [SPARK-37604][SQL] Change the behavior of emptyValueInRead parameter in CSVOptions Dec 12, 2021
@HyukjinKwon
Copy link
Member

ok to test

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm,

  • emptyValue in read path reads the value as "".
  • emptyValue in write path writes "" as the value.

But the current change will change the behaviour of emptyValue in read to:

  • emptyValue in read path reads the value as "" as the value.

I think either way make sense so I wouldn't introduce an unnecessary breaking change here.

@SparkQA
Copy link

SparkQA commented Dec 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50579/

@SparkQA
Copy link

SparkQA commented Dec 12, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50579/

@wayneguow
Copy link
Contributor Author

wayneguow commented Dec 12, 2021

Yes, I agree with you that it would make a breaking change. The parameter emptyValue which designed in CsvParserSettings of Univocity is used instead of an empty strings when reading csv files.

What makes me confused is that is there any better way to read csv files which are written with self-defined emptyValue. In writing, if emptyValue is set to "EMPTY", when reading, we can't recognize empty strings("EMPTY") but got a "EMPTY" string rather than "". We need to handle "EMPTY" strings with hardcode rather than setting options when reading.

@SparkQA
Copy link

SparkQA commented Dec 12, 2021

Test build #146105 has finished for PR 34860 at commit 36bb86d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wayneguow wayneguow changed the title [SPARK-37604][SQL] Change the behavior of emptyValueInRead parameter in CSVOptions [SPARK-37604][SQL] Change emptyValueInRead to that any fields matching this string will be set as "" when reading Dec 15, 2021
@wayneguow wayneguow changed the title [SPARK-37604][SQL] Change emptyValueInRead to that any fields matching this string will be set as "" when reading [SPARK-37604][SQL] Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading Dec 15, 2021
@wayneguow wayneguow changed the title [SPARK-37604][SQL] Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading [SPARK-37604][SQL] Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files Dec 15, 2021
@SparkQA
Copy link

SparkQA commented Dec 15, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50712/

@SparkQA
Copy link

SparkQA commented Dec 15, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50712/

@SparkQA
Copy link

SparkQA commented Dec 15, 2021

Test build #146238 has finished for PR 34860 at commit 52eef09.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wayneguow
Copy link
Contributor Author

wayneguow commented Dec 16, 2021

Maybe a kind of possible approach to avoid a breaking change is that adding another boolean option in CSVOptions such as parseEmptyValueAsEmpty which is default set as false.

/**
 * If set to true, emptyValueInRead strings are converted to empty strings "" when reading.
 */
val parseEmptyValueAsEmpty = getBool("parseEmptyValueAsEmpty", false)

And with this option, we can change the nullSafeDatum method in UnivocityParser as follows:

private def nullSafeDatum(
     datum: String,
     name: String,
     nullable: Boolean,
     options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
    if (!nullable) {
      throw new RuntimeException(s"null value found but field $name is not nullable.")
    }
    null
  } else if (options.parseEmptyValueAsEmpty && datum == options.emptyValueInRead) {
    converter.apply("")
  } else {
    converter.apply(datum)
  }
}

With this option, the default behavior is same with before, but for users such as me, we can succeed to parse emptyValue strings to "" when reading csv files.

@wayneguow wayneguow closed this Dec 17, 2021
@wayneguow
Copy link
Contributor Author

Currently, maybe it is not widely required.

@wayneguow wayneguow deleted the SPARK-37604 branch February 11, 2025 04:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments