[SPARK-25241][SQL] Configurable empty values when reading/writing CSV files #22234

mmolimar · 2018-08-25T17:59:39Z

What changes were proposed in this pull request?

There is an option in the CSV parser to set values when we have empty values in the CSV files or in our dataframes.
Currently, this option cannot be configured and always sets a default value (empty string for reading and "" for writing).
This PR is about enabling a new CSV option in the reader/writer to set custom empty values when reading/writing CSV files.

How was this patch tested?

The changes were tested by CSVSuite adding two unit tests.

MaxGekk · 2018-08-26T09:28:07Z

Should the new option be taken into account there:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala

Line 94 in b461acb

if (value == null || value.isEmpty || value == options.nullValue) {

and here:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

Line 82 in 5264164

if (field == null || field.isEmpty || field == options.nullValue) {

?

HyukjinKwon · 2018-08-26T09:36:20Z

python/pyspark/sql/readwriter.py

-            maxCharsPerColumn=None, maxMalformedLogPerPartition=None, mode=None,
-            columnNameOfCorruptRecord=None, multiLine=None, charToEscapeQuoteEscaping=None,
-            samplingRatio=None, enforceSchema=None):
+            ignoreTrailingWhiteSpace=None, nullValue=None, emptyValue=None, nanValue=None,


It should be put at the last; otherwise, it's going to break existing Python app when the arguments are given positionally.

We should add new parameter at the end. +1

HyukjinKwon · 2018-08-26T09:36:49Z

ok to test

MaxGekk

In light of discussion in the ticket https://issues.apache.org/jira/browse/SPARK-17916, could you write a test and check the case when empty values are written without quotes as it was in Spark 2.3 by default.

SparkQA · 2018-08-26T13:26:30Z

Test build #95259 has finished for PR 22234 at commit 8b51800.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mmolimar · 2018-08-26T23:30:42Z

@MaxGekk I added what you suggested as well.

…5241

HyukjinKwon · 2018-08-27T01:52:19Z

python/pyspark/sql/readwriter.py

-            nanValue=nanValue, positiveInf=positiveInf, negativeInf=negativeInf,
-            dateFormat=dateFormat, timestampFormat=timestampFormat, maxColumns=maxColumns,
-            maxCharsPerColumn=maxCharsPerColumn,
+            emptyValue=emptyValue, nanValue=nanValue, positiveInf=positiveInf,


I would put this at the end as well for readability.

HyukjinKwon · 2018-08-27T01:53:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala

@@ -117,6 +117,9 @@ class CSVOptions(

  val nullValue = parameters.getOrElse("nullValue", "")

+  val emptyValueInRead = parameters.getOrElse("emptyValue", "")


I would just call it emptyValue for consistency with other options here.

I though that as well. Just for the shake of providing backwards compatibility as we already have in ignoreLeadingWhiteSpaceInRead and ignoreLeadingWhiteSpaceFlagInWrite I implemented that in that way.
What do you say?

I had to name them differently names because the default values are different. Ah, yea then it makes sense here. I rushed to read.

SparkQA · 2018-08-27T03:19:36Z

Test build #95270 has finished for PR 22234 at commit 3d3f178.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-27T03:22:23Z

Test build #95271 has finished for PR 22234 at commit bb28db9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-27T05:44:08Z

Test build #95274 has finished for PR 22234 at commit 0bcdb2a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-08-27T14:53:36Z

Seems okay but I or someone else should take a closer look before getting this in.

HyukjinKwon · 2018-09-03T04:11:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

@@ -79,7 +79,8 @@ private[csv] object CSVInferSchema {
   * point checking if it is an Int, as the final type must be Double or higher.
   */
  def inferField(typeSoFar: DataType, field: String, options: CSVOptions): DataType = {
-    if (field == null || field.isEmpty || field == options.nullValue) {
+    if (field == null || field.isEmpty || field == options.nullValue ||
+      field == options.emptyValueInRead) {


I wouldn't do this for now. It needs another review iteration. Let's revert this back.

HyukjinKwon · 2018-09-03T04:11:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala

-          // When there are empty strings or the values set in `nullValue`, put the
-          // index as the suffix.
+        if (value == null || value.isEmpty || value == options.nullValue ||
+          value == options.emptyValueInRead) {


ditto for excluding.

Do I revert these both changes @HyukjinKwon then?

HyukjinKwon

Looks good. Let me take another look before getting this in.

gatorsmile · 2018-09-04T22:59:14Z

Did we introduce any behavior change in #21273? Does this PR resolve it?

HyukjinKwon · 2018-09-05T02:59:00Z

From my understanding, yea. The problem here is sounds like ambiguity in empty strings since they can be interpreted as empty strings and also null. To me, this is actually rather a bug since we can't distinguish, and don't respect the empty value. If empty strings are written, they should be read as empty strings.

This PR proposes an ability explicitly set the empty value to work around the behaviour change.

gatorsmile · 2018-09-05T05:08:44Z

Have we documented the behavior changes in the migration guide? If not, can we do it?

HyukjinKwon · 2018-09-05T05:15:44Z

This is rather a quite corner case (see the elaborated cases in the JIRA SPARK-17916) and there's ambiguity to treat this as a bug or a proper behaviour change; however, I don't object if this can be worth enough as something that should be mentioned.

cc @MaxGekk for a followup

MaxGekk · 2018-09-07T11:33:26Z

cc @MaxGekk for a followup

@HyukjinKwon Do you mean to update migration guide in master and probably in Spark 2.4? I don't think this should be considered as a bug because current version and previous versions of Spark can read saved CSV files correctly. Yes, for now empty strings are saved as "" and nulls as nothing but this is supposed to be to distinguish empty string and null in read. And produced CSV files are valid, and they can be read by any mature CSV libs.

HyukjinKwon · 2018-09-07T13:58:26Z

Oh no I mean we fixed a bug..

gatorsmile · 2018-09-08T19:08:47Z

@MaxGekk Could you take this PR over? I think we need to merge this to Spark 2.4. Users can set the behaviors to the previous one by this new conf emptyValue, if needed. Also update the migration guide about the behavior change and explain how to set emptyValue.

MaxGekk · 2018-09-08T21:23:24Z

@gatorsmile @HyukjinKwon Please, take a look at #22367

HyukjinKwon · 2018-09-11T06:10:09Z

@mmolimar, let's leave this closed since the newer one is open BTW. You will be credited as a primary author of #22367 anyway.

…sed as null when nullValue is set. ## What changes were proposed in this pull request? In the PR, I propose new CSV option `emptyValue` and an update in the SQL Migration Guide which describes how to revert previous behavior when empty strings were not written at all. Since Spark 2.4, empty strings are saved as `""` to distinguish them from saved `null`s. Closes #22234 Closes #22367 ## How was this patch tested? It was tested by `CSVSuite` and new tests added in the PR #22234 Closes #22389 from MaxGekk/csv-empty-value-master. Lead-authored-by: Mario Molina <mmolimar@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org> (cherry picked from commit c9cb393) Signed-off-by: hyukjinkwon <gurwls223@apache.org>

…sed as null when nullValue is set. ## What changes were proposed in this pull request? In the PR, I propose new CSV option `emptyValue` and an update in the SQL Migration Guide which describes how to revert previous behavior when empty strings were not written at all. Since Spark 2.4, empty strings are saved as `""` to distinguish them from saved `null`s. Closes apache#22234 Closes apache#22367 ## How was this patch tested? It was tested by `CSVSuite` and new tests added in the PR apache#22234 Closes apache#22389 from MaxGekk/csv-empty-value-master. Lead-authored-by: Mario Molina <mmolimar@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>

Configurable empty values when reading/writing CSV files

8b51800

HyukjinKwon reviewed Aug 26, 2018

View reviewed changes

MaxGekk reviewed Aug 26, 2018

View reviewed changes

mmolimar added 2 commits August 26, 2018 18:28

Changing order in args for emptyValue

ebd052b

Adding tests

17eaba6

mmolimar added 2 commits August 26, 2018 18:31

Merge branch 'master' of https://github.com/apache/spark into SPARK-2…

3d3f178

…5241

Changing emptyValue order arg in streaming.py

bb28db9

HyukjinKwon reviewed Aug 27, 2018

View reviewed changes

Changing emptyValue order arg in set_opts

0bcdb2a

This was referenced Sep 3, 2018

[SPARK-17916][SQL] Fix empty string being parsed as null when nullValue is set. #21273

Closed

[SPARK-17916][SQL] Fix new behavior when quote is set and fix old behavior when quote is unset #22312

Closed

HyukjinKwon reviewed Sep 3, 2018

View reviewed changes

MaxGekk mentioned this pull request Sep 8, 2018

[SPARK-17916][SPARK-25241][SQL][FOLLOWUP] Fix empty string being parsed as null when nullValue is set. #22367

Closed

MaxGekk mentioned this pull request Sep 11, 2018

[SPARK-17916][SPARK-25241][SQL][FOLLOW-UP] Fix empty string being parsed as null when nullValue is set. #22389

Closed

asfgit closed this in c9cb393 Sep 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-25241][SQL] Configurable empty values when reading/writing CSV files #22234

[SPARK-25241][SQL] Configurable empty values when reading/writing CSV files #22234

mmolimar commented Aug 25, 2018

MaxGekk commented Aug 26, 2018 •

edited

Loading

HyukjinKwon Aug 26, 2018

viirya Aug 26, 2018

mmolimar Aug 26, 2018

HyukjinKwon commented Aug 26, 2018

MaxGekk left a comment

SparkQA commented Aug 26, 2018

mmolimar commented Aug 26, 2018

HyukjinKwon Aug 27, 2018

mmolimar Aug 27, 2018

HyukjinKwon Aug 27, 2018

mmolimar Aug 27, 2018

HyukjinKwon Aug 27, 2018

SparkQA commented Aug 27, 2018

SparkQA commented Aug 27, 2018

SparkQA commented Aug 27, 2018

HyukjinKwon commented Aug 27, 2018

HyukjinKwon Sep 3, 2018

HyukjinKwon Sep 3, 2018

mmolimar Sep 10, 2018

HyukjinKwon left a comment

gatorsmile commented Sep 4, 2018

HyukjinKwon commented Sep 5, 2018

gatorsmile commented Sep 5, 2018

HyukjinKwon commented Sep 5, 2018

MaxGekk commented Sep 7, 2018

HyukjinKwon commented Sep 7, 2018

gatorsmile commented Sep 8, 2018

MaxGekk commented Sep 8, 2018

HyukjinKwon commented Sep 11, 2018

		@@ -117,6 +117,9 @@ class CSVOptions(

		val nullValue = parameters.getOrElse("nullValue", "")

		val emptyValueInRead = parameters.getOrElse("emptyValue", "")

[SPARK-25241][SQL] Configurable empty values when reading/writing CSV files #22234

[SPARK-25241][SQL] Configurable empty values when reading/writing CSV files #22234

Conversation

mmolimar commented Aug 25, 2018

What changes were proposed in this pull request?

How was this patch tested?

MaxGekk commented Aug 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Aug 26, 2018

MaxGekk left a comment

Choose a reason for hiding this comment

SparkQA commented Aug 26, 2018

mmolimar commented Aug 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 27, 2018

SparkQA commented Aug 27, 2018

SparkQA commented Aug 27, 2018

HyukjinKwon commented Aug 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

gatorsmile commented Sep 4, 2018

HyukjinKwon commented Sep 5, 2018

gatorsmile commented Sep 5, 2018

HyukjinKwon commented Sep 5, 2018

MaxGekk commented Sep 7, 2018

HyukjinKwon commented Sep 7, 2018

gatorsmile commented Sep 8, 2018

MaxGekk commented Sep 8, 2018

HyukjinKwon commented Sep 11, 2018

MaxGekk commented Aug 26, 2018 •

edited

Loading