Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25241][SQL] Configurable empty values when reading/writing CSV files #22234

Closed
wants to merge 6 commits into from

Conversation

mmolimar
Copy link
Contributor

What changes were proposed in this pull request?

There is an option in the CSV parser to set values when we have empty values in the CSV files or in our dataframes.
Currently, this option cannot be configured and always sets a default value (empty string for reading and "" for writing).
This PR is about enabling a new CSV option in the reader/writer to set custom empty values when reading/writing CSV files.

How was this patch tested?

The changes were tested by CSVSuite adding two unit tests.

@MaxGekk
Copy link
Member

MaxGekk commented Aug 26, 2018

Should the new option be taken into account there:

if (value == null || value.isEmpty || value == options.nullValue) {

and here:
if (field == null || field.isEmpty || field == options.nullValue) {
?

maxCharsPerColumn=None, maxMalformedLogPerPartition=None, mode=None,
columnNameOfCorruptRecord=None, multiLine=None, charToEscapeQuoteEscaping=None,
samplingRatio=None, enforceSchema=None):
ignoreTrailingWhiteSpace=None, nullValue=None, emptyValue=None, nanValue=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be put at the last; otherwise, it's going to break existing Python app when the arguments are given positionally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add new parameter at the end. +1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@HyukjinKwon
Copy link
Member

ok to test

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In light of discussion in the ticket https://issues.apache.org/jira/browse/SPARK-17916, could you write a test and check the case when empty values are written without quotes as it was in Spark 2.3 by default.

@SparkQA
Copy link

SparkQA commented Aug 26, 2018

Test build #95259 has finished for PR 22234 at commit 8b51800.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mmolimar
Copy link
Contributor Author

@MaxGekk I added what you suggested as well.

nanValue=nanValue, positiveInf=positiveInf, negativeInf=negativeInf,
dateFormat=dateFormat, timestampFormat=timestampFormat, maxColumns=maxColumns,
maxCharsPerColumn=maxCharsPerColumn,
emptyValue=emptyValue, nanValue=nanValue, positiveInf=positiveInf,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put this at the end as well for readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@@ -117,6 +117,9 @@ class CSVOptions(

val nullValue = parameters.getOrElse("nullValue", "")

val emptyValueInRead = parameters.getOrElse("emptyValue", "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just call it emptyValue for consistency with other options here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I though that as well. Just for the shake of providing backwards compatibility as we already have in ignoreLeadingWhiteSpaceInRead and ignoreLeadingWhiteSpaceFlagInWrite I implemented that in that way.
What do you say?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to name them differently names because the default values are different. Ah, yea then it makes sense here. I rushed to read.

@SparkQA
Copy link

SparkQA commented Aug 27, 2018

Test build #95270 has finished for PR 22234 at commit 3d3f178.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 27, 2018

Test build #95271 has finished for PR 22234 at commit bb28db9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 27, 2018

Test build #95274 has finished for PR 22234 at commit 0bcdb2a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Seems okay but I or someone else should take a closer look before getting this in.

@@ -79,7 +79,8 @@ private[csv] object CSVInferSchema {
* point checking if it is an Int, as the final type must be Double or higher.
*/
def inferField(typeSoFar: DataType, field: String, options: CSVOptions): DataType = {
if (field == null || field.isEmpty || field == options.nullValue) {
if (field == null || field.isEmpty || field == options.nullValue ||
field == options.emptyValueInRead) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't do this for now. It needs another review iteration. Let's revert this back.

// When there are empty strings or the values set in `nullValue`, put the
// index as the suffix.
if (value == null || value.isEmpty || value == options.nullValue ||
value == options.emptyValueInRead) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto for excluding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I revert these both changes @HyukjinKwon then?

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Let me take another look before getting this in.

@gatorsmile
Copy link
Member

Did we introduce any behavior change in #21273? Does this PR resolve it?

@HyukjinKwon
Copy link
Member

From my understanding, yea. The problem here is sounds like ambiguity in empty strings since they can be interpreted as empty strings and also null. To me, this is actually rather a bug since we can't distinguish, and don't respect the empty value. If empty strings are written, they should be read as empty strings.

This PR proposes an ability explicitly set the empty value to work around the behaviour change.

@gatorsmile
Copy link
Member

Have we documented the behavior changes in the migration guide? If not, can we do it?

@HyukjinKwon
Copy link
Member

This is rather a quite corner case (see the elaborated cases in the JIRA SPARK-17916) and there's ambiguity to treat this as a bug or a proper behaviour change; however, I don't object if this can be worth enough as something that should be mentioned.

cc @MaxGekk for a followup

@MaxGekk
Copy link
Member

MaxGekk commented Sep 7, 2018

cc @MaxGekk for a followup

@HyukjinKwon Do you mean to update migration guide in master and probably in Spark 2.4? I don't think this should be considered as a bug because current version and previous versions of Spark can read saved CSV files correctly. Yes, for now empty strings are saved as "" and nulls as nothing but this is supposed to be to distinguish empty string and null in read. And produced CSV files are valid, and they can be read by any mature CSV libs.

@HyukjinKwon
Copy link
Member

Oh no I mean we fixed a bug..

@gatorsmile
Copy link
Member

@MaxGekk Could you take this PR over? I think we need to merge this to Spark 2.4. Users can set the behaviors to the previous one by this new conf emptyValue, if needed. Also update the migration guide about the behavior change and explain how to set emptyValue.

@MaxGekk
Copy link
Member

MaxGekk commented Sep 8, 2018

@gatorsmile @HyukjinKwon Please, take a look at #22367

@HyukjinKwon
Copy link
Member

@mmolimar, let's leave this closed since the newer one is open BTW. You will be credited as a primary author of #22367 anyway.

asfgit pushed a commit that referenced this pull request Sep 11, 2018
…sed as null when nullValue is set.

## What changes were proposed in this pull request?

In the PR, I propose new CSV option `emptyValue` and an update in the SQL Migration Guide which describes how to revert previous behavior when empty strings were not written at all. Since Spark 2.4, empty strings are saved as `""` to distinguish them from saved `null`s.

Closes #22234
Closes #22367

## How was this patch tested?

It was tested by `CSVSuite` and new tests added in the PR #22234

Closes #22389 from MaxGekk/csv-empty-value-master.

Lead-authored-by: Mario Molina <mmolimar@gmail.com>
Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
(cherry picked from commit c9cb393)
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
@asfgit asfgit closed this in c9cb393 Sep 11, 2018
fjh100456 pushed a commit to fjh100456/spark that referenced this pull request Sep 13, 2018
…sed as null when nullValue is set.

## What changes were proposed in this pull request?

In the PR, I propose new CSV option `emptyValue` and an update in the SQL Migration Guide which describes how to revert previous behavior when empty strings were not written at all. Since Spark 2.4, empty strings are saved as `""` to distinguish them from saved `null`s.

Closes apache#22234
Closes apache#22367

## How was this patch tested?

It was tested by `CSVSuite` and new tests added in the PR apache#22234

Closes apache#22389 from MaxGekk/csv-empty-value-master.

Lead-authored-by: Mario Molina <mmolimar@gmail.com>
Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants