Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV cast null values properly #14118

Closed
wants to merge 6 commits into from

Conversation

lw-lin
Copy link
Contributor

@lw-lin lw-lin commented Jul 9, 2016

Problem

CSV in Spark 2.0.0:

  • does not read null values back correctly for certain data types such as Boolean, TimestampType, DateType -- this is a regression comparing to 1.6;
  • does not read empty values (specified by options.nullValue) as nulls for StringType -- this is compatible with 1.6 but leads to problems like SPARK-16903.

What changes were proposed in this pull request?

This patch makes changes to read all empty values back as nulls.

How was this patch tested?

New test cases.

@shivaram
Copy link
Contributor

shivaram commented Jul 9, 2016

cc @rxin

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jul 9, 2016

Actually, #12921 includes duplicated changes with here but I will close mine since I like this one more than mine but it would be great if it has [SPARK-15144] in the title so that they can be closed together.

if (datum == options.nullValue && nullable) {
null
} else {
if (datum == options.nullValue && nullable && (!castType.isInstanceOf[StringType])) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind if I ask why StringType is excluded?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be great to document why string type is ignored here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... why StringType is excluded?

Hi @HyukjinKwon, it's just to keep consistency with we did in spark-csv for 1.6. Actually I don't have strong preference here -- maybe we should not ignore StringType? @rxin could you share some thoughts? Thanks!

@SparkQA
Copy link

SparkQA commented Jul 9, 2016

Test build #62025 has finished for PR 14118 at commit e782616.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jul 9, 2016

@shivaram did you review this?

@shivaram
Copy link
Contributor

No - I just noticed a JIRA that said it was a regression, so I wanted to make sure you caught this in the RC cycle

@lw-lin
Copy link
Contributor Author

lw-lin commented Jul 10, 2016

The diff that github shows for CSVInferSchema.scala is a mess. The actual diff (which is quite small) is:

diff

@lw-lin
Copy link
Contributor Author

lw-lin commented Jul 10, 2016

FYI, before SPARK-14143, null values had been handled this way: :

if (datum == options.nullValue && nullable && (!castType.isInstanceOf[StringType])) {
  null
} else {
  castType match ...
}

Then in SPARK-14143, it was first broken down into numeric data types in 93ac6bb to handle byte-specific null value, short-specific null value, int-specific null value, ... :

castType match
  case _: ByteType => if (datum == params.byteNullValue && nullable) null else datum.toByte
  case _: ShortType => if (datum == params.shortNullValue && nullable) null else datum.toShort
  case _: IntegerType => if (datum == params.integerNullValue && nullable) null else datum.toInt
  ...

then in 698b4b4 byte-specific null value, short-specific null value, int-specific null value, ... were reduced back to one single null value:

castType match
  case _: ByteType => if (datum == params.nullValue && nullable) null else datum.toByte
  case _: ShortType => if (datum == params.nullValue && nullable) null else datum.toShort
  case _: IntegerType => if (datum == params.nullValue && nullable) null else datum.toInt
  ...

Along with that change, we had introduced regression handling non-numeric data types like BooleanType etc. Since we don't need to handle type-specific null values, this patch switchs back to the way we handled null values in the 1.6 days (and thus fixes the regression):

if (datum == options.nullValue && nullable && (!castType.isInstanceOf[StringType])) {
  null
} else {
  castType match ...
}

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jul 10, 2016

I just wonder why string should be ignored in the case above. I mean, you just said "we don't need to handle type-specific null values" and it seems strings are okay to be handled together.

@lw-lin
Copy link
Contributor Author

lw-lin commented Jul 10, 2016

@HyukjinKwon hi. The explanation above intends to help reviewers better understand how we introduced the regression. Regarding whether StringType should be ignored or not, I don't have strong preference :)

@lw-lin lw-lin changed the title [SPARK-16462][SPARK-16460][SQL] Make CSV cast null values properly [SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV cast null values properly Jul 10, 2016
@rxin
Copy link
Contributor

rxin commented Jul 11, 2016

Thanks for the information. I'm still confused. From an end-user perspective, do we need to handle StringType there?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jul 11, 2016

IMHO, handling StringType would, at least, let users handle nulls in roundtrip in writing and reading. CSV writes null according to nullValue here but this can't read them (as strings) back if it does not handle StringType. (I don't mind either but it might be great if this the behaviour is decided and confirmed).

@lw-lin
Copy link
Contributor Author

lw-lin commented Jul 11, 2016

I think @HyukjinKwon has made a good point: it's kind of strange null strings can be written out, but can not be read back as nulls.

So for StringType:

nulls write & read consistent with 1.6?
option (a) null strings can be written out,
but can NOT be read back as nulls
yes
option (b) null strings can be written out,
and can be read back as nulls
NO

@HyukjinKwon and I are somewhat inclined to option(b) because it sounds reasonable to end-users. @rxin would you mind making a final decision? Thanks!

@deanchen
Copy link
Contributor

Would be great to get a resolution to this. We're running into issues in production attempting to parse csv's with nullable dates. Personally prefer option b for our use case.

@lw-lin
Copy link
Contributor Author

lw-lin commented Aug 1, 2016

Some findings as I dug a little:

  1. Since Support for nullable schema types databricks/spark-csv#102, 2015), we would cast "" as null for all types other than strings. For strings, "" would still be "";
  2. Then we had added treatEmptyValuesAsNulls in Roundtrip null values of any type databricks/spark-csv#147, 2015), after which, "" would still be "" when treatEmptyValuesAsNulls == false but would be null otherwise;
  3. Then we had added nullValue in Add nullValue being respected when parsing CSVs databricks/spark-csv#224, 2015), so people could specify some string like "MISSING" other than the default "" to represent null values.

Then after the above 1.2.3., in spark-csv for strings we had had the following, which seemed reasonable and was backward-compatible:

(default) when nullVale == "" when nullValue == "MISSING"
(default) when treatEmptyValuesAsNulls == false "" would cast to "" "" would cast to ""
when treatEmptyValuesAsNulls == true "" would cast to null "" would cast to ""

However we don't have this treatEmptyValuesAsNulls in Spark 2.0. @falaki would it be OK with you if I add it back?

@falaki
Copy link
Contributor

falaki commented Aug 5, 2016

@lw-lin thanks a lot for the clear summary.
After seeing some use cases, I think it is better to apply nullValue to all types, including StringType. treatEmptyValuesAsNulls seems a special case of nullValue = "" and adds to confusion of all these options.

@SparkQA
Copy link

SparkQA commented Aug 5, 2016

Test build #63260 has finished for PR 14118 at commit bf01cea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@lw-lin
Copy link
Contributor Author

lw-lin commented Aug 5, 2016

@falaki could you take a look at the latest update: [bf01cea] StringType should also respect nullValue? Thanks!


assert(
CSVTypeCast.castTo(null, StringType, nullable = true, CSVOptions("nullValue", "null")) ==
null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use assertNull as you just did above?

Copy link
Contributor Author

@lw-lin lw-lin Aug 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh thanks!
I did this intentionally so that the diff is minimal and clear to reviewers. Maybe let's see what others think and I'm glad to change this if necessary. :)
nvm, this indeed should be assertNull

@HyukjinKwon
Copy link
Member

This change looks reasonable to me.

@djk121
Copy link

djk121 commented Aug 5, 2016

Is there a way to fall back to the old databricks csv library in spark 2.0 to work around this? Round-tripping worked there with .option("nullValue", "null"), but I don't see a way to get round-tripping working with any combo of options in 2.0.

@rxin
Copy link
Contributor

rxin commented Aug 5, 2016

You can specify "com.databricks.spark.csv" as the source.

On Fri, Aug 5, 2016 at 11:58 PM, djk121 notifications@github.com wrote:

Is there a way to fall back to the old databricks csv library in spark 2.0
to work around this? Round-tripping worked there with .option("nullValue",
"null"), but I don't see a way to get round-tripping working with any combo
of options in 2.0.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#14118 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AATvPBZJLGRjooxqI4XelaR9uv8EM092ks5qc128gaJpZM4JIoK7
.

@djk121
Copy link

djk121 commented Aug 6, 2016

I'm doing this:

val dataframe = sparkSession.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("nullValue", "null")
.schema(schema)
.load(csvPath)

I then take that dataframe and attempt to write it out to parquet like so:

dataframe.write
.mode(SaveMode.Overwrite)
.option("compression", "snappy")
.parquet(outputPath)

When the parquet writes go, I get the same traceback as above. I can see from that traceback that it's org.apache.spark.sql.execution.datasources.csv, so for whatever reason, com.databricks.spark.csv isn't being used. Do I need to do something different to force it to be used?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Aug 6, 2016

BTW, this problem exists in the external CSV data source as well (but only for StringType). The root cause of https://github.com/databricks/spark-csv/issues/370 is this issue and also if my understanding is correct, the external CSV data source would not work anyway in Spark 2.0 (due to incompatibility).

@SparkQA
Copy link

SparkQA commented Aug 10, 2016

Test build #63493 has finished for PR 14118 at commit f58e33d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@devmanhinton
Copy link

Just as a +1 would at least like the option to have "" autocast to null when read in from csv. Helpful for me in production given UDFs skip function application when input is null but not in the case of an empty string.

@SparkQA
Copy link

SparkQA commented Aug 19, 2016

Test build #64040 has finished for PR 14118 at commit 74b4dd8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@lw-lin
Copy link
Contributor Author

lw-lin commented Aug 19, 2016

@rxin yes all empty values become null values once they are read back!

E.g. given test.csv:

1,,3,

spark.read.csv(test.csv).show would produce:

+---+----+---+----+
|_c0| _c1|_c2| _c3|
+---+----+---+----+
|  1|null|  3|null|
+---+----+---+----+

@rxin
Copy link
Contributor

rxin commented Aug 19, 2016

What if I am writing explicitly an empty string out? Does it become just 1,,2?

Can you also clarify whether this is behavior changing, or something else?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Aug 19, 2016

@rxin Please let me leave my though why I thought it looks good to me in case it is helpful.

Yes, but we should set nullValue for writing null. So, I think, setting "" for nullValue means treating "" as null.

For example, if we have the dataframe as below:

+------+
|     a|
+------+
|   abc|
|  null|
+------+

with nullValue set to "abc", this will writes

abc
abc

Here, we ended up with no diff between null and abc. but since users set nullValue to abc for output, users would understand this behaviour.

I mean.. as far as I know, there is no (standard) expression for actual null as string expression so we are explicitly giving the representation for this and so, I thought it is okay even if we can't differentiate nullValue from actual null.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Aug 19, 2016

I re-editted this as I found this comment is super confusing. I meant suggesting \u0000 to express null which apparently looks empty string if we should dfferentiate empty string from null.

@lw-lin
Copy link
Contributor Author

lw-lin commented Aug 29, 2016

What if I am writing explicitly an empty string out? Does it become just 1,,2?

Yes. It becomes 1,,2 in 2.0, and the same 1,,2 with this patch -- no behavior change.

Can you also clarify whether this is behavior changing, or something else?

This patch behaves differently from 2.0 when reading 1,,2 back: (given nullValue the default value: empty string ""), 1,,2 would be read back as 1,,2 in 2.0, but would be read back as 1,[null],[null],2 with this patch.

@rxin ~

@lw-lin
Copy link
Contributor Author

lw-lin commented Aug 29, 2016

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Aug 29, 2016

Test build #64576 has finished for PR 14118 at commit d5357f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Sep 7, 2016

Test build #65042 has finished for PR 14118 at commit d5357f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Sep 12, 2016

@lw-lin just checking that you think this is still good to go? @HyukjinKwon do you have an opinion on the current state?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Sep 12, 2016

I support this PR. But just to make sure, I'd like to bring a reference.

It seems at least na.strings option in read.csv in R does as proposed here,

bt <- "A,B,C,D
10,20,NaN
30,,40
40,30,20
,NA,20"

b<- read.csv(text=bt, na.strings=c("NA","NaN", ""))
b

prints

   A  B  C  D                                                                                                      
1 10 20 NA NA                                                                                                      
2 30 NA 40 NA                                                                                                      
3 40 30 20 NA                                                                                                      
4 NA NA 20 NA    

@lw-lin
Copy link
Contributor Author

lw-lin commented Sep 14, 2016

@HyukjinKwon thanks for the information!

@srowen yea I still think this is good to go.

@lw-lin
Copy link
Contributor Author

lw-lin commented Sep 14, 2016

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Sep 14, 2016

Test build #65343 has finished for PR 14118 at commit d5357f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -329,7 +329,8 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non
being read should be skipped. If None is set, it uses
the default value, ``false``.
:param nullValue: sets the string representation of a null value. If None is set, it uses
the default value, empty string.
the default value, empty string. Since 2.0.1, this ``nullValue`` param
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can omit the "since x.y.z" in this PR. The new text will be in the docs for the version it applies to and not earlier ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch introduces a behavior change, i.e. how we deal with nullValue for the string type. So let's keep the "since x.y.z" thing for people to find a clue?

// If it fails to parse, then tries the way used in 2.0 and 1.x for backwards
// compatibility.
DateTimeUtils.stringToTime(datum).getTime * 1000L
if (datum == options.nullValue && nullable) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possibly worth checking if (nullable && ...) to avoid the comparison if it's not nullable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea let me do that. thanks.

case _: IntegerType => datum.toInt
case _: LongType => datum.toLong
case _: FloatType =>
if (datum == options.nanValue) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these nested if-else statements be a match statement? or is there some overhead to it that is too significant?

Copy link
Contributor Author

@lw-lin lw-lin Sep 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea they should be a match statement -- let me update this, thanks!

@hvanhovell
Copy link
Contributor

@lw-lin could you address @srowen's comments. Otherwise this is good to go.

@SparkQA
Copy link

SparkQA commented Sep 16, 2016

Test build #65466 has finished for PR 14118 at commit 365cbfb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Sep 18, 2016

Merged to master/2.0

@asfgit asfgit closed this in 1dbb725 Sep 18, 2016
asfgit pushed a commit that referenced this pull request Sep 18, 2016
…s properly

## Problem

CSV in Spark 2.0.0:
-  does not read null values back correctly for certain data types such as `Boolean`, `TimestampType`, `DateType` -- this is a regression comparing to 1.6;
- does not read empty values (specified by `options.nullValue`) as `null`s for `StringType` -- this is compatible with 1.6 but leads to problems like SPARK-16903.

## What changes were proposed in this pull request?

This patch makes changes to read all empty values back as `null`s.

## How was this patch tested?

New test cases.

Author: Liwei Lin <lwlin7@gmail.com>

Closes #14118 from lw-lin/csv-cast-null.

(cherry picked from commit 1dbb725)
Signed-off-by: Sean Owen <sowen@cloudera.com>
wgtmac pushed a commit to wgtmac/spark that referenced this pull request Sep 19, 2016
…s properly

## Problem

CSV in Spark 2.0.0:
-  does not read null values back correctly for certain data types such as `Boolean`, `TimestampType`, `DateType` -- this is a regression comparing to 1.6;
- does not read empty values (specified by `options.nullValue`) as `null`s for `StringType` -- this is compatible with 1.6 but leads to problems like SPARK-16903.

## What changes were proposed in this pull request?

This patch makes changes to read all empty values back as `null`s.

## How was this patch tested?

New test cases.

Author: Liwei Lin <lwlin7@gmail.com>

Closes apache#14118 from lw-lin/csv-cast-null.
@lw-lin lw-lin deleted the csv-cast-null branch November 7, 2016 03:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.