[SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV cast null values properly #14118

lw-lin · 2016-07-09T14:24:39Z

Problem

CSV in Spark 2.0.0:

does not read null values back correctly for certain data types such as Boolean, TimestampType, DateType -- this is a regression comparing to 1.6;
does not read empty values (specified by options.nullValue) as nulls for StringType -- this is compatible with 1.6 but leads to problems like SPARK-16903.

What changes were proposed in this pull request?

This patch makes changes to read all empty values back as nulls.

How was this patch tested?

New test cases.

shivaram · 2016-07-09T14:27:29Z

cc @rxin

HyukjinKwon · 2016-07-09T14:56:11Z

Actually, #12921 includes duplicated changes with here but I will close mine since I like this one more than mine but it would be great if it has [SPARK-15144] in the title so that they can be closed together.

HyukjinKwon · 2016-07-09T14:57:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

-        if (datum == options.nullValue && nullable) {
-          null
-        } else {
+    if (datum == options.nullValue && nullable && (!castType.isInstanceOf[StringType])) {


Do you mind if I ask why StringType is excluded?

It'd be great to document why string type is ignored here.

... why StringType is excluded?

Hi @HyukjinKwon, it's just to keep consistency with we did in spark-csv for 1.6. Actually I don't have strong preference here -- maybe we should not ignore StringType? @rxin could you share some thoughts? Thanks!

SparkQA · 2016-07-09T16:04:38Z

Test build #62025 has finished for PR 14118 at commit e782616.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-07-09T21:02:54Z

@shivaram did you review this?

shivaram · 2016-07-10T01:47:47Z

No - I just noticed a JIRA that said it was a regression, so I wanted to make sure you caught this in the RC cycle

lw-lin · 2016-07-10T04:18:59Z

The diff that github shows for CSVInferSchema.scala is a mess. The actual diff (which is quite small) is:

lw-lin · 2016-07-10T04:45:55Z

FYI, before SPARK-14143, null values had been handled this way: :

if (datum == options.nullValue && nullable && (!castType.isInstanceOf[StringType])) {
  null
} else {
  castType match ...
}

Then in SPARK-14143, it was first broken down into numeric data types in 93ac6bb to handle byte-specific null value, short-specific null value, int-specific null value, ... :

castType match
  case _: ByteType => if (datum == params.byteNullValue && nullable) null else datum.toByte
  case _: ShortType => if (datum == params.shortNullValue && nullable) null else datum.toShort
  case _: IntegerType => if (datum == params.integerNullValue && nullable) null else datum.toInt
  ...

then in 698b4b4 byte-specific null value, short-specific null value, int-specific null value, ... were reduced back to one single null value:

castType match
  case _: ByteType => if (datum == params.nullValue && nullable) null else datum.toByte
  case _: ShortType => if (datum == params.nullValue && nullable) null else datum.toShort
  case _: IntegerType => if (datum == params.nullValue && nullable) null else datum.toInt
  ...

Along with that change, we had introduced regression handling non-numeric data types like BooleanType etc. Since we don't need to handle type-specific null values, this patch switchs back to the way we handled null values in the 1.6 days (and thus fixes the regression):

if (datum == options.nullValue && nullable && (!castType.isInstanceOf[StringType])) {
  null
} else {
  castType match ...
}

HyukjinKwon · 2016-07-10T05:53:06Z

I just wonder why string should be ignored in the case above. I mean, you just said "we don't need to handle type-specific null values" and it seems strings are okay to be handled together.

lw-lin · 2016-07-10T14:10:10Z

@HyukjinKwon hi. The explanation above intends to help reviewers better understand how we introduced the regression. Regarding whether StringType should be ignored or not, I don't have strong preference :)

rxin · 2016-07-11T00:47:42Z

Thanks for the information. I'm still confused. From an end-user perspective, do we need to handle StringType there?

HyukjinKwon · 2016-07-11T01:26:27Z

IMHO, handling StringType would, at least, let users handle nulls in roundtrip in writing and reading. CSV writes null according to nullValue here but this can't read them (as strings) back if it does not handle StringType. (I don't mind either but it might be great if this the behaviour is decided and confirmed).

lw-lin · 2016-07-11T02:41:41Z

I think @HyukjinKwon has made a good point: it's kind of strange null strings can be written out, but can not be read back as nulls.

So for StringType:

	nulls write & read	consistent with 1.6?
option (a)	null strings can be written out, but can NOT be read back as nulls	yes
option (b)	null strings can be written out, and can be read back as nulls	NO

@HyukjinKwon and I are somewhat inclined to option(b) because it sounds reasonable to end-users. @rxin would you mind making a final decision? Thanks!

deanchen · 2016-07-31T02:17:12Z

Would be great to get a resolution to this. We're running into issues in production attempting to parse csv's with nullable dates. Personally prefer option b for our use case.

lw-lin · 2016-08-01T07:18:29Z

Some findings as I dug a little:

Since Support for nullable schema types databricks/spark-csv#102, 2015), we would cast "" as null for all types other than strings. For strings, "" would still be "";
Then we had added treatEmptyValuesAsNulls in Roundtrip null values of any type databricks/spark-csv#147, 2015), after which, "" would still be "" when treatEmptyValuesAsNulls == false but would be null otherwise;
Then we had added nullValue in Add nullValue being respected when parsing CSVs databricks/spark-csv#224, 2015), so people could specify some string like "MISSING" other than the default "" to represent null values.

Then after the above 1.2.3., in spark-csv for strings we had had the following, which seemed reasonable and was backward-compatible:

	(default) when nullVale == ""	when nullValue == "MISSING"
(default) when treatEmptyValuesAsNulls == false	"" would cast to ""	"" would cast to ""
when treatEmptyValuesAsNulls == true	"" would cast to null	"" would cast to ""

However we don't have this treatEmptyValuesAsNulls in Spark 2.0. @falaki would it be OK with you if I add it back?

falaki · 2016-08-05T01:21:38Z

@lw-lin thanks a lot for the clear summary.
After seeing some use cases, I think it is better to apply nullValue to all types, including StringType. treatEmptyValuesAsNulls seems a special case of nullValue = "" and adds to confusion of all these options.

SparkQA · 2016-08-05T08:39:45Z

Test build #63260 has finished for PR 14118 at commit bf01cea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

lw-lin · 2016-08-05T09:09:30Z

@falaki could you take a look at the latest update: [bf01cea] StringType should also respect nullValue? Thanks!

HyukjinKwon · 2016-08-05T09:17:41Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVTypeCastSuite.scala

+
+    assert(
+      CSVTypeCast.castTo(null, StringType, nullable = true, CSVOptions("nullValue", "null")) ==
+        null)


Maybe we can use assertNull as you just did above?

Oh thanks!
~~I did this intentionally so that the diff is minimal and clear to reviewers. Maybe let's see what others think and I'm glad to change this if necessary. :)~~
nvm, this indeed should be assertNull

HyukjinKwon · 2016-08-05T09:35:23Z

This change looks reasonable to me.

djk121 · 2016-08-05T15:57:57Z

Is there a way to fall back to the old databricks csv library in spark 2.0 to work around this? Round-tripping worked there with .option("nullValue", "null"), but I don't see a way to get round-tripping working with any combo of options in 2.0.

rxin · 2016-08-05T22:10:57Z

You can specify "com.databricks.spark.csv" as the source.

On Fri, Aug 5, 2016 at 11:58 PM, djk121 notifications@github.com wrote:

Is there a way to fall back to the old databricks csv library in spark 2.0
to work around this? Round-tripping worked there with .option("nullValue",
"null"), but I don't see a way to get round-tripping working with any combo
of options in 2.0.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#14118 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AATvPBZJLGRjooxqI4XelaR9uv8EM092ks5qc128gaJpZM4JIoK7
.

djk121 · 2016-08-06T00:25:57Z

I'm doing this:

val dataframe = sparkSession.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("nullValue", "null")
.schema(schema)
.load(csvPath)

I then take that dataframe and attempt to write it out to parquet like so:

dataframe.write
.mode(SaveMode.Overwrite)
.option("compression", "snappy")
.parquet(outputPath)

When the parquet writes go, I get the same traceback as above. I can see from that traceback that it's org.apache.spark.sql.execution.datasources.csv, so for whatever reason, com.databricks.spark.csv isn't being used. Do I need to do something different to force it to be used?

HyukjinKwon · 2016-08-06T01:35:40Z

BTW, this problem exists in the external CSV data source as well (but only for StringType). The root cause of https://github.com/databricks/spark-csv/issues/370 is this issue and also if my understanding is correct, the external CSV data source would not work anyway in Spark 2.0 (due to incompatibility).

SparkQA · 2016-08-10T04:52:21Z

Test build #63493 has finished for PR 14118 at commit f58e33d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

devmanhinton · 2016-08-12T20:11:33Z

Just as a +1 would at least like the option to have "" autocast to null when read in from csv. Helpful for me in production given UDFs skip function application when input is null but not in the case of an empty string.

SparkQA · 2016-08-19T06:09:30Z

Test build #64040 has finished for PR 14118 at commit 74b4dd8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

lw-lin · 2016-08-19T06:48:23Z

@rxin yes all empty values become null values once they are read back!

E.g. given test.csv:

1,,3,

spark.read.csv(test.csv).show would produce:

+---+----+---+----+
|_c0| _c1|_c2| _c3|
+---+----+---+----+
|  1|null|  3|null|
+---+----+---+----+

rxin · 2016-08-19T07:06:49Z

What if I am writing explicitly an empty string out? Does it become just 1,,2?

Can you also clarify whether this is behavior changing, or something else?

HyukjinKwon · 2016-08-19T07:53:46Z

@rxin Please let me leave my though why I thought it looks good to me in case it is helpful.

Yes, but we should set nullValue for writing null. So, I think, setting "" for nullValue means treating "" as null.

For example, if we have the dataframe as below:

+------+
|     a|
+------+
|   abc|
|  null|
+------+

with nullValue set to "abc", this will writes

abc
abc

Here, we ended up with no diff between null and abc. but since users set nullValue to abc for output, users would understand this behaviour.

I mean.. as far as I know, there is no (standard) expression for actual null as string expression so we are explicitly giving the representation for this and so, I thought it is okay even if we can't differentiate nullValue from actual null.

HyukjinKwon · 2016-08-19T08:06:17Z

I re-editted this as I found this comment is super confusing. I meant suggesting \u0000 to express null which apparently looks empty string if we should dfferentiate empty string from null.

lw-lin · 2016-08-29T10:25:04Z

What if I am writing explicitly an empty string out? Does it become just 1,,2?

Yes. It becomes 1,,2 in 2.0, and the same 1,,2 with this patch -- no behavior change.

Can you also clarify whether this is behavior changing, or something else?

This patch behaves differently from 2.0 when reading 1,,2 back: (given nullValue the default value: empty string ""), 1,,2 would be read back as 1,,2 in 2.0, but would be read back as 1,[null],[null],2 with this patch.

@rxin ~

lw-lin · 2016-08-29T10:25:15Z

Jenkins retest this please

SparkQA · 2016-08-29T19:10:15Z

Test build #64576 has finished for PR 14118 at commit d5357f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-09-07T15:53:29Z

retest this please

SparkQA · 2016-09-07T17:47:54Z

Test build #65042 has finished for PR 14118 at commit d5357f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-09-12T11:24:36Z

@lw-lin just checking that you think this is still good to go? @HyukjinKwon do you have an opinion on the current state?

HyukjinKwon · 2016-09-12T11:56:25Z

I support this PR. But just to make sure, I'd like to bring a reference.

It seems at least na.strings option in read.csv in R does as proposed here,

bt <- "A,B,C,D
10,20,NaN
30,,40
40,30,20
,NA,20"

b<- read.csv(text=bt, na.strings=c("NA","NaN", ""))
b

prints

   A  B  C  D                                                                                                      
1 10 20 NA NA                                                                                                      
2 30 NA 40 NA                                                                                                      
3 40 30 20 NA                                                                                                      
4 NA NA 20 NA

lw-lin · 2016-09-14T02:25:19Z

@HyukjinKwon thanks for the information!

@srowen yea I still think this is good to go.

lw-lin · 2016-09-14T02:25:30Z

Jenkins retest this please

SparkQA · 2016-09-14T04:21:29Z

Test build #65343 has finished for PR 14118 at commit d5357f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-09-15T08:34:02Z

python/pyspark/sql/readwriter.py

@@ -329,7 +329,8 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non
                                         being read should be skipped. If None is set, it uses
                                         the default value, ``false``.
        :param nullValue: sets the string representation of a null value. If None is set, it uses
-                          the default value, empty string.
+                          the default value, empty string. Since 2.0.1, this ``nullValue`` param


I think you can omit the "since x.y.z" in this PR. The new text will be in the docs for the version it applies to and not earlier ones.

This patch introduces a behavior change, i.e. how we deal with nullValue for the string type. So let's keep the "since x.y.z" thing for people to find a clue?

srowen · 2016-09-15T08:34:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

-            // If it fails to parse, then tries the way used in 2.0 and 1.x for backwards
-            // compatibility.
-            DateTimeUtils.stringToTime(datum).getTime  * 1000L
+    if (datum == options.nullValue && nullable) {


Is it possibly worth checking if (nullable && ...) to avoid the comparison if it's not nullable?

yea let me do that. thanks.

srowen · 2016-09-15T08:35:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

+        case _: IntegerType => datum.toInt
+        case _: LongType => datum.toLong
+        case _: FloatType =>
+          if (datum == options.nanValue) {


Can these nested if-else statements be a match statement? or is there some overhead to it that is too significant?

Yea they should be a match statement -- let me update this, thanks!

hvanhovell · 2016-09-15T19:51:42Z

@lw-lin could you address @srowen's comments. Otherwise this is good to go.

SparkQA · 2016-09-16T02:09:15Z

Test build #65466 has finished for PR 14118 at commit 365cbfb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-09-18T18:26:29Z

Merged to master/2.0

…s properly ## Problem CSV in Spark 2.0.0: - does not read null values back correctly for certain data types such as `Boolean`, `TimestampType`, `DateType` -- this is a regression comparing to 1.6; - does not read empty values (specified by `options.nullValue`) as `null`s for `StringType` -- this is compatible with 1.6 but leads to problems like SPARK-16903. ## What changes were proposed in this pull request? This patch makes changes to read all empty values back as `null`s. ## How was this patch tested? New test cases. Author: Liwei Lin <lwlin7@gmail.com> Closes #14118 from lw-lin/csv-cast-null. (cherry picked from commit 1dbb725) Signed-off-by: Sean Owen <sowen@cloudera.com>

…s properly ## Problem CSV in Spark 2.0.0: - does not read null values back correctly for certain data types such as `Boolean`, `TimestampType`, `DateType` -- this is a regression comparing to 1.6; - does not read empty values (specified by `options.nullValue`) as `null`s for `StringType` -- this is compatible with 1.6 but leads to problems like SPARK-16903. ## What changes were proposed in this pull request? This patch makes changes to read all empty values back as `null`s. ## How was this patch tested? New test cases. Author: Liwei Lin <lwlin7@gmail.com> Closes apache#14118 from lw-lin/csv-cast-null.

cast null correctly

e782616

HyukjinKwon reviewed Jul 9, 2016
View reviewed changes

lw-lin changed the title ~~[SPARK-16462][SPARK-16460][SQL] Make CSV cast null values properly~~ [SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV cast null values properly Jul 10, 2016

StringType should also respect nullValue

bf01cea

HyukjinKwon reviewed Aug 5, 2016
View reviewed changes

Address comments

f58e33d

HyukjinKwon mentioned this pull request Aug 23, 2016

Empty Strings Not treated as null databricks/spark-xml#156

Closed

Merge remote-tracking branch 'refs/remotes/apache/master

d5357f9

srowen requested changes Sep 15, 2016

View reviewed changes

Address comments

365cbfb

asfgit closed this in 1dbb725 Sep 18, 2016

SachinJanani mentioned this pull request Sep 26, 2016

[SNAP-1066] Fix for SNAP-1066 to allow custom null value in csv files TIBCOSoftware/snappy-spark#12

Merged

lw-lin deleted the csv-cast-null branch November 7, 2016 03:42

[SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV cast null values properly #14118

[SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV cast null values properly #14118

Conversation

lw-lin commented Jul 9, 2016 • edited Loading

Problem

What changes were proposed in this pull request?

How was this patch tested?

shivaram commented Jul 9, 2016

HyukjinKwon commented Jul 9, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 9, 2016

rxin commented Jul 9, 2016

shivaram commented Jul 10, 2016

lw-lin commented Jul 10, 2016 • edited Loading

lw-lin commented Jul 10, 2016 • edited Loading

HyukjinKwon commented Jul 10, 2016 • edited Loading

lw-lin commented Jul 10, 2016

rxin commented Jul 11, 2016

HyukjinKwon commented Jul 11, 2016 • edited Loading

lw-lin commented Jul 11, 2016 • edited Loading

deanchen commented Jul 31, 2016

lw-lin commented Aug 1, 2016 • edited Loading

falaki commented Aug 5, 2016

SparkQA commented Aug 5, 2016

lw-lin commented Aug 5, 2016 • edited Loading

Choose a reason for hiding this comment

lw-lin Aug 5, 2016 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon commented Aug 5, 2016

djk121 commented Aug 5, 2016

rxin commented Aug 5, 2016

djk121 commented Aug 6, 2016

HyukjinKwon commented Aug 6, 2016 • edited Loading

SparkQA commented Aug 10, 2016

devmanhinton commented Aug 12, 2016

SparkQA commented Aug 19, 2016

lw-lin commented Aug 19, 2016 • edited Loading

rxin commented Aug 19, 2016 • edited Loading

HyukjinKwon commented Aug 19, 2016 • edited Loading

HyukjinKwon commented Aug 19, 2016 • edited Loading

lw-lin commented Aug 29, 2016 • edited Loading

lw-lin commented Aug 29, 2016

SparkQA commented Aug 29, 2016

hvanhovell commented Sep 7, 2016

SparkQA commented Sep 7, 2016

srowen commented Sep 12, 2016

HyukjinKwon commented Sep 12, 2016 • edited Loading

lw-lin commented Sep 14, 2016

lw-lin commented Sep 14, 2016

SparkQA commented Sep 14, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lw-lin Sep 16, 2016 • edited Loading

Choose a reason for hiding this comment

hvanhovell commented Sep 15, 2016

SparkQA commented Sep 16, 2016

srowen commented Sep 18, 2016

lw-lin commented Jul 9, 2016 •

edited

Loading

HyukjinKwon commented Jul 9, 2016 •

edited

Loading

lw-lin commented Jul 10, 2016 •

edited

Loading

lw-lin commented Jul 10, 2016 •

edited

Loading

HyukjinKwon commented Jul 10, 2016 •

edited

Loading

HyukjinKwon commented Jul 11, 2016 •

edited

Loading

lw-lin commented Jul 11, 2016 •

edited

Loading

lw-lin commented Aug 1, 2016 •

edited

Loading

lw-lin commented Aug 5, 2016 •

edited

Loading

lw-lin Aug 5, 2016 •

edited

Loading

HyukjinKwon commented Aug 6, 2016 •

edited

Loading

lw-lin commented Aug 19, 2016 •

edited

Loading

rxin commented Aug 19, 2016 •

edited

Loading

HyukjinKwon commented Aug 19, 2016 •

edited

Loading

HyukjinKwon commented Aug 19, 2016 •

edited

Loading

lw-lin commented Aug 29, 2016 •

edited

Loading

HyukjinKwon commented Sep 12, 2016 •

edited

Loading

lw-lin Sep 16, 2016 •

edited

Loading