-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV cast null values properly #14118
Conversation
cc @rxin |
Actually, #12921 includes duplicated changes with here but I will close mine since I like this one more than mine but it would be great if it has |
if (datum == options.nullValue && nullable) { | ||
null | ||
} else { | ||
if (datum == options.nullValue && nullable && (!castType.isInstanceOf[StringType])) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind if I ask why StringType
is excluded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be great to document why string type is ignored here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... why StringType is excluded?
Hi @HyukjinKwon, it's just to keep consistency with we did in spark-csv
for 1.6. Actually I don't have strong preference here -- maybe we should not ignore StringType
? @rxin could you share some thoughts? Thanks!
Test build #62025 has finished for PR 14118 at commit
|
@shivaram did you review this? |
No - I just noticed a JIRA that said it was a regression, so I wanted to make sure you caught this in the RC cycle |
FYI, before SPARK-14143, null values had been handled this way: : if (datum == options.nullValue && nullable && (!castType.isInstanceOf[StringType])) {
null
} else {
castType match ...
} Then in SPARK-14143, it was first broken down into numeric data types in 93ac6bb to handle byte-specific null value, short-specific null value, int-specific null value, ... : castType match
case _: ByteType => if (datum == params.byteNullValue && nullable) null else datum.toByte
case _: ShortType => if (datum == params.shortNullValue && nullable) null else datum.toShort
case _: IntegerType => if (datum == params.integerNullValue && nullable) null else datum.toInt
... then in 698b4b4 byte-specific null value, short-specific null value, int-specific null value, ... were reduced back to one single null value: castType match
case _: ByteType => if (datum == params.nullValue && nullable) null else datum.toByte
case _: ShortType => if (datum == params.nullValue && nullable) null else datum.toShort
case _: IntegerType => if (datum == params.nullValue && nullable) null else datum.toInt
... Along with that change, we had introduced regression handling non-numeric data types like if (datum == options.nullValue && nullable && (!castType.isInstanceOf[StringType])) {
null
} else {
castType match ...
} |
I just wonder why string should be ignored in the case above. I mean, you just said "we don't need to handle type-specific null values" and it seems strings are okay to be handled together. |
@HyukjinKwon hi. The explanation above intends to help reviewers better understand how we introduced the regression. Regarding whether |
Thanks for the information. I'm still confused. From an end-user perspective, do we need to handle StringType there? |
IMHO, handling |
I think @HyukjinKwon has made a good point: it's kind of strange null strings can be written out, but can not be read back as nulls. So for
@HyukjinKwon and I are somewhat inclined to option(b) because it sounds reasonable to end-users. @rxin would you mind making a final decision? Thanks! |
Would be great to get a resolution to this. We're running into issues in production attempting to parse csv's with nullable dates. Personally prefer option b for our use case. |
Some findings as I dug a little:
Then after the above 1.2.3., in
However we don't have this |
@lw-lin thanks a lot for the clear summary. |
Test build #63260 has finished for PR 14118 at commit
|
@falaki could you take a look at the latest update: [bf01cea] StringType should also respect |
|
||
assert( | ||
CSVTypeCast.castTo(null, StringType, nullable = true, CSVOptions("nullValue", "null")) == | ||
null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can use assertNull
as you just did above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh thanks!
I did this intentionally so that the diff is minimal and clear to reviewers. Maybe let's see what others think and I'm glad to change this if necessary. :)
nvm, this indeed should be assertNull
This change looks reasonable to me. |
Is there a way to fall back to the old databricks csv library in spark 2.0 to work around this? Round-tripping worked there with .option("nullValue", "null"), but I don't see a way to get round-tripping working with any combo of options in 2.0. |
You can specify "com.databricks.spark.csv" as the source. On Fri, Aug 5, 2016 at 11:58 PM, djk121 notifications@github.com wrote:
|
I'm doing this: val dataframe = sparkSession.read I then take that dataframe and attempt to write it out to parquet like so: dataframe.write When the parquet writes go, I get the same traceback as above. I can see from that traceback that it's org.apache.spark.sql.execution.datasources.csv, so for whatever reason, com.databricks.spark.csv isn't being used. Do I need to do something different to force it to be used? |
BTW, this problem exists in the external CSV data source as well (but only for |
Test build #63493 has finished for PR 14118 at commit
|
Just as a +1 would at least like the option to have |
Test build #64040 has finished for PR 14118 at commit
|
@rxin yes all empty values become null values once they are read back! E.g. given
|
What if I am writing explicitly an empty string out? Does it become just 1,,2? Can you also clarify whether this is behavior changing, or something else? |
@rxin Please let me leave my though why I thought it looks good to me in case it is helpful. Yes, but we should set For example, if we have the dataframe as below:
with
Here, we ended up with no diff between I mean.. as far as I know, there is no (standard) expression for actual |
I re-editted this as I found this comment is super confusing. I meant suggesting |
Yes. It becomes
This patch behaves differently from 2.0 when reading @rxin ~ |
Jenkins retest this please |
Test build #64576 has finished for PR 14118 at commit
|
retest this please |
Test build #65042 has finished for PR 14118 at commit
|
@lw-lin just checking that you think this is still good to go? @HyukjinKwon do you have an opinion on the current state? |
I support this PR. But just to make sure, I'd like to bring a reference. It seems at least bt <- "A,B,C,D
10,20,NaN
30,,40
40,30,20
,NA,20"
b<- read.csv(text=bt, na.strings=c("NA","NaN", ""))
b prints
|
@HyukjinKwon thanks for the information! @srowen yea I still think this is good to go. |
Jenkins retest this please |
Test build #65343 has finished for PR 14118 at commit
|
@@ -329,7 +329,8 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non | |||
being read should be skipped. If None is set, it uses | |||
the default value, ``false``. | |||
:param nullValue: sets the string representation of a null value. If None is set, it uses | |||
the default value, empty string. | |||
the default value, empty string. Since 2.0.1, this ``nullValue`` param |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can omit the "since x.y.z" in this PR. The new text will be in the docs for the version it applies to and not earlier ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This patch introduces a behavior change, i.e. how we deal with nullValue
for the string type. So let's keep the "since x.y.z" thing for people to find a clue?
// If it fails to parse, then tries the way used in 2.0 and 1.x for backwards | ||
// compatibility. | ||
DateTimeUtils.stringToTime(datum).getTime * 1000L | ||
if (datum == options.nullValue && nullable) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possibly worth checking if (nullable && ...)
to avoid the comparison if it's not nullable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea let me do that. thanks.
case _: IntegerType => datum.toInt | ||
case _: LongType => datum.toLong | ||
case _: FloatType => | ||
if (datum == options.nanValue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these nested if-else statements be a match statement? or is there some overhead to it that is too significant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea they should be a match statement -- let me update this, thanks!
Test build #65466 has finished for PR 14118 at commit
|
Merged to master/2.0 |
…s properly ## Problem CSV in Spark 2.0.0: - does not read null values back correctly for certain data types such as `Boolean`, `TimestampType`, `DateType` -- this is a regression comparing to 1.6; - does not read empty values (specified by `options.nullValue`) as `null`s for `StringType` -- this is compatible with 1.6 but leads to problems like SPARK-16903. ## What changes were proposed in this pull request? This patch makes changes to read all empty values back as `null`s. ## How was this patch tested? New test cases. Author: Liwei Lin <lwlin7@gmail.com> Closes #14118 from lw-lin/csv-cast-null. (cherry picked from commit 1dbb725) Signed-off-by: Sean Owen <sowen@cloudera.com>
…s properly ## Problem CSV in Spark 2.0.0: - does not read null values back correctly for certain data types such as `Boolean`, `TimestampType`, `DateType` -- this is a regression comparing to 1.6; - does not read empty values (specified by `options.nullValue`) as `null`s for `StringType` -- this is compatible with 1.6 but leads to problems like SPARK-16903. ## What changes were proposed in this pull request? This patch makes changes to read all empty values back as `null`s. ## How was this patch tested? New test cases. Author: Liwei Lin <lwlin7@gmail.com> Closes apache#14118 from lw-lin/csv-cast-null.
Problem
CSV in Spark 2.0.0:
Boolean
,TimestampType
,DateType
-- this is a regression comparing to 1.6;options.nullValue
) asnull
s forStringType
-- this is compatible with 1.6 but leads to problems like SPARK-16903.What changes were proposed in this pull request?
This patch makes changes to read all empty values back as
null
s.How was this patch tested?
New test cases.