[SPARK-16310][SPARKR] R na.string-like default for csv source#13984
[SPARK-16310][SPARKR] R na.string-like default for csv source#13984felixcheung wants to merge 3 commits intoapache:masterfrom
Conversation
|
Test build #61511 has finished for PR 13984 at commit
|
|
@shivaram what do you think? |
R/pkg/R/SQLContext.R
Outdated
| source <- getDefaultSqlSource() | ||
| } | ||
| if (source == "csv" && is.null(options[["nullValue"]])) { | ||
| options[["nullValue"]] <- "NA" |
There was a problem hiding this comment.
instead of hard coding this can we add a na.strings argument to read.df ? thats more similar to read.table in R https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
|
Test build #61650 has finished for PR 13984 at commit
|
|
Test build #61655 has finished for PR 13984 at commit
|
| if (is.null(source)) { | ||
| source <- getDefaultSqlSource() | ||
| } | ||
| if (source == "csv" && is.null(options[["nullValue"]])) { |
There was a problem hiding this comment.
I think na.strings works for read.table and not just for read.csv in R ? Is the concern that NA is not a good default for other formats like JSON etc. ?
There was a problem hiding this comment.
AFAIK, R read.table is equivalent to read.csv, read.csv2 or read.delim - and only for delimited text file:
https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
Unlike in delimited/csv file, R NA is typically null in JSON, represented as
"myString": null
But there is no consistent approach from what I can see in R. There is no support for JSON in Base. There are jsonlite, RJSONIO, rjson, and it could be na or .na (but again typically default to "null")
I think it will be an interesting to support custom null/NA mapping for other text-based data sources.
From what I can see nullValue is only supported in Spark for csv data source.
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L373
Unlike R packages, Jackson doesn't seem to support custom "null" value:
http://fasterxml.github.io/jackson-core/javadoc/2.0.0/com/fasterxml/jackson/core/JsonToken.html#VALUE_NULL
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala#L83
There was a problem hiding this comment.
Thanks for that pointer to the SQL code. I guess my point is that we could always pass the nullValue to the scala side and not do additional filtering on the R side for source == csv ?
There was a problem hiding this comment.
Possibly. I wonder if we should be conservative - since data source API is extensible - perhaps a new source nullValue could cause an unexpected behavior change?
There was a problem hiding this comment.
Yeah this is the more conservative option - I guess thats fine for now and we can revisit this if required.
|
LGTM. Merging this to master and branch-2.0 |
## What changes were proposed in this pull request? Apply default "NA" as null string for R, like R read.csv na.string parameter. https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html na.strings = "NA" An user passing a csv file with NA value should get the same behavior with SparkR read.df(... source = "csv") (couldn't open JIRA, will do that later) ## How was this patch tested? unit tests shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13984 from felixcheung/rcsvnastring. (cherry picked from commit f4767bc) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
What changes were proposed in this pull request?
Apply default "NA" as null string for R, like R read.csv na.string parameter.
https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
na.strings = "NA"
An user passing a csv file with NA value should get the same behavior with SparkR read.df(... source = "csv")
(couldn't open JIRA, will do that later)
How was this patch tested?
unit tests
@shivaram