Skip to content

[SPARK-16310][SPARKR] R na.string-like default for csv source#13984

Closed
felixcheung wants to merge 3 commits intoapache:masterfrom
felixcheung:rcsvnastring
Closed

[SPARK-16310][SPARKR] R na.string-like default for csv source#13984
felixcheung wants to merge 3 commits intoapache:masterfrom
felixcheung:rcsvnastring

Conversation

@felixcheung
Copy link
Member

@felixcheung felixcheung commented Jun 29, 2016

What changes were proposed in this pull request?

Apply default "NA" as null string for R, like R read.csv na.string parameter.

https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
na.strings = "NA"

An user passing a csv file with NA value should get the same behavior with SparkR read.df(... source = "csv")

(couldn't open JIRA, will do that later)

How was this patch tested?

unit tests

@shivaram

@felixcheung felixcheung changed the title [SPARKR] R na.string like default for css source [SPARKR] R na.string-like default for csv source Jun 29, 2016
@felixcheung felixcheung changed the title [SPARKR] R na.string-like default for csv source [SPARK-16310][SPARKR] R na.string-like default for csv source Jun 29, 2016
@SparkQA
Copy link

SparkQA commented Jun 29, 2016

Test build #61511 has finished for PR 13984 at commit 7927973.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member Author

@shivaram what do you think?

source <- getDefaultSqlSource()
}
if (source == "csv" && is.null(options[["nullValue"]])) {
options[["nullValue"]] <- "NA"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of hard coding this can we add a na.strings argument to read.df ? thats more similar to read.table in R https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great point. updated

@SparkQA
Copy link

SparkQA commented Jul 2, 2016

Test build #61650 has finished for PR 13984 at commit ebc0dfe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 2, 2016

Test build #61655 has finished for PR 13984 at commit aaa6707.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (is.null(source)) {
source <- getDefaultSqlSource()
}
if (source == "csv" && is.null(options[["nullValue"]])) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think na.strings works for read.table and not just for read.csv in R ? Is the concern that NA is not a good default for other formats like JSON etc. ?

Copy link
Member Author

@felixcheung felixcheung Jul 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, R read.table is equivalent to read.csv, read.csv2 or read.delim - and only for delimited text file:
https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html

Unlike in delimited/csv file, R NA is typically null in JSON, represented as

  "myString": null

But there is no consistent approach from what I can see in R. There is no support for JSON in Base. There are jsonlite, RJSONIO, rjson, and it could be na or .na (but again typically default to "null")

I think it will be an interesting to support custom null/NA mapping for other text-based data sources.

From what I can see nullValue is only supported in Spark for csv data source.
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L373

Unlike R packages, Jackson doesn't seem to support custom "null" value:
http://fasterxml.github.io/jackson-core/javadoc/2.0.0/com/fasterxml/jackson/core/JsonToken.html#VALUE_NULL
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala#L83

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that pointer to the SQL code. I guess my point is that we could always pass the nullValue to the scala side and not do additional filtering on the R side for source == csv ?

Copy link
Member Author

@felixcheung felixcheung Jul 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly. I wonder if we should be conservative - since data source API is extensible - perhaps a new source nullValue could cause an unexpected behavior change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is the more conservative option - I guess thats fine for now and we can revisit this if required.

@shivaram
Copy link
Contributor

shivaram commented Jul 7, 2016

LGTM. Merging this to master and branch-2.0

@asfgit asfgit closed this in f4767bc Jul 7, 2016
asfgit pushed a commit that referenced this pull request Jul 7, 2016
## What changes were proposed in this pull request?

Apply default "NA" as null string for R, like R read.csv na.string parameter.

https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
na.strings = "NA"

An user passing a csv file with NA value should get the same behavior with SparkR read.df(... source = "csv")

(couldn't open JIRA, will do that later)

## How was this patch tested?

unit tests

shivaram

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13984 from felixcheung/rcsvnastring.

(cherry picked from commit f4767bc)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants