[SPARK-16310][SPARKR] R na.string-like default for csv source by felixcheung · Pull Request #13984 · apache/spark

felixcheung · 2016-06-29T23:09:34Z

What changes were proposed in this pull request?

Apply default "NA" as null string for R, like R read.csv na.string parameter.

https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
na.strings = "NA"

An user passing a csv file with NA value should get the same behavior with SparkR read.df(... source = "csv")

(couldn't open JIRA, will do that later)

How was this patch tested?

unit tests

@shivaram

SparkQA · 2016-06-29T23:46:10Z

Test build #61511 has finished for PR 13984 at commit 7927973.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-07-01T18:21:54Z

@shivaram what do you think?

shivaram · 2016-07-01T21:42:40Z

R/pkg/R/SQLContext.R

    source <- getDefaultSqlSource()
  }
+  if (source == "csv" && is.null(options[["nullValue"]])) {
+    options[["nullValue"]] <- "NA"


instead of hard coding this can we add a na.strings argument to read.df ? thats more similar to read.table in R https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html

great point. updated

SparkQA · 2016-07-02T06:57:45Z

Test build #61650 has finished for PR 13984 at commit ebc0dfe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-02T14:01:56Z

Test build #61655 has finished for PR 13984 at commit aaa6707.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-07-04T19:32:15Z

R/pkg/R/SQLContext.R

  if (is.null(source)) {
    source <- getDefaultSqlSource()
  }
+  if (source == "csv" && is.null(options[["nullValue"]])) {


I think na.strings works for read.table and not just for read.csv in R ? Is the concern that NA is not a good default for other formats like JSON etc. ?

AFAIK, R read.table is equivalent to read.csv, read.csv2 or read.delim - and only for delimited text file:
https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html

Unlike in delimited/csv file, R NA is typically null in JSON, represented as

"myString": null

But there is no consistent approach from what I can see in R. There is no support for JSON in Base. There are jsonlite, RJSONIO, rjson, and it could be na or .na (but again typically default to "null")

I think it will be an interesting to support custom null/NA mapping for other text-based data sources.

From what I can see nullValue is only supported in Spark for csv data source.
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L373

Unlike R packages, Jackson doesn't seem to support custom "null" value:
http://fasterxml.github.io/jackson-core/javadoc/2.0.0/com/fasterxml/jackson/core/JsonToken.html#VALUE_NULL
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala#L83

Thanks for that pointer to the SQL code. I guess my point is that we could always pass the nullValue to the scala side and not do additional filtering on the R side for source == csv ?

Possibly. I wonder if we should be conservative - since data source API is extensible - perhaps a new source nullValue could cause an unexpected behavior change?

Yeah this is the more conservative option - I guess thats fine for now and we can revisit this if required.

shivaram · 2016-07-07T22:19:09Z

LGTM. Merging this to master and branch-2.0

## What changes were proposed in this pull request? Apply default "NA" as null string for R, like R read.csv na.string parameter. https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html na.strings = "NA" An user passing a csv file with NA value should get the same behavior with SparkR read.df(... source = "csv") (couldn't open JIRA, will do that later) ## How was this patch tested? unit tests shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13984 from felixcheung/rcsvnastring. (cherry picked from commit f4767bc) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

R na.string like default for css source

7927973

felixcheung changed the title ~~[SPARKR] R na.string like default for css source~~ [SPARKR] R na.string-like default for csv source Jun 29, 2016

felixcheung changed the title ~~[SPARKR] R na.string-like default for csv source~~ [SPARK-16310][SPARKR] R na.string-like default for csv source Jun 29, 2016

shivaram reviewed Jul 1, 2016
View reviewed changes

update to add as parameter

ebc0dfe

take out debug print in test

aaa6707

shivaram reviewed Jul 4, 2016
View reviewed changes

asfgit closed this in f4767bc Jul 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-16310][SPARKR] R na.string-like default for csv source#13984

[SPARK-16310][SPARKR] R na.string-like default for csv source#13984
felixcheung wants to merge 3 commits intoapache:masterfrom
felixcheung:rcsvnastring

felixcheung commented Jun 29, 2016 •

edited

Loading

Uh oh!

SparkQA commented Jun 29, 2016

Uh oh!

felixcheung commented Jul 1, 2016

Uh oh!

shivaram Jul 1, 2016

Uh oh!

felixcheung Jul 2, 2016

Uh oh!

SparkQA commented Jul 2, 2016

Uh oh!

SparkQA commented Jul 2, 2016

Uh oh!

shivaram Jul 4, 2016

Uh oh!

felixcheung Jul 4, 2016 •

edited

Loading

Uh oh!

shivaram Jul 7, 2016

Uh oh!

felixcheung Jul 7, 2016 •

edited

Loading

Uh oh!

shivaram Jul 7, 2016

Uh oh!

shivaram commented Jul 7, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

felixcheung commented Jun 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jun 29, 2016

Uh oh!

felixcheung commented Jul 1, 2016

Uh oh!

shivaram Jul 1, 2016

Choose a reason for hiding this comment

Uh oh!

felixcheung Jul 2, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 2, 2016

Uh oh!

SparkQA commented Jul 2, 2016

Uh oh!

shivaram Jul 4, 2016

Choose a reason for hiding this comment

Uh oh!

felixcheung Jul 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shivaram Jul 7, 2016

Choose a reason for hiding this comment

Uh oh!

felixcheung Jul 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shivaram Jul 7, 2016

Choose a reason for hiding this comment

Uh oh!

shivaram commented Jul 7, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

felixcheung commented Jun 29, 2016 •

edited

Loading

felixcheung Jul 4, 2016 •

edited

Loading

felixcheung Jul 7, 2016 •

edited

Loading