[SPARK-40103][R][FEATURE] Support read/write.csv() in SparkR #37549

deshanxiao · 2022-08-17T07:17:18Z

What changes were proposed in this pull request?

Support read csv file in SparkR.

Why are the changes needed?

Today, almost languages spark supports have the DataFrameReader.csv() API but R. R users usually use read.df() to read the csv file. So we need a more high-level api about it.

Java:
DataFrameReader.csv()

Scala:
DataFrameReader.csv()

Python:
DataFrameReader.csv()

R base library "utils" has introduced the namespace of "read.csv". So this api has changed the name from "read.csv" to "read.spark.csv" to avoid the conflict.

Does this PR introduce any user-facing change?

Yes, read.spark.csv and write.spark.csv are introduced.

How was this patch tested?

UT (TODO)

HyukjinKwon · 2022-08-17T07:24:44Z

R/pkg/R/SQLContext.R

+#' }
+#' @name read.spark.csv
+#' @note read.spark.csv since 3.3.0
+read.spark.csv <- function(path, ...) {


The problem is that if it's worthwhile adding this different signature given that we're already able to do it with format.

Yes, We can read csv file by following code:
df <- read.df("examples/src/main/resources/people.csv", "csv", sep = ";", inferSchema = TRUE, header = TRUE)

However, considering that other formats have corresponding advanced functions(read.text() etc.), it is necessary to add a high-level api here.

format function looks like just working on scala & java:
https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#manually-specifying-options

I have the same feelings as @HyukjinKwon here.

If it wasn't for the name conflict it would be a nice addition. However, adding another naming convention (spark.read.csv would be a better choice, but then, it wouldn't be an "obvious" reading utility) just for the sake of omitting csv in a call, doesn't seem to be worth it.

@zero323 I totally argee. Do we have a more elegant way to support read.csv while maintaining compatibility? If not, I will close this PR.

I don't see any clean solution that would follow existing SparkR conventions @deshanxiao.

deshanxiao · 2022-08-18T04:24:50Z

Close this PR. If anyone have any good solutions or suggestions, please feel free to open.

AmplabJenkins · 2022-08-18T09:53:23Z

Can one of the admins verify this patch?

first commit

d857750

HyukjinKwon reviewed Aug 17, 2022

View reviewed changes

github-actions bot added R SQL labels Aug 17, 2022

deshanxiao closed this Aug 18, 2022

deshanxiao deleted the add-csv branch August 18, 2022 10:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-40103][R][FEATURE] Support read/write.csv() in SparkR #37549

[SPARK-40103][R][FEATURE] Support read/write.csv() in SparkR #37549

deshanxiao commented Aug 17, 2022 •

edited

HyukjinKwon Aug 17, 2022

deshanxiao Aug 17, 2022

deshanxiao Aug 17, 2022

zero323 Aug 17, 2022

deshanxiao Aug 17, 2022

zero323 Aug 17, 2022

deshanxiao commented Aug 18, 2022

AmplabJenkins commented Aug 18, 2022

[SPARK-40103][R][FEATURE] Support read/write.csv() in SparkR #37549

[SPARK-40103][R][FEATURE] Support read/write.csv() in SparkR #37549

Conversation

deshanxiao commented Aug 17, 2022 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

HyukjinKwon Aug 17, 2022

Choose a reason for hiding this comment

deshanxiao Aug 17, 2022

Choose a reason for hiding this comment

deshanxiao Aug 17, 2022

Choose a reason for hiding this comment

zero323 Aug 17, 2022

Choose a reason for hiding this comment

deshanxiao Aug 17, 2022

Choose a reason for hiding this comment

zero323 Aug 17, 2022

Choose a reason for hiding this comment

deshanxiao commented Aug 18, 2022

AmplabJenkins commented Aug 18, 2022

deshanxiao commented Aug 17, 2022 •

edited