Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-40103][R][FEATURE] Support read/write.csv() in SparkR #37549

Closed
wants to merge 1 commit into from
Closed

[SPARK-40103][R][FEATURE] Support read/write.csv() in SparkR #37549

wants to merge 1 commit into from

Conversation

deshanxiao
Copy link
Contributor

@deshanxiao deshanxiao commented Aug 17, 2022

What changes were proposed in this pull request?

Support read csv file in SparkR.

Why are the changes needed?

Today, almost languages spark supports have the DataFrameReader.csv() API but R. R users usually use read.df() to read the csv file. So we need a more high-level api about it.

Java:
DataFrameReader.csv()

Scala:
DataFrameReader.csv()

Python:
DataFrameReader.csv()

R base library "utils" has introduced the namespace of "read.csv". So this api has changed the name from "read.csv" to "read.spark.csv" to avoid the conflict.

Does this PR introduce any user-facing change?

Yes, read.spark.csv and write.spark.csv are introduced.

How was this patch tested?

UT (TODO)

#' }
#' @name read.spark.csv
#' @note read.spark.csv since 3.3.0
read.spark.csv <- function(path, ...) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that if it's worthwhile adding this different signature given that we're already able to do it with format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, We can read csv file by following code:
df <- read.df("examples/src/main/resources/people.csv", "csv", sep = ";", inferSchema = TRUE, header = TRUE)

However, considering that other formats have corresponding advanced functions(read.text() etc.), it is necessary to add a high-level api here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same feelings as @HyukjinKwon here.

If it wasn't for the name conflict it would be a nice addition. However, adding another naming convention (spark.read.csv would be a better choice, but then, it wouldn't be an "obvious" reading utility) just for the sake of omitting csv in a call, doesn't seem to be worth it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zero323 I totally argee. Do we have a more elegant way to support read.csv while maintaining compatibility? If not, I will close this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any clean solution that would follow existing SparkR conventions @deshanxiao.

@deshanxiao
Copy link
Contributor Author

Close this PR. If anyone have any good solutions or suggestions, please feel free to open.

@deshanxiao deshanxiao closed this Aug 18, 2022
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@deshanxiao deshanxiao deleted the add-csv branch August 18, 2022 10:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants