New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-40103][R][FEATURE] Support read/write.csv() in SparkR #37549
Conversation
#' } | ||
#' @name read.spark.csv | ||
#' @note read.spark.csv since 3.3.0 | ||
read.spark.csv <- function(path, ...) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that if it's worthwhile adding this different signature given that we're already able to do it with format
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, We can read csv file by following code:
df <- read.df("examples/src/main/resources/people.csv", "csv", sep = ";", inferSchema = TRUE, header = TRUE)
However, considering that other formats have corresponding advanced functions(read.text() etc.), it is necessary to add a high-level api here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format
function looks like just working on scala & java:
https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#manually-specifying-options
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same feelings as @HyukjinKwon here.
If it wasn't for the name conflict it would be a nice addition. However, adding another naming convention (spark.read.csv
would be a better choice, but then, it wouldn't be an "obvious" reading utility) just for the sake of omitting csv
in a call, doesn't seem to be worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zero323 I totally argee. Do we have a more elegant way to support read.csv while maintaining compatibility? If not, I will close this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any clean solution that would follow existing SparkR conventions @deshanxiao.
Close this PR. If anyone have any good solutions or suggestions, please feel free to open. |
Can one of the admins verify this patch? |
What changes were proposed in this pull request?
Support read csv file in SparkR.
Why are the changes needed?
Today, almost languages spark supports have the DataFrameReader.csv() API but R. R users usually use read.df() to read the csv file. So we need a more high-level api about it.
Java:
DataFrameReader.csv()
Scala:
DataFrameReader.csv()
Python:
DataFrameReader.csv()
R base library "utils" has introduced the namespace of "read.csv". So this api has changed the name from "read.csv" to "read.spark.csv" to avoid the conflict.
Does this PR introduce any user-facing change?
Yes, read.spark.csv and write.spark.csv are introduced.
How was this patch tested?
UT (TODO)