Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20550][SPARKR] R wrapper for Dataset.alias #17825

Closed
wants to merge 19 commits into from
Closed
Show file tree
Hide file tree
Changes from 17 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions R/pkg/R/DataFrame.R
Original file line number Diff line number Diff line change
Expand Up @@ -3745,3 +3745,26 @@ setMethod("hint",
jdf <- callJMethod(x@sdf, "hint", name, parameters)
dataFrame(jdf)
})

#' alias
#'
#' @aliases alias,SparkDataFrame-method
#' @family SparkDataFrame functions
#' @rdname alias
#' @name alias
#' @examples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add @family SparkDataFrame functions
I think we should probably review all these @family at one point...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I general it would nice to sweep all the files to make it more consistent. Capitalization, punctuation, examples. return and such.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add @export

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but do we actually need this? We don't use roxygen to maintain NAMESPACE, and (I believe i mentioned this before) we @export objects which are not really exported. Just saying...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, it's more for tracking it manually

#' \dontrun{
#' df <- alias(createDataFrame(mtcars), "mtcars")
#' avg_mpg <- alias(agg(groupBy(df, df$cyl), avg(df$mpg)), "avg_mpg")
#'
#' head(select(df, column("mtcars.mpg")))
#' head(join(df, avg_mpg, column("mtcars.cyl") == column("avg_mpg.cyl")))
#' }
#' @note alias(SparkDataFrame) since 2.3.0
setMethod("alias",
signature(object = "SparkDataFrame"),
function(object, data) {
stopifnot(is.character(data))
sdf <- callJMethod(object@sdf, "alias", data)
dataFrame(sdf)
})
16 changes: 8 additions & 8 deletions R/pkg/R/column.R
Original file line number Diff line number Diff line change
Expand Up @@ -130,19 +130,19 @@ createMethods <- function() {

createMethods()

#' alias
#'
#' Set a new name for a column
#'
#' @param object Column to rename
#' @param data new name to use
#'
#' @rdname alias
#' @name alias
#' @aliases alias,Column-method
#' @family colum_func
#' @export
#' @note alias since 1.4.0
#' @examples \dontrun{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think generally we put \dontrun on the next line

#' df <- createDataFrame(iris)
#'
#' head(select(
#' df, alias(df$Sepal_Length, "slength"), alias(df$Petal_Length, "plength")
#' ))
#' }
#' @note alias(Column) since 1.4.0
setMethod("alias",
signature(object = "Column"),
function(object, data) {
Expand Down
11 changes: 11 additions & 0 deletions R/pkg/R/generics.R
Original file line number Diff line number Diff line change
Expand Up @@ -387,6 +387,17 @@ setGeneric("value", function(bcast) { standardGeneric("value") })
#' @export
setGeneric("agg", function (x, ...) { standardGeneric("agg") })

#' alias
#'
#' Returns a new SparkDataFrame or Column with an alias set. Equivalent to SQL "AS" keyword.
#'
#' @name alias
#' @rdname alias
#' @param object x a Column or a SparkDataFrame
#' @param data new name to use
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we have a @return here? perhaps to say

Returns a new SparkDataFrame or Column with an alias set.
For Column, equivalent to SQL "AS" keyword.

@return a new SparkDataFrame or Column

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't be better to annotate actual implementations? To get something like this:

image

Copy link
Member

@felixcheung felixcheung May 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that we did, at one point. I think the feedback is we could have one line for parameter (object) and return value could be more than one but which line matches which input parameter type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I find both equally confusing, so if you think that a single annotation is better, I am happy to oblige.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's true actually.
if you think it's useful we could always have them in separate rd.
I'm pretty sure @rdname needs to match @aliases to fix multiple link bug https://issues.apache.org/jira/browse/SPARK-18825; which means we can't have multiple functions in the same rd - each has to have its own.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the bright side it looks like matching @rdname and @aliases like:

#' alias
#'
#' @aliases alias,SparkDataFrame-method
#' @family SparkDataFrame functions
#' @rdname alias,SparkDataFrame-method
#' @name alias
...

and

#' alias
#'
#' @aliases alias,SparkDataFrame-method
#' @family SparkDataFrame functions
#' @rdname alias,SparkDataFrame-method
#' @name alias
...

(I hope this is what you mean) indeed solves SPARK-18825. But it doesn't generate any docs for these two and makes CRAN checker unhappy:

Undocumented S4 methods:
  generic 'alias' and siglist 'Column'
  generic 'alias' and siglist 'SparkDataFrame'

Docs for generic are created but it doesn't help us here. Even if we bring @examples there we still have to deal with CRAN.

Theres is also my favorite \name must exist and be unique in Rd files which doesn't gives us much room here, does it?

I opened to suggestions, but personally I am out ideas. I've been digging trough roxygen docs, but between CRAN, S4 requirements, roxygen limitation and our own rules there is not much room left.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sigh, sadly I think you have captured all the constraints we are working with here.

let's get the 3 lines in the same order

#' Returns a new SparkDataFrame or Column with an alias set. Equivalent to SQL "AS" keyword.
#' @param object x a Column or a SparkDataFrame
#' @return a Column or a SparkDataFrame

to

#' Returns a new SparkDataFrame or Column with an alias set. Equivalent to SQL "AS" keyword.
#' @param object x a SparkDataFrame or Column
#' @return a SparkDataFrame or a Column

#' @return a Column or a SparkDataFrame
NULL

#' @rdname arrange
#' @export
setGeneric("arrange", function(x, col, ...) { standardGeneric("arrange") })
Expand Down
10 changes: 10 additions & 0 deletions R/pkg/inst/tests/testthat/test_sparkSQL.R
Original file line number Diff line number Diff line change
Expand Up @@ -1223,6 +1223,16 @@ test_that("select with column", {
expect_equal(columns(df4), c("name", "age"))
expect_equal(count(df4), 3)

# Test select with alias
df5 <- alias(df, "table")

expect_equal(columns(select(df5, column("table.name"))), "name")
expect_equal(columns(select(df5, "table.name")), "name")

# Test that stats::alias is not masked
expect_is(alias(aov(yield ~ block + N * P * K, npk)), "listof")


expect_error(select(df, c("name", "age"), "name"),
"To select multiple columns, use a character vector or list for col")
})
Expand Down