Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20550][SPARKR] R wrapper for Dataset.alias #17825

Closed
wants to merge 19 commits into from

Conversation

zero323
Copy link
Member

@zero323 zero323 commented May 1, 2017

What changes were proposed in this pull request?

  • Add SparkR wrapper for Dataset.alias.
  • Adjust roxygen annotations for functions.alias (including example usage).

How was this patch tested?

Unit tests, check_cran.sh.

@zero323
Copy link
Member Author

zero323 commented May 1, 2017

This may require some discussion. Right now we get a generic docs like this:

image

Does it make sense to put both in the same file? If not where should Dataset.alias go?

Names are not optimal, but I guess we'll keep it as is to avoid issues with stats::alias.

@SparkQA
Copy link

SparkQA commented May 1, 2017

Test build #76365 has finished for PR 17825 at commit b7d079b.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 2, 2017

Test build #76369 has finished for PR 17825 at commit 7abe6ca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -2253,6 +2253,15 @@ test_that("mutate(), transform(), rename() and names()", {
detach(airquality)
})

test_that("alias on SparkDataFrame", {
df <- alias(read.df(jsonPath, "json"), "table")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of adding a new test, add to one already naming things to reuse an existing df?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because trying to make a set of tests that makes sense for CRAN
#17817

#' head(select(df, column("mtcars.mpg")))
#' head(join(df, avg_mpg, column("mtcars.cyl") == column("avg_mpg.cyl")))
#' }
#' @note alias since 2.3.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R/pkg/R/column.R Outdated
#' @param data new name to use
#'
#' @rdname alias
#' @name alias
#' @aliases alias,Column-method
#' @family colum_func
#' @export
#' @examples \dontrun{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think generally we put \dontrun on the next line

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a few similar cases like this, say this with some discussions at the time.

there really isn't a single good place for it so we elect to put the doc in generic and both "overload" share the same rd page, like many other R methods does.

i think that's ok, but we should probably be more clear on the documentation - the existing description is quite vague, we should really be clear that a new "thing" is being returned with the new name, and not renaming the existing one

#' @aliases alias,SparkDataFrame-method
#' @rdname alias
#' @name alias
#' @examples \dontrun{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@zero323
Copy link
Member Author

zero323 commented May 2, 2017

That makes sense I guess. It would be great to have more control over the layout though. One can dream, right? :)

Thank you so much for all the reviews and information.

@SparkQA
Copy link

SparkQA commented May 2, 2017

Test build #76373 has finished for PR 17825 at commit d892f30.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

R/pkg/R/column.R Outdated
@@ -132,17 +132,24 @@ createMethods()

#' alias
#'
#' Set a new name for a column
#' Set a new name for an object. Equivalent to SQL "AS" keyword.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, this is Scala doc for Column.alias Gives the column an alias (which is not very concise)
Dataset.alias Returns a new Dataset with an alias set.

I think we need to say Set a new name to return as a new object or similar. Actually I think we should say "Column or SparkDataFrame" in place of "object" - what do you think?

I think the SQL "AS" part but perhaps it will be more clear if lead with "for Column, ..."?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think this doc block (description, param list specifically) should be move to DataFrame.R or generic.R as mentioned before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving to generics.R sounds good. "Column or SparkDataFrame" in place of "object" as well.

Regarding "AS"... In SQL it can be used with both expressions and tables so I deliberately didn't quantify this with Column.

I am not sure if we really need to state that it returns a new object. Maybe Return a new Column or SparkDataFrame with an alias. Equivalent to SQL "AS" keyword.? But it doesn't sound great.

@SparkQA
Copy link

SparkQA commented May 3, 2017

Test build #76400 has finished for PR 17825 at commit 32fd836.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

#' @aliases alias,SparkDataFrame-method
#' @rdname alias
#' @name alias
#' @examples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add @family SparkDataFrame functions
I think we should probably review all these @family at one point...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I general it would nice to sweep all the files to make it more consistent. Capitalization, punctuation, examples. return and such.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes!

@zero323
Copy link
Member Author

zero323 commented May 4, 2017

I wonder if it would make more sense to make alias generic for both object and data:

 signature(object = "SparkDataFrame", data = "character")

and skip the type checks.

@SparkQA
Copy link

SparkQA commented May 4, 2017

Test build #76437 has finished for PR 17825 at commit e06544f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

you know - it would definitely be a better experience for the R user, so we should try that - it might break with the generic in stats::alias though

and speaking of, we should probably add a test for stats:alias to see it is callable without stats::

@zero323
Copy link
Member Author

zero323 commented May 4, 2017

Oh right stats::alias is S3. Scratch that. Added test.

@SparkQA
Copy link

SparkQA commented May 4, 2017

Test build #76444 has finished for PR 17825 at commit 60fcd8a.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 4, 2017

Test build #76445 has finished for PR 17825 at commit 875921b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 4, 2017

Test build #76451 has finished for PR 17825 at commit 09f9cca.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 4, 2017

Test build #76452 has finished for PR 17825 at commit f1c74f3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

could you close/reopen to trigger appveyor again

@@ -387,6 +387,16 @@ setGeneric("value", function(bcast) { standardGeneric("value") })
#' @export
setGeneric("agg", function (x, ...) { standardGeneric("agg") })

#' alias
#'
#' Set a new name for a Column or a SparkDataFrame. Equivalent to SQL "AS" keyword.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right - I think again we should emphasize on returning a new SparkDataFrame

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about?

#' Return a new Column or a SparkDataFrame with a name set. Equivalent to SQL "AS" keyword.

Is the Column new?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we don't say return a new Column but more generally return a Column
and in other cases we say return a new SparkDataFrame

so I guess it's a difference in wording.
I think what you propose is fine, though do you think it's confusing to say Equivalent to SQL "AS" keyword. because that makes sense only for Column and not the whole dataframe?

Copy link
Member Author

@zero323 zero323 May 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still believe that AS is applicable to both. Essentially what we do is:

SELECT old_column AS new_column FROM table

and

(SELECT * FROM old_table) AS new_table
--or
SELECT * FROM old_table AS new_table

@zero323 zero323 closed this May 5, 2017
@zero323 zero323 reopened this May 5, 2017
#' @name alias
#' @rdname alias
#' @param object x a Column or a SparkDataFrame
#' @param data new name to use
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we have a @return here? perhaps to say

Returns a new SparkDataFrame or Column with an alias set.
For Column, equivalent to SQL "AS" keyword.

@return a new SparkDataFrame or Column

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't be better to annotate actual implementations? To get something like this:

image

Copy link
Member

@felixcheung felixcheung May 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that we did, at one point. I think the feedback is we could have one line for parameter (object) and return value could be more than one but which line matches which input parameter type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I find both equally confusing, so if you think that a single annotation is better, I am happy to oblige.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's true actually.
if you think it's useful we could always have them in separate rd.
I'm pretty sure @rdname needs to match @aliases to fix multiple link bug https://issues.apache.org/jira/browse/SPARK-18825; which means we can't have multiple functions in the same rd - each has to have its own.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the bright side it looks like matching @rdname and @aliases like:

#' alias
#'
#' @aliases alias,SparkDataFrame-method
#' @family SparkDataFrame functions
#' @rdname alias,SparkDataFrame-method
#' @name alias
...

and

#' alias
#'
#' @aliases alias,SparkDataFrame-method
#' @family SparkDataFrame functions
#' @rdname alias,SparkDataFrame-method
#' @name alias
...

(I hope this is what you mean) indeed solves SPARK-18825. But it doesn't generate any docs for these two and makes CRAN checker unhappy:

Undocumented S4 methods:
  generic 'alias' and siglist 'Column'
  generic 'alias' and siglist 'SparkDataFrame'

Docs for generic are created but it doesn't help us here. Even if we bring @examples there we still have to deal with CRAN.

Theres is also my favorite \name must exist and be unique in Rd files which doesn't gives us much room here, does it?

I opened to suggestions, but personally I am out ideas. I've been digging trough roxygen docs, but between CRAN, S4 requirements, roxygen limitation and our own rules there is not much room left.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sigh, sadly I think you have captured all the constraints we are working with here.

let's get the 3 lines in the same order

#' Returns a new SparkDataFrame or Column with an alias set. Equivalent to SQL "AS" keyword.
#' @param object x a Column or a SparkDataFrame
#' @return a Column or a SparkDataFrame

to

#' Returns a new SparkDataFrame or Column with an alias set. Equivalent to SQL "AS" keyword.
#' @param object x a SparkDataFrame or Column
#' @return a SparkDataFrame or a Column

@SparkQA
Copy link

SparkQA commented May 5, 2017

Test build #76503 has finished for PR 17825 at commit 505561a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

#' @family SparkDataFrame functions
#' @rdname alias
#' @name alias
#' @examples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add @export

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but do we actually need this? We don't use roxygen to maintain NAMESPACE, and (I believe i mentioned this before) we @export objects which are not really exported. Just saying...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, it's more for tracking it manually

@SparkQA
Copy link

SparkQA commented May 6, 2017

Test build #76514 has finished for PR 17825 at commit 1f1e72b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

could you keep the description and return type in the same order in this #17825 (comment)

it's not great, but it's the best we can do

@SparkQA
Copy link

SparkQA commented May 7, 2017

Test build #76552 has finished for PR 17825 at commit 2b8f288.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

merged to master.
thank you for working on this and hopefully we could really improve a lot of the things we have discussed. 👍

@asfgit asfgit closed this in 1f73d35 May 7, 2017
@zero323
Copy link
Member Author

zero323 commented May 9, 2017

Thanks @felixcheung

liyichao pushed a commit to liyichao/spark that referenced this pull request May 24, 2017
## What changes were proposed in this pull request?

- Add SparkR wrapper for `Dataset.alias`.
- Adjust roxygen annotations for `functions.alias` (including example usage).

## How was this patch tested?

Unit tests, `check_cran.sh`.

Author: zero323 <zero323@users.noreply.github.com>

Closes apache#17825 from zero323/SPARK-20550.
@zero323 zero323 deleted the SPARK-20550 branch February 2, 2020 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants