New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-15319][SPARKR][DOCS] Fix SparkR doc layout for corr and other DataFrame stats functions #13109

Closed
wants to merge 6 commits into
base: master
from

Conversation

Projects
None yet
6 participants
@felixcheung
Member

felixcheung commented May 13, 2016

What changes were proposed in this pull request?

Doc only changes. Please see screenshots.

Before:
http://spark.apache.org/docs/latest/api/R/statfunctions.html
image

After
image
(please ignore the style differences - this is due to not having the css in my local copy)

This is still a bit weird. As discussed in SPARK-15237, I think the better approach is to separate out the DataFrame stats function instead of putting everything on one page. At least now it is clearer which description is on which function.

How was this patch tested?

Build doc

@felixcheung

This comment has been minimized.

Show comment
Hide comment
@felixcheung
Member

felixcheung commented May 13, 2016

@shivaram

This comment has been minimized.

Show comment
Hide comment
@shivaram

shivaram May 13, 2016

Contributor

LGTM. Thanks @felixcheung

Contributor

shivaram commented May 13, 2016

LGTM. Thanks @felixcheung

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA May 13, 2016

Test build #58596 has finished for PR 13109 at commit 756262c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented May 13, 2016

Test build #58596 has finished for PR 13109 at commit 756262c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@sun-rui

This comment has been minimized.

Show comment
Hide comment
@sun-rui

sun-rui May 14, 2016

Contributor

This looks better. but the roxygen style is a little bit deviated. The previous is like:
#' function name
#' description

Current is like:
#' function name - description

We may need a consistent roxygen style documentation. At least for two styles:
one function for one RD
multiple functions for one RD

And also if you type '?corr' in R, only corr() for Column functions is displayed. Since R is function oriented, I think two corr() descriptions better to be displayed together in one page?

Contributor

sun-rui commented May 14, 2016

This looks better. but the roxygen style is a little bit deviated. The previous is like:
#' function name
#' description

Current is like:
#' function name - description

We may need a consistent roxygen style documentation. At least for two styles:
one function for one RD
multiple functions for one RD

And also if you type '?corr' in R, only corr() for Column functions is displayed. Since R is function oriented, I think two corr() descriptions better to be displayed together in one page?

@felixcheung

This comment has been minimized.

Show comment
Hide comment
@felixcheung

felixcheung May 15, 2016

Member

right - that's why my comment on SPARK-15237 is that we should have a different rd for each of "statsfunctions" instead of having all of them on one rd. To clarify, currently, we have

  1. column function corr in its rd file
  2. DataFrame function corr, crosstab, cov, freqItems in one rd file

What I think we should have instead is

  1. DataFrame function corr in one rd file
  2. DataFrame function crosstab in one rd file etc..
  3. column function corr should group with other column functions (or least column stats functions) in one rd file with appropriate "multiple functions in one rd" formatting as you have suggested

I think it is rather confusing to put both DataFrame corr on the same pages as column function corr?
Thoughts?

Member

felixcheung commented May 15, 2016

right - that's why my comment on SPARK-15237 is that we should have a different rd for each of "statsfunctions" instead of having all of them on one rd. To clarify, currently, we have

  1. column function corr in its rd file
  2. DataFrame function corr, crosstab, cov, freqItems in one rd file

What I think we should have instead is

  1. DataFrame function corr in one rd file
  2. DataFrame function crosstab in one rd file etc..
  3. column function corr should group with other column functions (or least column stats functions) in one rd file with appropriate "multiple functions in one rd" formatting as you have suggested

I think it is rather confusing to put both DataFrame corr on the same pages as column function corr?
Thoughts?

@sun-rui

This comment has been minimized.

Show comment
Hide comment
@sun-rui

sun-rui May 16, 2016

Contributor

My opinion is that since R supports generic function and a generic function can have multiple methods for it, it is natural to put both corr() in the same page. Is there a mechanism that descriptions can be aggregated even if methods of the same name are distributed in different RD files?

Contributor

sun-rui commented May 16, 2016

My opinion is that since R supports generic function and a generic function can have multiple methods for it, it is natural to put both corr() in the same page. Is there a mechanism that descriptions can be aggregated even if methods of the same name are distributed in different RD files?

@felixcheung

This comment has been minimized.

Show comment
Hide comment
@felixcheung

felixcheung May 16, 2016

Member

that's fine, I don't know the history of putting stats column function into one rd page though.
I agree it is fine to group function by the name corr DataFrame & corr column going to the same rd page. Are we ok with that approach?

Member

felixcheung commented May 16, 2016

that's fine, I don't know the history of putting stats column function into one rd page though.
I agree it is fine to group function by the name corr DataFrame & corr column going to the same rd page. Are we ok with that approach?

@shivaram

This comment has been minimized.

Show comment
Hide comment
@shivaram

shivaram May 24, 2016

Contributor

Chiming in a little late here -- from my R usage, I've definitely seen two patterns commonly used

Right now my take is the following:

  • For more complex functions like corr (i.e. where we have DataFrame and column versions), lets group by function name and lets exclude them from the generic stats column rd file
  • For simple functions like mean or sum lets keep them in a stats functions rd.

Let me know what you think of this proposal.

Contributor

shivaram commented May 24, 2016

Chiming in a little late here -- from my R usage, I've definitely seen two patterns commonly used

Right now my take is the following:

  • For more complex functions like corr (i.e. where we have DataFrame and column versions), lets group by function name and lets exclude them from the generic stats column rd file
  • For simple functions like mean or sum lets keep them in a stats functions rd.

Let me know what you think of this proposal.

@sun-rui

This comment has been minimized.

Show comment
Hide comment
@sun-rui

sun-rui May 24, 2016

Contributor

@shivaram, I checked your two examples. It seems that the rule is that:
For a generic function, methods for it are documented together
For non-generic functions (isGeneric() returns FALSE), functions having related functionalities can be documented together.

I have no strong opinion on this. so +1

Contributor

sun-rui commented May 24, 2016

@shivaram, I checked your two examples. It seems that the rule is that:
For a generic function, methods for it are documented together
For non-generic functions (isGeneric() returns FALSE), functions having related functionalities can be documented together.

I have no strong opinion on this. so +1

vectorijk added a commit to vectorijk/spark that referenced this pull request Jun 3, 2016

revert changes in R/pkg/R/stats.R
 - this changes might happen in #13109

vectorijk added a commit to vectorijk/spark that referenced this pull request Jun 14, 2016

revert changes in R/pkg/R/stats.R
 - this changes might happen in #13109
@shivaram

This comment has been minimized.

Show comment
Hide comment
@shivaram

shivaram Jun 20, 2016

Contributor

@felixcheung Is this PR still relevant ?

Contributor

shivaram commented Jun 20, 2016

@felixcheung Is this PR still relevant ?

@felixcheung

This comment has been minimized.

Show comment
Hide comment
@felixcheung

felixcheung Jun 20, 2016

Member

I think it does. I will try to update this today.

Member

felixcheung commented Jun 20, 2016

I think it does. I will try to update this today.

@felixcheung

This comment has been minimized.

Show comment
Hide comment
@felixcheung

felixcheung Jun 21, 2016

Member

Updated cov, corr into their rd page and kept others.

Member

felixcheung commented Jun 21, 2016

Updated cov, corr into their rd page and kept others.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Jun 21, 2016

Test build #60897 has finished for PR 13109 at commit df0851c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Jun 21, 2016

Test build #60897 has finished for PR 13109 at commit df0851c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@shivaram

This comment has been minimized.

Show comment
Hide comment
@shivaram
Contributor

shivaram commented Jun 21, 2016

cc @dongjoon-hyun

Show outdated Hide outdated R/pkg/R/stats.R
@dongjoon-hyun

This comment has been minimized.

Show comment
Hide comment
@dongjoon-hyun

dongjoon-hyun Jun 21, 2016

Member

Oh, the root cause exists in generics.R. Nice catch!

Member

dongjoon-hyun commented Jun 21, 2016

Oh, the root cause exists in generics.R. Nice catch!

@felixcheung

This comment has been minimized.

Show comment
Hide comment
@felixcheung

felixcheung Jun 21, 2016

Member

It's nasty! 😄

Member

felixcheung commented Jun 21, 2016

It's nasty! 😄

@dongjoon-hyun

This comment has been minimized.

Show comment
Hide comment
@dongjoon-hyun

dongjoon-hyun Jun 21, 2016

Member

In line 333 of functions.R, @rdname covar_pop -> @rdname cov?

#' covar_pop
#'
#' Compute the population covariance between two expressions.
#'
#' @rdname covar_pop
#' @name covar_pop
Member

dongjoon-hyun commented Jun 21, 2016

In line 333 of functions.R, @rdname covar_pop -> @rdname cov?

#' covar_pop
#'
#' Compute the population covariance between two expressions.
#'
#' @rdname covar_pop
#' @name covar_pop
@felixcheung

This comment has been minimized.

Show comment
Hide comment
@felixcheung

felixcheung Jun 21, 2016

Member

That's intentional - covar_pop has a separate page.
cov == covar_samp
covar_samp != covar_pop
putting covar_pop on the same rd would have 2 different descriptions there. (I think we really should avoid putting multiple things in one rd..)

Member

felixcheung commented Jun 21, 2016

That's intentional - covar_pop has a separate page.
cov == covar_samp
covar_samp != covar_pop
putting covar_pop on the same rd would have 2 different descriptions there. (I think we really should avoid putting multiple things in one rd..)

@dongjoon-hyun

This comment has been minimized.

Show comment
Hide comment
@dongjoon-hyun

dongjoon-hyun Jun 21, 2016

Member

Yes. Indeed, we had better keep each function on own RD generally.

Member

dongjoon-hyun commented Jun 21, 2016

Yes. Indeed, we had better keep each function on own RD generally.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Jun 21, 2016

Test build #60905 has finished for PR 13109 at commit 9104bbc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Jun 21, 2016

Test build #60905 has finished for PR 13109 at commit 9104bbc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
Show outdated Hide outdated R/pkg/R/stats.R
Show outdated Hide outdated R/pkg/R/stats.R
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Jun 21, 2016

Test build #60908 has finished for PR 13109 at commit 10a4dba.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Jun 21, 2016

Test build #60908 has finished for PR 13109 at commit 10a4dba.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@mengxr

This comment has been minimized.

Show comment
Hide comment
@mengxr

mengxr Jun 21, 2016

Contributor

I just checked the generated R doc and I felt that we shouldn't group many methods together. For example, in this PR, the DESCRIPTION section looks okay because we used crosstab - ... and freqItems - .... But the Arguments section becomes very messy and the Value section is even worse. The example code are chained together. See attached screenshots. summary is another example here. It might be better to only group methods are closely related to each other and share the same arguments.

screen shot 2016-06-20 at 11 21 48 pm

Contributor

mengxr commented Jun 21, 2016

I just checked the generated R doc and I felt that we shouldn't group many methods together. For example, in this PR, the DESCRIPTION section looks okay because we used crosstab - ... and freqItems - .... But the Arguments section becomes very messy and the Value section is even worse. The example code are chained together. See attached screenshots. summary is another example here. It might be better to only group methods are closely related to each other and share the same arguments.

screen shot 2016-06-20 at 11 21 48 pm

@shivaram

This comment has been minimized.

Show comment
Hide comment
@shivaram

shivaram Jun 21, 2016

Contributor

LGTM. This version looks good to me. Thanks for iterating on this. Will wait for Jenkins and then merge.

Contributor

shivaram commented Jun 21, 2016

LGTM. This version looks good to me. Thanks for iterating on this. Will wait for Jenkins and then merge.

@shivaram

This comment has been minimized.

Show comment
Hide comment
@shivaram

shivaram Jun 21, 2016

Contributor

@mengxr Yes - this is true and in #13798 we are making a few more of the methods into individual Rd files. At a high level there is a tradition in R to group together similar methods (https://stat.ethz.ch/R-manual/R-devel/library/base/html/colSums.html for example) but with roxygen2 that leads to some issues.

I think we can even split the statfunctions ones as they are named differently -- For summary I think its more tricky as the function name is the same ?

Contributor

shivaram commented Jun 21, 2016

@mengxr Yes - this is true and in #13798 we are making a few more of the methods into individual Rd files. At a high level there is a tradition in R to group together similar methods (https://stat.ethz.ch/R-manual/R-devel/library/base/html/colSums.html for example) but with roxygen2 that leads to some issues.

I think we can even split the statfunctions ones as they are named differently -- For summary I think its more tricky as the function name is the same ?

@mengxr

This comment has been minimized.

Show comment
Hide comment
@mengxr

mengxr Jun 21, 2016

Contributor

Methods documented in colSums share the same parameters and each was only documented once. Roxygen2 supports that if each param doc only appears once in the comment. That grouping looks okay to me, which is different from statfunctions or summary.

For summary, I'm thinking about documenting the corresponding summary methods in the fit methods such as spark.glm, spark.naiveBayes. And then put see alsos in the generic summary doc page.

Contributor

mengxr commented Jun 21, 2016

Methods documented in colSums share the same parameters and each was only documented once. Roxygen2 supports that if each param doc only appears once in the comment. That grouping looks okay to me, which is different from statfunctions or summary.

For summary, I'm thinking about documenting the corresponding summary methods in the fit methods such as spark.glm, spark.naiveBayes. And then put see alsos in the generic summary doc page.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Jun 21, 2016

Test build #60910 has finished for PR 13109 at commit 3760d03.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Jun 21, 2016

Test build #60910 has finished for PR 13109 at commit 3760d03.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@shivaram

This comment has been minimized.

Show comment
Hide comment
@shivaram

shivaram Jun 21, 2016

Contributor

@mengxr @felixcheung Can we open a new issue of the form Separate out rd files for SparkR functions ? We can then make a list there of everything thats sharing a rd file right now and see what is the way around things.

@mengxr - also let me know if you think this is good to merge. I think for 2.0 RC1 having this PR is better than not having it ?

Contributor

shivaram commented Jun 21, 2016

@mengxr @felixcheung Can we open a new issue of the form Separate out rd files for SparkR functions ? We can then make a list there of everything thats sharing a rd file right now and see what is the way around things.

@mengxr - also let me know if you think this is good to merge. I think for 2.0 RC1 having this PR is better than not having it ?

@shivaram

This comment has been minimized.

Show comment
Hide comment
@shivaram

shivaram Jun 21, 2016

Contributor

Jenkins, retest this please

Contributor

shivaram commented Jun 21, 2016

Jenkins, retest this please

@felixcheung

This comment has been minimized.

Show comment
Hide comment
@felixcheung

felixcheung Jun 21, 2016

Member

Right @shivaram @mengxr I agree it would be nicer to put predict, summary or even write.ml to their corresponding mllib methods.

Member

felixcheung commented Jun 21, 2016

Right @shivaram @mengxr I agree it would be nicer to put predict, summary or even write.ml to their corresponding mllib methods.

@mengxr

This comment has been minimized.

Show comment
Hide comment
@mengxr

mengxr Jun 21, 2016

Contributor

Yes, we should merge this PR first and discuss the grouping later.

Contributor

mengxr commented Jun 21, 2016

Yes, we should merge this PR first and discuss the grouping later.

@mengxr

This comment has been minimized.

Show comment
Hide comment
@mengxr
Contributor

mengxr commented Jun 21, 2016

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Jun 21, 2016

Test build #60912 has finished for PR 13109 at commit 3760d03.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Jun 21, 2016

Test build #60912 has finished for PR 13109 at commit 3760d03.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@shivaram

This comment has been minimized.

Show comment
Hide comment
@shivaram

shivaram Jun 21, 2016

Contributor

Cool. Thanks - LGTM. Merging this to master, branch-2.0

Contributor

shivaram commented Jun 21, 2016

Cool. Thanks - LGTM. Merging this to master, branch-2.0

@asfgit asfgit closed this in 843a1eb Jun 21, 2016

asfgit pushed a commit that referenced this pull request Jun 21, 2016

[SPARK-15319][SPARKR][DOCS] Fix SparkR doc layout for corr and other …
…DataFrame stats functions

## What changes were proposed in this pull request?

Doc only changes. Please see screenshots.

Before:
http://spark.apache.org/docs/latest/api/R/statfunctions.html
![image](https://cloud.githubusercontent.com/assets/8969467/15264110/cd458826-1924-11e6-85bd-8ee2e2e1a85f.png)

After
![image](https://cloud.githubusercontent.com/assets/8969467/16218452/b9e89f08-3732-11e6-969d-a3a1796e7ad0.png)
(please ignore the style differences - this is due to not having the css in my local copy)

This is still a bit weird. As discussed in SPARK-15237, I think the better approach is to separate out the DataFrame stats function instead of putting everything on one page. At least now it is clearer which description is on which function.

## How was this patch tested?

Build doc

Author: Felix Cheung <felixcheung_m@hotmail.com>
Author: felixcheung <felixcheung_m@hotmail.com>

Closes #13109 from felixcheung/rstatdoc.

(cherry picked from commit 843a1eb)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

nchammas added a commit to nchammas/spark that referenced this pull request Jul 11, 2016

[SPARK-15319][SPARKR][DOCS] Fix SparkR doc layout for corr and other …
…DataFrame stats functions

## What changes were proposed in this pull request?

Doc only changes. Please see screenshots.

Before:
http://spark.apache.org/docs/latest/api/R/statfunctions.html
![image](https://cloud.githubusercontent.com/assets/8969467/15264110/cd458826-1924-11e6-85bd-8ee2e2e1a85f.png)

After
![image](https://cloud.githubusercontent.com/assets/8969467/16218452/b9e89f08-3732-11e6-969d-a3a1796e7ad0.png)
(please ignore the style differences - this is due to not having the css in my local copy)

This is still a bit weird. As discussed in SPARK-15237, I think the better approach is to separate out the DataFrame stats function instead of putting everything on one page. At least now it is clearer which description is on which function.

## How was this patch tested?

Build doc

Author: Felix Cheung <felixcheung_m@hotmail.com>
Author: felixcheung <felixcheung_m@hotmail.com>

Closes #13109 from felixcheung/rstatdoc.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment