Skip to content

[SPARK-21780][R] Simpler Dataset.sample API in R#19243

Closed
HyukjinKwon wants to merge 2 commits intoapache:masterfrom
HyukjinKwon:SPARK-21780
Closed

[SPARK-21780][R] Simpler Dataset.sample API in R#19243
HyukjinKwon wants to merge 2 commits intoapache:masterfrom
HyukjinKwon:SPARK-21780

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Sep 15, 2017

What changes were proposed in this pull request?

This PR make sample(...) able to omit withReplacement defaulting to FALSE.

In short, the following examples are allowed:

> df <- createDataFrame(as.list(seq(10)))
> count(sample(df, fraction=0.5, seed=3))
[1] 4
> count(sample(df, fraction=1.0))
[1] 10

In addition, this PR also adds some type checking logics as below:

> sample(df, fraction = "a")
Error in sample(df, fraction = "a") :
  fraction must be numeric; however, got character
> sample(df, fraction = 1, seed = NULL)
Error in sample(df, fraction = 1, seed = NULL) :
  seed must not be NULL or NA; however, got NULL
> sample(df, list(1), 1.0)
Error in sample(df, list(1), 1) :
  withReplacement must be logical; however, got list
> sample(df, fraction = -1.0)
...
Error in sample : illegal argument - requirement failed: Sampling fraction (-1.0) must be on interval [0, 1] without replacement

How was this patch tested?

Manually tested, unit tests added in R/pkg/tests/fulltests/test_sparkSQL.R.

@HyukjinKwon
Copy link
Member Author

@felixcheung, could you take a look when you have some time please?

@SparkQA
Copy link

SparkQA commented Sep 15, 2017

Test build #81818 has finished for PR 19243 at commit 3debd9f.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Sep 15, 2017

Test build #81817 has finished for PR 19243 at commit 680157e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 15, 2017

Test build #81819 has finished for PR 19243 at commit 3debd9f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

let me think about this a bit...

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Sep 15, 2017

Sure. This one is a bit tricky. Let me try to find out a better way too.

@felixcheung
Copy link
Member

felixcheung commented Sep 17, 2017

thinking about this, I wonder if it is more common in R to skip param with default values and the rest of param by names, like sample(df, fraction=1.0)

with

signature(x = "SparkDataFrame"),
function(x, withReplacement = FALSE, fraction, seed = NULL) 

for instance, these just work (tested this)

sample(df, fraction=0.5, seed=3)
sample(df, withReplacement=TRUE, fraction=0.5, seed=3)
sample(df, fraction=1.0)
sample(df, FALSE, fraction=1.0)
sample(df, 1.0, withReplacement=FALSE)

these don't

sample(df, 0.5, 3) # this sort of does work actually, with withReplacement set to 0.5
sample(df, 1.0)

@HyukjinKwon
Copy link
Member Author

Sure, let me minimise the changes as you suggested for now and keep the current change somewhere in my local just in case. That makes sense to me too.

@SparkQA
Copy link

SparkQA commented Sep 19, 2017

Test build #81903 has finished for PR 19243 at commit 287f2b4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we actually want to change to not documenting the default value if it is already in the signature - because then it would be the roxygen2 generated doc

Copy link
Member

@felixcheung felixcheung Sep 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the style here is to have one space between the param name and its value, like as seed = 3 and fraction = 0.5 for the line above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D'oh.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd then wrap withReplacement as as.logical(withReplacement) and fraction as as.numeric(fraction)
because it might be coercible (note the L)

> is.numeric(1L)
[1] TRUE

but passing as integer could cause callJMethod to match to a different signature on the JVM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should be careful only as.integer if it isn't NULL or NA

> as.integer(NULL)
integer(0)
> as.integer(NA)
[1] NA

@HyukjinKwon
Copy link
Member Author

Thanks @felixcheung, will address the comments soon.

@SparkQA
Copy link

SparkQA commented Sep 19, 2017

Test build #81925 has finished for PR 19243 at commit 580ae5e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 19, 2017

Test build #81926 has finished for PR 19243 at commit cea3625.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM one comment


if (!missing(seed)) {
if (is.null(seed) || is.na(seed)) {
stop(paste("seed must not be NULL or NA; however, got", class(seed)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this actually doesn't work for NA

> class(NULL)
[1] "NULL"
> class(NA)
[1] "logical"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see ..

@SparkQA
Copy link

SparkQA commented Sep 20, 2017

Test build #81978 has finished for PR 19243 at commit b1a86ed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Merged to master.

@asfgit asfgit closed this in a8d9ec8 Sep 21, 2017
@HyukjinKwon HyukjinKwon deleted the SPARK-21780 branch January 2, 2018 03:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants