Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading the fst packages initializes the RNG even though it is not used #251

Closed
mlell opened this issue Sep 20, 2020 · 7 comments
Closed
Assignees
Labels
Milestone

Comments

@mlell
Copy link

mlell commented Sep 20, 2020

To make sure that all my scripts are reproducible, I ensure that they either receive a command line argument that sets the RNG seed via set.seed() or that they do not use the RNG. As the R variable .Random.seed is defined on the first RNG run in a session, I throw an error if .Random.seed exists but no SEED was provided to the script.

However, loading the fst packages creates .Random.seed even though the RNG is not used. I show this in the script below: Running sample() before and after loading the packages does not give a different result, but .Random.seed is created.

To provide a means of writing verifiably deterministic R scripts, fst should not initialize the RNG if it is not needed.

I installed the development version of fst using remotes::install_github("fstpackage/fst", ref = "59a18110").
This is the script I use:

RNG_status <- function(id){
  message(sprintf("%10s: .Random.seed exists: ", id), exists(".Random.seed"))
}

cmd <- commandArgs(TRUE)[1]
if(cmd == ".Random.seed"){

  RNG_status("begin")
  message("FST RNG tester")
  RNG_status("message")
  loadNamespace("dplyr")
  RNG_status("loadNamespace(dplyr)")
  loadNamespace("fst")
  RNG_status("loadNamespace(fst)")

}else if(cmd == "samplertest_1"){

  set.seed(123)
  message("Random number: ", sample(1:100, 1))

}else if(cmd == "samplertest_2"){

  set.seed(123)
  loadNamespace("fst")
  message("Random number: ", sample(1:100, 1))

}

And this is the script output in my shell ($ = my input)

$ Rscript test.R .Random.seed
     begin: .Random.seed exists: FALSE
FST RNG tester
   message: .Random.seed exists: FALSE
loadNamespace(dplyr): .Random.seed exists: FALSE
loadNamespace(fst): .Random.seed exists: TRUE
$ Rscript test.R samplertest_1
Random number: 31
$ Rscript test.R samplertest_2
Random number: 31

you can see that .Random.seed is set by the loadNamespace("fst") call but not for (example) loadNamespace("dplyr"). Moreover, a random number between 1 and 100 is chosen identically regardless of whether it is before or after the loadNamespace("fst") call.

@MarcusKlik MarcusKlik self-assigned this Sep 22, 2020
@MarcusKlik
Copy link
Collaborator

Hi @mlell, thanks for submitting your issue!

Parameter .Random.seed is not set directly in fst, but perhaps it's set in some of it's dependencies? Also, I found that when I run your script in a fresh R session after starting rstudio, parameter .Random.seed is already initialized. But starting a new session from the R GUI apparently does not initialize .Random.seed, so something is going on with rstudio as well.

I'm not sure I understand your samplertest_1 and samplertest_2 tests, doesn't that show that you can set the random seed and it's not changed by loading the fst namespace?

(so loading the namespace only initializes the seed when it's not initialized yet?)

@mlell
Copy link
Author

mlell commented Sep 25, 2020

@MarcusKlik , samplertest_1 and samplertest_2 show that the RNG is not used by fst, no random number is actually generated during package init. This was meant to show that fst (or one of the dependencies) does not need the RNG for anything.

But the example above (".Random.seed") shows that it (or a dep) does initialize the RNG, in contrast to other packages like dplyr (i chose that because dplyr makes heavy use of C code. I suspected that the RNG is initialized by Rcpp or so, but apparently it isn't)

Yes I also suspected first that RStudio initializes the RNG, that's why I included the "negative controls" and ran the scripts in base R to eliminate this source of errors.

I hope I managed to be clearer this time...

@riccardoporreca
Copy link

riccardoporreca commented Sep 26, 2020

@MarcusKlik, @mlell, this is due to the functions called .onLoad():
https://github.com/fstpackage/fstcore/blob/c96028e320e8d723f991ef6134dc2245b6e5f1b0/R/package.R#L209
Those are C++ functions exported to R using // [[Rcpp::export]], which are wrapped in RcppExports.cpp with code handling the RNG scope, which I believe implicitly initialize the internal state .Random.seed.

I did a quick check on a local checkout of branch master, replacing all // [[Rcpp::export]] // [[Rcpp::export(rng = false)]]: .Random.seed is then not created upon package loading. It might be a good idea to switch to [[Rcpp::export(rng = false)]], provided the Rcpp-exported functions never use the R random number generator. See the Rcpp-attributes vignette (https://github.com/RcppCore/Rcpp/blob/d6594bae57d796f0385bf9552532422e59a6e56d/vignettes/rmd/Rcpp-attributes.Rmd#L351).

A nice example of this is package qs, with rng = false set / not set e.g. in https://github.com/traversc/qs/blob/953af431514481a25cfe74093fa3a74220e653d8/src/extra_functions.h

NOTE: I am referring to the master branch of fst, develop is now based on fstcore, where the same considerations apply.

@mlell
Copy link
Author

mlell commented Oct 2, 2020

@riccardoporreca, thank you for this research! So is it as simple as adding this attribute, or should we include tests that make sure that the RNG is not used? But I don't know how to do that. First I thought of something similar like in the script above:

set.seed(123) # we know that the first random number in {1..100} with this seed will be 31
C_function_to_test()
expect_equal(sample(1:100, 1), 31)

But would that work? If a C function uses the RNG without initializing, maybe the RNG state will not be forwarded to R so the test cannot detect that? Did I understand that correctly?

Or am I overthinking this?

@riccardoporreca
Copy link

@mlell, I think one should just follow what the vignettes recommends

If you are certain that no C++ code will make use of random number generation and the 2ms of execution time is meaningful in your context, you can disable the automatic injection of RNGScope using the rng parameter of the Rcpp::export attribute

I doubt package maintainers would have to rely on unit tests to know if their C++ code uses RNG, and if that is the case (unless completely expected, like in C++ functions doing random things) this should be investigated and at least documented.

Overall I think we can summarize the matter as follows, which makes it ultimately a matter of choice for @MarcusKlik.

  • .Random.seed being set when fst is loaded is an artifact of the default RNG protection defined by Rcpp when creating R wrappers.
  • Assuming that none of the fst functionality makes use of RNG (I would be surprised of the contrary), the fact that .Random.seed is set upon loading is really just an artifact w/o any consequence on future RNG usage (.Random.seed will be eventually set). This makes the usage of [[Rcpp::export(rng = false)]] a matter of choice to spare the little overhead mentioned in the vignette.

@MarcusKlik
Copy link
Collaborator

Hi @riccardoporreca, great, thanks a lot for clearing that up, I didn't know about Rcpp's automatic establishing of the RNGScope!

In that case, I think we can safely use the Rcpp::export(rng = false) option and claim those 2 ms of startup time :-)

@MarcusKlik MarcusKlik added this to the fst v0.9.6 milestone Oct 5, 2020
MarcusKlik added a commit to MarcusKlik/fstcore that referenced this issue Oct 5, 2020
@MarcusKlik
Copy link
Collaborator

Hi @mlell and @riccardoporreca, with the latest commit (in fstcore), the random seed is not initialized anymore, thanks a lot for the find and solution!

exists(".Random.seed")
#> [1] FALSE

require(fstcore)
#> Loading required package: fstcore

exists(".Random.seed")
#> [1] FALSE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants