# daqana/dqrng

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

# Design of seeds with more than 32 bits of randomness #11

Closed
opened this Issue Feb 1, 2019 · 9 comments

Projects
None yet
Contributor

### LTLA commented Feb 1, 2019 • edited

As suggested by @rstub on #10, I'll try to justify why I chose raw vectors for generating >32 bit seeds.

# Why raw vectors?

Raw vectors are `unsigned char` vectors represented in R, with size in bytes equal to its length (plus some overhead for R vectors). This provides a nice representation in which one can specify the settings for an arbitrary number of bits. Each entry of the raw vector stores 8 bits, so to store more bits, one just uses a longer vector. This is easy to construct in R (see the code in #10); it is also easy to reason with in C++ requiring only a few bit shifts and masks to create a `uintX_t` seed integer from a raw vector.

Let's now consider the integer as an alternative. R uses a 32-bit signed integer representation, so we would need to use two integers to generate a 64-bit seed. Thus, we could use an integer vector instead of a raw vector. Generation of this seed in R could then feasibly be done with:

`sample((-.Machine\$integer.max):(.Machine\$integer.max), 2, replace=TRUE)`

... which would then be converted into unsigned integers and, with some bit operations, used to generate a 64-bit seed. The problem is that the above code will not span the entire 32-bit range of possible signed integers, as `-INT_MAX` is lost to represent `NA_INTEGER` in R. The correct way to do it:

```X <- c(NA, (-.Machine\$integer.max):(.Machine\$integer.max))
sample(X, 2, replace=TRUE)```

... causes a `Error: cannot allocate vector of size 16.0 Gb`. And, to be honest, I find reasoning about signed-to-unsigned integer conversions to be difficult, though this might just be my problem.

The final numerical alternative is to use a `double`. My first attempt was to actually generate a number within the range of possible 64-bit unsigned integers, and pass that to C++:

``````runif(1, 0, 2^64-1)
``````

The problem here is that I am not guaranteed to sample from the full space of seeds, as double-precision numbers cannot represent every 64-bit integer perfectly. There will be some integers that simply will never be sampled, i.e., `runif` cannot generate numbers within `[i, i+1)` for some integers `i`. The coercion from `double` to `uintX_t` in C++ is also difficult as we need to check that the number can fit within the unsigned range; this is not straightforward if we cannot accurately represent the maximum unsigned number as a double for comparison. (And the penalty of not doing so is extreme - undefined behaviour!)

My second, much shorter-lived attempt was to try to directly convert the bits in a floating point number to a 64-bit unsigned integer. This is probably not a good idea - IIRC, it gets into implementation-specific details about how data are represented on different platforms. Moreover, even if this were portably possible, it is not clear how to sample a floating point number in a way that would get an equal chance of setting each bit.

So, by process of elimination, I ended up with a design using raw vectors. I have not considered using strings because `\0` would need to be sampled to offer 8 bits of randomness per character.

# Why is `generateRawSeeds()` an R function?

This is motivated by the usage of PRNGs in R packages. As a R package developer, I use dqrng's PRNGs in my C++ code, seeded to ensure reproducible execution. However, I also want my functions to respond to the state of R's seed, so that users running my functions multiple times (e.g., to check the variance due to the calculations) will get different results. A natural solution to both of these constraints is to randomly define seeds in R and then pass those seeds to the C++ code for seeding and further random number generation.

In this context, it is sensible to have `generateRawSeeds()` as an R function. I can use this function in my package functions to generate seeds in a manner that responds to user-specified `set.seed()`. As an added bonus, it also allows users in interactive sessions to call `generateRawSeeds()` to seed dqrng's R-visible distribution functions with 64-bit of randomness. (Possibly want to set `bits=64` by default).

It is possible to translate this into a C++ function, and just have the R function as a wrapper. This would allow generation of raw-vector seeds from C++ as well, which might be desirable in some settings. I have not done so because... well, I guess I didn't really need it. A direct translation will also have some issues with replacing `sample` with `std::shuffle` due to the latter's lack of reproducibility across platforms. One could use `boost::random::uniform_int_distribution` if R's API exposed the raw PRNG... which it doesn't seem to. In scran, I have added my own C++ shuffling algorithm to ensure that I get the same result across platforms, but this was more of a change than I was willing to make here without further approval. It should be possible to do something fairly simple with `R::unif_rand() * 256` if this is considered to be necessary.

Member

### rstub commented Feb 1, 2019 • edited

Let's see what's currently implemented:

• C++ users that only use the header files are left alone.
• When an R or C++ user uses the library, an RNG is instantiated and seeded from `std::random_device` and time (the latter because of mingw where `std::random_device` is deterministic).
• When an R or C++ user changes the RNG, a 64bit integer is drawn from the current RNG and used to seed the new one.
• Both R and C++ users can use a 32 bit integer (signed for R, unsigned for C++) to re-seed an existing RNG.

I see two additional uses cases from what you are describing.

## Seeding with more than 32 bits

You are correctly pointing out that seeding with only 32 bits is not very good. I would still prefer it, though, if the user could make some sense from the command. Something like

``````dqset.seed(c(4711, 42))
``````

could be used to specify two (or more) integers. This is the sort of usage that I am missing when using raw vectors.

The integer vector might then either be used directly or fed to `sed_seq_fe` to provide the seeds. Alternatively, the `bit64`package could be used. This provides (signed) 64bit integers for R users by reinterpreting the bit pattern of doubles. Using them is quite simple:

```#include <cstdint>
#include <Rcpp.h>

// [[Rcpp::export]]
void printSeed1(Rcpp::NumericVector seed_vec) {
uint64_t _seed;
double seed;
static_assert(sizeof seed == sizeof _seed, "Size does not match!");
for (int i = 0; i < seed_vec.length(); ++i) {
seed = seed_vec(i);
std::memcpy(&_seed, &seed, sizeof seed);
Rcpp::Rcout << "double: " << seed << std::endl;
Rcpp::Rcout << "uint64: " << _seed << std::endl;
}
}

// [[Rcpp::export]]
void printSeed2(Rcpp::NumericVector seed_vec) {
int64_t _seed;
double seed;
static_assert(sizeof seed == sizeof _seed, "Size does not match!");
for (int i = 0; i < seed_vec.length(); ++i) {
seed = seed_vec(i);
std::memcpy(&_seed, &seed, sizeof seed);
Rcpp::Rcout << "double: " << seed << std::endl;
Rcpp::Rcout << "int64: " << _seed << std::endl;
}
}

/*** R
library(bit64)
seed <- as.integer64(c("1234567890987654321", "12345", "-1234567890987654321"))
seed
unclass(seed)
printSeed1(seed)
printSeed2(seed)
*/```

Output:

``````> library(bit64)

> seed <- as.integer64(c("1234567890987654321", "12345", "-1234567890987654321"))

> seed
integer64
 1234567890987654321  12345                -1234567890987654321

> unclass(seed)
  3.813124e-226  6.099240e-320 -1.107996e+226

> printSeed1(seed)
double: 3.81312e-226
uint64: 1234567890987654321
double: 6.09924e-320
uint64: 12345
double: -1.108e+226
uint64: 17212176182721897295

> printSeed2(seed)
double: 3.81312e-226
int64: 1234567890987654321
double: 6.09924e-320
int64: 12345
double: -1.108e+226
int64: -1234567890987654321
``````

So an `integer64` vector is actually a `double` vector with an additional class and can be moved to C++ as `Rcpp::NumericVector`. Reinterpreting these doubles as signed 64 bit integers gives us the original interpretation in R. Reinterpreting them as unsigned 64 bit integers gives a different interpretation for negative/large values, but that's more useful when using them as seed.

Overall I would prefer using an `IntegerVector`, a single `integer64` or an `integer64` vector as input for `dqset.seed`.

## Seeding using R's RNG as source

You want some `randomize` function which uses R's RNG to provide you with a new seed for the RNG in use. Getting a random `uint32_t` from R is quite easy at the C++ level:

```uint32_t random32() {
return R::runif(0, 1) * 4294967296; /* 2^32 */
}```

This works correctly for the default Mersenne-Twister. For other RNGs implemented in R, this will probably miss some possible integer values, since the RNG does not provide 32 bits of randomness. I don't see an easy way to solve this, though. So I would find it easiest to have a C++ function which calls R's RNG to provide whatever extended seed structure is used. That structure can then either be returned to R, so that the user can safe the values for reproducing the results in the future. Or it can be fed to the RNGs, just like above.

BTW, I like the idea of an `randomize` function, since I could use that instead of `std::random_device`. That's from the `random` header which one is not supposed to use in R extensions. And the analogue from boost requires linking so that the `BH` package is not enough. Of course, this would also mean that I should no longer use Mersenne-Twister as default RNG, since using MT (default in R) to generate seeds for MT is considered a bad idea. But then there are other reasons to remove MT: Currently it also comes from the `random` header. And the version from boost misses a few `constexpr`. In addition, all other RNGs support multiple streams some way or other, which I would like to make available from R. With MT one can only use poor-man's streams, i.e. consecutive seeds, and hope for the best ...

This is motivated by the usage of PRNGs in R packages. As a R package developer, I use dqrng's PRNGs in my C++ code, seeded to ensure reproducible execution. However, I also want my functions to respond to the state of R's seed, so that users running my functions multiple times (e.g., to check the variance due to the calculations) will get different results. A natural solution to both of these constraints is to randomly define seeds in R and then pass those seeds to the C++ code for seeding and further random number generation.

Hmm, I do not quite understand what you mean by that. Do you create a new RNG on every function call?

BTW, is the source code of these two packages on GitHub or similar. I am curious how you are using the package. :-)

Contributor Author

### LTLA commented Feb 1, 2019 • edited

Thanks for the comments @rstub.

# Using raw vectors as the seed

I can empathize with your wariness towards raw vectors, as I felt the same way when I first saw them. (What's this `0e f6 a3` nonsense? Must be a segmentation fault!) But I have grudgingly come around to concede that they are useful in some cases. I'll start by pointing out that you can directly create raw-vector seeds:

`dqset.seed(as.raw(c(0x10, 0x20, 0x30, 0x40)))`

This is about as interpretable as a integer-vector seed; the values aren't very user-friendly in either case. Yes, it's more typing, but someone going for an easy manual initialization should just supply an integer scalar directly; they'll never do it enough times to run into statistical biases from the limited state space.

If manual usability is critical, integer vectors are probably better than the `bit64` approach. The latter requires an additional `Depends` and doesn't allow us to avoid using vectors (e.g., if you're seeding one of the PRNGs with 128 bits of state in your package's C++ code). As you say, sampling these integers across their full range would be relatively easy in C++, though passing them to R would require some care with the signed/unsigned transition. And I, at least, do want some method of getting the random seeds in my R session - see below.

I'm also surprised that `bit64` involves reinterpreting a `double` like that, I thought that would be implementation-specific behaviour. Maybe it's always the same on the platforms on which R compiles...

# Using R's RNG to define the seed

The clearest example is probably in DropletUtils in the `emptyDrops` function. This operates on single-cell genomics data where the C++ code loops across cells and does a series of Monte Carlo simulations to compute a per-cell p-value. Normally, I would just use R's PRNG within C++, i.e., initialize the PRNG at the start of the C++ function and run through all cells without resetting the PRNG or anything.

The problem is that this code is potentially parallelized at the R-level. This is achieved via the the `bplapply` call (from the BiocParallel package), with a variety of parallelization schemes defined by the `BPPARAM` argument. This means that I get a different series of random numbers depending on the parallelization scheme, i.e., number of workers and how the jobs are partitioned across those workers. Obviously, this is not desirable as I would like to get the same results regardless of the chosen parallelization.

The solution is to seed a fresh PRNG for each cell. This guarantees that I get the same results regardless of how the jobs (one job per cell) are distributed across workers, as each cell's series of random numbers are unrelated to that of another cell's. Moreover, I'm using PCG32 with different streams, which provides an added level of reassurance with respect to statistical independence. The construction of the PRNG is done within the cell-by-cell loop in C++, but the generation of the seeds is still performed in the R code where (i) it can be affected by `set.seed()` and (ii) it is still serially executed and thus avoids weird changes in seeds in workers under some parallelization schemes, e.g., those using cluster job schedulers.

This approach does, however, rely on a lot of re-seeding, which means that statistical biases may manifest in the random numbers if I can only seed a part of the state space. Which brings us to #10 and this issue.

Contributor Author

### LTLA commented Feb 2, 2019 • edited

 If raw vectors are too unpalatable, I just did a refactor to use integer vectors - this is on the `integerseed` branch of my fork here. As anticipated, some work was required to guarantee a sensible `uint32_t` to `int` conversion, which would otherwise be implementation-defined if we left it to a `static_cast`. The relevant file is `convert_seed.h`, which contains all the new goodies.
Member

### rstub commented Feb 2, 2019

 Thanks for your comments, @LTLA. You almost had me convinced that raw vectors are a good idea. However, your code triggered my urge to experiment with Melissa O'Neil's `seed_seeq_fe` mentioned above. And I think this looks quite promising: One can feed it any number of `int`s, which might be supplied by the user or derived from R's RNG. One can get a vector of `int`s from it that can used to reconstruct an identical copy. Supposedly one can use it to seed any C++11 compatible RNG. Example: ```#include #include "randutils.hpp" int32_t R_random_32 () { constexpr double upper_limit=2147483648; double val=R::unif_rand() * upper_limit * 2 - upper_limit; if (val >= upper_limit || val < -upper_limit) { val=-upper_limit; } // Absolutely avoid over-/underflow. return static_cast(val); } // [[Rcpp::export]] Rcpp::IntegerVector getParam() { Rcpp::IntegerVector raw_seeds(4); std::generate(raw_seeds.begin(), raw_seeds.end(), R_random_32); randutils::seed_seq_fe128 seeder(raw_seeds.begin(), raw_seeds.end()); Rcpp::IntegerVector param(4); seeder.param(param.begin()); return param; } // [[Rcpp::export]] Rcpp::IntegerVector getSeeds(int n, Rcpp::Nullable param = R_NilValue) { Rcpp::IntegerVector raw_seeds(4); if (param.isNull()) { std::generate(raw_seeds.begin(), raw_seeds.end(), R_random_32); } else { raw_seeds = param.as(); } randutils::seed_seq_fe128 seeder(raw_seeds.begin(), raw_seeds.end()); Rcpp::IntegerVector seeds(n); seeder.generate(seeds.begin(), seeds.end()); return seeds; } /*** R set.seed(42) getSeeds(6) getSeeds(6) set.seed(42) getSeeds(6) getSeeds(6) set.seed(42) one <- getParam() two <- getParam() getSeeds(6, one) getSeeds(6, two) getSeeds(6, 42) getSeeds(6, c(4711, 42)) */``` Output: ``````>set.seed(42) >getSeeds(6)  -1368039275 -1100785627 1327213203 1033422429 1573423832 475391043 > getSeeds(6)  909206077 1132750064 872894922 151901752 1480135365 -947383956 > set.seed(42) > getSeeds(6)  -1368039275 -1100785627 1327213203 1033422429 1573423832 475391043 > getSeeds(6)  909206077 1132750064 872894922 151901752 1480135365 -947383956 > set.seed(42) > one <- getParam() > two <- getParam() > getSeeds(6, one)  -1368039275 -1100785627 1327213203 1033422429 1573423832 475391043 > getSeeds(6, two)  909206077 1132750064 872894922 151901752 1480135365 -947383956 > getSeeds(6, 42)  -850130249 -1625411987 2046530742 -713526308 1691623607 2099784219 > getSeeds(6, c(4711, 42))  1485050188 -994439850 532523273 -1671404795 -1234644656 810452082 `````` Right now I am unsure, though, how that handles the unsigned to signed conversion. So the above results might be implementation specific (at least until C++20 is used as I just learned). BTW, did you consider `std::memcopy` for that? Concerning you parallel RNG usage: The solution is to seed a fresh PRNG for each cell. This guarantees that I get the same results regardless of how the jobs (one job per cell) are distributed across workers, as each cell's series of random numbers are unrelated to that of another cell's. Moreover, I'm using PCG32 with different streams, which provides an added level of reassurance with respect to statistical independence. The construction of the PRNG is done within the cell-by-cell loop in C++, but the generation of the seeds is still performed in the R code where (i) it can be affected by set.seed() and (ii) it is still serially executed and thus avoids weird changes in seeds in workers under some parallelization schemes, e.g., those using cluster job schedulers. I am not sure that this is the way PCG should be used in parallel. My understanding is that one would use the same seed for all the RNGs. The different workers would then do one of two things: switch to a different stream skip ahead for different amounts In both cases the random numbers generated should be independent, although skipping ahead is probably unpractical given the (rather short) period of pcg32. Using different seeds is similar to skipping ahead, but one does not know how many numbers one can draw. You are mitigating this using in addition different streams. But the different streams alone should be enough.
Contributor Author

### LTLA commented Feb 3, 2019 • edited

 Regarding `seed_seq_fe`: I don't have much of an opinion here. It seems cool but I was happy to rely on R's MT stream being "random enough" for seeding purposes. Happy to go with whatever you decide. BTW, did you consider `std::memcopy` for that? Briefly. But I didn't know whether a `memcpy` from an unsigned to signed type would behave in a standard way, and I didn't want to take responsibility for introducing difficult-to-track implementation-specific behaviour. But the different streams alone should be enough. That's probably true. I don't think it hurts to have different seeds either, but I'm willing to be told otherwise. (Do you know someone on the PCG team who could give an authoritative answer?) It would make it slightly easier for me if I just had to create one PCG seed vector in my functions.... but not that much, and it would have to be done in the serially executed part of R anyway.
Contributor Author

### LTLA commented Feb 3, 2019

 Edit, 5 hours later: I spent some time thinking about the use of `seed_seq_fe` here - and I don't think it's necessary. From what I understand, the seed sequence is intended to "inflate" a small amount of entropy to get enough bits for seeding. But here, R's MT stream is already giving us decent random numbers for seeding. There's no need to complicate the dqrng API with extra work and bug potential, e.g., signed vs. unsigned, how to guarantee randomness/full state space exploration when more than four integers are requested. Perhaps the use of the seed sequence would be useful for handling the case where `dqset.seed()` is supplied with a single integer. Otherwise, the current changes in #10 (now switched to using integer vectors, as suggested) seem sufficient for all but the most advanced PRNG users.
Member

### rstub commented Feb 5, 2019

 I spent some time thinking about the use of seed_seq_fe here - and I don't think it's necessary. From what I understand, the seed sequence is intended to "inflate" a small amount of entropy to get enough bits for seeding. But here, R's MT stream is already giving us decent random numbers for seeding. ACK. It might make sense for seeding MT, but MT support in dqrng is going to be removed anyway. I don't think it hurts to have different seeds either, but I'm willing to be told otherwise. (Do you know someone on the PCG team who could give an authoritative answer?) Unfortunately I have no such contact. But I have seen the issue you raised in the GitHub repository. Let's hope you will get an answer there. The blog article you linked is indeed rather unclear. Section 4.3.2 of her paper reads quite different: Every choice of c [the additive constant in the LCG] results in a different sequence of numbers that has none of its pairs of successive outputs in common with another sequence. That's why I used constant seed and consecutive stream IDs in the parallel vignette.

Open

Member

### rstub commented Feb 7, 2019

 Fixed in #10.

### rstub closed this Feb 7, 2019

Contributor Author

### LTLA commented Feb 25, 2019 • edited

For posterity's sake, I had a go at using dqrng with multiple streams vs multiple seeds:

## streamer.cpp

```// [[Rcpp::plugins(cpp11)]]
#include <Rcpp.h>

// [[Rcpp::depends(dqrng)]]
#include "convert_seed.h"
#include "pcg_random.hpp"

// [[Rcpp::export(rng=false)]]
Rcpp::IntegerVector spew_raw_numbers(Rcpp::IntegerVector seed, int stream, int N=10000) {
pcg32 rng(dqrng::convert_seed<uint64_t>(seed), stream);
Rcpp::IntegerVector output(N);

for (auto& o : output) {
uint32_t sampled=rng();

// Converting uint to int.
constexpr uint32_t max_int=2147483647;
if (sampled <= max_int) {
o=static_cast<int>(sampled);
} else {
constexpr uint32_t max_uint=-1;
o=-static_cast<int>(max_uint - sampled) - 1;
}
}

return output;
}```

## R code

```Rcpp::sourceCpp("streamer.cpp")

collected0 <- collected1 <- collected2 <- numeric(1000)
for (i in seq_along(collected1)) {
out <- dqrng::generateSeedVectors(2)

x1_1 <- spew_raw_numbers(out[], 1, 20000)
x1_1a <- head(x1_1, 10000)
x1_1b <- tail(x1_1, 10000)

x2_2 <- spew_raw_numbers(out[], 2)
x1_2 <- spew_raw_numbers(out[], 2)

# Switch functions to test different types of correlations:
#    FUN <- cor
#    FUN <- function(...) cor(..., method="spearman")
FUN <- function(...) max(abs(ccf(..., plot=FALSE)\$acf))

collected0[i] <- FUN(x1_1a, x1_1b)
collected1[i] <- FUN(x1_1a, x2_2)
collected2[i] <- FUN(x1_1a, x1_2)
}

mean(abs(collected0))
mean(abs(collected1))
mean(abs(collected2))

max(abs(collected0))
max(abs(collected1))
max(abs(collected2))```

Yeah, it's not very sophisticated, but it's the best I've got without further knowledge of the failure modes of this PRNGs. I really wish someone would chime in on imneme/pcg-cpp#49...

## tl;dr

For the amount of random numbers I'm generating, there are no observable differences in mean or maximum correlation from a jump-ahead approach on a single stream compared to multiple streams with the same seed or multiple streams with different seeds. So both our approaches are probably fine. Probably.

to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.