Skip to content

Commit

Permalink
Feature/weighted sampling (#47)
Browse files Browse the repository at this point in the history
Weighted sampling

* Move sampling code to separate header file
* Add weighted sampling methods
* Document sampling algorithm
* Add vignette on sampling methods
  • Loading branch information
rstub committed Jul 30, 2023
1 parent 98e0da3 commit 547ed8e
Show file tree
Hide file tree
Showing 24 changed files with 527 additions and 140 deletions.
9 changes: 6 additions & 3 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Package: dqrng
Type: Package
Title: Fast Pseudo Random Number Generators
Version: 0.3.0.1
Version: 0.3.0.2
Authors@R: c(
person("Ralf", "Stubner", email = "ralf.stubner@gmail.com", role = c("aut", "cre")),
person("daqana GmbH", role = "cph"),
Expand All @@ -19,10 +19,13 @@ Description: Several fast random number generators are provided as C++
numbers according to a uniform, normal and exponential distribution
are included. The latter two use the Ziggurat algorithm originally
proposed by Marsaglia and Tsang (2000, <doi:10.18637/jss.v005.i08>).
These functions are exported to R and as a C++ interface and are
The fast sampling methods support weighted and unweighted sampling both with
and without replacement. For weighted sampling a stochastic acceptance method
suggested by Lipowski and Lipowska (2012, <doi:10.1016/j.physa.2011.12.004>)
is used. These functions are exported to R and as a C++ interface and are
enabled for use with the default 64 bit generator from the PCG family,
Xoroshiro128+ and Xoshiro256+ as well as the 64 bit version of the 20 rounds
Threefry engine (Salmon et al., 2011 <doi:10.1145/2063384.2063405>) as
Threefry engine (Salmon et al., 2011, <doi:10.1145/2063384.2063405>) as
provided by the package 'sitmo'.
License: AGPL-3 | file LICENSE
Depends: R (>= 3.1.0)
Expand Down
1 change: 1 addition & 0 deletions LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,5 @@ License: AGPL-3

Files: *
Copyright: 2018-2019 Ralf Stubner (daqana GmbH)
2022-2023 Ralf Stubner
License: AGPL-3
3 changes: 3 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
# dgrng (unreleased)

* new method `dqrrademacher` for drawing Rademacher weights (Kyle Butts in [#50](https://github.com/daqana/dqrng/pull/50 fixing [#49](https://github.com/daqana/dqrng/pull/49))
* Implement weighted sampling with and without replacement.
Move sampling methods to separate header file, allowing for parallel usage.
([#47](https://github.com/daqana/dqrng/pull/47) fixing [#18](https://github.com/daqana/dqrng/issues/18) and [#26](https://github.com/daqana/dqrng/issues/26))

# dqrng 0.3.0

Expand Down
7 changes: 2 additions & 5 deletions R/dqsample.R
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
##' @param size a non-negative integer giving the number of items to choose.
##' @param replace should sampling be with replacement?
##' @param prob a vector of probability weights for obtaining the elements of the vector being sampled.
##' @seealso \code{\link{sample}} and \code{\link{sample.int}}
##' @seealso \code{vignette("sample", package = "dqrng")}, \code{\link{sample}} and \code{\link{sample.int}}
##' @export
dqsample <- function(x, size, replace = FALSE, prob = NULL) {
if(length(x) == 1L && is.numeric(x) && is.finite(x) && x >= 1) {
Expand All @@ -19,10 +19,7 @@ dqsample <- function(x, size, replace = FALSE, prob = NULL) {
##' @rdname dqsample
##' @export
dqsample.int <- function(n, size = n, replace = FALSE, prob = NULL) {
if (!is.null(prob)) {
warning("Using 'prob' is not supported yet. Using default 'sample.int'.")
sample.int(n, size, replace, prob)
} else if (n <= .Machine$integer.max)
if (n <= .Machine$integer.max)
dqsample_int(n, size, replace, prob, 1L)
else
dqsample_num(n, size, replace, prob, 1L)
Expand Down
32 changes: 28 additions & 4 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -35,12 +35,13 @@ install.packages("dqrng")
```

Intermediate releases can also be obtained via
[drat](https://cran.r-project.org/package=drat):
[r-universe](https://rstub.r-universe.dev/dqrng):

```r
if (!requireNamespace("drat", quietly = TRUE)) install.packages("drat")
drat::addRepo("daqana")
install.packages("dqrng")
options(repos = c(
rstub = 'https://rstub.r-universe.dev',
CRAN = 'https://cloud.r-project.org'))
install.packages('dqrng')
```

## Example
Expand Down Expand Up @@ -89,6 +90,29 @@ bm[, 1:4]

Note that sampling from `10^10` elements triggers "long-vector support" in R.

It is also possible to use weighted sampling both with replacement:

```{r sampling3}
m <- 1e6
n <- 1e4
prob <- dqrunif(m)
bm <- bench::mark(sample.int(m, n, replace = TRUE, prob = prob),
dqsample.int(m, n, replace = TRUE, prob = prob),
check = FALSE)
bm[, 1:4]
```

And without replacement:

```{r sampling4}
bm <- bench::mark(sample.int(m, n, prob = prob),
dqsample.int(m, n, prob = prob),
check = FALSE)
bm[, 1:4]
```

Especially for weighted sampling without replacement the performance advantage compared with R's default methods is particularly large.

In addition the RNGs provide support for multiple independent streams for parallel usage:

```{r parallel}
Expand Down
74 changes: 55 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,13 @@ install.packages("dqrng")
```

Intermediate releases can also be obtained via
[drat](https://cran.r-project.org/package=drat):
[r-universe](https://rstub.r-universe.dev/dqrng):

``` r
if (!requireNamespace("drat", quietly = TRUE)) install.packages("drat")
drat::addRepo("daqana")
install.packages("dqrng")
options(repos = c(
rstub = 'https://rstub.r-universe.dev',
CRAN = 'https://cloud.r-project.org'))
install.packages('dqrng')
```

## Example
Expand All @@ -57,11 +58,11 @@ They are quite a bit faster, though:
N <- 1e4
bm <- bench::mark(rnorm(N), dqrnorm(N), check = FALSE)
bm[, 1:4]
#> # A tibble: 2 x 4
#> # A tibble: 2 × 4
#> expression min median `itr/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl>
#> 1 rnorm(N) 598.9µs 670µs 1414.
#> 2 dqrnorm(N) 85.5µs 89µs 9663.
#> 1 rnorm(N) 612µs 685.2µs 1397.
#> 2 dqrnorm(N) 86µs 88.6µs 10388.
```

This is also true for the provided sampling functions with replacement:
Expand All @@ -75,13 +76,13 @@ bm <- bench::mark(sample.int(m, n, replace = TRUE),
dqsample.int(1e3*m, n, replace = TRUE),
check = FALSE)
bm[, 1:4]
#> # A tibble: 4 x 4
#> # A tibble: 4 × 4
#> expression min median `itr/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl>
#> 1 sample.int(m, n, replace = TRUE) 6.94ms 7.52ms 131.
#> 2 sample.int(1000 * m, n, replace = TRUE) 8.8ms 9.64ms 101.
#> 3 dqsample.int(m, n, replace = TRUE) 304.75µs 444.96µs 2207.
#> 4 dqsample.int(1000 * m, n, replace = TRUE) 397.96µs 675.24µs 1502.
#> 1 sample.int(m, n, replace = TRUE) 6.88ms 7.63ms 114.
#> 2 sample.int(1000 * m, n, replace = TRUE) 8.72ms 9.55ms 96.1
#> 3 dqsample.int(m, n, replace = TRUE) 482.21µs 810.29µs 1254.
#> 4 dqsample.int(1000 * m, n, replace = TRUE) 492.79µs 822.86µs 1275.
```

And without replacement:
Expand All @@ -93,21 +94,56 @@ bm <- bench::mark(sample.int(m, n),
dqsample.int(m, n),
dqsample.int(1e3*m, n),
check = FALSE)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
bm[, 1:4]
#> # A tibble: 5 x 4
#> # A tibble: 5 × 4
#> expression min median `itr/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl>
#> 1 sample.int(m, n) 38.59ms 51.73ms 19.6
#> 2 sample.int(1000 * m, n) 11.98ms 15.34ms 63.9
#> 3 sample.int(m, n, useHash = TRUE) 9.94ms 12.73ms 71.5
#> 4 dqsample.int(m, n) 942.04µs 1.05ms 755.
#> 5 dqsample.int(1000 * m, n) 1.86ms 2.44ms 315.
#> 1 sample.int(m, n) 40.1ms 42.54ms 23.5
#> 2 sample.int(1000 * m, n) 12.19ms 14.38ms 67.8
#> 3 sample.int(m, n, useHash = TRUE) 9.43ms 11.17ms 81.9
#> 4 dqsample.int(m, n) 1.22ms 1.35ms 638.
#> 5 dqsample.int(1000 * m, n) 1.98ms 2.51ms 358.
```

Note that sampling from `10^10` elements triggers “long-vector support”
in R.

It is also possible to use weighted sampling both with replacement:

``` r
m <- 1e6
n <- 1e4
prob <- dqrunif(m)
bm <- bench::mark(sample.int(m, n, replace = TRUE, prob = prob),
dqsample.int(m, n, replace = TRUE, prob = prob),
check = FALSE)
bm[, 1:4]
#> # A tibble: 2 × 4
#> expression min median `itr/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl>
#> 1 sample.int(m, n, replace = TRUE, prob = prob) 22.02ms 23.82ms 41.7
#> 2 dqsample.int(m, n, replace = TRUE, prob = prob) 5.05ms 5.41ms 183.
```

And without replacement:

``` r
bm <- bench::mark(sample.int(m, n, prob = prob),
dqsample.int(m, n, prob = prob),
check = FALSE)
bm[, 1:4]
#> # A tibble: 2 × 4
#> expression min median `itr/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl>
#> 1 sample.int(m, n, prob = prob) 13.63s 13.63s 0.0734
#> 2 dqsample.int(m, n, prob = prob) 5.16ms 5.63ms 175.
```

Especially for weighted sampling without replacement the performance
advantage compared with R’s default methods is particularly large.

In addition the RNGs provide support for multiple independent streams
for parallel usage:

Expand Down
14 changes: 14 additions & 0 deletions inst/include/dqrng_generator.h
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
// Copyright 2018-2019 Ralf Stubner (daqana GmbH)
// Copyright 2023 Ralf Stubner
//
// This file is part of dqrng.
//
Expand Down Expand Up @@ -73,6 +74,7 @@ class random_64bit_wrapper : public random_64bit_generator {
public:
random_64bit_wrapper() : gen() {};
random_64bit_wrapper(result_type seed) : gen(seed) {};
random_64bit_wrapper(result_type seed, result_type stream) : gen(seed, stream) {};
virtual result_type operator() () {return this->bit64();}
virtual void seed(result_type seed) {cache = false; gen.seed(seed);}
virtual void seed(result_type seed, result_type stream) {throw std::runtime_error("Stream handling not supported for this RNG!");}
Expand Down Expand Up @@ -189,6 +191,18 @@ typename std::enable_if<std::is_base_of<random_64bit_generator, RNG>::value, rng
generator (uint64_t seed) {
return std::make_shared<RNG>(seed);
}

template<typename RNG = default_64bit_generator>
typename std::enable_if<!std::is_base_of<random_64bit_generator, RNG>::value, rng64_t>::type
generator (uint64_t seed, uint64_t stream) {
return std::make_shared<random_64bit_wrapper<RNG>>(seed, stream);
}

template<typename RNG = default_64bit_generator>
typename std::enable_if<std::is_base_of<random_64bit_generator, RNG>::value, rng64_t>::type
generator (uint64_t seed, uint64_t stream) {
return std::make_shared<RNG>(seed, stream);
}
} // namespace dqrng

#endif // DQRNG_GENERATOR_H
107 changes: 107 additions & 0 deletions inst/include/dqrng_sample.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
// Copyright 2018-2019 Ralf Stubner (daqana GmbH)
// Copyright 2022 Ralf Stubner
//
// This file is part of dqrng.
//
// dqrng is free software: you can redistribute it and/or modify it
// under the terms of the GNU Affero General Public License as published by
// the Free Software Foundation, either version 3 of the License, or
// (at your option) any later version.
//
// dqrng is distributed in the hope that it will be useful, but
// WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU Affero General Public License for more details.
//
// You should have received a copy of the GNU Affero General Public License
// along with dqrng. If not, see <http://www.gnu.org/licenses/>.

#ifndef DQRNG_SAMPLE_H
#define DQRNG_SAMPLE_H 1

#include <mystdint.h>
#include <Rcpp.h>
#include <dqrng_generator.h>
#include <dqrng_distribution.h>
#include <minimal_int_set.h>

namespace dqrng {
namespace sample {
template<int RTYPE, typename INT>
inline Rcpp::Vector<RTYPE> replacement(std::function<INT(INT)> selector, INT m, INT n, int offset) {
using storage_t = typename Rcpp::traits::storage_type<RTYPE>::type;
Rcpp::Vector<RTYPE> result(Rcpp::no_init(n));
std::generate(result.begin(), result.end(),
[m, offset, selector] () {return static_cast<storage_t>(offset + selector(m));});
return result;
}

template<int RTYPE, typename INT>
inline Rcpp::Vector<RTYPE> no_replacement_shuffle(std::function<INT(INT)> selector, INT m, INT n, int offset) {
using storage_t = typename Rcpp::traits::storage_type<RTYPE>::type;
Rcpp::Vector<RTYPE> tmp(Rcpp::no_init(m));
std::iota(tmp.begin(), tmp.end(), static_cast<storage_t>(offset));
for (INT i = 0; i < n; ++i) {
std::swap(tmp[i], tmp[i + selector(m - i)]);
}
if (m == n)
return tmp;
else
return Rcpp::Vector<RTYPE>(tmp.begin(), tmp.begin() + n);
}

template<int RTYPE, typename INT, typename SET>
inline Rcpp::Vector<RTYPE> no_replacement_set(std::function<INT(INT)> selector, INT m, INT n, int offset) {
using storage_t = typename Rcpp::traits::storage_type<RTYPE>::type;
Rcpp::Vector<RTYPE> result(Rcpp::no_init(n));
SET elems(m, n);
for (INT i = 0; i < n; ++i) {
INT v = selector(m);
while (!elems.insert(v)) {
v = selector(m);
}
result(i) = static_cast<storage_t>(offset + v);
}
return result;
}

template<typename INT>
inline INT roulette_wheel_selection(const dqrng::rng64_t &rng, INT m, Rcpp::NumericVector probs, double max_probs) {
while (true) {
INT index = (*rng)(m);
if (dqrng::uniform01((*rng)()) < probs(index) / max_probs)
return index;
}
}

template<int RTYPE, typename INT>
inline Rcpp::Vector<RTYPE> sample(const dqrng::rng64_t &rng,
INT m,
INT n,
bool replace,
Rcpp::Nullable<Rcpp::NumericVector> probs = R_NilValue,
int offset = 0) {
std::function<INT(INT)> selector = [rng] (INT m) {return (*rng)(m);};
if (probs.isNotNull()) {
Rcpp::NumericVector tmp = probs.as();
double max_probs = Rcpp::max(tmp);
selector = [rng, tmp, max_probs] (INT m) {return roulette_wheel_selection<INT>(rng, m, tmp, max_probs);};
}
if (replace || n <= 1) {
return dqrng::sample::replacement<RTYPE, INT>(selector, m, n, offset);
} else {
if (!(m >= n))
Rcpp::stop("Argument requirements not fulfilled: m >= n");
if (m < 2 * n) {
return dqrng::sample::no_replacement_shuffle<RTYPE, INT>(selector, m, n, offset);
} else if (m < 1000 * n) {
return dqrng::sample::no_replacement_set<RTYPE, INT, dqrng::minimal_bit_set>(selector, m, n, offset);
} else {
return dqrng::sample::no_replacement_set<RTYPE, INT, dqrng::minimal_hash_set<INT>>(selector, m, n, offset);
}
}
}
} // sample
} // dqrng

#endif // DQRNG_SAMPLE_H
3 changes: 2 additions & 1 deletion man/dqrng-package.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/dqsample.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit 547ed8e

Please sign in to comment.