Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/weighted sampling #47

Merged
merged 19 commits into from Jul 30, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
9 changes: 6 additions & 3 deletions DESCRIPTION
@@ -1,7 +1,7 @@
Package: dqrng
Type: Package
Title: Fast Pseudo Random Number Generators
Version: 0.3.0.1
Version: 0.3.0.2
Authors@R: c(
person("Ralf", "Stubner", email = "ralf.stubner@gmail.com", role = c("aut", "cre")),
person("daqana GmbH", role = "cph"),
Expand All @@ -19,10 +19,13 @@ Description: Several fast random number generators are provided as C++
numbers according to a uniform, normal and exponential distribution
are included. The latter two use the Ziggurat algorithm originally
proposed by Marsaglia and Tsang (2000, <doi:10.18637/jss.v005.i08>).
These functions are exported to R and as a C++ interface and are
The fast sampling methods support weighted and unweighted sampling both with
and without replacement. For weighted sampling a stochastic acceptance method
suggested by Lipowski and Lipowska (2012, <doi:10.1016/j.physa.2011.12.004>)
is used. These functions are exported to R and as a C++ interface and are
enabled for use with the default 64 bit generator from the PCG family,
Xoroshiro128+ and Xoshiro256+ as well as the 64 bit version of the 20 rounds
Threefry engine (Salmon et al., 2011 <doi:10.1145/2063384.2063405>) as
Threefry engine (Salmon et al., 2011, <doi:10.1145/2063384.2063405>) as
provided by the package 'sitmo'.
License: AGPL-3 | file LICENSE
Depends: R (>= 3.1.0)
Expand Down
1 change: 1 addition & 0 deletions LICENSE
Expand Up @@ -16,4 +16,5 @@ License: AGPL-3

Files: *
Copyright: 2018-2019 Ralf Stubner (daqana GmbH)
2022-2023 Ralf Stubner
License: AGPL-3
3 changes: 3 additions & 0 deletions NEWS.md
@@ -1,6 +1,9 @@
# dgrng (unreleased)

* new method `dqrrademacher` for drawing Rademacher weights (Kyle Butts in [#50](https://github.com/daqana/dqrng/pull/50 fixing [#49](https://github.com/daqana/dqrng/pull/49))
* Implement weighted sampling with and without replacement.
Move sampling methods to separate header file, allowing for parallel usage.
([#47](https://github.com/daqana/dqrng/pull/47) fixing [#18](https://github.com/daqana/dqrng/issues/18) and [#26](https://github.com/daqana/dqrng/issues/26))

# dqrng 0.3.0

Expand Down
7 changes: 2 additions & 5 deletions R/dqsample.R
Expand Up @@ -4,7 +4,7 @@
##' @param size a non-negative integer giving the number of items to choose.
##' @param replace should sampling be with replacement?
##' @param prob a vector of probability weights for obtaining the elements of the vector being sampled.
##' @seealso \code{\link{sample}} and \code{\link{sample.int}}
##' @seealso \code{vignette("sample", package = "dqrng")}, \code{\link{sample}} and \code{\link{sample.int}}
##' @export
dqsample <- function(x, size, replace = FALSE, prob = NULL) {
if(length(x) == 1L && is.numeric(x) && is.finite(x) && x >= 1) {
Expand All @@ -19,10 +19,7 @@ dqsample <- function(x, size, replace = FALSE, prob = NULL) {
##' @rdname dqsample
##' @export
dqsample.int <- function(n, size = n, replace = FALSE, prob = NULL) {
if (!is.null(prob)) {
warning("Using 'prob' is not supported yet. Using default 'sample.int'.")
sample.int(n, size, replace, prob)
} else if (n <= .Machine$integer.max)
if (n <= .Machine$integer.max)
dqsample_int(n, size, replace, prob, 1L)
else
dqsample_num(n, size, replace, prob, 1L)
Expand Down
32 changes: 28 additions & 4 deletions README.Rmd
Expand Up @@ -35,12 +35,13 @@ install.packages("dqrng")
```

Intermediate releases can also be obtained via
[drat](https://cran.r-project.org/package=drat):
[r-universe](https://rstub.r-universe.dev/dqrng):

```r
if (!requireNamespace("drat", quietly = TRUE)) install.packages("drat")
drat::addRepo("daqana")
install.packages("dqrng")
options(repos = c(
rstub = 'https://rstub.r-universe.dev',
CRAN = 'https://cloud.r-project.org'))
install.packages('dqrng')
```

## Example
Expand Down Expand Up @@ -89,6 +90,29 @@ bm[, 1:4]

Note that sampling from `10^10` elements triggers "long-vector support" in R.

It is also possible to use weighted sampling both with replacement:

```{r sampling3}
m <- 1e6
n <- 1e4
prob <- dqrunif(m)
bm <- bench::mark(sample.int(m, n, replace = TRUE, prob = prob),
dqsample.int(m, n, replace = TRUE, prob = prob),
check = FALSE)
bm[, 1:4]
```

And without replacement:

```{r sampling4}
bm <- bench::mark(sample.int(m, n, prob = prob),
dqsample.int(m, n, prob = prob),
check = FALSE)
bm[, 1:4]
```

Especially for weighted sampling without replacement the performance advantage compared with R's default methods is particularly large.

In addition the RNGs provide support for multiple independent streams for parallel usage:

```{r parallel}
Expand Down
74 changes: 55 additions & 19 deletions README.md
Expand Up @@ -29,12 +29,13 @@ install.packages("dqrng")
```

Intermediate releases can also be obtained via
[drat](https://cran.r-project.org/package=drat):
[r-universe](https://rstub.r-universe.dev/dqrng):

``` r
if (!requireNamespace("drat", quietly = TRUE)) install.packages("drat")
drat::addRepo("daqana")
install.packages("dqrng")
options(repos = c(
rstub = 'https://rstub.r-universe.dev',
CRAN = 'https://cloud.r-project.org'))
install.packages('dqrng')
```

## Example
Expand All @@ -57,11 +58,11 @@ They are quite a bit faster, though:
N <- 1e4
bm <- bench::mark(rnorm(N), dqrnorm(N), check = FALSE)
bm[, 1:4]
#> # A tibble: 2 x 4
#> # A tibble: 2 × 4
#> expression min median `itr/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl>
#> 1 rnorm(N) 598.9µs 670µs 1414.
#> 2 dqrnorm(N) 85.5µs 89µs 9663.
#> 1 rnorm(N) 612µs 685.2µs 1397.
#> 2 dqrnorm(N) 86µs 88.6µs 10388.
```

This is also true for the provided sampling functions with replacement:
Expand All @@ -75,13 +76,13 @@ bm <- bench::mark(sample.int(m, n, replace = TRUE),
dqsample.int(1e3*m, n, replace = TRUE),
check = FALSE)
bm[, 1:4]
#> # A tibble: 4 x 4
#> # A tibble: 4 × 4
#> expression min median `itr/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl>
#> 1 sample.int(m, n, replace = TRUE) 6.94ms 7.52ms 131.
#> 2 sample.int(1000 * m, n, replace = TRUE) 8.8ms 9.64ms 101.
#> 3 dqsample.int(m, n, replace = TRUE) 304.75µs 444.96µs 2207.
#> 4 dqsample.int(1000 * m, n, replace = TRUE) 397.96µs 675.24µs 1502.
#> 1 sample.int(m, n, replace = TRUE) 6.88ms 7.63ms 114.
#> 2 sample.int(1000 * m, n, replace = TRUE) 8.72ms 9.55ms 96.1
#> 3 dqsample.int(m, n, replace = TRUE) 482.21µs 810.29µs 1254.
#> 4 dqsample.int(1000 * m, n, replace = TRUE) 492.79µs 822.86µs 1275.
```

And without replacement:
Expand All @@ -93,21 +94,56 @@ bm <- bench::mark(sample.int(m, n),
dqsample.int(m, n),
dqsample.int(1e3*m, n),
check = FALSE)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
bm[, 1:4]
#> # A tibble: 5 x 4
#> # A tibble: 5 × 4
#> expression min median `itr/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl>
#> 1 sample.int(m, n) 38.59ms 51.73ms 19.6
#> 2 sample.int(1000 * m, n) 11.98ms 15.34ms 63.9
#> 3 sample.int(m, n, useHash = TRUE) 9.94ms 12.73ms 71.5
#> 4 dqsample.int(m, n) 942.04µs 1.05ms 755.
#> 5 dqsample.int(1000 * m, n) 1.86ms 2.44ms 315.
#> 1 sample.int(m, n) 40.1ms 42.54ms 23.5
#> 2 sample.int(1000 * m, n) 12.19ms 14.38ms 67.8
#> 3 sample.int(m, n, useHash = TRUE) 9.43ms 11.17ms 81.9
#> 4 dqsample.int(m, n) 1.22ms 1.35ms 638.
#> 5 dqsample.int(1000 * m, n) 1.98ms 2.51ms 358.
```

Note that sampling from `10^10` elements triggers “long-vector support”
in R.

It is also possible to use weighted sampling both with replacement:

``` r
m <- 1e6
n <- 1e4
prob <- dqrunif(m)
bm <- bench::mark(sample.int(m, n, replace = TRUE, prob = prob),
dqsample.int(m, n, replace = TRUE, prob = prob),
check = FALSE)
bm[, 1:4]
#> # A tibble: 2 × 4
#> expression min median `itr/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl>
#> 1 sample.int(m, n, replace = TRUE, prob = prob) 22.02ms 23.82ms 41.7
#> 2 dqsample.int(m, n, replace = TRUE, prob = prob) 5.05ms 5.41ms 183.
```

And without replacement:

``` r
bm <- bench::mark(sample.int(m, n, prob = prob),
dqsample.int(m, n, prob = prob),
check = FALSE)
bm[, 1:4]
#> # A tibble: 2 × 4
#> expression min median `itr/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl>
#> 1 sample.int(m, n, prob = prob) 13.63s 13.63s 0.0734
#> 2 dqsample.int(m, n, prob = prob) 5.16ms 5.63ms 175.
```

Especially for weighted sampling without replacement the performance
advantage compared with R’s default methods is particularly large.

In addition the RNGs provide support for multiple independent streams
for parallel usage:

Expand Down
14 changes: 14 additions & 0 deletions inst/include/dqrng_generator.h
@@ -1,4 +1,5 @@
// Copyright 2018-2019 Ralf Stubner (daqana GmbH)
// Copyright 2023 Ralf Stubner
//
// This file is part of dqrng.
//
Expand Down Expand Up @@ -73,6 +74,7 @@ class random_64bit_wrapper : public random_64bit_generator {
public:
random_64bit_wrapper() : gen() {};
random_64bit_wrapper(result_type seed) : gen(seed) {};
random_64bit_wrapper(result_type seed, result_type stream) : gen(seed, stream) {};
virtual result_type operator() () {return this->bit64();}
virtual void seed(result_type seed) {cache = false; gen.seed(seed);}
virtual void seed(result_type seed, result_type stream) {throw std::runtime_error("Stream handling not supported for this RNG!");}
Expand Down Expand Up @@ -189,6 +191,18 @@ typename std::enable_if<std::is_base_of<random_64bit_generator, RNG>::value, rng
generator (uint64_t seed) {
return std::make_shared<RNG>(seed);
}

template<typename RNG = default_64bit_generator>
typename std::enable_if<!std::is_base_of<random_64bit_generator, RNG>::value, rng64_t>::type
generator (uint64_t seed, uint64_t stream) {
return std::make_shared<random_64bit_wrapper<RNG>>(seed, stream);
}

template<typename RNG = default_64bit_generator>
typename std::enable_if<std::is_base_of<random_64bit_generator, RNG>::value, rng64_t>::type
generator (uint64_t seed, uint64_t stream) {
return std::make_shared<RNG>(seed, stream);
}
} // namespace dqrng

#endif // DQRNG_GENERATOR_H
107 changes: 107 additions & 0 deletions inst/include/dqrng_sample.h
@@ -0,0 +1,107 @@
// Copyright 2018-2019 Ralf Stubner (daqana GmbH)
// Copyright 2022 Ralf Stubner
//
// This file is part of dqrng.
//
// dqrng is free software: you can redistribute it and/or modify it
// under the terms of the GNU Affero General Public License as published by
// the Free Software Foundation, either version 3 of the License, or
// (at your option) any later version.
//
// dqrng is distributed in the hope that it will be useful, but
// WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU Affero General Public License for more details.
//
// You should have received a copy of the GNU Affero General Public License
// along with dqrng. If not, see <http://www.gnu.org/licenses/>.

#ifndef DQRNG_SAMPLE_H
#define DQRNG_SAMPLE_H 1

#include <mystdint.h>
#include <Rcpp.h>
#include <dqrng_generator.h>
#include <dqrng_distribution.h>
#include <minimal_int_set.h>

namespace dqrng {
namespace sample {
template<int RTYPE, typename INT>
inline Rcpp::Vector<RTYPE> replacement(std::function<INT(INT)> selector, INT m, INT n, int offset) {
using storage_t = typename Rcpp::traits::storage_type<RTYPE>::type;
Rcpp::Vector<RTYPE> result(Rcpp::no_init(n));
std::generate(result.begin(), result.end(),
[m, offset, selector] () {return static_cast<storage_t>(offset + selector(m));});
return result;
}

template<int RTYPE, typename INT>
inline Rcpp::Vector<RTYPE> no_replacement_shuffle(std::function<INT(INT)> selector, INT m, INT n, int offset) {
using storage_t = typename Rcpp::traits::storage_type<RTYPE>::type;
Rcpp::Vector<RTYPE> tmp(Rcpp::no_init(m));
std::iota(tmp.begin(), tmp.end(), static_cast<storage_t>(offset));
for (INT i = 0; i < n; ++i) {
std::swap(tmp[i], tmp[i + selector(m - i)]);
}
if (m == n)
return tmp;
else
return Rcpp::Vector<RTYPE>(tmp.begin(), tmp.begin() + n);
}

template<int RTYPE, typename INT, typename SET>
inline Rcpp::Vector<RTYPE> no_replacement_set(std::function<INT(INT)> selector, INT m, INT n, int offset) {
using storage_t = typename Rcpp::traits::storage_type<RTYPE>::type;
Rcpp::Vector<RTYPE> result(Rcpp::no_init(n));
SET elems(m, n);
for (INT i = 0; i < n; ++i) {
INT v = selector(m);
while (!elems.insert(v)) {
v = selector(m);
}
result(i) = static_cast<storage_t>(offset + v);
}
return result;
}

template<typename INT>
inline INT roulette_wheel_selection(const dqrng::rng64_t &rng, INT m, Rcpp::NumericVector probs, double max_probs) {
while (true) {
INT index = (*rng)(m);
if (dqrng::uniform01((*rng)()) < probs(index) / max_probs)
return index;
}
}

template<int RTYPE, typename INT>
inline Rcpp::Vector<RTYPE> sample(const dqrng::rng64_t &rng,
INT m,
INT n,
bool replace,
Rcpp::Nullable<Rcpp::NumericVector> probs = R_NilValue,
int offset = 0) {
std::function<INT(INT)> selector = [rng] (INT m) {return (*rng)(m);};
if (probs.isNotNull()) {
Rcpp::NumericVector tmp = probs.as();
double max_probs = Rcpp::max(tmp);
selector = [rng, tmp, max_probs] (INT m) {return roulette_wheel_selection<INT>(rng, m, tmp, max_probs);};
}
if (replace || n <= 1) {
return dqrng::sample::replacement<RTYPE, INT>(selector, m, n, offset);
} else {
if (!(m >= n))
Rcpp::stop("Argument requirements not fulfilled: m >= n");
if (m < 2 * n) {
return dqrng::sample::no_replacement_shuffle<RTYPE, INT>(selector, m, n, offset);
} else if (m < 1000 * n) {
return dqrng::sample::no_replacement_set<RTYPE, INT, dqrng::minimal_bit_set>(selector, m, n, offset);
} else {
return dqrng::sample::no_replacement_set<RTYPE, INT, dqrng::minimal_hash_set<INT>>(selector, m, n, offset);
}
}
}
} // sample
} // dqrng

#endif // DQRNG_SAMPLE_H
3 changes: 2 additions & 1 deletion man/dqrng-package.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/dqsample.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.