New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite of RandomSample to use Jeffrey Scott Vitter's Algorithm D. #553

Merged
merged 8 commits into from Jul 2, 2012

Conversation

Projects
None yet
4 participants
@WebDrake
Contributor

WebDrake commented Apr 24, 2012

This algorithm calculates a random sample of size n in O(n) time,
generating O(n) random variates, as opposed to the previously
implemented Algorithm S where both scale with the size of the
data being sampled.

Variable names in the implementation follow those in Vitter's
papers introducing the algorithm:

  • Vitter, J. S. (1984) Faster methods for random sampling.
    Commun. ACM 27(7): 703--718.
  • Vitter, J. S. (1987) An efficient algorithm for sequential
    random sampling. ACM Trans. Math. Softw. 13(1): 58--67.
Rewrite of RandomSample to use Jeffrey Scott Vitter's Algorithm D.
This algorithm calculates a random sample of size n in O(n) time,
generating O(n) random variates, as opposed to the previously
implemented Algorithm S where both scale with the size of the
data being sampled.

Variable names in the implementation follow those in Vitter's
papers introducing the algorithm:

  * Vitter, J. S. (1984) Faster methods for random sampling.
    Commun. ACM 27(7): 703--718.

  * Vitter, J. S. (1987) An efficient algorithm for sequential
    random sampling.  ACM Trans. Math. Softw. 13(1): 58--67.
@donc

This comment has been minimized.

Show comment
Hide comment
@donc

donc Apr 26, 2012

Collaborator

Looks great!
But despite what someone on these newsgroup said, I've never contributed to that module, and have never even used it..
So I will leave it to Andrei to merge in.

Collaborator

donc commented Apr 26, 2012

Looks great!
But despite what someone on these newsgroup said, I've never contributed to that module, and have never even used it..
So I will leave it to Andrei to merge in.

@WebDrake

This comment has been minimized.

Show comment
Hide comment
@WebDrake

WebDrake Apr 26, 2012

Contributor

Great! I will look forward to his feedback. :-)

There are some benchmarks for the code at https://github.com/WebDrake/RandomSample which should serve to demonstrate that the code put in place here gives a big speedup. I'd be happy to adapt these as inbuilt benchmarks in line with Andrei's proposed std.benchmark framework.

I can also send copies of the original research papers to anyone who wants or needs them to review the implementation of the algorithm.

Contributor

WebDrake commented Apr 26, 2012

Great! I will look forward to his feedback. :-)

There are some benchmarks for the code at https://github.com/WebDrake/RandomSample which should serve to demonstrate that the code put in place here gives a big speedup. I'd be happy to adapt these as inbuilt benchmarks in line with Andrei's proposed std.benchmark framework.

I can also send copies of the original research papers to anyone who wants or needs them to review the implementation of the algorithm.

@WebDrake

This comment has been minimized.

Show comment
Hide comment
@WebDrake

WebDrake Apr 26, 2012

Contributor

Couple of small queries:

(i) I've included comments on some of the private internal functions of the algorithm. Should those be Ddoc comments starting /** or just regular comments?

(ii) What (if anything) should I add to author, copyright, credits lines?

Contributor

WebDrake commented Apr 26, 2012

Couple of small queries:

(i) I've included comments on some of the private internal functions of the algorithm. Should those be Ddoc comments starting /** or just regular comments?

(ii) What (if anything) should I add to author, copyright, credits lines?

@donc

This comment has been minimized.

Show comment
Hide comment
@donc

donc Apr 26, 2012

Collaborator

(i) Just use regular comments on the internal functions.
(ii) If you think you've made a major contribution to the module (which kind of implies some responsibility for it) then you can add your name if you like. As far as attribution goes, it doesn't really matter much, since it's in the revision history anyway. There's no problem with adding it later on, too.

Collaborator

donc commented Apr 26, 2012

(i) Just use regular comments on the internal functions.
(ii) If you think you've made a major contribution to the module (which kind of implies some responsibility for it) then you can add your name if you like. As far as attribution goes, it doesn't really matter much, since it's in the revision history anyway. There's no problem with adding it later on, too.

@WebDrake

This comment has been minimized.

Show comment
Hide comment
@WebDrake

WebDrake Apr 26, 2012

Contributor

OK, I've corrected and improved the comments and added my name. I do intend to take responsibility for this code, and besides, I'll be contributing more in other areas (e.g. random distributions).

Contributor

WebDrake commented Apr 26, 2012

OK, I've corrected and improved the comments and added my name. I do intend to take responsibility for this code, and besides, I'll be contributing more in other areas (e.g. random distributions).

Show outdated Hide outdated std/random.d
$(WEB http://dx.doi.org/10.1145/23002.23003, 1987) which selects
a sample of size $(D n) in O(n) steps and requiring O(n) random
variates, regardless of the size of the data being sampled.
*/
struct RandomSample(R, Random = void)
if(isUniformRNG!Random || is(Random == void))

This comment has been minimized.

@jmdavis

jmdavis May 4, 2012

Member

This should be if(isInputRange!R && (isUniformRNG!Random || is(Random == void)). Right now, it's failing to test the first template parameter.

@jmdavis

jmdavis May 4, 2012

Member

This should be if(isInputRange!R && (isUniformRNG!Random || is(Random == void)). Right now, it's failing to test the first template parameter.

This comment has been minimized.

@WebDrake

WebDrake May 4, 2012

Contributor

That "if" is not my code, but preserved from the old RandomSample. :-) I'm presuming the assumption is that users will use randomSample to return the sample, rather than RandomSample directly; randomSample does perform these checks.

I'll happily make your suggested tweak, but can we wait for Andrei's input before doing so, as this was his code?

@WebDrake

WebDrake May 4, 2012

Contributor

That "if" is not my code, but preserved from the old RandomSample. :-) I'm presuming the assumption is that users will use randomSample to return the sample, rather than RandomSample directly; randomSample does perform these checks.

I'll happily make your suggested tweak, but can we wait for Andrei's input before doing so, as this was his code?

This comment has been minimized.

@jmdavis

jmdavis May 4, 2012

Member

I'm well aware that the if is like that in the current code, but it's a mistake. In almost all cases, every template parameter should be checked in a template constraint. Yes, randomSample should catch it, but that doesn't excuse RandomSample from checking its parameters - especially when it's public and could be constructed separately, even if that wasn't the intent (if randomSample were written now, RandomSample would almost certainly be internal to it, but that could break code, so we've left it outside).

@jmdavis

jmdavis May 4, 2012

Member

I'm well aware that the if is like that in the current code, but it's a mistake. In almost all cases, every template parameter should be checked in a template constraint. Yes, randomSample should catch it, but that doesn't excuse RandomSample from checking its parameters - especially when it's public and could be constructed separately, even if that wasn't the intent (if randomSample were written now, RandomSample would almost certainly be internal to it, but that could break code, so we've left it outside).

This comment has been minimized.

@WebDrake

WebDrake May 4, 2012

Contributor

OK, I'll fix. Sorry if it sounded like I was questioning your judgement -- I'm just leery as a new contributor of rewriting anyone else's code except where strictly necessary for the new functionality.

@WebDrake

WebDrake May 4, 2012

Contributor

OK, I'll fix. Sorry if it sounded like I was questioning your judgement -- I'm just leery as a new contributor of rewriting anyone else's code except where strictly necessary for the new functionality.

Show outdated Hide outdated std/random.d
if (empty) return;
assert(_available && _available >= _toSelect);
for (;;)
size_t S;

This comment has been minimized.

@jmdavis

jmdavis May 4, 2012

Member

Please don't use PascalCase for variables. Types should be PascalCased, and variables should be camelcased.

@jmdavis

jmdavis May 4, 2012

Member

Please don't use PascalCase for variables. Types should be PascalCased, and variables should be camelcased.

This comment has been minimized.

@WebDrake

WebDrake May 4, 2012

Contributor

OK. I made a handful of exceptions here so that variable names matched precisely those of the research paper where the algorithm is described; I'll correct them.

@WebDrake

WebDrake May 4, 2012

Contributor

OK. I made a handful of exceptions here so that variable names matched precisely those of the research paper where the algorithm is described; I'll correct them.

This comment has been minimized.

@WebDrake

WebDrake May 4, 2012

Contributor

Query: should the variables be likethis or likeThis? Just for clarity :-)

@WebDrake

WebDrake May 4, 2012

Contributor

Query: should the variables be likethis or likeThis? Just for clarity :-)

This comment has been minimized.

@WebDrake

WebDrake May 4, 2012

Contributor

Second query: is _Vprime acceptable? I ask because it's more convenient to my eye to have _Vprime (variable) and newVprime (function to set its value) rather than _vPrime and newVprime.

@WebDrake

WebDrake May 4, 2012

Contributor

Second query: is _Vprime acceptable? I ask because it's more convenient to my eye to have _Vprime (variable) and newVprime (function to set its value) rather than _vPrime and newVprime.

This comment has been minimized.

@jmdavis

jmdavis May 4, 2012

Member

Variable names should be likeThis. The first letter is lowercase, the first letter of any other words are uppercase, and all the other letters are lowercase. I really think that it's better to do _vPrime and newVPrime. The capitalization of _Vprime is backwards. If they were public variables, then they'd definitely have to be changed. As they're not, it's probably getting too picky to absolutely insist that they be _vPrime and newVPrime. Having S and V though definitely can't stay, because they're captilized like type names, which makes reading the code much harder.

@jmdavis

jmdavis May 4, 2012

Member

Variable names should be likeThis. The first letter is lowercase, the first letter of any other words are uppercase, and all the other letters are lowercase. I really think that it's better to do _vPrime and newVPrime. The capitalization of _Vprime is backwards. If they were public variables, then they'd definitely have to be changed. As they're not, it's probably getting too picky to absolutely insist that they be _vPrime and newVPrime. Having S and V though definitely can't stay, because they're captilized like type names, which makes reading the code much harder.

This comment has been minimized.

@WebDrake

WebDrake May 4, 2012

Contributor

OK, great. S, V, X and U have all been lowercased; the Vprimes have been left alone for reasons described.

@WebDrake

WebDrake May 4, 2012

Contributor

OK, great. S, V, X and U have all been lowercased; the Vprimes have been left alone for reasons described.

Show outdated Hide outdated std/random.d
V = uniform!("()")(0.0, 1.0, gen);
}
while (quot > V) {

This comment has been minimized.

@jmdavis

jmdavis May 4, 2012

Member

The style of bracing that we use in Phobos is to have braces on their own line (any place in Phobos which does otherwise is generally because it slipped past in the review phase or because it's older code).

@jmdavis

jmdavis May 4, 2012

Member

The style of bracing that we use in Phobos is to have braces on their own line (any place in Phobos which does otherwise is generally because it slipped past in the review phase or because it's older code).

This comment has been minimized.

@WebDrake

WebDrake May 4, 2012

Contributor

Yea, that's an error, I thought I'd tweaked them all ... :-(

@WebDrake

WebDrake May 4, 2012

Contributor

Yea, that's an error, I thought I'd tweaked them all ... :-(

Show outdated Hide outdated std/random.d
{
// chosen!
return;
V = uniform!("()")(0.0, 1.0);

This comment has been minimized.

@jmdavis

jmdavis May 4, 2012

Member

It's not a big deal, but in case you didn't know, if a template only has a single argument, you can forgoe the parens, which tends to make the code more legible. In this case, that would mean

v = uniform!"()"(0.0, 1.0);
@jmdavis

jmdavis May 4, 2012

Member

It's not a big deal, but in case you didn't know, if a template only has a single argument, you can forgoe the parens, which tends to make the code more legible. In this case, that would mean

v = uniform!"()"(0.0, 1.0);
@jmdavis

This comment has been minimized.

Show comment
Hide comment
@jmdavis

jmdavis May 4, 2012

Member

As it stands, this code does not compile. Please build and run the Phobos unit tests (e.g. make -f posix.mak unittest). At minimum, you are using skipA and skip as if they were properties, and they aren't. They must be called with parens. I assume that you wrote and tested your version of RandomSample separately before putting it into Phobos and did not compile with -property. Phobos is compiled with -property, and -property will eventually be the normal behavior.

Member

jmdavis commented May 4, 2012

As it stands, this code does not compile. Please build and run the Phobos unit tests (e.g. make -f posix.mak unittest). At minimum, you are using skipA and skip as if they were properties, and they aren't. They must be called with parens. I assume that you wrote and tested your version of RandomSample separately before putting it into Phobos and did not compile with -property. Phobos is compiled with -property, and -property will eventually be the normal behavior.

@WebDrake

This comment has been minimized.

Show comment
Hide comment
@WebDrake

WebDrake May 4, 2012

Contributor

Thanks for the detailed feedback. I did compile the version uploaded, I think including the unittests -- it's actually the Phobos currently running on my system -- but had not used the -property flag. I'll fix that and upload further patches.

Contributor

WebDrake commented May 4, 2012

Thanks for the detailed feedback. I did compile the version uploaded, I think including the unittests -- it's actually the Phobos currently running on my system -- but had not used the -property flag. I'll fix that and upload further patches.

skip() and skipA() should not be called as properties.
Ensures code builds with -property and -unittest flags
both enabled.
@WebDrake

This comment has been minimized.

Show comment
Hide comment
@WebDrake

WebDrake May 4, 2012

Contributor

The just-uploaded commit compiles with the -property and -unittest flags. I'll follow up with further patches for your other feedback.

Contributor

WebDrake commented May 4, 2012

The just-uploaded commit compiles with the -property and -unittest flags. I'll follow up with further patches for your other feedback.

@WebDrake

This comment has been minimized.

Show comment
Hide comment
@WebDrake

WebDrake May 4, 2012

Contributor

This second commit tweaks variable names, bracing and the one-argument-to-a-template style.

I've left _Vprime as-is for reasons described above, as only a private member variable can begin with _ so the capital V is not ambiguous. However, I'll tweak it if requested.

Contributor

WebDrake commented May 4, 2012

This second commit tweaks variable names, bracing and the one-argument-to-a-template style.

I've left _Vprime as-is for reasons described above, as only a private member variable can begin with _ so the capital V is not ambiguous. However, I'll tweak it if requested.

@WebDrake

This comment has been minimized.

Show comment
Hide comment
@WebDrake

WebDrake May 4, 2012

Contributor

@jmdavis: OK, I think with this last commit all your comments have been addressed.

Re your comments on bracing: there are a bunch of places elsewhere in random.d where the same opening-brace-on-same-line issue occurs. If you like I'll fix these too, but I'll do it in a separate pull request.

Contributor

WebDrake commented May 4, 2012

@jmdavis: OK, I think with this last commit all your comments have been addressed.

Re your comments on bracing: there are a bunch of places elsewhere in random.d where the same opening-brace-on-same-line issue occurs. If you like I'll fix these too, but I'll do it in a separate pull request.

@jmdavis

This comment has been minimized.

Show comment
Hide comment
@jmdavis

jmdavis May 4, 2012

Member

I've actually fixed the braces in another pull request to fix a bug. The pull request which gets merged in second will likely have some minor merging to do (though not much).

Member

jmdavis commented May 4, 2012

I've actually fixed the braces in another pull request to fix a bug. The pull request which gets merged in second will likely have some minor merging to do (though not much).

@WebDrake

This comment has been minimized.

Show comment
Hide comment
@WebDrake

WebDrake May 4, 2012

Contributor

I'll happily take on that merge responsibility if it makes things easier for you or others -- just let me know.

Contributor

WebDrake commented May 4, 2012

I'll happily take on that merge responsibility if it makes things easier for you or others -- just let me know.

@jmdavis

This comment has been minimized.

Show comment
Hide comment
@jmdavis

jmdavis May 4, 2012

Member

It's pretty much purely a question of what gets merged in first. The merge shouldn't be a big deal in this case. I was just warning you that it might be necessary.

Member

jmdavis commented May 4, 2012

It's pretty much purely a question of what gets merged in first. The merge shouldn't be a big deal in this case. I was just warning you that it might be necessary.

@WebDrake

This comment has been minimized.

Show comment
Hide comment
@WebDrake

WebDrake May 9, 2012

Contributor

I've included an extra update that deals with Bug 7396. This should handle the case better than Jerro's fix as it covers not only the case of differently-seeded RNGs but also the case where no RNG is passed to randomSample.

To see the problem, try running the following code with and without the latest patch applied:

import std.random, std.range, std.stdio;

void main()
{
    auto sample = randomSample(iota(0, 100), 5);
    foreach(i; 0..5)
    {
        foreach(s; sample)
            writeln(s);
        writeln();
    }
}

You will note that without the latest patch, the first 1 or 2 sample elements are always identical.

Contributor

WebDrake commented May 9, 2012

I've included an extra update that deals with Bug 7396. This should handle the case better than Jerro's fix as it covers not only the case of differently-seeded RNGs but also the case where no RNG is passed to randomSample.

To see the problem, try running the following code with and without the latest patch applied:

import std.random, std.range, std.stdio;

void main()
{
    auto sample = randomSample(iota(0, 100), 5);
    foreach(i; 0..5)
    {
        foreach(s; sample)
            writeln(s);
        writeln();
    }
}

You will note that without the latest patch, the first 1 or 2 sample elements are always identical.

@WebDrake

This comment has been minimized.

Show comment
Hide comment
@WebDrake

WebDrake May 30, 2012

Contributor

Any update/feedback here? Just checking, as it's now been 3 weeks without further comment.

Contributor

WebDrake commented May 30, 2012

Any update/feedback here? Just checking, as it's now been 3 weeks without further comment.

@jmdavis

This comment has been minimized.

Show comment
Hide comment
@jmdavis

jmdavis May 30, 2012

Member

Andrei's been too busy to comment on much of anything of late, and I've been wanting to look at the papers that you took this from before merging it, since I really know nothing about the algorithm in question, and I haven't had the time yet.

Member

jmdavis commented May 30, 2012

Andrei's been too busy to comment on much of anything of late, and I've been wanting to look at the papers that you took this from before merging it, since I really know nothing about the algorithm in question, and I haven't had the time yet.

@WebDrake

This comment has been minimized.

Show comment
Hide comment
@WebDrake

WebDrake May 30, 2012

Contributor

No worries, I guessed it was something like that. Do you already have copies of the papers? If not I can get them to you.

Contributor

WebDrake commented May 30, 2012

No worries, I guessed it was something like that. Do you already have copies of the papers? If not I can get them to you.

@jmdavis

This comment has been minimized.

Show comment
Hide comment
@jmdavis

jmdavis May 30, 2012

Member

I don't. I was going to grab them via the ACM, but I had just renewed my membership when I tried, and I couldn't get my login to work. I haven't had the time since. If you want to e-mail me the papers, that would be fine. I'll try to make time to look them over.

Member

jmdavis commented May 30, 2012

I don't. I was going to grab them via the ACM, but I had just renewed my membership when I tried, and I couldn't get my login to work. I haven't had the time since. If you want to e-mail me the papers, that would be fine. I'll try to make time to look them over.

@WebDrake

This comment has been minimized.

Show comment
Hide comment
@WebDrake

WebDrake May 30, 2012

Contributor

Sent. :-)

Contributor

WebDrake commented May 30, 2012

Sent. :-)

@WebDrake

This comment has been minimized.

Show comment
Hide comment
@WebDrake

WebDrake Jun 30, 2012

Contributor

Just checking in to ask if there are any further updates, in particular in light of the discussion earlier in June:
http://forum.dlang.org/thread/4FD735EB.70404@webdrake.net

Contributor

WebDrake commented Jun 30, 2012

Just checking in to ask if there are any further updates, in particular in light of the discussion earlier in June:
http://forum.dlang.org/thread/4FD735EB.70404@webdrake.net

@jmdavis

This comment has been minimized.

Show comment
Hide comment
@jmdavis

jmdavis Jun 30, 2012

Member

I started reading the relevant paper but haven't gotten around to finishing it, since I've been fairly busy and that's something that takes a fair bit of brain power for, and I need to be well rested when I read it. I'll try to get around to it this weekend, since it really has been too long.

Member

jmdavis commented Jun 30, 2012

I started reading the relevant paper but haven't gotten around to finishing it, since I've been fairly busy and that's something that takes a fair bit of brain power for, and I need to be well rested when I read it. I'll try to get around to it this weekend, since it really has been too long.

andralex added a commit that referenced this pull request Jul 2, 2012

Merge pull request #553 from WebDrake/master
Rewrite of RandomSample to use Jeffrey Scott Vitter's Algorithm D.

@andralex andralex merged commit ae15e0e into dlang:master Jul 2, 2012

@MartinNowak MartinNowak referenced this pull request Jul 2, 2012

Merged

fix issue 8314 #654

@WebDrake

This comment has been minimized.

Show comment
Hide comment
@WebDrake

WebDrake Jul 4, 2012

Contributor

Many thanks. Very happy to have been able to make this contribution. :-)

Contributor

WebDrake commented Jul 4, 2012

Many thanks. Very happy to have been able to make this contribution. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment