New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite of RandomSample to use Jeffrey Scott Vitter's Algorithm D. #553

Merged
merged 8 commits into from Jul 2, 2012

Conversation

Projects
None yet
4 participants
@WebDrake
Contributor

WebDrake commented Apr 24, 2012

This algorithm calculates a random sample of size n in O(n) time,
generating O(n) random variates, as opposed to the previously
implemented Algorithm S where both scale with the size of the
data being sampled.

Variable names in the implementation follow those in Vitter's
papers introducing the algorithm:

  • Vitter, J. S. (1984) Faster methods for random sampling.
    Commun. ACM 27(7): 703--718.
  • Vitter, J. S. (1987) An efficient algorithm for sequential
    random sampling. ACM Trans. Math. Softw. 13(1): 58--67.
Rewrite of RandomSample to use Jeffrey Scott Vitter's Algorithm D.
This algorithm calculates a random sample of size n in O(n) time,
generating O(n) random variates, as opposed to the previously
implemented Algorithm S where both scale with the size of the
data being sampled.

Variable names in the implementation follow those in Vitter's
papers introducing the algorithm:

  * Vitter, J. S. (1984) Faster methods for random sampling.
    Commun. ACM 27(7): 703--718.

  * Vitter, J. S. (1987) An efficient algorithm for sequential
    random sampling.  ACM Trans. Math. Softw. 13(1): 58--67.
@donc

This comment has been minimized.

Show comment
Hide comment
@donc

donc Apr 26, 2012

Collaborator

Looks great!
But despite what someone on these newsgroup said, I've never contributed to that module, and have never even used it..
So I will leave it to Andrei to merge in.

Collaborator

donc commented Apr 26, 2012

Looks great!
But despite what someone on these newsgroup said, I've never contributed to that module, and have never even used it..
So I will leave it to Andrei to merge in.

@WebDrake

This comment has been minimized.

Show comment
Hide comment
@WebDrake

WebDrake Apr 26, 2012

Contributor

Great! I will look forward to his feedback. :-)

There are some benchmarks for the code at https://github.com/WebDrake/RandomSample which should serve to demonstrate that the code put in place here gives a big speedup. I'd be happy to adapt these as inbuilt benchmarks in line with Andrei's proposed std.benchmark framework.

I can also send copies of the original research papers to anyone who wants or needs them to review the implementation of the algorithm.

Contributor

WebDrake commented Apr 26, 2012

Great! I will look forward to his feedback. :-)

There are some benchmarks for the code at https://github.com/WebDrake/RandomSample which should serve to demonstrate that the code put in place here gives a big speedup. I'd be happy to adapt these as inbuilt benchmarks in line with Andrei's proposed std.benchmark framework.

I can also send copies of the original research papers to anyone who wants or needs them to review the implementation of the algorithm.

@WebDrake

This comment has been minimized.

Show comment
Hide comment
@WebDrake

WebDrake Apr 26, 2012

Contributor

Couple of small queries:

(i) I've included comments on some of the private internal functions of the algorithm. Should those be Ddoc comments starting /** or just regular comments?

(ii) What (if anything) should I add to author, copyright, credits lines?

Contributor

WebDrake commented Apr 26, 2012

Couple of small queries:

(i) I've included comments on some of the private internal functions of the algorithm. Should those be Ddoc comments starting /** or just regular comments?

(ii) What (if anything) should I add to author, copyright, credits lines?

@donc

This comment has been minimized.

Show comment
Hide comment
@donc

donc Apr 26, 2012

Collaborator

(i) Just use regular comments on the internal functions.
(ii) If you think you've made a major contribution to the module (which kind of implies some responsibility for it) then you can add your name if you like. As far as attribution goes, it doesn't really matter much, since it's in the revision history anyway. There's no problem with adding it later on, too.

Collaborator

donc commented Apr 26, 2012

(i) Just use regular comments on the internal functions.
(ii) If you think you've made a major contribution to the module (which kind of implies some responsibility for it) then you can add your name if you like. As far as attribution goes, it doesn't really matter much, since it's in the revision history anyway. There's no problem with adding it later on, too.

@WebDrake

This comment has been minimized.

Show comment
Hide comment
@WebDrake

WebDrake Apr 26, 2012

Contributor

OK, I've corrected and improved the comments and added my name. I do intend to take responsibility for this code, and besides, I'll be contributing more in other areas (e.g. random distributions).

Contributor

WebDrake commented Apr 26, 2012

OK, I've corrected and improved the comments and added my name. I do intend to take responsibility for this code, and besides, I'll be contributing more in other areas (e.g. random distributions).

Show outdated Hide outdated std/random.d Outdated
Show outdated Hide outdated std/random.d Outdated
Show outdated Hide outdated std/random.d Outdated
Show outdated Hide outdated std/random.d Outdated
@jmdavis

This comment has been minimized.

Show comment
Hide comment
@jmdavis

jmdavis May 4, 2012

Member

As it stands, this code does not compile. Please build and run the Phobos unit tests (e.g. make -f posix.mak unittest). At minimum, you are using skipA and skip as if they were properties, and they aren't. They must be called with parens. I assume that you wrote and tested your version of RandomSample separately before putting it into Phobos and did not compile with -property. Phobos is compiled with -property, and -property will eventually be the normal behavior.

Member

jmdavis commented May 4, 2012

As it stands, this code does not compile. Please build and run the Phobos unit tests (e.g. make -f posix.mak unittest). At minimum, you are using skipA and skip as if they were properties, and they aren't. They must be called with parens. I assume that you wrote and tested your version of RandomSample separately before putting it into Phobos and did not compile with -property. Phobos is compiled with -property, and -property will eventually be the normal behavior.

@WebDrake

This comment has been minimized.

Show comment
Hide comment
@WebDrake

WebDrake May 4, 2012

Contributor

Thanks for the detailed feedback. I did compile the version uploaded, I think including the unittests -- it's actually the Phobos currently running on my system -- but had not used the -property flag. I'll fix that and upload further patches.

Contributor

WebDrake commented May 4, 2012

Thanks for the detailed feedback. I did compile the version uploaded, I think including the unittests -- it's actually the Phobos currently running on my system -- but had not used the -property flag. I'll fix that and upload further patches.

skip() and skipA() should not be called as properties.
Ensures code builds with -property and -unittest flags
both enabled.
@WebDrake

This comment has been minimized.

Show comment
Hide comment
@WebDrake

WebDrake May 4, 2012

Contributor

The just-uploaded commit compiles with the -property and -unittest flags. I'll follow up with further patches for your other feedback.

Contributor

WebDrake commented May 4, 2012

The just-uploaded commit compiles with the -property and -unittest flags. I'll follow up with further patches for your other feedback.

@WebDrake

This comment has been minimized.

Show comment
Hide comment
@WebDrake

WebDrake May 4, 2012

Contributor

This second commit tweaks variable names, bracing and the one-argument-to-a-template style.

I've left _Vprime as-is for reasons described above, as only a private member variable can begin with _ so the capital V is not ambiguous. However, I'll tweak it if requested.

Contributor

WebDrake commented May 4, 2012

This second commit tweaks variable names, bracing and the one-argument-to-a-template style.

I've left _Vprime as-is for reasons described above, as only a private member variable can begin with _ so the capital V is not ambiguous. However, I'll tweak it if requested.

@WebDrake

This comment has been minimized.

Show comment
Hide comment
@WebDrake

WebDrake May 4, 2012

Contributor

@jmdavis: OK, I think with this last commit all your comments have been addressed.

Re your comments on bracing: there are a bunch of places elsewhere in random.d where the same opening-brace-on-same-line issue occurs. If you like I'll fix these too, but I'll do it in a separate pull request.

Contributor

WebDrake commented May 4, 2012

@jmdavis: OK, I think with this last commit all your comments have been addressed.

Re your comments on bracing: there are a bunch of places elsewhere in random.d where the same opening-brace-on-same-line issue occurs. If you like I'll fix these too, but I'll do it in a separate pull request.

@jmdavis

This comment has been minimized.

Show comment
Hide comment
@jmdavis

jmdavis May 4, 2012

Member

I've actually fixed the braces in another pull request to fix a bug. The pull request which gets merged in second will likely have some minor merging to do (though not much).

Member

jmdavis commented May 4, 2012

I've actually fixed the braces in another pull request to fix a bug. The pull request which gets merged in second will likely have some minor merging to do (though not much).

@WebDrake

This comment has been minimized.

Show comment
Hide comment
@WebDrake

WebDrake May 4, 2012

Contributor

I'll happily take on that merge responsibility if it makes things easier for you or others -- just let me know.

Contributor

WebDrake commented May 4, 2012

I'll happily take on that merge responsibility if it makes things easier for you or others -- just let me know.

@jmdavis

This comment has been minimized.

Show comment
Hide comment
@jmdavis

jmdavis May 4, 2012

Member

It's pretty much purely a question of what gets merged in first. The merge shouldn't be a big deal in this case. I was just warning you that it might be necessary.

Member

jmdavis commented May 4, 2012

It's pretty much purely a question of what gets merged in first. The merge shouldn't be a big deal in this case. I was just warning you that it might be necessary.

@WebDrake

This comment has been minimized.

Show comment
Hide comment
@WebDrake

WebDrake May 9, 2012

Contributor

I've included an extra update that deals with Bug 7396. This should handle the case better than Jerro's fix as it covers not only the case of differently-seeded RNGs but also the case where no RNG is passed to randomSample.

To see the problem, try running the following code with and without the latest patch applied:

import std.random, std.range, std.stdio;

void main()
{
    auto sample = randomSample(iota(0, 100), 5);
    foreach(i; 0..5)
    {
        foreach(s; sample)
            writeln(s);
        writeln();
    }
}

You will note that without the latest patch, the first 1 or 2 sample elements are always identical.

Contributor

WebDrake commented May 9, 2012

I've included an extra update that deals with Bug 7396. This should handle the case better than Jerro's fix as it covers not only the case of differently-seeded RNGs but also the case where no RNG is passed to randomSample.

To see the problem, try running the following code with and without the latest patch applied:

import std.random, std.range, std.stdio;

void main()
{
    auto sample = randomSample(iota(0, 100), 5);
    foreach(i; 0..5)
    {
        foreach(s; sample)
            writeln(s);
        writeln();
    }
}

You will note that without the latest patch, the first 1 or 2 sample elements are always identical.

@WebDrake

This comment has been minimized.

Show comment
Hide comment
@WebDrake

WebDrake May 30, 2012

Contributor

Any update/feedback here? Just checking, as it's now been 3 weeks without further comment.

Contributor

WebDrake commented May 30, 2012

Any update/feedback here? Just checking, as it's now been 3 weeks without further comment.

@jmdavis

This comment has been minimized.

Show comment
Hide comment
@jmdavis

jmdavis May 30, 2012

Member

Andrei's been too busy to comment on much of anything of late, and I've been wanting to look at the papers that you took this from before merging it, since I really know nothing about the algorithm in question, and I haven't had the time yet.

Member

jmdavis commented May 30, 2012

Andrei's been too busy to comment on much of anything of late, and I've been wanting to look at the papers that you took this from before merging it, since I really know nothing about the algorithm in question, and I haven't had the time yet.

@WebDrake

This comment has been minimized.

Show comment
Hide comment
@WebDrake

WebDrake May 30, 2012

Contributor

No worries, I guessed it was something like that. Do you already have copies of the papers? If not I can get them to you.

Contributor

WebDrake commented May 30, 2012

No worries, I guessed it was something like that. Do you already have copies of the papers? If not I can get them to you.

@jmdavis

This comment has been minimized.

Show comment
Hide comment
@jmdavis

jmdavis May 30, 2012

Member

I don't. I was going to grab them via the ACM, but I had just renewed my membership when I tried, and I couldn't get my login to work. I haven't had the time since. If you want to e-mail me the papers, that would be fine. I'll try to make time to look them over.

Member

jmdavis commented May 30, 2012

I don't. I was going to grab them via the ACM, but I had just renewed my membership when I tried, and I couldn't get my login to work. I haven't had the time since. If you want to e-mail me the papers, that would be fine. I'll try to make time to look them over.

@WebDrake

This comment has been minimized.

Show comment
Hide comment
@WebDrake

WebDrake May 30, 2012

Contributor

Sent. :-)

Contributor

WebDrake commented May 30, 2012

Sent. :-)

@WebDrake

This comment has been minimized.

Show comment
Hide comment
@WebDrake

WebDrake Jun 30, 2012

Contributor

Just checking in to ask if there are any further updates, in particular in light of the discussion earlier in June:
http://forum.dlang.org/thread/4FD735EB.70404@webdrake.net

Contributor

WebDrake commented Jun 30, 2012

Just checking in to ask if there are any further updates, in particular in light of the discussion earlier in June:
http://forum.dlang.org/thread/4FD735EB.70404@webdrake.net

@jmdavis

This comment has been minimized.

Show comment
Hide comment
@jmdavis

jmdavis Jun 30, 2012

Member

I started reading the relevant paper but haven't gotten around to finishing it, since I've been fairly busy and that's something that takes a fair bit of brain power for, and I need to be well rested when I read it. I'll try to get around to it this weekend, since it really has been too long.

Member

jmdavis commented Jun 30, 2012

I started reading the relevant paper but haven't gotten around to finishing it, since I've been fairly busy and that's something that takes a fair bit of brain power for, and I need to be well rested when I read it. I'll try to get around to it this weekend, since it really has been too long.

andralex added a commit that referenced this pull request Jul 2, 2012

Merge pull request #553 from WebDrake/master
Rewrite of RandomSample to use Jeffrey Scott Vitter's Algorithm D.

@andralex andralex merged commit ae15e0e into dlang:master Jul 2, 2012

@MartinNowak MartinNowak referenced this pull request Jul 2, 2012

Merged

fix issue 8314 #654

@WebDrake

This comment has been minimized.

Show comment
Hide comment
@WebDrake

WebDrake Jul 4, 2012

Contributor

Many thanks. Very happy to have been able to make this contribution. :-)

Contributor

WebDrake commented Jul 4, 2012

Many thanks. Very happy to have been able to make this contribution. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment