New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Use randomkit in standalone #706

Merged
merged 30 commits into from Jun 9, 2016

Conversation

Projects
None yet
2 participants
@mstimberg
Member

mstimberg commented Jun 2, 2016

[This continues the discussion from #671]

This PR includes the use of randomkit from standalone as in the earlier PR but it does not try to use it from weave/cython. It does implement a shared buffer for random numbers across code objects, though, as discussed.

Standalone

The open issue is the thread-safety when used with OpenMP. I had a look, but apart from re-writing randomkit.c completely with locks on rk_state (numpy's random functions use locking in the Cython wrapper around randomkit) I did not see an easy way of doing it. We could wrap the calls to rk_gauss and rk_double in rand and randn with a #pragma omp critical but this slows down simulations dramatically and is also not 100% thread-safe (parallely executed code could call both rand() and randn()). I think we'll therefore have to revert our decision and use one rk_state object per thread. In a paper talking about the problems surrounding initialization[1], the authors claim that the non-linear initialization for Mersenne-Twister makes it unaffected by seeding issues: "According to experiments, this seems to solve these problems, although it is highly heuristic.". In another paper they introduce the new algorithm that I mentioned earlier (that comes up with new parameters for each PRNG)[2] and there on the other hand they claim that using the same algorithm with different seeds is problematic (even though this seems to be what almost everyone is doing...). However, they also say "The danger becomes non-negligible if the number of parallel streams becomes large compared to the size of the state space." -- this is certainly not the case here for OpenMP.
I'd therefore propose seed with seed + thread_id for each stream (we'll still have to expose the seed function, though). This means that you can reproduce random numbers even with OpenMP (for the same number of threads) and that there is an easy rule what seeds to avoid when to re-seed (e.g. if you seed with 1234 and have 8 threads, you shouldn't use any seed between 1234 and 1241 for an independent run). Incidentally, this seems to be the approach that NEST is using[3].

weave

I did not do very systematic benchmarking, but the new approach (with the shared buffer across code objects) does not really seem to give a performance improvement, if anything, it seems to be slightly slower... I looked at a modified COBAHH example where I added a noise term (and reduced the dt to 0.05ms so that Euler was stable) and realized that the buffer size does not really matter because it is not the overhead of going back to Python that is taking a lot of time but the random number generation itself. On my machine, the state updater takes 7.63s without and 11.25s with noise (i.e. calls to randn), i.e. a difference of 3.62s. Now, creating the necessary 80 million normally distributed random numbers in a single call to numpy.random.randn already takes 3.12s, i.e. there's not much margin for the buffer size to matter in practice. I'd probably prefer the new approach anyway, it makes the whole buffering thing more explicit and even allows a workaround to the problem of non-reproducible random numbers with weave/cython in the same thread. Actually if we write a device.seed function for standalone anyway, we could do this for runtime as well where it will set numpy's seed and reset the buffer -- that would solve a few issues. I did not implement the shared buffer for Cython yet, but it should be straightforward.

[1] http://dl.acm.org/citation.cfm?id=1276928
[2] http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/DC/dgene.pdf
[3] http://www.nest-simulator.org/random_numbers/#Seeding_the_Random_Generators

@thesamovar

This comment has been minimized.

Show comment
Hide comment
@thesamovar

thesamovar Jun 2, 2016

Member

I'm convinced!

Member

thesamovar commented Jun 2, 2016

I'm convinced!

@mstimberg

This comment has been minimized.

Show comment
Hide comment
@mstimberg

mstimberg Jun 2, 2016

Member

I realized that it is only logical that the new approach with the shared buffer should be somewhat slower than the previous one: we are now copying over the result from numpy.random.rand into the buffer, previously we were updating a pointer. Do you think it is worth the effort make this work without a copy again?
Note to self: the current code probably leaks memory and needs a Py_DECREF for the result of rand/randn.

Member

mstimberg commented Jun 2, 2016

I realized that it is only logical that the new approach with the shared buffer should be somewhat slower than the previous one: we are now copying over the result from numpy.random.rand into the buffer, previously we were updating a pointer. Do you think it is worth the effort make this work without a copy again?
Note to self: the current code probably leaks memory and needs a Py_DECREF for the result of rand/randn.

@mstimberg

This comment has been minimized.

Show comment
Hide comment
@mstimberg

mstimberg Jun 2, 2016

Member

I implemented the seed function for runtime and standalone (no proper tests yet). Two syntax questions:

  1. Currently you'll have to call device.seed -- probably just seed is more intuitive (this would then delegate to device.seed)?
  2. Is seed a good name or rather random_seed to be more explicit? Or set_seed, set_random_seed, ...
Member

mstimberg commented Jun 2, 2016

I implemented the seed function for runtime and standalone (no proper tests yet). Two syntax questions:

  1. Currently you'll have to call device.seed -- probably just seed is more intuitive (this would then delegate to device.seed)?
  2. Is seed a good name or rather random_seed to be more explicit? Or set_seed, set_random_seed, ...
@thesamovar

This comment has been minimized.

Show comment
Hide comment
@thesamovar

thesamovar Jun 2, 2016

Member

I think it's probably worth the effort to eliminate the copy if it's a noticeable bottleneck and it isn't horribly complicated to implement.

For seed syntax, I think seed or random_seed is good (not set_... and not device.). Probably seed is fine: it's pretty common usage.

Incidentally, isn't seeding from OS-level randomness better because you get more bits: the full rk_state rather than an integer that is probably quite small (less than 32 bits)?

Member

thesamovar commented Jun 2, 2016

I think it's probably worth the effort to eliminate the copy if it's a noticeable bottleneck and it isn't horribly complicated to implement.

For seed syntax, I think seed or random_seed is good (not set_... and not device.). Probably seed is fine: it's pretty common usage.

Incidentally, isn't seeding from OS-level randomness better because you get more bits: the full rk_state rather than an integer that is probably quite small (less than 32 bits)?

mstimberg added some commits Jun 2, 2016

Introduce a top-level `seed` function that calls `Device.seed`
The type-check for the parameter is only done in the top-level function
@mstimberg

This comment has been minimized.

Show comment
Hide comment
@mstimberg

mstimberg Jun 3, 2016

Member

I think it's probably worth the effort to eliminate the copy if it's a noticeable bottleneck and it isn't horribly complicated to implement.

Ok, I implemented the change, it took me a while to get this right (we have to make sure that the numpy array object created by numpy.random.rand gets garbage collected but that the data buffer stays alive) but it seems to work fine now and at least in a 10min run that did nothing but generate random numbers all the time (see below), I did not observe a memory leak.

For the speed, I ran the following simple program which should not do much apart from the random number generation:

G = NeuronGroup(100000, 'v:1')
G.run_regularly('v = randn() + rand()')
run(1*second, report='text')

Results are looking good, apparently there was some inefficiency in the previous Cython code:

numpy weave cython
runtime (master) 48s 54s 77s
runtime (PR) 48s 54s 51s
memory usage 138MB 109MB 107MB

For seed syntax, I think seed or random_seed is good (not set_... and not device.). Probably seed is fine: it's pretty common usage.

Ok, there is now a top-level function seed. It shadows numpy's seed function (from the pylab import), but for runtime it will set numpy's seed anyway so this should not be a problem I think. We should probably make clear in the documentation that this is not the case for standalone, i.e. if someone uses numpy's random numbers to generate values in the Python code, then the seed value for that will not be set when using seed.

Incidentally, isn't seeding from OS-level randomness better because you get more bits: the full rk_state rather than an integer that is probably quite small (less than 32 bits)?

I am not quite sure what you mean. You can either use an integer value for reproducible seeds or call seed() (or seed(None)) and this will seed from /dev/urandom (or whatever it does on Windows).

What remains to do is to implement the multiple streams for OpenMP (and adapt the seeding) and documentation + tests.

Member

mstimberg commented Jun 3, 2016

I think it's probably worth the effort to eliminate the copy if it's a noticeable bottleneck and it isn't horribly complicated to implement.

Ok, I implemented the change, it took me a while to get this right (we have to make sure that the numpy array object created by numpy.random.rand gets garbage collected but that the data buffer stays alive) but it seems to work fine now and at least in a 10min run that did nothing but generate random numbers all the time (see below), I did not observe a memory leak.

For the speed, I ran the following simple program which should not do much apart from the random number generation:

G = NeuronGroup(100000, 'v:1')
G.run_regularly('v = randn() + rand()')
run(1*second, report='text')

Results are looking good, apparently there was some inefficiency in the previous Cython code:

numpy weave cython
runtime (master) 48s 54s 77s
runtime (PR) 48s 54s 51s
memory usage 138MB 109MB 107MB

For seed syntax, I think seed or random_seed is good (not set_... and not device.). Probably seed is fine: it's pretty common usage.

Ok, there is now a top-level function seed. It shadows numpy's seed function (from the pylab import), but for runtime it will set numpy's seed anyway so this should not be a problem I think. We should probably make clear in the documentation that this is not the case for standalone, i.e. if someone uses numpy's random numbers to generate values in the Python code, then the seed value for that will not be set when using seed.

Incidentally, isn't seeding from OS-level randomness better because you get more bits: the full rk_state rather than an integer that is probably quite small (less than 32 bits)?

I am not quite sure what you mean. You can either use an integer value for reproducible seeds or call seed() (or seed(None)) and this will seed from /dev/urandom (or whatever it does on Windows).

What remains to do is to implement the multiple streams for OpenMP (and adapt the seeding) and documentation + tests.

@mstimberg

This comment has been minimized.

Show comment
Hide comment
@mstimberg

mstimberg Jun 3, 2016

Member

Ok, looking at the test results so far there is also still some issue surrounding the pointer type that makes it fail on OS-X and on 32bit Linux.

Member

mstimberg commented Jun 3, 2016

Ok, looking at the test results so far there is also still some issue surrounding the pointer type that makes it fail on OS-X and on 32bit Linux.

@thesamovar

This comment has been minimized.

Show comment
Hide comment
@thesamovar

thesamovar Jun 3, 2016

Member

Looking good. Will be worth it for Cython alone by the looks of it!

Do you have any idea what is causing weave and cython to be so much slower than numpy? Is it just Python overheads or is it actually an inefficiency in the code?

Member

thesamovar commented Jun 3, 2016

Looking good. Will be worth it for Cython alone by the looks of it!

Do you have any idea what is causing weave and cython to be so much slower than numpy? Is it just Python overheads or is it actually an inefficiency in the code?

@mstimberg

This comment has been minimized.

Show comment
Hide comment
@mstimberg

mstimberg Jun 3, 2016

Member

Do you have any idea what is causing weave and cython to be so much slower than numpy? Is it just Python overheads or is it actually an inefficiency in the code?

It is 12% respectively 6% which is not too bad I think. Since the network has 100000 neurons, the weave/cython code is calling back to Python about 200 times per time step. I did not further investigate whether a change in buffer size improves things -- is it worth turning this into a preference, maybe?

With a small network (e.g. 100 neurons for 100s), the relation between numpy/weave/cython inverses (now also about half of the time is spend in the general Python loop overhead):

numpy weave cython
runtime (PR) 36s 25s 26s
Member

mstimberg commented Jun 3, 2016

Do you have any idea what is causing weave and cython to be so much slower than numpy? Is it just Python overheads or is it actually an inefficiency in the code?

It is 12% respectively 6% which is not too bad I think. Since the network has 100000 neurons, the weave/cython code is calling back to Python about 200 times per time step. I did not further investigate whether a change in buffer size improves things -- is it worth turning this into a preference, maybe?

With a small network (e.g. 100 neurons for 100s), the relation between numpy/weave/cython inverses (now also about half of the time is spend in the general Python loop overhead):

numpy weave cython
runtime (PR) 36s 25s 26s
@thesamovar

This comment has been minimized.

Show comment
Hide comment
@thesamovar

thesamovar Jun 3, 2016

Member

Interesting. I still don't feel like I have a good feel for it. Maybe try pushing the buffer size up very large and see if it makes a difference?

Member

thesamovar commented Jun 3, 2016

Interesting. I still don't feel like I have a good feel for it. Maybe try pushing the buffer size up very large and see if it makes a difference?

@mstimberg

This comment has been minimized.

Show comment
Hide comment
@mstimberg

mstimberg Jun 3, 2016

Member

Interesting. I still don't feel like I have a good feel for it.

Same for me... I checked a few buffer sizes (in the 100000 neurons, 1s example), but each only with a single run so let's not rely on these numbers too much:

buffer size 100 500 1024 5000 10000 20000 50000 100000 100000000
runtime 67s 56s 54s 53s 52s 52s 52s 53s 53s

So it does not really seem to matter much except that it should not be too small. Given that, I don't think that having a preference for the size is going to help alot (it would also need some care to implement correctly, e.g. when the preference is changed between runs). How about setting the default size to 10000 or 20000? The numbers take <1ms to generate so we certainly don't have to worry about slowing down simulations that don't need any (or don't need enough) random numbers.

Member

mstimberg commented Jun 3, 2016

Interesting. I still don't feel like I have a good feel for it.

Same for me... I checked a few buffer sizes (in the 100000 neurons, 1s example), but each only with a single run so let's not rely on these numbers too much:

buffer size 100 500 1024 5000 10000 20000 50000 100000 100000000
runtime 67s 56s 54s 53s 52s 52s 52s 53s 53s

So it does not really seem to matter much except that it should not be too small. Given that, I don't think that having a preference for the size is going to help alot (it would also need some care to implement correctly, e.g. when the preference is changed between runs). How about setting the default size to 10000 or 20000? The numbers take <1ms to generate so we certainly don't have to worry about slowing down simulations that don't need any (or don't need enough) random numbers.

@thesamovar

This comment has been minimized.

Show comment
Hide comment
@thesamovar

thesamovar Jun 3, 2016

Member

Agreed.

Still mystified about the timing difference then given that. Why would weave be faster than numpy for small N but slower for large N? Maybe it is unrelated to random number generation?

Member

thesamovar commented Jun 3, 2016

Agreed.

Still mystified about the timing difference then given that. Why would weave be faster than numpy for small N but slower for large N? Maybe it is unrelated to random number generation?

@mstimberg

This comment has been minimized.

Show comment
Hide comment
@mstimberg

mstimberg Jun 3, 2016

Member

Still mystified about the timing difference then given that. Why would weave be faster than numpy for small N but slower for large N? Maybe it is unrelated to random number generation?

I think for small N the fact that we pre-generate the random numbers in weave/cython is the advantage, no? weave/cython call np.random.randn(1024) once every ~10 timesteps and then fetch 100 values every timestep from the buffer while numpy calls np.random.randn(100) every timestep.

In general, weave/cython and numpy do some operations not in exactly the same order, maybe cache effects explain the difference for large N?

Member

mstimberg commented Jun 3, 2016

Still mystified about the timing difference then given that. Why would weave be faster than numpy for small N but slower for large N? Maybe it is unrelated to random number generation?

I think for small N the fact that we pre-generate the random numbers in weave/cython is the advantage, no? weave/cython call np.random.randn(1024) once every ~10 timesteps and then fetch 100 values every timestep from the buffer while numpy calls np.random.randn(100) every timestep.

In general, weave/cython and numpy do some operations not in exactly the same order, maybe cache effects explain the difference for large N?

@thesamovar

This comment has been minimized.

Show comment
Hide comment
@thesamovar

thesamovar Jun 3, 2016

Member

Good points. Can you run the same benchmark without random number generation and see what happens?

Member

thesamovar commented Jun 3, 2016

Good points. Can you run the same benchmark without random number generation and see what happens?

@thesamovar

This comment has been minimized.

Show comment
Hide comment
@thesamovar

thesamovar Jun 3, 2016

Member

e.g. something like

G = NeuronGroup(100000, 'v:1\nx:1\ny:1')
G.run_regularly('v = x+y')
run(1*second, report='text')
Member

thesamovar commented Jun 3, 2016

e.g. something like

G = NeuronGroup(100000, 'v:1\nx:1\ny:1')
G.run_regularly('v = x+y')
run(1*second, report='text')
@mstimberg

This comment has been minimized.

Show comment
Hide comment
@mstimberg

mstimberg Jun 3, 2016

Member

Will do so soon, I am currently in the middle of dealing with the 32bit issue.

Member

mstimberg commented Jun 3, 2016

Will do so soon, I am currently in the middle of dealing with the 32bit issue.

Set the buffer size for random number generation to 20000
The value is rather arbitrary, see discussion in github PR #706
@thesamovar

This comment has been minimized.

Show comment
Hide comment
@thesamovar

thesamovar Jun 3, 2016

Member

OK so for these examples, weave and cython are always faster than numpy as you'd hope. That means there is something specific to the rand code that is slowing down weave and cython more than numpy. What could it be?

Maybe function call cost? For numpy you'd have one function call regardless of N, whereas for weave/cython you'd have N function calls (although they'd be C function calls so relatively fast). Could try declaring _rand as inline to see if this makes a difference? No way of doing that for Cython (is there?), but for weave we'd get an idea.

We also have an additional if check that numpy doesn't have (if buffer index = 0). I'd be surprised if that added up to such a substantial time difference though because it'll be a very predictable pathway (the if will almost always evaluate to false and the processor should normally pick up on that).

Could potentially also replace the if buffer index = buffer size then buffer index = 0 line with a simple buffer index = buffer index mod buffer size to avoid a branch. Or even better, *buffer_index = (*buffer_index+1)%BUFFER_SIZE combining two lines into one. Again, I'd be surprised if that adds up to much though.

Could also be the case that putting the if statement messes up the compiler's ability to use SSE instructions. I can think of hand-crafted ways to get round this, but a general solution would be quite tricky.

Member

thesamovar commented Jun 3, 2016

OK so for these examples, weave and cython are always faster than numpy as you'd hope. That means there is something specific to the rand code that is slowing down weave and cython more than numpy. What could it be?

Maybe function call cost? For numpy you'd have one function call regardless of N, whereas for weave/cython you'd have N function calls (although they'd be C function calls so relatively fast). Could try declaring _rand as inline to see if this makes a difference? No way of doing that for Cython (is there?), but for weave we'd get an idea.

We also have an additional if check that numpy doesn't have (if buffer index = 0). I'd be surprised if that added up to such a substantial time difference though because it'll be a very predictable pathway (the if will almost always evaluate to false and the processor should normally pick up on that).

Could potentially also replace the if buffer index = buffer size then buffer index = 0 line with a simple buffer index = buffer index mod buffer size to avoid a branch. Or even better, *buffer_index = (*buffer_index+1)%BUFFER_SIZE combining two lines into one. Again, I'd be surprised if that adds up to much though.

Could also be the case that putting the if statement messes up the compiler's ability to use SSE instructions. I can think of hand-crafted ways to get round this, but a general solution would be quite tricky.

@mstimberg

This comment has been minimized.

Show comment
Hide comment
@mstimberg

mstimberg Jun 3, 2016

Member

I think most of the proposed changes could go one way or the other -- I won't try to optimise this any further, but please go ahead if you feel motivated :-) I don't think that using inline helps much, the compiler should have either already realized that inlining is a good idea here or, well, maybe it isn't.

Could also be the case that putting the if statement messes up the compiler's ability to use SSE instructions. I can think of hand-crafted ways to get round this, but a general solution would be quite tricky.

I think it may well be something around SSE and caching. From what I understand, what our numpy code is doing on a C level is basically the following (somewhat pseudocody):

for (int i=0; i<N; i++)
   ar1[i] = rk_double(state)
for (int i=0; i<N; i++)
   ar2[i] = rk_gauss(state)
for (int i=0; i<N; i++)
   v[i] = ar1[i] + ar2[i]

The addition in the end should be very easy to vectorize/cache-optimize for the compiler.

What we are doing in our C code is more like (idealized with buffer size == N):

for (int i=0; i<N; i++)
   ar1[i] = rk_double(state)
for (int i=0; i<N; i++)
   ar2[i] = rk_gauss(state)
inline double rand() {
    static counter=0;
    return ar1[counter++];
}
inline double randn() {
    static counter=0;
    return ar2[counter++];
}
for (int i=0; i<N; i++)
   v[i] = rand() + randn();

The last line cannot be easily vectorized by the compiler, I guess.

Member

mstimberg commented Jun 3, 2016

I think most of the proposed changes could go one way or the other -- I won't try to optimise this any further, but please go ahead if you feel motivated :-) I don't think that using inline helps much, the compiler should have either already realized that inlining is a good idea here or, well, maybe it isn't.

Could also be the case that putting the if statement messes up the compiler's ability to use SSE instructions. I can think of hand-crafted ways to get round this, but a general solution would be quite tricky.

I think it may well be something around SSE and caching. From what I understand, what our numpy code is doing on a C level is basically the following (somewhat pseudocody):

for (int i=0; i<N; i++)
   ar1[i] = rk_double(state)
for (int i=0; i<N; i++)
   ar2[i] = rk_gauss(state)
for (int i=0; i<N; i++)
   v[i] = ar1[i] + ar2[i]

The addition in the end should be very easy to vectorize/cache-optimize for the compiler.

What we are doing in our C code is more like (idealized with buffer size == N):

for (int i=0; i<N; i++)
   ar1[i] = rk_double(state)
for (int i=0; i<N; i++)
   ar2[i] = rk_gauss(state)
inline double rand() {
    static counter=0;
    return ar1[counter++];
}
inline double randn() {
    static counter=0;
    return ar2[counter++];
}
for (int i=0; i<N; i++)
   v[i] = rand() + randn();

The last line cannot be easily vectorized by the compiler, I guess.

@mstimberg

This comment has been minimized.

Show comment
Hide comment
@mstimberg

mstimberg Jun 3, 2016

Member

Performance issues aside and pending that the stuff I did to fix 32bit Linux and OS-X actually works, this should cover everything now. OpenMP seems to work as well, but everything is still lacking tests and documentation.

Member

mstimberg commented Jun 3, 2016

Performance issues aside and pending that the stuff I did to fix 32bit Linux and OS-X actually works, this should cover everything now. OpenMP seems to work as well, but everything is still lacking tests and documentation.

@thesamovar

This comment has been minimized.

Show comment
Hide comment
@thesamovar

thesamovar Jun 4, 2016

Member

I will have a little play with it to see if I can improve performance, but I don't think we should let that hold us back from going ahead with this merge when it's ready.

Member

thesamovar commented Jun 4, 2016

I will have a little play with it to see if I can improve performance, but I don't think we should let that hold us back from going ahead with this merge when it's ready.

@mstimberg

This comment has been minimized.

Show comment
Hide comment
@mstimberg

mstimberg Jun 7, 2016

Member

It took a bit of fiddling around to get things to work correctly on 32bit Linux and OS-X, but it now seems to work and all tests pass on my machine. Let's wait what travis/appveyor says.

I also added tests for the seed mechanism and some documentation -- from my side this is now ready (assuming that the test suites pass of course).

Member

mstimberg commented Jun 7, 2016

It took a bit of fiddling around to get things to work correctly on 32bit Linux and OS-X, but it now seems to work and all tests pass on my machine. Let's wait what travis/appveyor says.

I also added tests for the seed mechanism and some documentation -- from my side this is now ready (assuming that the test suites pass of course).

@mstimberg mstimberg changed the title from [WIP] Use randomkit in standalone to [MRG] Use randomkit in standalone Jun 7, 2016

mstimberg added some commits Jun 8, 2016

Try replacing `int64` by `int32` for the buffer indices again -- the …
…workaround is hopefully no longer necessary
@thesamovar

This comment has been minimized.

Show comment
Hide comment
@thesamovar

thesamovar Jun 9, 2016

Member

Good to merge I think!

Member

thesamovar commented Jun 9, 2016

Good to merge I think!

@mstimberg

This comment has been minimized.

Show comment
Hide comment
@mstimberg

mstimberg Jun 9, 2016

Member

Good to merge I think!

Ok, great, going ahead with the merge then.

Member

mstimberg commented Jun 9, 2016

Good to merge I think!

Ok, great, going ahead with the merge then.

@mstimberg mstimberg merged commit 02ce467 into master Jun 9, 2016

@mstimberg mstimberg removed the in progress label Jun 9, 2016

@mstimberg mstimberg deleted the randomkit_standalone branch Jun 9, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment