Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace ISAAC with HC-128 #53

Closed
pitdicker opened this issue Nov 16, 2017 · 17 comments
Closed

Replace ISAAC with HC-128 #53

pitdicker opened this issue Nov 16, 2017 · 17 comments

Comments

@pitdicker
Copy link

Okay, this is somewhat funny. After putting quite some effort into optimising our implementation of ISAAC, I now propose we replace it with HC-128.

Just like ISAAC and RC-4 before it, HC-128 is an array based stream cipher that we use as an RNG. Comparison:

Security / predictability

ISAAC is designed to be usable for applications that need a cryptographically secure PRNG. But it does not really meet the current standards for one. This makes ISAAC overkill for applications that just need a statistically good PRNG, and not good enough for anything that needs guarantees for security.

HC-128 is designed by Hongjun Wu, one of the experts in the field. It is well received, and selected as one of the "stream ciphers suitable for widespread adoption" by eSTREAM. A very comprehensive analysis of the current state of attacks / known weaknesses of HC-128 is given in "Some Results On Analysis And Implementation Of HC-128 Stream Cipher" by Shashwat Raizada.

HC-128 has no known weaknesses that are easier to exploit than doing a brute-force search of 2^128. Is makes clear security promises, and also has a clear story around initialization.

Performance

HC-128 is an 32-bit algorithm. It performs usually 30% faster than the 32-bit ISAAC, and mostly the same as the 64-bit ISAAC-64. Only for fill_bytes ISAAC-64 is about 45% faster. Note that this makes it still as fast as our implementation until a week ago.

The results here may improve a little bit, because I have not been able yet to have to compiler optimise out all bounds checks. It now does one per u32.

x86 (all numbers in Mb/s):

PRNG u32 u64 byte stream
ISAAC (rand 0.3.16) 743 764 432
ISAAC (current) 916 1043 1219
ISAAC-64 (rand 0.3.16) 668 1307 743
ISAAC-64 (current) 937 1410 1451
HC-128 1231 1417 1231

x86_64 (all numbers in Mb/s):

PRNG u32 u64 byte stream
ISAAC (rand 0.3.16) 826 831 488
ISAAC (current) 980 1118 1434
ISAAC-64 (rand 0.3.16) 937 1874 620
ISAAC-64 (current) 1287 1930 2640
HC-128 1300 1810 1833

For the time it takes to initialise I don't yet have good numbers. With the unoptimised routine from hc128 it takes about as much time as ISAAC-64. It can probably be improved by ~30%.

Memory

ISAAC uses two arrays: one to hold 256 state words, and one to hold 256 results. HC-128 needs two arrays of 512 words to hold its state. As an optimisation we include a 16-word array of results.

Memory usage (excl. counters etc.):

PRNG state results total
ISAAC 256*4 = 1024 256*4 = 1024 2048 bytes
ISAAC-64 256*8 = 2048 256*8 = 2048 4096 bytes
HC-128 25124 = 4096 16*4 = 64 4160 bytes

Overall HC-128 seems like a clear improvement over ISAAC. For ISAAC it is not really clear what its place is in the world. HC-128 is a super-fast cryptographic PRNG that works similar.

@dhardy
Copy link
Owner

dhardy commented Nov 16, 2017

That's a good argument.

The other thing that jumps out from your results is that there is only one reason to use 32-bit Isaac over the 64-bit version: memory usage (which is in any case still poor).

At some point in time rand should probably promise that XYZRng will remain permanently available to avoid breakage where reproducibility is required. Nothing is stabilised yet. It may still be more sensible to add a new generator along-side, update StdRng, and propose removal of ISAAC separately.

@dhardy
Copy link
Owner

dhardy commented Dec 6, 2017

HC-128 is at least 13 years old and appears to be designed for 32-bit hardware. Most comparable generators are also 32-bit and of similar age, if not older. Can the arithmetic be parallelised well for modern CPUs? Are there any more recent stream cyphers we should consider?

@pitdicker
Copy link
Author

pitdicker commented Dec 7, 2017

Good questions. For cryptographic RNG's it seems sensible to not use something very recent, because there has not been enough time to analyse it. The variant of HC-128 I worked from is from 2009, and has seen a few small improvements. It produces different results from the one in the eSTREAM contest, mostly for performance it seems.

It is written to take good advantage of instruction level parallelism, much more than ISAAC. SIMD does not make sense here, because what it does is mostly reading from different parts of an array. Something like seven times.

The other ciphers from eSTREAM also seem interesting. HC-128 is the said to be clearly fastest one, although I haven't seen benchmarks comparing them. I don't know how optimal our implementation of ChaCha is. LLVM seems to be able to vectorize it... HC128 is 4~6x faster.

It seems all (?) cryptographic RNG's / ciphers get designed for 32 bit, so they work on all sorts of devices well. I don't expect a good 64-bit RNG to get the necessary analysis soon. But I am just an outsider doing some reading... Interestingly HC-128 with just 32-bit arithmetic can rival the performance of 64-bit ISAAC.

I can give updated benchmarks on my implementation of HC-128:
x86 (all numbers in Mb/s):

PRNG u32 u64 byte stream
ISAAC (rand 0.3.16) 743 764 432
ISAAC (current) 916 1043 1219
ISAAC-64 (rand 0.3.16) 668 1307 743
ISAAC-64 (current) 937 1410 1451
HC-128 (previous) 1231 1417 1231
HC-128 1254 1326 1770

x86_64 (all numbers in Mb/s):

PRNG u32 u64 byte stream
ISAAC (rand 0.3.16) 826 831 488
ISAAC (current) 980 1118 1434
ISAAC-64 (rand 0.3.16) 937 1874 620
ISAAC-64 (current) 1287 1930 2640
HC-128 (previous) 1300 1810 1833
HC-128 1478 1935 2397

I have hold off making a PR because I use unsafe to do unchecked indexing, and would like to find a trick to avoid bounds checking with just safe code. But with every variant I tried, LLVM just can't figure it out.

I think I will just make a PR and leave that as an exercise for later.

@zackw
Copy link

zackw commented Dec 8, 2017

I would like to voice my support for making thread_rng a full-strength cryptographic PRNG. I do not have an opinion on which specific algorithm should be used, though.

In another context (backing up the C library's rand with a CSPRNG) someone suggested to me that it would be good to choose the output cache size to make the amortized cost of each call to sample<u32>(Uniform) roughly what the per-call cost with a linear congruential generator would be.

@pitdicker
Copy link
Author

someone suggested to me that it would be good to choose the output cache size to make the amortized cost of each call to sample(Uniform) roughly what the per-call cost with a linear congruential generator would be.

That is a good suggestion, and already what ISAAC does, and HC-128 to a smaller extend.
ISAAC calculates 256 values at a time, and HC-128 16 at a time.

The small amount of logic to read the next result, and generate new ones if necessary is already 20~40% of the time it takes to generate the next random number.

@pitdicker
Copy link
Author

pitdicker commented Dec 16, 2017

I measured the initialization time weeks ago, but forgot to post it.
HC-128 takes about twice as long to do the calculations to set up its table. But because it request fewer bytes from OsRng, it ends up being similar to ISAAC (32-bit variant).

Initialization time (ns):

PRNG OS RNG combined
ChaCha20 1989 17 2006 (for comparison)
HC-128 2482 339 4347 4686
ISAAC 2636 2280 4916
ISAAC-64 5007 2418 7425

@dhardy
Copy link
Owner

dhardy commented Dec 16, 2017

But 2482 + 4347 >> 4686, and anyway 2482 is not much different that 2636? Those numbers make no sense.

Anyway, I don't think we have to worry too much since thread_rng only seeds rarely.

@pitdicker
Copy link
Author

Sloppy cope & paste, sorry for the confusion. The column 'OS' is my estimate of the times spend outside the control of the RNG init function, by measuring the difference between initializing from OsRng and XorShiftRng.

The impact of reseeding the RNG in tread_rng is surprisingly high. tread_rng reseeded it every 2^15 generated bytes.
With a total init of time 4686ns, and with 2.733ns to generate an u32, the overhead is:
4686ns / (2^15 / 4 * 2.733ns) * 100% = 21%

This matches the benchmarks surprisingly well.

Maybe it is better to set the reseeding threshold of thread_rng a little higher, like 2^17 bytes. This makes the overhead 4× less, 5%. And reseeding happens every 131 kB.

But I suppose any threshold that is less than the number of rounds it takes to recover the internal state of an RNG ('crack' it) is good. HC-128 claims there is no better algorithm than a brute-force search of 2^128 values. Such numbers do not drop quickly, an improvement that could crack it in half the time would still take 2^127 tries.

A quick look the much weaker (and not actually recognised as cryptographically secure) RC-4 shows it has an attack that can recover the state after 2^26 rounds. I think we can safely say that for an actual cryptographic RNG the number should never drop that low. 2^26 rounds of 4 bytes means reseeding after generating 128 MB. That seems like a nicer compromise to me between not reseeding, and reseeding unnecessary often. With HC-128 running at its best speed it still means reseeding 14 times per second.

@dhardy
Copy link
Owner

dhardy commented Jan 1, 2018

Should we also consider Kravatte? I don't know a lot about it, but it's supposed to be a pseudo-random function from the Keccak team.

@pitdicker
Copy link
Author

Nice find! I think there are quite a few cryptographically good RNG's to choose from.

I saw replacing ISAAC with HC-128 as a simple step because they are so similar. They follow roughly the same method, both are based on indirect array indexing. And they have roughly the seems performance, reportedly much faster than the alternatives. I probably sell it short by saying this, but HC-128 can be seen as an improved variant of ISAAC.

So it seems like a not too controversial change.

Exploring / collecting more CryptoRng's would be interesting though.

@dhardy
Copy link
Owner

dhardy commented Jan 22, 2018

@Lokathor can I get you to do some more Rasperry Pi benchmarks? HC-128 is merged into upstream master now; it would be nice to see the result of cargo bench --bench generators on ARM.

@Lokathor
Copy link

yeah, I'll get to this when I have a free moment.

@Lokathor
Copy link

pi@retropie:~/dev/rand $ cargo +nightly bench --bench generators
    Finished release [optimized] target(s) in 0.0 secs
     Running target/release/deps/generators-e972b27ec2312ee9

running 31 tests
test gen_bytes_chacha      ... bench:  17,644,512 ns/iter (+/- 398,746) = 58 MB/s
test gen_bytes_hc128       ... bench:   7,899,742 ns/iter (+/- 44,517) = 129 MB/s
test gen_bytes_isaac       ... bench:   4,246,930 ns/iter (+/- 80,216) = 241 MB/s
test gen_bytes_isaac64     ... bench:   3,641,228 ns/iter (+/- 35,539) = 281 MB/s
test gen_bytes_os          ... bench:  24,481,923 ns/iter (+/- 65,632) = 41 MB/s
test gen_bytes_std         ... bench:   4,244,826 ns/iter (+/- 50,662) = 241 MB/s
test gen_bytes_xorshift    ... bench:   2,685,297 ns/iter (+/- 5,213) = 381 MB/s
test gen_u32_chacha        ... bench:      68,660 ns/iter (+/- 292) = 58 MB/s
test gen_u32_hc128         ... bench:      30,628 ns/iter (+/- 222) = 130 MB/s
test gen_u32_isaac         ... bench:      22,280 ns/iter (+/- 95) = 179 MB/s
test gen_u32_isaac64       ... bench:      20,794 ns/iter (+/- 151) = 192 MB/s
test gen_u32_os            ... bench:   1,687,689 ns/iter (+/- 3,675) = 2 MB/s
test gen_u32_std           ... bench:      28,134 ns/iter (+/- 142) = 142 MB/s
test gen_u32_xorshift      ... bench:      10,853 ns/iter (+/- 32) = 368 MB/s
test gen_u64_chacha        ... bench:     141,694 ns/iter (+/- 704) = 56 MB/s
test gen_u64_hc128         ... bench:      56,631 ns/iter (+/- 288) = 141 MB/s
test gen_u64_isaac         ... bench:      42,022 ns/iter (+/- 213) = 190 MB/s
test gen_u64_isaac64       ... bench:      33,237 ns/iter (+/- 1,177) = 240 MB/s
test gen_u64_jitter        ... bench:     108,299 ns/iter (+/- 2,685)
test gen_u64_os            ... bench:   1,693,601 ns/iter (+/- 5,160) = 4 MB/s
test gen_u64_std           ... bench:      47,928 ns/iter (+/- 224) = 166 MB/s
test gen_u64_xorshift      ... bench:      17,547 ns/iter (+/- 1,078) = 455 MB/s
test init_chacha           ... bench:         493 ns/iter (+/- 2)
test init_hc128            ... bench:      40,840 ns/iter (+/- 117)
test init_isaac            ... bench:      16,890 ns/iter (+/- 40)
test init_isaac64          ... bench:      15,887 ns/iter (+/- 179)
test init_jitter           ... bench:     640,362 ns/iter (+/- 10,337)
test init_xorshift         ... bench:         105 ns/iter (+/- 0)
test reseeding_hc128_bytes ... bench:   7,903,153 ns/iter (+/- 32,645) = 129 MB/s
test reseeding_hc128_u32   ... bench:      35,869 ns/iter (+/- 56) = 111 MB/s
test reseeding_hc128_u64   ... bench:      70,728 ns/iter (+/- 147) = 113 MB/s

test result: ok. 0 passed; 0 failed; 0 ignored; 31 measured; 0 filtered out

@dhardy
Copy link
Owner

dhardy commented Jan 29, 2018

Thanks @Lokathor. Looks like @pitdicker was right that ISAAC64 is faster than the 32-bit version everywhere; unfortunately HC-128 doesn't perform well. Comparison on my laptop (Haswell):

$ RUST_VER=nightly cargo bench --bench generators
    Finished release [optimized] target(s) in 0.0 secs
     Running target/release/deps/generators-0cb36d8f63ab1872

running 34 tests
test gen_bytes_chacha12 ... bench:   1,756,129 ns/iter (+/- 178,500) = 583 MB/s
test gen_bytes_chacha20 ... bench:   2,693,987 ns/iter (+/- 249,770) = 380 MB/s
test gen_bytes_chacha8  ... bench:   1,281,627 ns/iter (+/- 47,023) = 798 MB/s
test gen_bytes_hc128    ... bench:     629,247 ns/iter (+/- 81,467) = 1627 MB/s
test gen_bytes_isaac    ... bench:   1,108,569 ns/iter (+/- 31,206) = 923 MB/s
test gen_bytes_isaac64  ... bench:     591,091 ns/iter (+/- 57,612) = 1732 MB/s
test gen_bytes_os       ... bench:   6,163,858 ns/iter (+/- 307,591) = 166 MB/s
test gen_bytes_std      ... bench:     591,572 ns/iter (+/- 56,295) = 1730 MB/s
test gen_bytes_xorshift ... bench:     887,066 ns/iter (+/- 29,150) = 1154 MB/s

@pitdicker
Copy link
Author

Wow! I did not expect such a large difference between ISAAC and HC-128. I wonder what causes it?

test gen_bytes_hc128       ... bench:   7,899,742 ns/iter (+/- 44,517) = 129 MB/s
test gen_bytes_isaac       ... bench:   4,246,930 ns/iter (+/- 80,216) = 241 MB/s
test gen_bytes_isaac64     ... bench:   3,641,228 ns/iter (+/- 35,539) = 281 MB/s

If the problem is code size or register pressure HC-128 can be optimized some more for this target. But if the performance difference is just because of the larger tables or the number of instructions this is it...

@pitdicker
Copy link
Author

Looks like @pitdicker was right that ISAAC64 is faster than the 32-bit version everywhere

I should be careful not to claim too much... Actually I didn't notice it before your comment #53 (comment) 😄.

The other thing that jumps out from your results is that there is only one reason to use 32-bit Isaac over the 64-bit version: memory usage (which is in any case still poor).

@dhardy
Copy link
Owner

dhardy commented Mar 11, 2018

Implemented: rust-random#277

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants