Use hash table instead of binary search to look up 32-bit pseudoprimes in n_is_prime#2509
Conversation
|
Does looking up the pseudoprimes perform favorably compared to looking up the witness to use, like Forisek and Jancina's 32-bit test? If you want to reduce the number of pseudoprimes to store, you can use a different initial witness. 2 is actually very average when used by itself, it's strong because it eliminates the Monier-Rabin semiprimes { (2x+1)(4x+1) where x is odd} that many other witnesses are weak to. Given a pair (witness, odd_strong_pseudoprimes_under_2^32) we have (15,1883) (34,2009) (37,1959) and given their proximity to a power of two (16-1,32+2,32+5) you might be able to exploit their form for faster montgomery exponentiation like done with 2. I'm not aware of any witness that has only 1024 pseudoprimes under 2^32 as convenient as that might be. |
Yes, though this may differ depending on CPU and minute implementation differences. Random input will be declared composite by the base 2 test with high probability, so in the average case we benefit from the fast powering of base 2 and the fact that we don't need any hash table lookup at all. For primes (or, rarely, base-2 pseudoprimes), the additional hash table lookup after the strong probable prime test should have comparable cost to a hash table lookup before the test. BTW, note that we use the Shoup modular reduction for the 32-bit powering and Montgomery only for >32 bit. This was faster on my machine, but again, this may differ on other machines (and with any micro-optimizations I've overlooked). |
Instead of branchy binary search, do a branch-free O(1) hash table lookup. Inspired in part by discussion with David Sparks.
Gives a 5-10% speedup for random integers and 15-40% speedup for primes:
I'm not an expert on designing perfect hash functions; these are the parameters I came up with after a bit of trial and error. The hash table stores the 2314 pseudoprimes in an array of 2560 entries, requiring 10% zero padding (plus a bit of extra space for hash keys).
Using an array of length 4096 instead would probably be a little bit faster by allowing a cheaper modulo operation, but would waste 6 KB of precious L1 cache. Maybe one could split by size of n into one small 512-entry table and a large 2048-entry table instead? If someone else wants to revisit this after me.