-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More efficient phi(x, a) caching #48
Comments
It's definitely useful, and I greatly appreciate the comment! Yes, the phi_walk was interesting for its time, but the memory use makes it fairly limited. Your caching phi is extremely well optimized, making the difference even smaller. I'll take a look! |
I just realized that this not only increases the cache's capacity but it also increases the cache hit rate. E.g. suppose you cache phi(501, a) and at some later point you need to compute phi(502, a). Well then you're lucky because |
It is a nice improvement, thank you! I also saw your main loop in src/phi.cpp stops recursing when we know all further answers come from the cache. That's a big help. I further split it to stop looping entirely when xp <= primes[i] (we can just add in the right number of -SIGN values). Something I noticed earlier:
I think this is both performance and correctness. Both these should take essentially zero time to return 1. If you can get a reasonable prime count upper bound or nth prime lower bound, you can use it in the main phi(x,a,threads) function before generating all the primes, to quickly decide you can return 1. E.g. Whether this is a case worth optimizing is up to you. The correctness problem with negative returns seems to be primes[a] being an int32_t so overflowing to negative values. Therefore the if statement after that doesn't get used like it should. Perhaps some limit checks before generating the primes. |
Thanks for your bug report! I was actually hoping that we could jointly improve phi(x, a) and it has worked out great so far 😄 I have fixed the integer overflow and also massively improved my phi(x, a) for large values of a (i.e. if prime[a] > sqrt(x)) by using code that is similar to yours. My prime counting function implementations do not directly benefit from this, but since phi(x, a) is part of primecount's API I want it to be correct and it should also not cause out of memory errors as was the case before. Unfortunately my fix contains many edge cases and it is full of pitfalls, but it seems to be necessary. I will add more tests on the weekend. There is one optimization that I have added that seems to be missing in your implementation, here is your code: UV legendre_phi(UV x, UV a)
{
...
/* If a > prime_count(2^32), then we need not be concerned with composite
* x values with all factors > 2^32, as x is limited to 64-bit. */
if (a > 203280221) { /* prime_count(2**32) */
UV pc = LMO_prime_count(x);
return (a >= pc) ? 1 : pc - a + 1;
} I changed that to something like: if (a >= sqrt(x) || nth_prime(a) > sqrt(x)) {
UV pc = LMO_prime_count(x);
return (a >= pc) ? 1 : pc - a + 1;
} The tricky part is that just like your code my pi_legendre(x) implementation uses legendre_phi(x, a) under the hood. If we are not very careful than legendre_phi(x, a) will switch to LMO_prime_count(x) when pi_legendre(x) is called. This is surely not what our users expect. It seems that by using |
I don't know if it is practical in the caching situation, but for the walk, I used a reduction for both 2, 3, and 5:
For phi(1e13,1e5) the number of walk inserts was: It is a nice savings for the walk routine, but naively applying it to caching didn't help. It should collapse more values to the same x so we get a better cache hit rate, but I didn't look into it enough to find out what was happening. I'll look at the preface. I think I was trying hard to avoid calling the actual prime count or nth prime early, but I doubt it's a concern since we already just ensured a < sqrt(x) therefore a < 2^32 so that's ~10ms for the exact answer (worst case, and yours is even faster). nth_prime_upper is under a microsecond I might try that first. That leaves a tiny window where we'd calculate the exact nth_prime but not use it. |
In my code I don't call nth prime either. Instead what I do is: while filling up the primes array (with the first a primes) I abort as soon as the current_prime > sqrt(x) and switch to the prime counting function based algorithm. As you mentioned, I think it is a good idea to add another nth_prime_upper check somewhere earlier in phi(x, a). |
I tried it as well using a modulo 2310 wheel that I am already using in the computation of the hard special leaves (see ModuloWheel.hpp & ModuloWheel.cpp). And in my tests it actually significantly improves performance, especially if x >= 1e14 i.e. I measured 30% speed improvement for pi_legendre(1e15). Note that in my test I cached all phi(x, a) values if (x / 2310 * 481) <= UINT16_MAX (and a < 100). |
Nice! I just tested collapsing the values themselves, but used the identical 2x reduction (ceil_div(x,2)) for the actual cache storage. So the memory use would be the same, but the cache hit rate should have increased. It sounds like you did it properly. I've got a todo item to retest my code with legendre_pi, as I haven't tested and I'm sure I haven't caught all the cases. |
It looks like there were a few conflicting things preventing this from showing up as obviously for me. Importantly, I allow caching past 65535, which isn't necessary good or bad, but it made the results harder to follow. As an aside, if I increase the caching size in src/phi.cpp: PhiCache constructor, it starts giving incorrect answers. Not a user modifiable thing, but might be worth noting. After some code refactoring, it is showing a speedup for me. On my M1 Macbook, primecount's single threaded Legendre primecount for 1e15 went from 59.7s (v6.2) to 24.1s (v6.3). My version of legendre primecount went from 1m37s to 23.8.s. Meissel also improved greatly in both our codes. |
I know this is dangerous, I mentioned that I currently cache up to UINT16_MAX/480*2310+2309=316469. This is safe as phi(316469, 6) = 60702 < UINT16_MAX. I have put an assertion in my code (that checks before inserting into the cache), but you are right if somebody would slightly increase the cache size we would start getting integer overflows. Ideally we should calculate the exact limit for our cache size and have an assertion next to the code that sets the cache size...
Awesome 😄 |
I have achieved another 2.5x speedup! While playing with my phi(x, a) implementation I realized that the cache is extremely important for performance, whenever I increased the size of the cache my implementation ran faster. So in order to improve our phi(x, a) implementation we need to find a way to increase our cache's capacity, ideally without increasing the memory usage (because this hurts in multi-threading). We can increase the cache capacity by about 5x without increasing its memory usage using an array with prefix sums. For each a that we cache we create a bit sieve array (where each bit represents an odd number) from which we remove the first a primes and their multiples. Next we generate the prefix sums array by iterating over bit sieve array and counting the 1 bits (which correspond to numbers not divisible by any of the first a primes). This is implemented here. After the cache has been initialized this way, we can compute phi(x, a) using the code below: int64_t phi_cache(uint64_t x, uint64_t a)
{
// Unset bits which correspond to numbers > x
uint64_t bitmask = unset_bits_[x % 128];
uint64_t bits = sieve_[a][x / 128];
return sieve_counts_[a][x / 128] + popcnt64(bits & bitmask);
} I have found one more trick that slightly improves performance. Currently in both our implementations we use c=6 or c=7 and we start iterating from there up to a. We do this, because if c is tiny we have an implementation that can compute phi(x, c) in O(1). However using our cache we can actually compute phi(x, c) in O(1) for much larger values of c. Below is my code for using a larger c: int64_t c = 6;
int64_t larger_c = min(a, max_a_cached_);
if (c >= larger_c ||
!is_cached(x, larger_c))
sum = phi_tiny(x, c) * SIGN;
else
{
c = larger_c;
assert(larger_c <= a);
sum = phi_cache(x, c) * SIGN;
} |
Hi Dana,
I have seen that you have significantly reworked your
legendre_phi(x, a)
implementation on the weekend. This algorithm is also quite important in primecount because it is used in mypi_simple(x)
implementation (which is used all over the place internally) and this algorithm is also frequently used in the computation of the hard special leaves (to initialize an array of phi(x, a) values for each new thread).I have known for many years that your
phi_walk(x, a)
is much faster than primecount's implementation for small values of a. This benefits mainly thepi_lehmer(x)
algorithm (where a is small) but not so much thepi_legendre(x)
andpi_meissel(x)
algorithms (where a is larger). In primecount I am mainly interested in speeding uplegendre_phi(x, a)
for relatively large values of a, hence your currentphi_walk(x, a)
implementation is not very useful to me.After having seen your code changes, I re-read the prime counting function section in Hans Riesel's book as I remembered that this book contains some information about caching phi(x, a) results. The phi(x, a) cache in Hans Riesel's book is a compressed lookup table which only stores information when the function changes its value. This obviously saves a lot of memory (I guess a factor of log(x)) but I think this deteriorates accessing the cache from O(1) to O(log(x)) and filling up this cache is nontrivial as can be seen in your
phi_walk(x, a)
implementation.When analyzing the results of the phi(x, a) function I realized that this function only ever changes its results when x is odd (for the same value of a). So for example phi(501, a) and phi(502, a) will always have the same result, but phi(503, a) may be different. For the previous example, both our phi(x, a) implementations currently cache phi(501, a) and phi(502, a) even though both statements have the same result. Since our cache uses the
uint16_t
type which is very narrow, we cannot cache a lot of numbers and because of this doubling the cache capacity provides a significant speedup.I have implemented this improvement (see kimwalisch/primecount@fb3a643) and I have measured a speed up of up to 30% for both
pi_legendre(x)
andpi_meissel(x)
. Note that part of the speedup (up to 10%) is due to an unrelated change where I split up the main phi(x, a) loop into 2 loops (have a look at my commit).I hope this is useful to you.
The text was updated successfully, but these errors were encountered: