Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More efficient phi(x, a) caching #48

Open
kimwalisch opened this issue Feb 9, 2021 · 11 comments
Open

More efficient phi(x, a) caching #48

kimwalisch opened this issue Feb 9, 2021 · 11 comments
Assignees

Comments

@kimwalisch
Copy link

Hi Dana,

I have seen that you have significantly reworked your legendre_phi(x, a) implementation on the weekend. This algorithm is also quite important in primecount because it is used in my pi_simple(x) implementation (which is used all over the place internally) and this algorithm is also frequently used in the computation of the hard special leaves (to initialize an array of phi(x, a) values for each new thread).

I have known for many years that your phi_walk(x, a) is much faster than primecount's implementation for small values of a. This benefits mainly the pi_lehmer(x) algorithm (where a is small) but not so much the pi_legendre(x) and pi_meissel(x) algorithms (where a is larger). In primecount I am mainly interested in speeding up legendre_phi(x, a) for relatively large values of a, hence your current phi_walk(x, a) implementation is not very useful to me.

After having seen your code changes, I re-read the prime counting function section in Hans Riesel's book as I remembered that this book contains some information about caching phi(x, a) results. The phi(x, a) cache in Hans Riesel's book is a compressed lookup table which only stores information when the function changes its value. This obviously saves a lot of memory (I guess a factor of log(x)) but I think this deteriorates accessing the cache from O(1) to O(log(x)) and filling up this cache is nontrivial as can be seen in your phi_walk(x, a) implementation.

When analyzing the results of the phi(x, a) function I realized that this function only ever changes its results when x is odd (for the same value of a). So for example phi(501, a) and phi(502, a) will always have the same result, but phi(503, a) may be different. For the previous example, both our phi(x, a) implementations currently cache phi(501, a) and phi(502, a) even though both statements have the same result. Since our cache uses the uint16_t type which is very narrow, we cannot cache a lot of numbers and because of this doubling the cache capacity provides a significant speedup.

I have implemented this improvement (see kimwalisch/primecount@fb3a643) and I have measured a speed up of up to 30% for both pi_legendre(x) and pi_meissel(x). Note that part of the speedup (up to 10%) is due to an unrelated change where I split up the main phi(x, a) loop into 2 loops (have a look at my commit).

I hope this is useful to you.

@danaj danaj self-assigned this Feb 9, 2021
@danaj
Copy link
Owner

danaj commented Feb 9, 2021

It's definitely useful, and I greatly appreciate the comment!

Yes, the phi_walk was interesting for its time, but the memory use makes it fairly limited. Your caching phi is extremely well optimized, making the difference even smaller.

I'll take a look!

@kimwalisch
Copy link
Author

I just realized that this not only increases the cache's capacity but it also increases the cache hit rate. E.g. suppose you cache phi(501, a) and at some later point you need to compute phi(502, a). Well then you're lucky because ceil_div(502, 2) evaluates to the same value as ceil_div(501, 2), so phi(502, a) will be a cache hit even if it has not been computed/cached before.

@danaj
Copy link
Owner

danaj commented Feb 11, 2021

It is a nice improvement, thank you!

I also saw your main loop in src/phi.cpp stops recursing when we know all further answers come from the cache. That's a big help. I further split it to stop looping entirely when xp <= primes[i] (we can just add in the right number of -SIGN values).

Something I noticed earlier:

time ./primecount --phi 1e9 100858536
1

real	0m0.579s
user	0m0.532s
sys	0m0.046s

time ./primecount --phi 1e9 900858536
-850011001

real	0m5.578s
user	0m5.143s
sys	0m0.422s

I think this is both performance and correctness. Both these should take essentially zero time to return 1.

If you can get a reasonable prime count upper bound or nth prime lower bound, you can use it in the main phi(x,a,threads) function before generating all the primes, to quickly decide you can return 1. E.g.
if (a >= prime_count_upper(x)) return 1;

Whether this is a case worth optimizing is up to you. The correctness problem with negative returns seems to be primes[a] being an int32_t so overflowing to negative values. Therefore the if statement after that doesn't get used like it should. Perhaps some limit checks before generating the primes.

@kimwalisch
Copy link
Author

Thanks for your bug report! I was actually hoping that we could jointly improve phi(x, a) and it has worked out great so far 😄

I have fixed the integer overflow and also massively improved my phi(x, a) for large values of a (i.e. if prime[a] > sqrt(x)) by using code that is similar to yours. My prime counting function implementations do not directly benefit from this, but since phi(x, a) is part of primecount's API I want it to be correct and it should also not cause out of memory errors as was the case before. Unfortunately my fix contains many edge cases and it is full of pitfalls, but it seems to be necessary. I will add more tests on the weekend.

There is one optimization that I have added that seems to be missing in your implementation, here is your code:

UV legendre_phi(UV x, UV a)
{
...

  /* If a > prime_count(2^32), then we need not be concerned with composite
   * x values with all factors > 2^32, as x is limited to 64-bit. */
  if (a > 203280221) {  /* prime_count(2**32) */
    UV pc = LMO_prime_count(x);
    return (a >= pc)  ?  1  :  pc - a + 1;
  }

I changed that to something like:

  if (a >= sqrt(x) || nth_prime(a) > sqrt(x)) {
    UV pc = LMO_prime_count(x);
    return (a >= pc)  ?  1  :  pc - a + 1;
  }

The tricky part is that just like your code my pi_legendre(x) implementation uses legendre_phi(x, a) under the hood. If we are not very careful than legendre_phi(x, a) will switch to LMO_prime_count(x) when pi_legendre(x) is called. This is surely not what our users expect. It seems that by using if (nth_prime(a) > sqrt(x)) instead of if (nth_prime(a) >= sqrt(x)) this can be prevented though.

@danaj
Copy link
Owner

danaj commented Feb 12, 2021

I don't know if it is practical in the caching situation, but for the walk, I used a reduction for both 2, 3, and 5:

// Valid for a >= 3 and x >= 1.
static const unsigned char _pred5[30] = {1,0,1,2,3,4,5,0,1,2,3,0,1,0,1,2,3,0,1,0,1,2,3,0,1,2,3,4,5,0};
x -= _pred5[x % 30];

For phi(1e13,1e5) the number of walk inserts was:
44.9M normal
34.3M reduce even numbers to odd
25.7M mod 30 reduction
23.8M mod 210 reduction

It is a nice savings for the walk routine, but naively applying it to caching didn't help. It should collapse more values to the same x so we get a better cache hit rate, but I didn't look into it enough to find out what was happening.


I'll look at the preface. I think I was trying hard to avoid calling the actual prime count or nth prime early, but I doubt it's a concern since we already just ensured a < sqrt(x) therefore a < 2^32 so that's ~10ms for the exact answer (worst case, and yours is even faster). nth_prime_upper is under a microsecond I might try that first. That leaves a tiny window where we'd calculate the exact nth_prime but not use it.

@kimwalisch
Copy link
Author

I think I was trying hard to avoid calling the actual prime count or nth prime early...

In my code I don't call nth prime either. Instead what I do is: while filling up the primes array (with the first a primes) I abort as soon as the current_prime > sqrt(x) and switch to the prime counting function based algorithm. As you mentioned, I think it is a good idea to add another nth_prime_upper check somewhere earlier in phi(x, a).

@kimwalisch
Copy link
Author

kimwalisch commented Feb 12, 2021

I used a reduction for both 2, 3, and 5:

I tried it as well using a modulo 2310 wheel that I am already using in the computation of the hard special leaves (see ModuloWheel.hpp & ModuloWheel.cpp). And in my tests it actually significantly improves performance, especially if x >= 1e14 i.e. I measured 30% speed improvement for pi_legendre(1e15). Note that in my test I cached all phi(x, a) values if (x / 2310 * 481) <= UINT16_MAX (and a < 100).

@danaj
Copy link
Owner

danaj commented Feb 12, 2021

Nice! I just tested collapsing the values themselves, but used the identical 2x reduction (ceil_div(x,2)) for the actual cache storage. So the memory use would be the same, but the cache hit rate should have increased. It sounds like you did it properly.

I've got a todo item to retest my code with legendre_pi, as I haven't tested and I'm sure I haven't caught all the cases.

@danaj
Copy link
Owner

danaj commented Feb 13, 2021

It looks like there were a few conflicting things preventing this from showing up as obviously for me. Importantly, I allow caching past 65535, which isn't necessary good or bad, but it made the results harder to follow. As an aside, if I increase the caching size in src/phi.cpp: PhiCache constructor, it starts giving incorrect answers. Not a user modifiable thing, but might be worth noting.

After some code refactoring, it is showing a speedup for me.

On my M1 Macbook, primecount's single threaded Legendre primecount for 1e15 went from 59.7s (v6.2) to 24.1s (v6.3).
Both built locally with system clang and without OpenMP.
Of course threaded (gcc-10) they are much faster in real time.

My version of legendre primecount went from 1m37s to 23.8.s.

Meissel also improved greatly in both our codes.

@kimwalisch
Copy link
Author

It looks like there were a few conflicting things preventing this from showing up as obviously for me. Importantly, I allow caching past 65535, which isn't necessary good or bad, but it made the results harder to follow. As an aside, if I increase the caching size in src/phi.cpp: PhiCache constructor, it starts giving incorrect answers. Not a user modifiable thing, but might be worth noting.

I know this is dangerous, I mentioned that I currently cache up to UINT16_MAX/480*2310+2309=316469. This is safe as phi(316469, 6) = 60702 < UINT16_MAX. I have put an assertion in my code (that checks before inserting into the cache), but you are right if somebody would slightly increase the cache size we would start getting integer overflows. Ideally we should calculate the exact limit for our cache size and have an assertion next to the code that sets the cache size...

My version of legendre primecount went from 1m37s to 23.8.s.

Awesome 😄

@kimwalisch
Copy link
Author

I have achieved another 2.5x speedup!

While playing with my phi(x, a) implementation I realized that the cache is extremely important for performance, whenever I increased the size of the cache my implementation ran faster. So in order to improve our phi(x, a) implementation we need to find a way to increase our cache's capacity, ideally without increasing the memory usage (because this hurts in multi-threading).

We can increase the cache capacity by about 5x without increasing its memory usage using an array with prefix sums. For each a that we cache we create a bit sieve array (where each bit represents an odd number) from which we remove the first a primes and their multiples. Next we generate the prefix sums array by iterating over bit sieve array and counting the 1 bits (which correspond to numbers not divisible by any of the first a primes). This is implemented here.

After the cache has been initialized this way, we can compute phi(x, a) using the code below:

int64_t phi_cache(uint64_t x, uint64_t a)
{
    // Unset bits which correspond to numbers > x
    uint64_t bitmask = unset_bits_[x % 128];
    uint64_t bits = sieve_[a][x / 128];
    return sieve_counts_[a][x / 128] + popcnt64(bits & bitmask);
}

I have found one more trick that slightly improves performance. Currently in both our implementations we use c=6 or c=7 and we start iterating from there up to a. We do this, because if c is tiny we have an implementation that can compute phi(x, c) in O(1). However using our cache we can actually compute phi(x, c) in O(1) for much larger values of c.

Below is my code for using a larger c:

int64_t c = 6;
int64_t larger_c = min(a, max_a_cached_);

if (c >= larger_c ||
    !is_cached(x, larger_c))
    sum = phi_tiny(x, c) * SIGN;
else
{
    c = larger_c;
    assert(larger_c <= a);
    sum = phi_cache(x, c) * SIGN;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants