New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of BigInteger.Pow/ModPow #2182

Merged
merged 1 commit into from Jul 9, 2015

Conversation

Projects
None yet
6 participants
@axelheer
Contributor

axelheer commented Jun 27, 2015

To introduce further performance tweaks, the exponentiation algorithms are ported to BigIntegerCalculator. Furthermore the newly introduced FastReducer triggers a bad corner case within the division algorithm, which gets fixed too.

A basic performance comparison based on this code unveils the following results:

ModPow

# of bits # of vals before ms after ms
16 100,000 64 23
64 10,000 145 202
256 1,000 298 181
1,024 100 1,344 637
4,096 10 7,778 2,548
16,384 1 48,292 9,595
#1307
@stephentoub

This comment has been minimized.

Show comment
Hide comment
@stephentoub

stephentoub Jun 27, 2015

Member

Thanks, @axelheer. Will take a look, but in the meantime, do you know why the 64/10K case suffers a 33% regression?

Member

stephentoub commented Jun 27, 2015

Thanks, @axelheer. Will take a look, but in the meantime, do you know why the 64/10K case suffers a 33% regression?

@axelheer

This comment has been minimized.

Show comment
Hide comment
@axelheer

axelheer Jun 27, 2015

Contributor

@stephentoub nope, that's a bit odd and needs further investigation. Maybe the Barrett reduction backfires for smaller numbers.

Contributor

axelheer commented Jun 27, 2015

@stephentoub nope, that's a bit odd and needs further investigation. Maybe the Barrett reduction backfires for smaller numbers.

@axelheer

This comment has been minimized.

Show comment
Hide comment
@axelheer

axelheer Jun 29, 2015

Contributor

@stephentoub is it of equal value to just "amend" further changes? Or has the push further commits / squash in the end process any pros?

Contributor

axelheer commented Jun 29, 2015

@stephentoub is it of equal value to just "amend" further changes? Or has the push further commits / squash in the end process any pros?

@stephentoub

This comment has been minimized.

Show comment
Hide comment
@stephentoub

stephentoub Jun 29, 2015

Member

In this case I'd suggesting adding additional commits and then squashing later. Personally, I prefer to just ammend the existing commit when the changes are small / trivial and I don't expect anyone to need to re-review, e.g. fixing up typos or making simple one-line fixes. For larger things, I prefer to have the additional commits so that it's clear to a reviewer what's changed. That's just me, though.

Member

stephentoub commented Jun 29, 2015

In this case I'd suggesting adding additional commits and then squashing later. Personally, I prefer to just ammend the existing commit when the changes are small / trivial and I don't expect anyone to need to re-review, e.g. fixing up typos or making simple one-line fixes. For larger things, I prefer to have the additional commits so that it's clear to a reviewer what's changed. That's just me, though.

@axelheer

This comment has been minimized.

Show comment
Hide comment
@axelheer

axelheer Jun 30, 2015

Contributor

@stephentoub @mellinoe the two internal helpers are structures now, the reducer seems to pay off already at 256 bits -- below that an ordinary remainder is used. But I still measure that regression at 64 bits, a bit smaller but still there. I have no idea what's wrong here...

@jasonwilliams200OK a DNX console app doesn't produce an .exe. And a classic .NET 4.6 console application does not work since there are type forwards from System.Runtime.Numberics to System.Numerics. Thus, just putting the modified DLL to the build / run directory as described here doesn't work.

Contributor

axelheer commented Jun 30, 2015

@stephentoub @mellinoe the two internal helpers are structures now, the reducer seems to pay off already at 256 bits -- below that an ordinary remainder is used. But I still measure that regression at 64 bits, a bit smaller but still there. I have no idea what's wrong here...

@jasonwilliams200OK a DNX console app doesn't produce an .exe. And a classic .NET 4.6 console application does not work since there are type forwards from System.Runtime.Numberics to System.Numerics. Thus, just putting the modified DLL to the build / run directory as described here doesn't work.

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Jun 30, 2015

Ah, I thought you are using coreclr: CoreRun.exe yours.exe.

ghost commented Jun 30, 2015

Ah, I thought you are using coreclr: CoreRun.exe yours.exe.

@axelheer

This comment has been minimized.

Show comment
Hide comment
@axelheer

axelheer Jul 1, 2015

Contributor

@dotnet-bot test this please

Contributor

axelheer commented Jul 1, 2015

@dotnet-bot test this please

@axelheer

This comment has been minimized.

Show comment
Hide comment
@axelheer

axelheer Jul 2, 2015

Contributor

Update: it seems that the current mod implementation is faster at 64 bits, and since this is the most used operation here (if we don't use the "reducer") we have a regression. (And the "reducer" isn't that fast at 64 bits either...) But the mod operation has already been replaced with #1618, because I only compared performance data of div there. (I assumed no difference, because the algorithms are the same, but BigIntegerBuilder is better within this special case...) Yeah, that sucks.

Thus, I don't think right here is much more to optimize. But, I should review integer division for 64 bit numbers (for 64 bit divisors?) and prepare a separate PR when I figured out a solution.

@stephentoub What do you think?

Contributor

axelheer commented Jul 2, 2015

Update: it seems that the current mod implementation is faster at 64 bits, and since this is the most used operation here (if we don't use the "reducer") we have a regression. (And the "reducer" isn't that fast at 64 bits either...) But the mod operation has already been replaced with #1618, because I only compared performance data of div there. (I assumed no difference, because the algorithms are the same, but BigIntegerBuilder is better within this special case...) Yeah, that sucks.

Thus, I don't think right here is much more to optimize. But, I should review integer division for 64 bit numbers (for 64 bit divisors?) and prepare a separate PR when I figured out a solution.

@stephentoub What do you think?

@stephentoub

This comment has been minimized.

Show comment
Hide comment
@stephentoub

stephentoub Jul 2, 2015

Member

But the mod operation has already been replaced with #1618, because I only compared performance data of div there

Just so I understand, you're saying we already took a regression in the previous change, and that's what you'll be looking to repair in a subsequent one? Can you quantify that regression?

Member

stephentoub commented Jul 2, 2015

But the mod operation has already been replaced with #1618, because I only compared performance data of div there

Just so I understand, you're saying we already took a regression in the previous change, and that's what you'll be looking to repair in a subsequent one? Can you quantify that regression?

@axelheer

This comment has been minimized.

Show comment
Hide comment
@axelheer

axelheer Jul 2, 2015

Contributor

Just so I understand, you're saying we already took a regression in the previous change, and that's what you'll be looking to repair in a subsequent one?

Yeah. If that's a problem we can rollback this stuff and wait until I figured out why it gets stuck at 64 bits.

Can you quantify that regression?

It's the same situation for mod as here for modpow, only a regression at 64 bits, but about 80%.

Contributor

axelheer commented Jul 2, 2015

Just so I understand, you're saying we already took a regression in the previous change, and that's what you'll be looking to repair in a subsequent one?

Yeah. If that's a problem we can rollback this stuff and wait until I figured out why it gets stuck at 64 bits.

Can you quantify that regression?

It's the same situation for mod as here for modpow, only a regression at 64 bits, but about 80%.

@axelheer

This comment has been minimized.

Show comment
Hide comment
@axelheer

axelheer Jul 3, 2015

Contributor

I finally rewrote the integer division, I just combined the algorithms of BigIntegerBuilder and BigIntegerCalculator. Bonus: no more allocations within integer division.

The performance regression seems to be fixed now:

ModPow

# of bits # of vals before ms after ms
16 100,000 64 23
64 10,000 145 131
256 1,000 298 241
1,024 100 1,344 634
4,096 10 7,778 2,542
16,384 1 48,292 9,573

I'm still not happy with the division code, maybe I tweak it a bit tomorrow.

Contributor

axelheer commented Jul 3, 2015

I finally rewrote the integer division, I just combined the algorithms of BigIntegerBuilder and BigIntegerCalculator. Bonus: no more allocations within integer division.

The performance regression seems to be fixed now:

ModPow

# of bits # of vals before ms after ms
16 100,000 64 23
64 10,000 145 131
256 1,000 298 241
1,024 100 1,344 634
4,096 10 7,778 2,542
16,384 1 48,292 9,573

I'm still not happy with the division code, maybe I tweak it a bit tomorrow.

@axelheer

This comment has been minimized.

Show comment
Hide comment
@axelheer

axelheer Jul 5, 2015

Contributor

@stephentoub since the last commit the temporary memory of buffers can be shared between them, which means n / 2 - 1 less allocations (okay, just 3 instead of 4 in our case). I'm quite sure doing it with less allocations isn't possible.

Agreed?

Contributor

axelheer commented Jul 5, 2015

@stephentoub since the last commit the temporary memory of buffers can be shared between them, which means n / 2 - 1 less allocations (okay, just 3 instead of 4 in our case). I'm quite sure doing it with less allocations isn't possible.

Agreed?

Debug.Assert(leftLength >= 0);
Debug.Assert(rightLength >= 0);
Debug.Assert(leftLength >= rightLength);
Debug.Assert(q <= 0xFFFFFFFF);

This comment has been minimized.

@stephentoub

stephentoub Jul 7, 2015

Member

Debug.Assert(q <= uint.MaxValue) ?

@stephentoub

stephentoub Jul 7, 2015

Member

Debug.Assert(q <= uint.MaxValue) ?

This comment has been minimized.

@axelheer

axelheer Jul 7, 2015

Contributor

I prefer explicit values, is that ok?

@axelheer

axelheer Jul 7, 2015

Contributor

I prefer explicit values, is that ok?

This comment has been minimized.

@stephentoub

stephentoub Jul 7, 2015

Member

It's ok, and you can leave it if you like. I just worry about typos, accidentally having one too few or one too many Fs, etc.

@stephentoub

stephentoub Jul 7, 2015

Member

It's ok, and you can leave it if you like. I just worry about typos, accidentally having one too few or one too many Fs, etc.

Debug.Assert(shift >= 0 && shift < 32);
Debug.Assert(leftLength >= 0);
Debug.Assert(rightLength >= 0);
Debug.Assert(leftLength >= rightLength);

This comment has been minimized.

@stephentoub

stephentoub Jul 7, 2015

Member

Doesn't matter much, but if we're asserting that leftLength >= rightLength and that rightLength >= 0, is it necessary to also assert that leftLength >= 0 as is done above?

@stephentoub

stephentoub Jul 7, 2015

Member

Doesn't matter much, but if we're asserting that leftLength >= rightLength and that rightLength >= 0, is it necessary to also assert that leftLength >= 0 as is done above?

This comment has been minimized.

@axelheer

axelheer Jul 7, 2015

Contributor

Nope, not necessary, I just try to write asserts as verbose as meaningful.

@axelheer

axelheer Jul 7, 2015

Contributor

Nope, not necessary, I just try to write asserts as verbose as meaningful.

@stephentoub

This comment has been minimized.

Show comment
Hide comment
@stephentoub

stephentoub Jul 7, 2015

Member

Also, I know we don't have any formal performance harnesses set up, but are your perf tests checked in anywhere? If not, could you please add them? You could check them in as unit tests with an [ActiveIssue("PerfTest")] or something like that for now, until we have a better performance-measurement system in place.

Member

stephentoub commented Jul 7, 2015

Also, I know we don't have any formal performance harnesses set up, but are your perf tests checked in anywhere? If not, could you please add them? You could check them in as unit tests with an [ActiveIssue("PerfTest")] or something like that for now, until we have a better performance-measurement system in place.

@axelheer

This comment has been minimized.

Show comment
Hide comment
@axelheer

axelheer Jul 7, 2015

Contributor

@stephentoub thanks, I pushed a few changes addressing your input so far. For your debug / release concern I don't came up with a solution, maybe you've an idea what I should do? The performance tests I've made are based on a very primitive console app (I linked to its gist in the PR message), it's far from a bunch of unit tests. If you want me to check in that code (although I don't think it's that exciting), where? If you want me to write some tests based on it, how?

Contributor

axelheer commented Jul 7, 2015

@stephentoub thanks, I pushed a few changes addressing your input so far. For your debug / release concern I don't came up with a solution, maybe you've an idea what I should do? The performance tests I've made are based on a very primitive console app (I linked to its gist in the PR message), it's far from a bunch of unit tests. If you want me to check in that code (although I don't think it's that exciting), where? If you want me to write some tests based on it, how?

@stephentoub

This comment has been minimized.

Show comment
Hide comment
@stephentoub

stephentoub Jul 7, 2015

Member

For your debug / release concern I don't came up with a solution, maybe you've an idea what I should do?

What is the time distribution like amongst the various tests? Are there any we could move to be [OuterLoop] such that we'd be able to always use the 32 value, still have reasonably good inner-loop test coverage, have that inner-loop coverage run relatively quickly, but still have the full suite available to run as part of outer loop?

If you want me to write some tests based on it, how?

I was thinking you could do something like this:

  1. Create a PerformanceTests.cs file in the tests project.
  2. Add to it one [Fact] per perf test, something like:
[Fact]
[ActiveIssue("PerformanceTest")]
public static void ModPow()
{
    ... // most of the contents of your existing console test for ModPow
}

Then when someone wants to run the test, they can just remove the [ActiveIssue] attribute, or something like that. And it should hopefully make it easier for us to promote these to real and harnessed performance tests when we have the infrastructure available publically.

Is that reasonable? I was aiming for something easy for you to do and with minimal ceremony.

Member

stephentoub commented Jul 7, 2015

For your debug / release concern I don't came up with a solution, maybe you've an idea what I should do?

What is the time distribution like amongst the various tests? Are there any we could move to be [OuterLoop] such that we'd be able to always use the 32 value, still have reasonably good inner-loop test coverage, have that inner-loop coverage run relatively quickly, but still have the full suite available to run as part of outer loop?

If you want me to write some tests based on it, how?

I was thinking you could do something like this:

  1. Create a PerformanceTests.cs file in the tests project.
  2. Add to it one [Fact] per perf test, something like:
[Fact]
[ActiveIssue("PerformanceTest")]
public static void ModPow()
{
    ... // most of the contents of your existing console test for ModPow
}

Then when someone wants to run the test, they can just remove the [ActiveIssue] attribute, or something like that. And it should hopefully make it easier for us to promote these to real and harnessed performance tests when we have the infrastructure available publically.

Is that reasonable? I was aiming for something easy for you to do and with minimal ceremony.

@axelheer

This comment has been minimized.

Show comment
Hide comment
@axelheer

axelheer Jul 7, 2015

Contributor

What is the time distribution like amongst the various tests?

Just ModPow3LargeInt is already a bit slow for values < 32.

Are there any we could move to be [OuterLoop] such that we'd be able to always use the 32 value

What does OuterLoop mean? I didn't find any documentation on it. Note that we don't have tests for >= 32 values -- if I remove the debug / release stuff, we won't cover the reducer code. And skipping the whole reducer code for an "ordinary" test run... ?

I might add, that the changed behavior should be very harmless, it's just a threshold when to switch to the "more sophisticated" code, which doesn't depend on that value. On the contrary, if it works even for smaller values, it's a good thing.

I was thinking you could do something like this [...]

I see, I can add that for all those operations, but I'd prefer to do that separately on a subsequent date.

By the way, I did more cleanup on the "modpow" code based on your inputs and will push that shortly.

Contributor

axelheer commented Jul 7, 2015

What is the time distribution like amongst the various tests?

Just ModPow3LargeInt is already a bit slow for values < 32.

Are there any we could move to be [OuterLoop] such that we'd be able to always use the 32 value

What does OuterLoop mean? I didn't find any documentation on it. Note that we don't have tests for >= 32 values -- if I remove the debug / release stuff, we won't cover the reducer code. And skipping the whole reducer code for an "ordinary" test run... ?

I might add, that the changed behavior should be very harmless, it's just a threshold when to switch to the "more sophisticated" code, which doesn't depend on that value. On the contrary, if it works even for smaller values, it's a good thing.

I was thinking you could do something like this [...]

I see, I can add that for all those operations, but I'd prefer to do that separately on a subsequent date.

By the way, I did more cleanup on the "modpow" code based on your inputs and will push that shortly.

@stephentoub

This comment has been minimized.

Show comment
Hide comment
@stephentoub

stephentoub Jul 7, 2015

Member

What does OuterLoop mean?

We currently have two main classes of tests. The first, called "inner loop", are meant to be fast, do basic validation, and run whenever you do a build locally or as part of CI on the server. The second, called "outer loop", are meant to be more robust and are ok to take longer; these don't run by default when you build locally, nor as part of normal CI runs... we have a special job on the server that runs them a few times a day. You can also run them locally with xunit by not passing "-notrait category=outerloop" on the command-line (this argument is specified by default when you do a test run locally).

Note that we don't have tests for >= 32 values -- if I remove the debug / release stuff, we won't cover the reducer code

That feels like a gap... doesn't that mean that the reducer stuff won't be used at all when tests are run on release builds?

it's just a threshold when to switch to the "more sophisticated" code

That also means that we're not testing the normal algorithm in debug for values between 8 and 32 bits, yeah that's the algorithm we'll be using on that range in production. That feels wrong.

How important is it to performance that ReducerThreshold be a const? If it were, for example, a static int that was read each time it was needed, would that be measurable? If that wouldn't affect anything, we could potentially test things by mucking with the threshold value from the tests via reflection at runtime. Not ideal, for sure, but could be the least evil thing.

I can add that for all those operations, but I'd prefer to do that separately on a subsequent date.

That's fine, thanks.

Member

stephentoub commented Jul 7, 2015

What does OuterLoop mean?

We currently have two main classes of tests. The first, called "inner loop", are meant to be fast, do basic validation, and run whenever you do a build locally or as part of CI on the server. The second, called "outer loop", are meant to be more robust and are ok to take longer; these don't run by default when you build locally, nor as part of normal CI runs... we have a special job on the server that runs them a few times a day. You can also run them locally with xunit by not passing "-notrait category=outerloop" on the command-line (this argument is specified by default when you do a test run locally).

Note that we don't have tests for >= 32 values -- if I remove the debug / release stuff, we won't cover the reducer code

That feels like a gap... doesn't that mean that the reducer stuff won't be used at all when tests are run on release builds?

it's just a threshold when to switch to the "more sophisticated" code

That also means that we're not testing the normal algorithm in debug for values between 8 and 32 bits, yeah that's the algorithm we'll be using on that range in production. That feels wrong.

How important is it to performance that ReducerThreshold be a const? If it were, for example, a static int that was read each time it was needed, would that be measurable? If that wouldn't affect anything, we could potentially test things by mucking with the threshold value from the tests via reflection at runtime. Not ideal, for sure, but could be the least evil thing.

I can add that for all those operations, but I'd prefer to do that separately on a subsequent date.

That's fine, thanks.

@axelheer

This comment has been minimized.

Show comment
Hide comment
@axelheer

axelheer Jul 7, 2015

Contributor

Hm, we've that debug / release thing already three times: multiply, square, modpow.

I can remove this "trick" and change the thresholds to static values as you've suggested, which should work since xunit runs the tests at class level single threaded, and add further tests / extend existing tests to change the thresholds accordingly. That should do it, yeah. Not ideal, but better than the current one.

Contributor

axelheer commented Jul 7, 2015

Hm, we've that debug / release thing already three times: multiply, square, modpow.

I can remove this "trick" and change the thresholds to static values as you've suggested, which should work since xunit runs the tests at class level single threaded, and add further tests / extend existing tests to change the thresholds accordingly. That should do it, yeah. Not ideal, but better than the current one.

@axelheer

This comment has been minimized.

Show comment
Hide comment
@axelheer

axelheer Jul 7, 2015

Contributor

@stephentoub the last commit should test as discussed. I actually like that solution now. The (new) tests for three big values became OuterLoop tests, since they only add coverage for a very small function but take a long time.

Contributor

axelheer commented Jul 7, 2015

@stephentoub the last commit should test as discussed. I actually like that solution now. The (new) tests for three big values became OuterLoop tests, since they only add coverage for a very small function but take a long time.

@stephentoub

This comment has been minimized.

Show comment
Hide comment
@stephentoub

stephentoub Jul 8, 2015

Member

Left a few minor comments, but otherwise LGTM. Feel free to squash when ready.

Member

stephentoub commented Jul 8, 2015

Left a few minor comments, but otherwise LGTM. Feel free to squash when ready.

@stephentoub

This comment has been minimized.

Show comment
Hide comment
@stephentoub

stephentoub Jul 8, 2015

Member

LGTM. Thanks for doing this, @axelheer.

@KrzysztofCwalina, look good to you?

Member

stephentoub commented Jul 8, 2015

LGTM. Thanks for doing this, @axelheer.

@KrzysztofCwalina, look good to you?

@KrzysztofCwalina

This comment has been minimized.

Show comment
Hide comment
@KrzysztofCwalina

KrzysztofCwalina Jul 8, 2015

Member

Looks good. Thanks!

Member

KrzysztofCwalina commented Jul 8, 2015

Looks good. Thanks!

Improve performance of BigInteger.Pow/ModPow
- To introduce further performance tweaks, the exponentiation algorithms
  are rewritten and ported to `BigIntegerCalculator`.
- To scale better for bigger numbers a `FastReducer` is in use instead
  of "ordinary" modulo operations, which is based on multiplications.
- Furthermore the newly introduced `FastReducer` triggers a bad corner
  case within the division algorithm, which gets fixed too.
- A performance regression at 64 bits within integer division was found,
  which gets fixed too (no more allocations within that code).
- The test code for threshold values of square / multiply / modpow now
  modifies these thresholds for more thorough testing.
@axelheer

This comment has been minimized.

Show comment
Hide comment
@axelheer

axelheer Jul 9, 2015

Contributor

@dotnet-bot test this please

Contributor

axelheer commented Jul 9, 2015

@dotnet-bot test this please

@stephentoub

This comment has been minimized.

Show comment
Hide comment
@stephentoub

stephentoub Jul 9, 2015

Member

Thanks!

Member

stephentoub commented Jul 9, 2015

Thanks!

stephentoub added a commit that referenced this pull request Jul 9, 2015

Merge pull request #2182 from axelheer/biginteger-performance
Improve performance of BigInteger.Pow/ModPow

@stephentoub stephentoub merged commit 3279ca8 into dotnet:master Jul 9, 2015

1 check passed

default Build finished. No test results found.
Details

@axelheer axelheer deleted the axelheer:biginteger-performance branch Jul 9, 2015

@karelz karelz modified the milestone: 1.0.0-rtm Dec 3, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment