Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

secp256k1: Add TinyGo support. #3223

Merged
merged 1 commit into from
Mar 19, 2024
Merged

secp256k1: Add TinyGo support. #3223

merged 1 commit into from
Mar 19, 2024

Conversation

seedhammer
Copy link
Contributor

@seedhammer seedhammer commented Mar 16, 2024

The pre-computed table for speeding up ScalarBaseMultNonConst is several hundred kilobytes in the binary and even more when unpacked into working memory. Special-case ScalarBaseMultNonConst to fall back to ScalarMultNonConst when the 'tinygo' tag is specified, which is true when building a Go program with TinyGo.

Copy link
Member

@davecgh davecgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. This looks good other than a couple of inline nits I've identified.

For what it's worth, while this approach is fine as a method to immediately allow it to work with TinyGo, I think it would ultimately make more sense to modify the core window-based logic in the innards of ScalarBaseMultNonConst to support a smaller window size in exchange for a bit of calculation so that it still works much more quickly on TinyGo than doing a round trip through the arbitrary point multiplication as this PR does. In that way, it would essentially allow a tradeoff between the memory usage and calculation speed. Moreover, it would allow it to avoid memory allocations which have GC implications.

For example, currently it has a window size of 256 with pure lookups which results in about 240KiB memory usage. If instead it went with a window size of something like 32 while calculating each window, it would only need about 1KiB memory usage and would still be quite a bit faster. A rough guess is that it would probably only be twice as slow using that approach versus the current method which is around 5x slower.

BenchmarkScalarBaseMultNonConstFast             64776             18540 ns/op               0 B/op          0 allocs/op
BenchmarkScalarBaseMultNonConstSlow             13072             91975 ns/op              64 B/op          2 allocs/op

dcrec/secp256k1/curve_embedded.go Outdated Show resolved Hide resolved
dcrec/secp256k1/curve_precompute.go Outdated Show resolved Hide resolved
@davecgh davecgh changed the title secp256k1: add support for resource constrained environments (TinyGo) secp256k1: Add TinyGo support. Mar 19, 2024
The pre-computed table for speeding up ScalarBaseMultNonConst is
several hundred kilobytes in the binary and even more when unpacked
into working memory. Special-case ScalarBaseMultNonConst to fall back to
ScalarMultNonConst when the 'tinygo' tag is specified, which is true
when building a Go program with TinyGo.
@seedhammer
Copy link
Contributor Author

For what it's worth, while this approach is fine as a method to immediately allow it to work with TinyGo, I think it would ultimately make more sense to modify the core window-based logic in the innards of ScalarBaseMultNonConst to support a smaller window size in exchange for a bit of calculation so that it still works much more quickly on TinyGo than doing a round trip through the arbitrary point multiplication as this PR does. In that way, it would essentially allow a tradeoff between the memory usage and calculation speed. Moreover, it would allow it to avoid memory allocations which have GC implications.

Tweaking the window size is ideal, but I couldn't figure out a window size small enough to matter. See below.

For example, currently it has a window size of 256 with pure lookups which results in about 240KiB memory usage. If instead it went with a window size of something like 32 while calculating each window, it would only need about 1KiB memory usage and would still be quite a bit faster.

fmt.Println("unsafe.Sizeof([32][256]JacobianPoint{})", unsafe.Sizeof([32][256]JacobianPoint{}))
fmt.Println("unsafe.Sizeof([32][32]JacobianPoint{})", unsafe.Sizeof([32][32]JacobianPoint{}))
fmt.Println("unsafe.Sizeof([32][1]JacobianPoint{})", unsafe.Sizeof([32][1]JacobianPoint{}))

results in

unsafe.Sizeof([32][256]JacobianPoint{}) 983040
unsafe.Sizeof([32][32]JacobianPoint{}) 122880
unsafe.Sizeof([32][1]JacobianPoint{}) 3840

which are larger than your 240KiB/1KiB numbers. What did I miss?

@davecgh
Copy link
Member

davecgh commented Mar 19, 2024

which are larger than your 240KiB/1KiB numbers. What did I miss?

You're right. I left off a factor of 4 there for the uint32s when calculating, so both should be 4x higher. Specifically, it should've been 32*256*3*10*4 = 983040 ~= 960 KiB which matches the printout from unsafe.Sizeof.

With a window size of 2^5 = 32 instead, it would only need to store 32*3*10*4 = 3840 ~= 3.75KiB in exchange for the extra calculations that would be needed to do the windowed NAF conversions and multiplications of the 256/32 = 8 windows.

@davecgh davecgh merged commit 2ee2ebe into decred:master Mar 19, 2024
2 checks passed
@davecgh
Copy link
Member

davecgh commented Mar 19, 2024

For reference, I opend #3225 to make it so it remains zero allocation in the slow path (aka on TinyGo) as well to avoid GC implications.

@seedhammer
Copy link
Contributor Author

seedhammer commented Mar 19, 2024

With a window size of 2^5 = 32 instead, it would only need to store 32*3*10*4 = 3840 ~= 3.75KiB in exchange for the extra calculations that would be needed to do the windowed NAF conversions and multiplications of the 256/32 = 8 windows.

Thanks. I took another look at implementing this, but it seems to be quite some work. Am I right that the existing naf function is binary and the optimization needs a w-naf with 2^w digits?

It also seems to me (and from casual glancing over the Bitcoin secp256k1 implementation) that pre-computing a small (say w=4 or 5) window is advantageous even for the general ScalarMultNonConst. If so, ScalarBaseMultNonConstSlow would be faster even without extra RAM or flash ROM space. To offline pre-compute multiples of the generator would simply be an additional optimization on top.

I suppose I either missed something, or the reason ScalarMultNonConst doesn't pre-compute a window is because there's nowhere to stash the extra values in a GC-free manner. The Bitcoin secp256k1 library takes a context value in its API, presumably for this reason.

@davecgh
Copy link
Member

davecgh commented Mar 19, 2024

Yes, it would be quite a bit of work to implement. That's a big reason why I didn't have a big issue with the approach in this PR, but I figured it was worth mentioning.

Am I right that the existing naf function is binary and the optimization needs a w-naf with 2^w digits?

Correct. GECC (Guide to Elliptic Curve Cryptography) section 3.30 provides some information and algorithms for window methods, but the width-w NAF algorithms it provides really aren't very optimized from what I recall. The existing binary NAF uses a significantly faster algorithm (#2695, ~93%) introduced by Prodinger, but it doesn't apply directly to width-w NAF.

I mention that because it may or may not be the case that a small window that needs a much more expensive NAF calculation (as well as some other calculations for dealing with the endomorphism in Jacobian coords) will be all that much faster in practice. I suspect that it would be faster since point additions are relatively costly and using a window would cut down on those, but without actually implementing and testing it, it's hard to say with any certainty.

I suppose I either missed something, or the reason ScalarMultNonConst doesn't pre-compute a window is because there's nowhere to stash the extra values in a GC-free manner.

Well, the primary reason is just that I never got around to putting the effort into implementing and testing it given it's already extremely fast and I had reached the point of diminishing returns on optimizations. I believe there are also some other considerations and normalizations that would probably have to occur for a window approach due to the use of the endormophism along with Jacobian projective space, but it's been a few years since I wrote and optimized all of that code that I'd need to dig in again to verify all the math.

On the topic of optimizations and diminishing returns, there are some others that would also likely result in some additional speedups in signature verification such as Shamir's trick (multiple point multiplication).

The Bitcoin secp256k1 library takes a context value in its API, presumably for this reason.

I've not looked at their implementation for any of this, but if there is a need to store additional information, a context value is a tried and true method for sure.

Our existing Go code effectively more or less does that by housing the NAF state in a struct that is kept on the stack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants