secp256k1: Add TinyGo support. #3223

seedhammer · 2024-03-16T19:36:00Z

The pre-computed table for speeding up ScalarBaseMultNonConst is several hundred kilobytes in the binary and even more when unpacked into working memory. Special-case ScalarBaseMultNonConst to fall back to ScalarMultNonConst when the 'tinygo' tag is specified, which is true when building a Go program with TinyGo.

davecgh

Thanks for the PR. This looks good other than a couple of inline nits I've identified.

For what it's worth, while this approach is fine as a method to immediately allow it to work with TinyGo, I think it would ultimately make more sense to modify the core window-based logic in the innards of ScalarBaseMultNonConst to support a smaller window size in exchange for a bit of calculation so that it still works much more quickly on TinyGo than doing a round trip through the arbitrary point multiplication as this PR does. In that way, it would essentially allow a tradeoff between the memory usage and calculation speed. Moreover, it would allow it to avoid memory allocations which have GC implications.

For example, currently it has a window size of 256 with pure lookups which results in about 240KiB memory usage. If instead it went with a window size of something like 32 while calculating each window, it would only need about 1KiB memory usage and would still be quite a bit faster. A rough guess is that it would probably only be twice as slow using that approach versus the current method which is around 5x slower.

BenchmarkScalarBaseMultNonConstFast             64776             18540 ns/op               0 B/op          0 allocs/op
BenchmarkScalarBaseMultNonConstSlow             13072             91975 ns/op              64 B/op          2 allocs/op

dcrec/secp256k1/curve_embedded.go

dcrec/secp256k1/curve_precompute.go

The pre-computed table for speeding up ScalarBaseMultNonConst is several hundred kilobytes in the binary and even more when unpacked into working memory. Special-case ScalarBaseMultNonConst to fall back to ScalarMultNonConst when the 'tinygo' tag is specified, which is true when building a Go program with TinyGo.

seedhammer · 2024-03-19T16:14:55Z

For what it's worth, while this approach is fine as a method to immediately allow it to work with TinyGo, I think it would ultimately make more sense to modify the core window-based logic in the innards of ScalarBaseMultNonConst to support a smaller window size in exchange for a bit of calculation so that it still works much more quickly on TinyGo than doing a round trip through the arbitrary point multiplication as this PR does. In that way, it would essentially allow a tradeoff between the memory usage and calculation speed. Moreover, it would allow it to avoid memory allocations which have GC implications.

Tweaking the window size is ideal, but I couldn't figure out a window size small enough to matter. See below.

For example, currently it has a window size of 256 with pure lookups which results in about 240KiB memory usage. If instead it went with a window size of something like 32 while calculating each window, it would only need about 1KiB memory usage and would still be quite a bit faster.

fmt.Println("unsafe.Sizeof([32][256]JacobianPoint{})", unsafe.Sizeof([32][256]JacobianPoint{}))
fmt.Println("unsafe.Sizeof([32][32]JacobianPoint{})", unsafe.Sizeof([32][32]JacobianPoint{}))
fmt.Println("unsafe.Sizeof([32][1]JacobianPoint{})", unsafe.Sizeof([32][1]JacobianPoint{}))

results in

unsafe.Sizeof([32][256]JacobianPoint{}) 983040
unsafe.Sizeof([32][32]JacobianPoint{}) 122880
unsafe.Sizeof([32][1]JacobianPoint{}) 3840

which are larger than your 240KiB/1KiB numbers. What did I miss?

davecgh · 2024-03-19T16:49:34Z

which are larger than your 240KiB/1KiB numbers. What did I miss?

You're right. I left off a factor of 4 there for the uint32s when calculating, so both should be 4x higher. Specifically, it should've been 32*256*3*10*4 = 983040 ~= 960 KiB which matches the printout from unsafe.Sizeof.

With a window size of 2^5 = 32 instead, it would only need to store 32*3*10*4 = 3840 ~= 3.75KiB in exchange for the extra calculations that would be needed to do the windowed NAF conversions and multiplications of the 256/32 = 8 windows.

davecgh · 2024-03-19T17:53:57Z

For reference, I opend #3225 to make it so it remains zero allocation in the slow path (aka on TinyGo) as well to avoid GC implications.

seedhammer · 2024-03-19T20:35:54Z

With a window size of 2^5 = 32 instead, it would only need to store 32*3*10*4 = 3840 ~= 3.75KiB in exchange for the extra calculations that would be needed to do the windowed NAF conversions and multiplications of the 256/32 = 8 windows.

Thanks. I took another look at implementing this, but it seems to be quite some work. Am I right that the existing naf function is binary and the optimization needs a w-naf with 2^w digits?

It also seems to me (and from casual glancing over the Bitcoin secp256k1 implementation) that pre-computing a small (say w=4 or 5) window is advantageous even for the general ScalarMultNonConst. If so, ScalarBaseMultNonConstSlow would be faster even without extra RAM or flash ROM space. To offline pre-compute multiples of the generator would simply be an additional optimization on top.

I suppose I either missed something, or the reason ScalarMultNonConst doesn't pre-compute a window is because there's nowhere to stash the extra values in a GC-free manner. The Bitcoin secp256k1 library takes a context value in its API, presumably for this reason.

davecgh · 2024-03-19T22:37:46Z

Yes, it would be quite a bit of work to implement. That's a big reason why I didn't have a big issue with the approach in this PR, but I figured it was worth mentioning.

Am I right that the existing naf function is binary and the optimization needs a w-naf with 2^w digits?

Correct. GECC (Guide to Elliptic Curve Cryptography) section 3.30 provides some information and algorithms for window methods, but the width-w NAF algorithms it provides really aren't very optimized from what I recall. The existing binary NAF uses a significantly faster algorithm (#2695, ~93%) introduced by Prodinger, but it doesn't apply directly to width-w NAF.

I mention that because it may or may not be the case that a small window that needs a much more expensive NAF calculation (as well as some other calculations for dealing with the endomorphism in Jacobian coords) will be all that much faster in practice. I suspect that it would be faster since point additions are relatively costly and using a window would cut down on those, but without actually implementing and testing it, it's hard to say with any certainty.

I suppose I either missed something, or the reason ScalarMultNonConst doesn't pre-compute a window is because there's nowhere to stash the extra values in a GC-free manner.

Well, the primary reason is just that I never got around to putting the effort into implementing and testing it given it's already extremely fast and I had reached the point of diminishing returns on optimizations. I believe there are also some other considerations and normalizations that would probably have to occur for a window approach due to the use of the endormophism along with Jacobian projective space, but it's been a few years since I wrote and optimized all of that code that I'd need to dig in again to verify all the math.

On the topic of optimizations and diminishing returns, there are some others that would also likely result in some additional speedups in signature verification such as Shamir's trick (multiple point multiplication).

The Bitcoin secp256k1 library takes a context value in its API, presumably for this reason.

I've not looked at their implementation for any of this, but if there is a need to store additional information, a context value is a tried and true method for sure.

Our existing Go code effectively more or less does that by housing the NAF state in a struct that is kept on the stack.

seedhammer force-pushed the master branch from 55aeef9 to adb62e1 Compare March 16, 2024 19:46

davecgh requested changes Mar 19, 2024

View reviewed changes

dcrec/secp256k1/curve_embedded.go Outdated Show resolved Hide resolved

dcrec/secp256k1/curve_precompute.go Outdated Show resolved Hide resolved

davecgh changed the title ~~secp256k1: add support for resource constrained environments (TinyGo)~~ secp256k1: Add TinyGo support. Mar 19, 2024

seedhammer force-pushed the master branch from adb62e1 to 0d633fc Compare March 19, 2024 16:07

davecgh mentioned this pull request Mar 19, 2024

secp256k1: Add scalar base mult variant benchmarks. #3224

Merged

davecgh added this to the 1.9.0 milestone Mar 19, 2024

davecgh added the optimization label Mar 19, 2024

davecgh approved these changes Mar 19, 2024

View reviewed changes

davecgh merged commit 2ee2ebe into decred:master Mar 19, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

secp256k1: Add TinyGo support. #3223

secp256k1: Add TinyGo support. #3223

seedhammer commented Mar 16, 2024 •

edited by davecgh

Loading

davecgh left a comment •

edited

Loading

seedhammer commented Mar 19, 2024

davecgh commented Mar 19, 2024

davecgh commented Mar 19, 2024

seedhammer commented Mar 19, 2024 •

edited

Loading

davecgh commented Mar 19, 2024

secp256k1: Add TinyGo support. #3223

secp256k1: Add TinyGo support. #3223

Conversation

seedhammer commented Mar 16, 2024 • edited by davecgh Loading

davecgh left a comment • edited Loading

Choose a reason for hiding this comment

seedhammer commented Mar 19, 2024

davecgh commented Mar 19, 2024

davecgh commented Mar 19, 2024

seedhammer commented Mar 19, 2024 • edited Loading

davecgh commented Mar 19, 2024

seedhammer commented Mar 16, 2024 •

edited by davecgh

Loading

davecgh left a comment •

edited

Loading

seedhammer commented Mar 19, 2024 •

edited

Loading