-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make f1600x4 public #117
Make f1600x4 public #117
Conversation
5f52126
to
816b794
Compare
@armfazh Rebased. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments above
@@ -1,8 +1,8 @@ | |||
package internal | |||
|
|||
import ( | |||
"github.com/cloudflare/circl/hash/f1600x4" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renaming the package to some standard name following FIPS202 (e.g. KeccakF1600) will be appreciated.
f1600x4_other.go -> f1600x4_generic.go
The generic implementation must work too, for example using circl/internal/shake/keccakf.go.
OTOH: (If we want people don't use the generic implementation, the entire package must be marked as only for amd64.)
Handle the case of AMD64 machines without support for AVX2. -- The Available
flag delegates this case to the user, who might not handle this situation properly.
Some guidance must be given in the documentation that shows how to effectively load four states interleaved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renaming the package to some standard name following FIPS202 (e.g. KeccakF1600) will be appreciated.
I renamed it to keccakf1600x4
.
f1600x4_other.go -> f1600x4_generic.go
The generic implementation must work too, for example using circl/internal/shake/keccakf.go.
OTOH: (If we want people don't use the generic implementation, the entire package must be marked as only for amd64.)
Even on amd64, people should not use the package if AVX2 is not available. I dislike a generic implementation as it's slower (due to transposing.) Also I would like to add an ARM optimized version.
Handle the case of AMD64 machines without support for AVX2. -- The
Available
flag delegates this case to the user, who might not handle this situation properly.
Same argument as above: if AVX2 is not available a generic implementation would be prohibitively slow.
Some guidance must be given in the documentation that shows how to effectively load four states interleaved.
I'll add an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added the example and some more documentation on its intended purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd agree with @armfazh here. Ideally there's a generic implementation, doesn't have to be fast, but it works. If not, guard the entire package with amd64
build tags, so it's literally impossible to build against this package on other platforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not, guard the entire package with
amd64
build tags, so it's literally impossible to build against this package on other platforms.
That doesn't really solve the issue as not every amd64 system has AVX2 to begin with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is clear that the generic implementation will be slower, but it should have one because amd64 pre-AVX2 machines still need to be able to compile and use this external package.
It would be great to have build tags like machine+feature, so one can make the package compiles only for amd64+avx2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also for the generic implementation, you don't need to perform transpose every time provided you read data from the right place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is clear that the generic implementation will be slower, but it should have one because amd64 pre-AVX2 machines still need to be able to compile and use this external package.
They can use it! A pre-AVX2 machine can compile this code just fine. If the use is guarded by the Available
flag, everything is just fine. (Dilithium, for instance, runs just fine on my pre-AVX2 MacBook because it uses the Available
flag to check before using f1600x4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also for the generic implementation, you don't need to perform transpose every time provided you read data from the right place.
But you can't view this function in isolation: for every application there is some kind of transposition going on. The generic Dilithium code is faster than the code that uses f1600x4 if the latter has a generic implementation. This is because the latter needs to interleave the data into the state whereas the former doesn't.
481e608
to
5829ed1
Compare
I also wonder if it's better to name this package Suppose there were |
It doesn't make sense to have a separate package for plain f[1600]. It does make sense to expose X2 and X8 if available. I'll rename accordingly. |
hash/keccakf1600x4/example_test.go
Outdated
// state is initialized with zeroes. As the messages fit within one | ||
// block, we only need to write the messages, domain separators | ||
// and padding. | ||
for i := 0; i < 4; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic can be abstrated by a function beloging to the package.
It could be a function that receives four pointers and loads them into the internal state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that will be a trap! I sped up SPHINCS+ by a factor of 2 precisely by removing such a function and doing it ad hoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And that is in C. In Go the difference will be more stark.
I'm likely missing context from your team discussions, but I'm wondering why this package is being made public? My feeling is:
Generic implementation can also be useful for differential testing #4. Relevant: |
The comments of Michael and me comes from the fact that a go package must provide the same functionality regardless the architecture. This is the expected behaviour of a public, single package. I understand the benefits of using the vectorized code branch. and I also understand the risk of providing a generic implementation, e.g. some user could think that the permutation is faster even on a pre-AVX2 machine. However, in a pre-AVX2 machine and in all other architectures, this package doesn't provide any functionality, instead the user has to write code for that (as you did it for dilithium). I think one point in the middle of this trade-off is to provide an implementation only for AMD64, with two code branches. The vectorized and the scalar one. |
I use it in go-xmssmt. Currently I have a copy of the package there, but I don't it's good to maintain a copy.
The target audience is very small: it will be used by implementors of cryptographic primitives. I suspect there might be 10 packages max that will refer to it. I think the panic is actually helpful: it points out a performance bug and wrong use of the package which you might have otherwise missed.
I'll do a benchmark. |
So I ran the Dilithium public key unpacking (which contains the derivation of the matrix A, which uses a lot of SHAKE) using either the standard SHAKE code or the generic slow f1600x4 code. The result surprised me: the generic f1600x4 has a non negligible overhead as compared to simple f1600, but it's still faster than the standard SHAKE API.
|
I've renamed the package (again) and I've added a generic fallback. I would like to protest again: people will use the package without checking the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the simd package, it opens the possibility to add other simd implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apply the fix, and restore the go.mod files.
Then it will be ready to merge.
b3e85b7
to
7026f54
Compare
Co-authored-by: Armando Faz <armfazh@users.noreply.github.com>
No description provided.