Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make f1600x4 public #117

Merged
merged 3 commits into from
Jul 1, 2020
Merged

Make f1600x4 public #117

merged 3 commits into from
Jul 1, 2020

Conversation

bwesterb
Copy link
Member

No description provided.

@bwesterb bwesterb requested a review from armfazh May 21, 2020 17:41
@bwesterb bwesterb force-pushed the public-f1600x4 branch 2 times, most recently from 5f52126 to 816b794 Compare June 2, 2020 15:31
@bwesterb
Copy link
Member Author

bwesterb commented Jun 2, 2020

@armfazh Rebased.

Copy link
Contributor

@armfazh armfazh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments above

@@ -1,8 +1,8 @@
package internal

import (
"github.com/cloudflare/circl/hash/f1600x4"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming the package to some standard name following FIPS202 (e.g. KeccakF1600) will be appreciated.

f1600x4_other.go -> f1600x4_generic.go
The generic implementation must work too, for example using circl/internal/shake/keccakf.go.
OTOH: (If we want people don't use the generic implementation, the entire package must be marked as only for amd64.)

Handle the case of AMD64 machines without support for AVX2. -- The Available flag delegates this case to the user, who might not handle this situation properly.

Some guidance must be given in the documentation that shows how to effectively load four states interleaved.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming the package to some standard name following FIPS202 (e.g. KeccakF1600) will be appreciated.

I renamed it to keccakf1600x4.

f1600x4_other.go -> f1600x4_generic.go
The generic implementation must work too, for example using circl/internal/shake/keccakf.go.
OTOH: (If we want people don't use the generic implementation, the entire package must be marked as only for amd64.)

Even on amd64, people should not use the package if AVX2 is not available. I dislike a generic implementation as it's slower (due to transposing.) Also I would like to add an ARM optimized version.

Handle the case of AMD64 machines without support for AVX2. -- The Available flag delegates this case to the user, who might not handle this situation properly.

Same argument as above: if AVX2 is not available a generic implementation would be prohibitively slow.

Some guidance must be given in the documentation that shows how to effectively load four states interleaved.

I'll add an example.

Copy link
Member Author

@bwesterb bwesterb Jun 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the example and some more documentation on its intended purpose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd agree with @armfazh here. Ideally there's a generic implementation, doesn't have to be fast, but it works. If not, guard the entire package with amd64 build tags, so it's literally impossible to build against this package on other platforms.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not, guard the entire package with amd64 build tags, so it's literally impossible to build against this package on other platforms.

That doesn't really solve the issue as not every amd64 system has AVX2 to begin with.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is clear that the generic implementation will be slower, but it should have one because amd64 pre-AVX2 machines still need to be able to compile and use this external package.

It would be great to have build tags like machine+feature, so one can make the package compiles only for amd64+avx2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for the generic implementation, you don't need to perform transpose every time provided you read data from the right place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is clear that the generic implementation will be slower, but it should have one because amd64 pre-AVX2 machines still need to be able to compile and use this external package.

They can use it! A pre-AVX2 machine can compile this code just fine. If the use is guarded by the Available flag, everything is just fine. (Dilithium, for instance, runs just fine on my pre-AVX2 MacBook because it uses the Available flag to check before using f1600x4.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for the generic implementation, you don't need to perform transpose every time provided you read data from the right place.

But you can't view this function in isolation: for every application there is some kind of transposition going on. The generic Dilithium code is faster than the code that uses f1600x4 if the latter has a generic implementation. This is because the latter needs to interleave the data into the state whereas the former doesn't.

hash/f1600x4/asm/go.mod Outdated Show resolved Hide resolved
@bwesterb bwesterb force-pushed the public-f1600x4 branch 3 times, most recently from 481e608 to 5829ed1 Compare June 7, 2020 11:33
@mmcloughlin
Copy link
Contributor

I also wonder if it's better to name this package keccakf1600, and then the types/functions in the package have X4 on them?

Suppose there were x1, x2, x8 variants too... would you expect them to be in the same or different packages?

@bwesterb
Copy link
Member Author

bwesterb commented Jun 7, 2020

I also wonder if it's better to name this package keccakf1600, and then the types/functions in the package have X4 on them?

Suppose there were x1, x2, x8 variants too... would you expect them to be in the same or different packages?

It doesn't make sense to have a separate package for plain f[1600]. It does make sense to expose X2 and X8 if available. I'll rename accordingly.

// state is initialized with zeroes. As the messages fit within one
// block, we only need to write the messages, domain separators
// and padding.
for i := 0; i < 4; i++ {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic can be abstrated by a function beloging to the package.
It could be a function that receives four pointers and loads them into the internal state.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that will be a trap! I sped up SPHINCS+ by a factor of 2 precisely by removing such a function and doing it ad hoc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And that is in C. In Go the difference will be more stark.

sphincs/sphincsplus#9

@mmcloughlin
Copy link
Contributor

I'm likely missing context from your team discussions, but I'm wondering why this package is being made public? My feeling is:

  • If it's a public package, it should have a generic fallback, regardless of how slow that is. It can be documented that you probably don't want to use it without amd64+avx2, but it's very unfriendly if the package crashes without it.
  • If it's an internal package, code that works on avx2 only is completely fine.

Generic implementation can also be useful for differential testing #4.

Relevant:
#28
https://github.com/golang/go/wiki/TargetSpecific

@armfazh
Copy link
Contributor

armfazh commented Jun 9, 2020

But you can't view this function in isolation: for every application there is some kind of transposition going on.

The comments of Michael and me comes from the fact that a go package must provide the same functionality regardless the architecture. This is the expected behaviour of a public, single package.

I understand the benefits of using the vectorized code branch. and I also understand the risk of providing a generic implementation, e.g. some user could think that the permutation is faster even on a pre-AVX2 machine.

However, in a pre-AVX2 machine and in all other architectures, this package doesn't provide any functionality, instead the user has to write code for that (as you did it for dilithium).

I think one point in the middle of this trade-off is to provide an implementation only for AMD64, with two code branches. The vectorized and the scalar one.
It's clear that performing 4 times the scalar version versus calling 4 times the stdlib function can have different performance, but I guess, it would be small.

@bwesterb
Copy link
Member Author

bwesterb commented Jun 9, 2020

I'm likely missing context from your team discussions, but I'm wondering why this package is being made public?

I use it in go-xmssmt. Currently I have a copy of the package there, but I don't it's good to maintain a copy.

  • If it's a public package, it should have a generic fallback, regardless of how slow that is. It can be documented that you probably don't want to use it without amd64+avx2, but it's very unfriendly if the package crashes without it.

The target audience is very small: it will be used by implementors of cryptographic primitives. I suspect there might be 10 packages max that will refer to it. I think the panic is actually helpful: it points out a performance bug and wrong use of the package which you might have otherwise missed.

It's clear that performing 4 times the scalar version versus calling 4 times the stdlib function can have different performance, but I guess, it would be small.

I'll do a benchmark.

@bwesterb
Copy link
Member Author

It's clear that performing 4 times the scalar version versus calling 4 times the stdlib function can have different performance, but I guess, it would be small.

I'll do a benchmark.

So I ran the Dilithium public key unpacking (which contains the derivation of the matrix A, which uses a lot of SHAKE) using either the standard SHAKE code or the generic slow f1600x4 code.

The result surprised me: the generic f1600x4 has a non negligible overhead as compared to simple f1600, but it's still faster than the standard SHAKE API.

PkUnpack-8  96.7µs ± 1%  88.8µs ± 1%  -8.21%  (p=0.000 n=10+8)

@bwesterb
Copy link
Member Author

I've renamed the package (again) and I've added a generic fallback. I would like to protest again: people will use the package without checking the AvailableX4 which is a performance bug. Not panic()ing is an implicit API: we might break packages in the future if we panic() if the fast vectorized versions are not supported.

Copy link
Contributor

@armfazh armfazh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the simd package, it opens the possibility to add other simd implementations.

simd/keccakf1600/f1600x4.go Outdated Show resolved Hide resolved
simd/keccakf1600/f1600x4.go Outdated Show resolved Hide resolved
simd/keccakf1600/f1600x4_amd64.go Outdated Show resolved Hide resolved
simd/keccakf1600/f1600x4_test.go Outdated Show resolved Hide resolved
simd/keccakf1600/internal/asm/go.mod Outdated Show resolved Hide resolved
go.mod Show resolved Hide resolved
Copy link
Contributor

@armfazh armfazh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apply the fix, and restore the go.mod files.
Then it will be ready to merge.

simd/keccakf1600/example_test.go Outdated Show resolved Hide resolved
@bwesterb bwesterb force-pushed the public-f1600x4 branch 2 times, most recently from b3e85b7 to 7026f54 Compare July 1, 2020 12:30
Co-authored-by: Armando Faz <armfazh@users.noreply.github.com>
@bwesterb bwesterb merged commit f0d671b into cloudflare:master Jul 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants