Make f1600x4 public #117

bwesterb · 2020-05-21T17:41:20Z

No description provided.

bwesterb · 2020-06-02T15:40:20Z

armfazh

Some comments above

armfazh · 2020-06-04T17:26:19Z

sign/dilithium/mode3/internal/sample.go

@@ -1,8 +1,8 @@
 package internal

 import (
+	"github.com/cloudflare/circl/hash/f1600x4"


Renaming the package to some standard name following FIPS202 (e.g. KeccakF1600) will be appreciated.

f1600x4_other.go -> f1600x4_generic.go
The generic implementation must work too, for example using circl/internal/shake/keccakf.go.
OTOH: (If we want people don't use the generic implementation, the entire package must be marked as only for amd64.)

Handle the case of AMD64 machines without support for AVX2. -- The Available flag delegates this case to the user, who might not handle this situation properly.

Some guidance must be given in the documentation that shows how to effectively load four states interleaved.

Renaming the package to some standard name following FIPS202 (e.g. KeccakF1600) will be appreciated.

I renamed it to keccakf1600x4.

f1600x4_other.go -> f1600x4_generic.go
The generic implementation must work too, for example using circl/internal/shake/keccakf.go.
OTOH: (If we want people don't use the generic implementation, the entire package must be marked as only for amd64.)

Even on amd64, people should not use the package if AVX2 is not available. I dislike a generic implementation as it's slower (due to transposing.) Also I would like to add an ARM optimized version.

Handle the case of AMD64 machines without support for AVX2. -- The Available flag delegates this case to the user, who might not handle this situation properly.

Same argument as above: if AVX2 is not available a generic implementation would be prohibitively slow.

Some guidance must be given in the documentation that shows how to effectively load four states interleaved.

I'll add an example.

I've added the example and some more documentation on its intended purpose.

I'd agree with @armfazh here. Ideally there's a generic implementation, doesn't have to be fast, but it works. If not, guard the entire package with amd64 build tags, so it's literally impossible to build against this package on other platforms.

If not, guard the entire package with amd64 build tags, so it's literally impossible to build against this package on other platforms.

That doesn't really solve the issue as not every amd64 system has AVX2 to begin with.

It is clear that the generic implementation will be slower, but it should have one because amd64 pre-AVX2 machines still need to be able to compile and use this external package.

It would be great to have build tags like machine+feature, so one can make the package compiles only for amd64+avx2.

Also for the generic implementation, you don't need to perform transpose every time provided you read data from the right place.

It is clear that the generic implementation will be slower, but it should have one because amd64 pre-AVX2 machines still need to be able to compile and use this external package.

They can use it! A pre-AVX2 machine can compile this code just fine. If the use is guarded by the Available flag, everything is just fine. (Dilithium, for instance, runs just fine on my pre-AVX2 MacBook because it uses the Available flag to check before using f1600x4.

Also for the generic implementation, you don't need to perform transpose every time provided you read data from the right place.

But you can't view this function in isolation: for every application there is some kind of transposition going on. The generic Dilithium code is faster than the code that uses f1600x4 if the latter has a generic implementation. This is because the latter needs to interleave the data into the state whereas the former doesn't.

hash/f1600x4/asm/go.mod

mmcloughlin · 2020-06-07T19:28:55Z

I also wonder if it's better to name this package keccakf1600, and then the types/functions in the package have X4 on them?

Suppose there were x1, x2, x8 variants too... would you expect them to be in the same or different packages?

bwesterb · 2020-06-07T22:46:55Z

I also wonder if it's better to name this package keccakf1600, and then the types/functions in the package have X4 on them?

Suppose there were x1, x2, x8 variants too... would you expect them to be in the same or different packages?

It doesn't make sense to have a separate package for plain f[1600]. It does make sense to expose X2 and X8 if available. I'll rename accordingly.

armfazh · 2020-06-08T05:34:38Z

hash/keccakf1600x4/example_test.go

+		// state is initialized with zeroes.  As the messages fit within one
+		// block, we only need to write the messages, domain separators
+		// and padding.
+		for i := 0; i < 4; i++ {


This logic can be abstrated by a function beloging to the package.
It could be a function that receives four pointers and loads them into the internal state.

But that will be a trap! I sped up SPHINCS+ by a factor of 2 precisely by removing such a function and doing it ad hoc.

And that is in C. In Go the difference will be more stark.

sphincs/sphincsplus#9

mmcloughlin · 2020-06-08T21:01:55Z

I'm likely missing context from your team discussions, but I'm wondering why this package is being made public? My feeling is:

If it's a public package, it should have a generic fallback, regardless of how slow that is. It can be documented that you probably don't want to use it without amd64+avx2, but it's very unfriendly if the package crashes without it.
If it's an internal package, code that works on avx2 only is completely fine.

Generic implementation can also be useful for differential testing #4.

Relevant:
#28
https://github.com/golang/go/wiki/TargetSpecific

armfazh · 2020-06-09T06:12:14Z

But you can't view this function in isolation: for every application there is some kind of transposition going on.

The comments of Michael and me comes from the fact that a go package must provide the same functionality regardless the architecture. This is the expected behaviour of a public, single package.

I understand the benefits of using the vectorized code branch. and I also understand the risk of providing a generic implementation, e.g. some user could think that the permutation is faster even on a pre-AVX2 machine.

However, in a pre-AVX2 machine and in all other architectures, this package doesn't provide any functionality, instead the user has to write code for that (as you did it for dilithium).

I think one point in the middle of this trade-off is to provide an implementation only for AMD64, with two code branches. The vectorized and the scalar one.
It's clear that performing 4 times the scalar version versus calling 4 times the stdlib function can have different performance, but I guess, it would be small.

bwesterb · 2020-06-09T12:56:55Z

I'm likely missing context from your team discussions, but I'm wondering why this package is being made public?

I use it in go-xmssmt. Currently I have a copy of the package there, but I don't it's good to maintain a copy.

If it's a public package, it should have a generic fallback, regardless of how slow that is. It can be documented that you probably don't want to use it without amd64+avx2, but it's very unfriendly if the package crashes without it.

The target audience is very small: it will be used by implementors of cryptographic primitives. I suspect there might be 10 packages max that will refer to it. I think the panic is actually helpful: it points out a performance bug and wrong use of the package which you might have otherwise missed.

It's clear that performing 4 times the scalar version versus calling 4 times the stdlib function can have different performance, but I guess, it would be small.

I'll do a benchmark.

bwesterb · 2020-06-11T21:27:57Z

It's clear that performing 4 times the scalar version versus calling 4 times the stdlib function can have different performance, but I guess, it would be small.

I'll do a benchmark.

So I ran the Dilithium public key unpacking (which contains the derivation of the matrix A, which uses a lot of SHAKE) using either the standard SHAKE code or the generic slow f1600x4 code.

The result surprised me: the generic f1600x4 has a non negligible overhead as compared to simple f1600, but it's still faster than the standard SHAKE API.

PkUnpack-8  96.7µs ± 1%  88.8µs ± 1%  -8.21%  (p=0.000 n=10+8)

bwesterb · 2020-06-11T21:56:07Z

I've renamed the package (again) and I've added a generic fallback. I would like to protest again: people will use the package without checking the AvailableX4 which is a performance bug. Not panic()ing is an implicit API: we might break packages in the future if we panic() if the fast vectorized versions are not supported.

armfazh

I like the simd package, it opens the possibility to add other simd implementations.

simd/keccakf1600/f1600x4.go

simd/keccakf1600/f1600x4_amd64.go

simd/keccakf1600/f1600x4_test.go

simd/keccakf1600/internal/asm/go.mod

go.mod

armfazh

Apply the fix, and restore the go.mod files.
Then it will be ready to merge.

simd/keccakf1600/example_test.go

Co-authored-by: Armando Faz <armfazh@users.noreply.github.com>

bwesterb requested a review from armfazh May 21, 2020 17:41

bwesterb force-pushed the public-f1600x4 branch from 90967c0 to 853997a Compare May 21, 2020 18:10

bwesterb force-pushed the public-f1600x4 branch 2 times, most recently from 5f52126 to 816b794 Compare June 2, 2020 15:31

armfazh requested changes Jun 4, 2020

View reviewed changes

armfazh reviewed Jun 4, 2020

View reviewed changes

hash/f1600x4/asm/go.mod Outdated Show resolved Hide resolved

bwesterb force-pushed the public-f1600x4 branch 3 times, most recently from 481e608 to 5829ed1 Compare June 7, 2020 11:33

armfazh reviewed Jun 8, 2020

View reviewed changes

bwesterb force-pushed the public-f1600x4 branch from 5829ed1 to c7cb77e Compare June 11, 2020 21:54

armfazh reviewed Jun 15, 2020

View reviewed changes

bwesterb force-pushed the public-f1600x4 branch from c7cb77e to 4a336b0 Compare June 16, 2020 09:36

bwesterb commented Jun 26, 2020

View reviewed changes

go.mod Show resolved Hide resolved

bwesterb force-pushed the public-f1600x4 branch from 51a6635 to 3691a59 Compare June 26, 2020 12:10

bwesterb and others added 2 commits June 26, 2020 14:11

Expose f1600x4 and add a generic slow fallback

396f743

Unify files and testing/bench both generic and SIMD implementation.

4b79db5

bwesterb force-pushed the public-f1600x4 branch from 3691a59 to 97444af Compare June 26, 2020 12:12

armfazh self-requested a review June 26, 2020 18:38

armfazh approved these changes Jun 26, 2020

View reviewed changes

simd/keccakf1600/example_test.go Outdated Show resolved Hide resolved

bwesterb force-pushed the public-f1600x4 branch 2 times, most recently from b3e85b7 to 7026f54 Compare July 1, 2020 12:30

Satisfy linter, fix merge and go.mod

72f85a8

Co-authored-by: Armando Faz <armfazh@users.noreply.github.com>

bwesterb force-pushed the public-f1600x4 branch from 7026f54 to 72f85a8 Compare July 1, 2020 12:34

bwesterb merged commit f0d671b into cloudflare:master Jul 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make f1600x4 public #117

Make f1600x4 public #117

bwesterb commented May 21, 2020

bwesterb commented Jun 2, 2020 •

edited

Loading

armfazh left a comment •

edited

Loading

armfazh Jun 4, 2020

bwesterb Jun 7, 2020

bwesterb Jun 7, 2020 •

edited

Loading

mmcloughlin Jun 7, 2020

bwesterb Jun 7, 2020

armfazh Jun 8, 2020

armfazh Jun 8, 2020

bwesterb Jun 8, 2020

bwesterb Jun 8, 2020

mmcloughlin commented Jun 7, 2020

bwesterb commented Jun 7, 2020

armfazh Jun 8, 2020

bwesterb Jun 8, 2020

bwesterb Jun 8, 2020

mmcloughlin commented Jun 8, 2020

armfazh commented Jun 9, 2020

bwesterb commented Jun 9, 2020

bwesterb commented Jun 11, 2020

bwesterb commented Jun 11, 2020

armfazh left a comment

armfazh left a comment

Make f1600x4 public #117

Make f1600x4 public #117

Conversation

bwesterb commented May 21, 2020

bwesterb commented Jun 2, 2020 • edited Loading

armfazh left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bwesterb Jun 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mmcloughlin commented Jun 7, 2020

bwesterb commented Jun 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mmcloughlin commented Jun 8, 2020

armfazh commented Jun 9, 2020

bwesterb commented Jun 9, 2020

bwesterb commented Jun 11, 2020

bwesterb commented Jun 11, 2020

armfazh left a comment

Choose a reason for hiding this comment

armfazh left a comment

Choose a reason for hiding this comment

bwesterb commented Jun 2, 2020 •

edited

Loading

armfazh left a comment •

edited

Loading

bwesterb Jun 7, 2020 •

edited

Loading