Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HKDF performance #52

Closed
ektrah opened this issue Dec 27, 2020 · 5 comments · Fixed by #53
Closed

HKDF performance #52

ektrah opened this issue Dec 27, 2020 · 5 comments · Fixed by #53
Labels

Comments

@ektrah
Copy link
Owner

ektrah commented Dec 27, 2020

HKDF.Standard claims to be 2.6 - 6.7 times faster than NSec.

@andreimilto Any quick thoughts on why NSec is so much slower?

https://github.com/andreimilto/HKDF.Standard/blob/67f5f1078b4be61746047a57842d2fa1ea042377/src/HkdfStandard.Benchmark/KeyDerivationBenchmark.cs#L121-L127

Is it the SharedSecret.Import or the DeriveBytes call?

@andreimilto
Copy link

Is it the SharedSecret.Import or the DeriveBytes call?

It's mostly SharedSecret.Import when deriving a short key, and both of them when deriving a long key.
I ran another benchmark on the same hardware that was used for measuring HKDF.Standard performance, so the results can be compared directly. The benchmark derives 128-bit and 4096-bit keys, uses SHA-256 and SHA-512, includes and excludes the time taken for importing the shared secret from measurements.

Benchmark class (click to open)
public class HkdfBenchmark
{
    private static byte[] rawSharedSecret = new byte[32];
    private static SharedSecret importedSharedSecret = SharedSecret.Import(rawSharedSecret);

    private static byte[] salt = new byte[32];
    private static byte[] info = new byte[32];

    [Params(16, 512)]
    public int OutputLength;

    [ParamsSource(nameof(Hkdfs))]
    public KeyDerivationAlgorithm Hkdf;

    public static KeyDerivationAlgorithm[] Hkdfs => new KeyDerivationAlgorithm[]
    {
        KeyDerivationAlgorithm.HkdfSha256,
        KeyDerivationAlgorithm.HkdfSha512,
    };


    [Benchmark]
    public void BenchmarkHkdfWithSecretImport()
    {
        var sharedSecret = SharedSecret.Import(rawSharedSecret);
        Hkdf.DeriveBytes(sharedSecret, salt, info, OutputLength);
    }

    [Benchmark]
    public void BenchmarkHkdfWithoutSecretImport()
    {
        Hkdf.DeriveBytes(importedSharedSecret, salt, info, OutputLength);
    }
}

Benchmark results:

Method OutputLength Hkdf Mean Error StdDev
BenchmarkHkdfWithSecretImport 16 NSec.(...)ha256 [28] 12.882 us 0.1887 us 0.1765 us
BenchmarkHkdfWithoutSecretImport 16 NSec.(...)ha256 [28] 3.258 us 0.0280 us 0.0262 us
BenchmarkHkdfWithSecretImport 16 NSec.(...)ha512 [28] 13.627 us 0.1709 us 0.1515 us
BenchmarkHkdfWithoutSecretImport 16 NSec.(...)ha512 [28] 4.297 us 0.0676 us 0.0632 us
BenchmarkHkdfWithSecretImport 512 NSec.(...)ha256 [28] 42.546 us 0.7893 us 0.7383 us
BenchmarkHkdfWithoutSecretImport 512 NSec.(...)ha256 [28] 33.149 us 0.2092 us 0.1855 us
BenchmarkHkdfWithSecretImport 512 NSec.(...)ha512 [28] 29.399 us 0.4734 us 0.4196 us
BenchmarkHkdfWithoutSecretImport 512 NSec.(...)ha512 [28] 19.230 us 0.3780 us 0.3713 us

SharedSecret.Import always takes about 9 microseconds, which is a lot when deriving a 128-bit key because the key derivation itself takes only 3-4 microseconds, and becomes less expensive in relative terms but still significant when deriving a 4096-key as the key derivation time increases to 19-29 microseconds. The DeriveBytes alone takes longer than HKDF.Standard's DeriveKey or NET 5's HKDF.DeriveKey.


Any quick thoughts on why NSec is so much slower?

I don't know, maybe it's worth taking a look not only at SharedSecret.Import and DeriveBytes themselves, but also at the underlying cryptographic primitives - HMAC and SHA.

For example, on my PC the NSec's implementation of SHA-512 is faster that the .NET 5's implementation for small inputs:

Benchmark class for SHA-512 (click to open)
public class Sha512Benchmark
{
    [ParamsSource(nameof(Inputs))]
    public byte[] Input;

    public IEnumerable<byte[]> Inputs => new[]
    {
        new byte[16],
        new byte[64],
        new byte[256],
        new byte[1024],
        new byte[4096],
        new byte[16384],
        new byte[65536]
    };


    [Benchmark]
    public void BenchmarkNetSha512()
    {
        using var sha = SHA512.Create();
        sha.ComputeHash(Input);
    }

    [Benchmark]
    public void BenchmarkNSecSha512()
    {
        var sha = new Sha512();
        sha.Hash(Input);
    }
}
Benchmark results (click to open)
Method Input Mean Error StdDev
BenchmarkNetSha512 Byte[1024] 3,138.1 ns 49.23 ns 46.05 ns
BenchmarkNSecSha512 Byte[1024] 3,507.6 ns 28.86 ns 27.00 ns
BenchmarkNetSha512 Byte[16384] 37,040.9 ns 101.85 ns 95.28 ns
BenchmarkNSecSha512 Byte[16384] 45,727.5 ns 351.83 ns 329.10 ns
BenchmarkNetSha512 Byte[16] 881.3 ns 8.59 ns 8.04 ns
BenchmarkNSecSha512 Byte[16] 551.5 ns 5.17 ns 4.84 ns
BenchmarkNetSha512 Byte[256] 1,445.5 ns 5.15 ns 4.56 ns
BenchmarkNSecSha512 Byte[256] 1,379.2 ns 11.47 ns 10.73 ns
BenchmarkNetSha512 Byte[4096] 9,935.6 ns 71.31 ns 63.22 ns
BenchmarkNSecSha512 Byte[4096] 11,922.2 ns 57.82 ns 54.08 ns
BenchmarkNetSha512 Byte[64] 884.6 ns 3.74 ns 3.50 ns
BenchmarkNSecSha512 Byte[64] 560.6 ns 4.92 ns 4.60 ns
BenchmarkNetSha512 Byte[65536] 145,972.0 ns 1,901.19 ns 1,778.38 ns
BenchmarkNSecSha512 Byte[65536] 180,436.8 ns 830.70 ns 693.67 ns

So, I'd naturally expect NSec's HMAC-SHA-512 be faster than .NET 5's for small inputs, but it turns out not to be the case. This makes me think that there's probably a potential for optimisation in NSec's HMAC (both the Key.Import and HmacSha512.Mac judging from the benchmark results), which, if released, may also speed-up the HKDF primitive (since under the hood HKDF uses HMAC).

Benchmark class for HMAC-SHA-512 (click to open)
public class HmacSha512Benchmark
{
    private static byte[] rawKey = new byte[64];
    private static Key importedKey = Key.Import(MacAlgorithm.HmacSha512, rawKey, KeyBlobFormat.RawSymmetricKey);

    [ParamsSource(nameof(Messages))]
    public byte[] Message;

    public IEnumerable<byte[]> Messages => new[]
    {
        new byte[16],
        new byte[64],
        new byte[256],
        new byte[1024],
        new byte[4096],
        new byte[16384],
        new byte[65536]
    };
        
    [Benchmark]
    public void BenchmarkNetHmacSha512()
    {
        using var hmac = new HMACSHA512(rawKey);
        hmac.ComputeHash(Message);
    }

    [Benchmark]
    public void BenchmarkNSecHmacSha512WithKeyImport()
    {
        var hmac = MacAlgorithm.HmacSha512;
        hmac.Mac(Key.Import(hmac, rawKey, KeyBlobFormat.RawSymmetricKey), Message);
    }

    [Benchmark]
    public void BenchmarkNSecHmacSha512WithoutKeyImport()
    {
        var hmac = MacAlgorithm.HmacSha512;
        hmac.Mac(importedKey, Message);
    }
}
Benchmark results (click to open)
Method Message Mean Error StdDev
BenchmarkNetHmacSha512 Byte[1024] 4.319 us 0.0527 us 0.0467 us
BenchmarkNSecHmacSha512WithKeyImport Byte[1024] 15.385 us 0.2583 us 0.2290 us
BenchmarkNSecHmacSha512WithoutKeyImport Byte[1024] 5.101 us 0.0144 us 0.0127 us
BenchmarkNetHmacSha512 Byte[16384] 38.452 us 0.2644 us 0.2474 us
BenchmarkNSecHmacSha512WithKeyImport Byte[16384] 53.303 us 0.8284 us 0.7749 us
BenchmarkNSecHmacSha512WithoutKeyImport Byte[16384] 47.552 us 0.3483 us 0.3258 us
BenchmarkNetHmacSha512 Byte[16] 2.025 us 0.0204 us 0.0191 us
BenchmarkNSecHmacSha512WithKeyImport Byte[16] 12.690 us 0.1718 us 0.1434 us
BenchmarkNSecHmacSha512WithoutKeyImport Byte[16] 2.151 us 0.0116 us 0.0103 us
BenchmarkNetHmacSha512 Byte[256] 2.588 us 0.0097 us 0.0086 us
BenchmarkNSecHmacSha512WithKeyImport Byte[256] 13.498 us 0.2566 us 0.2746 us
BenchmarkNSecHmacSha512WithoutKeyImport Byte[256] 2.995 us 0.0167 us 0.0156 us
BenchmarkNetHmacSha512 Byte[4096] 11.097 us 0.1234 us 0.1154 us
BenchmarkNSecHmacSha512WithKeyImport Byte[4096] 23.129 us 0.4410 us 0.4529 us
BenchmarkNSecHmacSha512WithoutKeyImport Byte[4096] 13.547 us 0.1934 us 0.1714 us
BenchmarkNetHmacSha512 Byte[64] 2.045 us 0.0189 us 0.0177 us
BenchmarkNSecHmacSha512WithKeyImport Byte[64] 12.964 us 0.2339 us 0.2073 us
BenchmarkNSecHmacSha512WithoutKeyImport Byte[64] 2.145 us 0.0158 us 0.0140 us
BenchmarkNetHmacSha512 Byte[65536] 146.429 us 0.9867 us 0.9230 us
BenchmarkNSecHmacSha512WithKeyImport Byte[65536] 188.959 us 0.9673 us 0.9049 us
BenchmarkNSecHmacSha512WithoutKeyImport Byte[65536] 181.751 us 0.8287 us 0.7346 us

So, in total it makes four candidates for optimisation:

  • SharedSecret.Import
  • HKDF itself
  • Key.Import (IDK whether it will also improve HKDF's performance, but definitely good for HMAC)
  • HMAC itself

Not sure though that all four really need it and even can be optimised. Maybe some of them, like Key.Import, deserve their separate issues.
It's all IMHO, I know nothing about the internals of NSec.

@ektrah
Copy link
Owner Author

ektrah commented Jan 9, 2021

Wow, that's quite a detailed analysis. Thanks a lot!

SharedSecret.Import is somewhat expensive as NSec allocates the memory for shared secrets outside the managed heap and applies a few protections for a little bit of extra defense in depth (libsodium's guarded heap allocations). There is not much that could be optimized here, besides removing those protections. Of course, these probably don't help a lot when the shared secret is already inside the managed heap and then imported. So maybe the solution here is to provide additional method overloads in the HKDF classes that accept Span<byte> instead of going through SharedSecret.Import.

Key.Import works the same as SharedSecret.Import, but shouldn't be of relevance here, since the NSec HKDF classes don't use the Key class internally. So the benchmark results for HMAC WithoutKeyImport are the better indicator. I'm not sure why HMAC is so much slower than NSec for large inputs; maybe libsodium's HMAC implementation isn't as well optimized.

That leaves NSec's HKDF implementation itself. At first glance, it doesn't look too different from what HKDF.Standard does or from libsodium's upcoming implementation. I'll have to dig a bit more into this sometime.

Overall, I'm a bit wary of implementing cryptographic primitives myself. Even though it's "just" HKDF and probably not very difficult to get right, I think it would be better to defer to established, well-tested implementations. So, maybe, the "solution" to the performance problem here is just to use .NET's HKDF where available and/or libsodium's HKDF when that's released...

@andreimilto
Copy link

Wow, that's quite a detailed analysis. Thanks a lot!

Happy to help!


That leaves NSec's HKDF implementation itself. At first glance, it doesn't look too different from what HKDF.Standard does or from libsodium's upcoming implementation. I'll have to dig a bit more into this sometime.

It must be the ability to initialize HMAC with a key once and compute multiple MACs that gives .NET's HMAC-based implementations of HKDF noticeable performance boost. NSec's HKDF reinitializes HMAC on every iteration during the Expand stage, and it looks like currently there's no way around this - libsodium's documentation specifically requires so:

The state must be initialized with crypto_auth_hmacsha*_init() before updating or finalizing it. After crypto_auth_hmacsha*_final() returns, the state should not be used any more, unless it is reinitialized using crypto_auth_hmacsha*_init().

For example, the derivation of 4096-bit key (as in the HKDF.Standard benchmark) via SHA-512 requires 8 MAC computations with the same key during the Expand stage. Initializing HMAC once instead of eight times more than doubles the speed of this group of operations:

Benchmark class
public class BatchHmacSha512Benchmark
{
    private static byte[] rawKey = new byte[64];
    private static Key importedKey = Key.Import(MacAlgorithm.HmacSha512, rawKey, KeyBlobFormat.RawSymmetricKey);

    private static byte[] message = new byte[97];


    [Benchmark]
    public void BenchmarkNSecHmacSha512_8Inits_8Macs()
    {
        for (int i = 0; i < 8; i++)
            MacAlgorithm.HmacSha512.Mac(importedKey, message);
    }

    [Benchmark]
    public void BenchmarkNetHmacSha512_8Inits_8Macs()
    {
        for (int i = 0; i < 8; i++)
        {
            using var hmac = new HMACSHA512(rawKey);
            hmac.ComputeHash(message);
        }
    }

    [Benchmark]
    public void BenchmarkNetHmacSha512_1Init_8Macs()
    {
        using var hmac = new HMACSHA512(rawKey);
        for (int i = 0; i < 8; i++)
            hmac.ComputeHash(message);
    }
}
Method Mean Error StdDev
BenchmarkNSecHmacSha512_8Inits_8Macs 17.189 us 0.0676 us 0.0599 us
BenchmarkNetHmacSha512_8Inits_8Macs 16.479 us 0.2641 us 0.2470 us
BenchmarkNetHmacSha512_1Init_8Macs 7.768 us 0.1242 us 0.1162 us

So, it's the key import and MAC computation operations that contribute mostly to the execution time, maybe there are other optimizations possible (e.g., moving crypto_auth_hmacsha512_state state; outside of the while loop) but they won't bring much benefit:

Derivation of 4096-bit key using SHA-512 (as in the HKDF.Standard benchmark)

Operation NSec HKDF.Standard
Key Import 9 us 0 us
1 x HMAC(32-byte key, 32-byte msg) during extraction 2 us 2 us
1 x HMAC(64-byte key, 33-byte msg), 7 x HMAC (64-byte key, 97-byte msg) during expansion 17 us 8 us
Total 28 us 10 us

Derivation of 128-bit key using SHA-512 (as in the HKDF.Standard benchmark)

Operation NSec HKDF.Standard
Key Import 9 us 0 us
1 x HMAC(32-byte key, 32-byte msg) during extraction 2 us 2 us
1 x HMAC(64-byte key, 33-byte msg) during expansion 2 us 2 us
Total 13 us 4 us

@ektrah
Copy link
Owner Author

ektrah commented Jan 10, 2021

I've added new overloads to the HKDF class and factored out the HMAC initialization in a new branch.
(If you want to try it without checking out the code, here's the latest build artifact: artifact.zip)

It must be the ability to initialize HMAC with a key once and compute multiple MACs that gives .NET's HMAC-based implementations of HKDF noticeable performance boost. NSec's HKDF reinitializes HMAC on every iteration during the Expand stage, and it looks like currently there's no way around this - libsodium's documentation specifically requires so:

libsodium mutates the state in place, so, after updating or finalizing it, the original state is gone and needs to be recreated in the next iteration. I didn't think that initialization would take so much time, so I just reinitialized the original state in every iteration. But, of course, it's also possible to keep a copy of the original state and just initialize the state from that.

(e.g., moving crypto_auth_hmacsha512_state state; outside of the while loop)

As far as I know, after checking the scopes of variables, the C# compiler just hoists all variables to the top of the method. So this shouldn't make a difference.

but they won't bring much benefit:

Thanks again for the detailed benchmarks!

@andreimilto
Copy link

I've added new overloads to the HKDF class and factored out the HMAC initialization in a new branch.

Great result!

Hash Output Before After
SHA-256 128 bit 12 us 3 us
SHA-512 128 bit 13 us 4 us
SHA-256 4096 bit 41 us 21 us
SHA-512 4096 bit 29 us 12 us

Thanks again for the detailed benchmarks!

You are welcome :)

@ektrah ektrah added the bug label Mar 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants