ChaCha7539/ChaCha20/Salsa20/Poly1305 SIMD#86
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
A broad throughput pass across the ChaCha20/ChaCha7539, Salsa20, and
Poly1305 primitives. Key additions: a 4-block (256-byte) fast path for
ChaCha7539 with a dedicated AVX2 kernel, a fully pipelined
ProcessBytesoverride for ChaCha7539 and Salsa20, a UInt64-wide XORtail that replaces the per-byte loop throughout, AVX2 4-way bulk
evaluation for Poly1305, a
ProcessBlocks2override on the Salsa20base engine, and a shared
ChaChaSse2_DoubleRoundBody.incthatdeduplicates the SSE2 double-round body across all consumers. The
ProcessBlocks2Sse2stub is promoted from an inline two-call sequenceinto its own proper SIMD kernel covering both x86_64 and i386.
TChaCha7539Engine(ClpChaCha7539Engine)ProcessBlocks4(new)256-byte (4-block) variant. Dispatches to a new dedicated
ChaCha7539ProcessBlocks4Avx2kernel on AVX2; falls back to twosequential
ProcessBlocks2calls on everything else. The counteroverflow guard is checked before the AVX2 path is entered. New include
files:
ChaCha7539ProcessBlocks4Avx2_x86_64.inc,ChaCha7539ProcessBlocks4Avx2_i386.inc.ProcessBlocks2refactoredThe previous SSE2 fallback (two separate
ChaCha7539BlockSse2calls +manual XOR loops) is replaced by a proper dedicated two-block SSE2
kernel
ChaCha7539ProcessBlocks2Sse2, covering both x86_64 and i386.The
{$IFDEF CRYPTOLIB_X86_64_ASM}guard around the AVX2 dispatch isremoved — both stubs are now bi-arch. New include files:
ChaCha7539ProcessBlocks2Sse2_x86_64.inc,ChaCha7539ProcessBlocks2Sse2_i386.inc.ProcessBytesoverride (new)A fully pipelined
ProcessBytesreplaces the inherited Salsa20implementation. The dispatch ladder in priority order:
FKeyStreamfrom a partial block.ProcessBlocks4whileALen >= 256.ProcessBlocks2whileALen >= 128.ImplProcessBlockwhileALen >= 64.GenerateKeyStream+AdvanceCounter+ UInt64 XORloop (8 words) + byte-level cleanup for the remainder.
DoFinal—ProcessBlocks4drain + UInt64 XOR tailDoFinalnow drains 256-byte chunks viaProcessBlocks4before theexisting 128-byte drain loop. The per-byte XOR in the tail keystream
path is replaced with a
UInt64-wide XOR overAInLen shr 3wordsfollowed by a byte-level cleanup loop for the remaining 0–7 bytes.
ImplProcessBlock— UInt64 XORThe per-byte XOR loop (64 iterations) is replaced with 8
PUInt64pointer-cast XOR operations via
LInP,LOutP,LKeyPlocals.IChaCha7539Engine—ProcessBlocks2removedProcessBlocks2is removed from the interface. The method now livessolely on
TChaCha7539Engine(and its baseTSalsa20Engine), lettingTChaCha20Poly1305hold a concreteTChaCha7539Enginereferencerather than
IChaCha7539Engine, giving it direct access toProcessBlocks4without an extra interface method.TChaChaEngine(ClpChaChaEngine)ProcessBlocks2override (new)A
ProcessBlocks2override is added at theTChaChaEnginelevel. Itcalls
AssertInitialisedAndBlockAlignedthen delegates to twoImplProcessBlockcalls. This plugs the virtual dispatch slot fornon-7539 ChaCha variants so the base
TSalsa20Engine.ProcessBlocks2SSE4.1 path does not run on ChaCha (which uses a different state
format).
TSalsa20Engine(ClpSalsa20Engine)AssertInitialisedAndBlockAligned(new)A shared
inlinehelper that checksFInitialisedandFIndex = 0,raising
EInvalidOperationCryptoLibExceptionwith the appropriatemessage for either failure. Replaces the inline duplicated checks in
ProcessBlock,ProcessBlocks2, andProcessBlocks4.ImplProcessBlock(new, promoted tostrict protected)Encapsulates
GenerateKeyStream+AdvanceCounter+ the 8-widePUInt64XOR + write path. Used internally byProcessBlocks2(scalarfallback) and
TChaChaEngine.ProcessBlocks2.ProcessBlocks2override (new, virtual)Added to
TSalsa20Engineas avirtualmethod. CallsAssertInitialisedAndBlockAligned, then dispatches toSalsa20ProcessBlocks2Sse41on SSE4.1; falls back to twoImplProcessBlockcalls. New include files:Salsa20ProcessBlocks2Sse41_x86_64.inc,Salsa20ProcessBlocks2Sse41_i386.inc.ProcessBytes— batch dispatch rewriteThe old per-byte loop (one
ReturnByteequivalent per iteration) isreplaced with a structured while-loop that:
FIndex <> 0).ProcessBlocks2whileALen >= 128.ImplProcessBlockwhileALen >= 64.GenerateKeyStream+AdvanceCounterSNotBlockAlignedresource string added for the block-alignment error.TPoly1305(ClpPoly1305)State consolidation —
TPoly1305StateA new
TPoly1305Staterecord (72 bytes) collectsR0..R4,S1..S4,H0..H4, andK0..K3into a single value. The previous 13 separateFR*,FS*,FH*,FK*fields are replaced byFState: TPoly1305State.A new
FPowTable: TCryptoLibByteArrayfield holds the AVX2 power table,and doubles as a dispatch flag (
FPowTable <> niliff a SIMD path is active).New scalar primitives (unit-level, not methods)
Poly1305StateReset— zeroesH0..H4inFState.Poly1305StateAbsorbR— clamps and packs the first 16 key bytes intoR0..R4/S1..S4using the RFC 7539 clamp masks.Poly1305StateAbsorbS— packs the Poly1305 "s" key intoK0..K3.Poly1305StateProcessBlock— the(H + M) * r mod pstep inradix-2^26, operating on
TPoly1305State. Replaces the inlineexpansion previously duplicated in
ProcessBlockandDoFinal.Poly1305StateProcessBlocksScalar— iteratesPoly1305StateProcessBlockN times; used as scalar bulk path and astail handler for the AVX2 path.
AVX2 4-way bulk evaluation
Poly1305MulLimbs— 5-limb radix-2^26 field multiply used atSetKeytime to derive r^2, r^3, r^4 for the power table.
Poly1305Avx2InitPowerTable— allocates a 320-byte power table andpacks r^1..r^4 limbs + their 5x wraparound multipliers in the
post-
VPERMDlayout expected by the assembly kernel.Poly1305BlocksBulkAvx2Core— Pascal ABI wrapper aroundPoly1305BlocksBulkAvx2Core_x86_64.inc/Poly1305BlocksBulkAvx2Core_i386.inc. Takes aTPoly1305Statepointer, the power table pointer, the input buffer pointer, byte
length, and a padding flag.
Poly1305ProcessBlocksAvx2— roundsANumBlocksdown to a multipleof 4, dispatches the AVX2 kernel for that bulk, forwards the 0–3
remainder to
Poly1305StateProcessBlocksScalar.SetKey— SIMD table allocationAfter populating
FStateviaPoly1305StateAbsorbR/Poly1305StateAbsorbS,SetKeyresetsFPowTabletonilthen conditionally callsPoly1305Avx2InitPowerTablewhenHasAVX2()is true.BlockUpdate— batch dispatchThe inner while-loop is replaced with a block-count calculation
(
LNb := LRemaining shr 4); whenLNb > 0, the batch is routed toPoly1305ProcessBlocksAvx2(whenFPowTable <> nilandHasAVX2())or
Poly1305StateProcessBlocksScalar.ProcessBlockreduced to a one-linerDelegates to
Poly1305StateProcessBlock(FState, ABuf, AOff).TChaCha20Poly1305(ClpChaCha20Poly1305)FChaCha20type changed fromIChaCha7539EnginetoTChaCha7539Engine(concrete reference, enablingProcessBlocks4).ClpIChaCha7539Engineremoved from the uses clause.ProcessBlocks4added (delegates toFChaCha20.ProcessBlocks4,increments
FDataCountby 256).ProcessBytesencrypt loop now drives aProcessBlocks4/ AVX2drain pass (
while AInOff <= LInLimit2 - BufSize * 2) before theexisting
ProcessBlocks2loop. Decrypt path mirrors the same change.LChaCha20Paramsconstruction simplified:as IParametersWithIVcast removed (the class reference is now used directly).
destructor Destroyadded to free the ownedTChaCha7539Engine.Include file refactoring
ChaChaSse2_DoubleRoundBody.inc(new, shared)The 28-instruction SSE2 double-round body (column round + diagonal
round, each using
paddd/pxor/movdqa/psrld/pslld/por/pshufd)is extracted into a single shared include. Previously duplicated
verbatim in both
ChaCha20BlockSse2_x86_64.incandChaCha20BlockSse2_i386.inc; both now reduce to{$I ChaChaSse2_DoubleRoundBody.inc}inside their loop.Shared AVX2 sub-includes (new)
The 128B AVX2 state-load and XOR-store sequences are factored out of
the monolithic
ChaCha7539ProcessBlocks2Avx2_x86_64.incinto:ChaCha7539_Avx2_128B_LoadState_x86_64.inc/_i386.incChaCha7539_Avx2_128B_XorStore_x86_64.inc/_i386.incChaCha7539_Avx2_DoubleRound.inc(architecture-neutral VEXdb-encoded body)These are composed by both
ChaCha7539ProcessBlocks2Avx2_*.incandChaCha7539ProcessBlocks4Avx2_*.inc, eliminating a second verbatimcopy of the AVX2 double-round body.
New Tests
TTestChaCha7539ProcessBlocks2.Test256ByteProcessBlocks4VsTwoProcessBlocks2ProcessBlocks4(256B)== two sequentialProcessBlocks2(128B)calls from the same key/nonce.TTestChaCha7539ProcessBlocks2.TestProcessBytesVsReturnByteProcessBytes(300B)== 300 sequentialReturnBytecalls.TTestSalsa20.TestProcessBytesVsReturnByteTTestPoly1305.TestBlockUpdateOneShotVsChunkedBlockUpdate== chunkedBlockUpdatewith a{1,5,16,32,13}chunk pattern across 13 message lengths (0–2048 bytes).TTestPoly1305.TestLcgMessageBulkLengthsTTestChaCha20Poly1305.TestAeadInputChunkingProcessBytes(600B)== chunkedProcessBytesat offsets{1, 255, 256, 88}for both ciphertext and tag.TTestChaCha7539ProcessBlocks2also switches fromIChaCha7539Engineto
TChaCha7539Engine(concrete), addstry/finallyguards aroundall engine instances, and removes the now-unused
ClpIChaCha7539Engineimport.