Skip to content

ChaCha7539/ChaCha20/Salsa20/Poly1305 SIMD#86

Merged
Xor-el merged 6 commits intovNextfrom
enhancement/chacha-and-salsa-simd-enhancements
Apr 28, 2026
Merged

ChaCha7539/ChaCha20/Salsa20/Poly1305 SIMD#86
Xor-el merged 6 commits intovNextfrom
enhancement/chacha-and-salsa-simd-enhancements

Conversation

@Xor-el
Copy link
Copy Markdown
Owner

@Xor-el Xor-el commented Apr 28, 2026

Overview

A broad throughput pass across the ChaCha20/ChaCha7539, Salsa20, and
Poly1305 primitives. Key additions: a 4-block (256-byte) fast path for
ChaCha7539 with a dedicated AVX2 kernel, a fully pipelined
ProcessBytes override for ChaCha7539 and Salsa20, a UInt64-wide XOR
tail that replaces the per-byte loop throughout, AVX2 4-way bulk
evaluation for Poly1305, a ProcessBlocks2 override on the Salsa20
base engine, and a shared ChaChaSse2_DoubleRoundBody.inc that
deduplicates the SSE2 double-round body across all consumers. The
ProcessBlocks2Sse2 stub is promoted from an inline two-call sequence
into its own proper SIMD kernel covering both x86_64 and i386.


TChaCha7539Engine (ClpChaCha7539Engine)

ProcessBlocks4 (new)

256-byte (4-block) variant. Dispatches to a new dedicated
ChaCha7539ProcessBlocks4Avx2 kernel on AVX2; falls back to two
sequential ProcessBlocks2 calls on everything else. The counter
overflow guard is checked before the AVX2 path is entered. New include
files: ChaCha7539ProcessBlocks4Avx2_x86_64.inc,
ChaCha7539ProcessBlocks4Avx2_i386.inc.

ProcessBlocks2 refactored

The previous SSE2 fallback (two separate ChaCha7539BlockSse2 calls +
manual XOR loops) is replaced by a proper dedicated two-block SSE2
kernel ChaCha7539ProcessBlocks2Sse2, covering both x86_64 and i386.
The {$IFDEF CRYPTOLIB_X86_64_ASM} guard around the AVX2 dispatch is
removed — both stubs are now bi-arch. New include files:
ChaCha7539ProcessBlocks2Sse2_x86_64.inc,
ChaCha7539ProcessBlocks2Sse2_i386.inc.

ProcessBytes override (new)

A fully pipelined ProcessBytes replaces the inherited Salsa20
implementation. The dispatch ladder in priority order:

  1. Drain any bytes remaining in FKeyStream from a partial block.
  2. ProcessBlocks4 while ALen >= 256.
  3. ProcessBlocks2 while ALen >= 128.
  4. ImplProcessBlock while ALen >= 64.
  5. Scalar tail: GenerateKeyStream + AdvanceCounter + UInt64 XOR
    loop (8 words) + byte-level cleanup for the remainder.

DoFinalProcessBlocks4 drain + UInt64 XOR tail

DoFinal now drains 256-byte chunks via ProcessBlocks4 before the
existing 128-byte drain loop. The per-byte XOR in the tail keystream
path is replaced with a UInt64-wide XOR over AInLen shr 3 words
followed by a byte-level cleanup loop for the remaining 0–7 bytes.

ImplProcessBlock — UInt64 XOR

The per-byte XOR loop (64 iterations) is replaced with 8 PUInt64
pointer-cast XOR operations via LInP, LOutP, LKeyP locals.

IChaCha7539EngineProcessBlocks2 removed

ProcessBlocks2 is removed from the interface. The method now lives
solely on TChaCha7539Engine (and its base TSalsa20Engine), letting
TChaCha20Poly1305 hold a concrete TChaCha7539Engine reference
rather than IChaCha7539Engine, giving it direct access to
ProcessBlocks4 without an extra interface method.


TChaChaEngine (ClpChaChaEngine)

ProcessBlocks2 override (new)

A ProcessBlocks2 override is added at the TChaChaEngine level. It
calls AssertInitialisedAndBlockAligned then delegates to two
ImplProcessBlock calls. This plugs the virtual dispatch slot for
non-7539 ChaCha variants so the base TSalsa20Engine.ProcessBlocks2
SSE4.1 path does not run on ChaCha (which uses a different state
format).


TSalsa20Engine (ClpSalsa20Engine)

AssertInitialisedAndBlockAligned (new)

A shared inline helper that checks FInitialised and FIndex = 0,
raising EInvalidOperationCryptoLibException with the appropriate
message for either failure. Replaces the inline duplicated checks in
ProcessBlock, ProcessBlocks2, and ProcessBlocks4.

ImplProcessBlock (new, promoted to strict protected)

Encapsulates GenerateKeyStream + AdvanceCounter + the 8-wide
PUInt64 XOR + write path. Used internally by ProcessBlocks2 (scalar
fallback) and TChaChaEngine.ProcessBlocks2.

ProcessBlocks2 override (new, virtual)

Added to TSalsa20Engine as a virtual method. Calls
AssertInitialisedAndBlockAligned, then dispatches to
Salsa20ProcessBlocks2Sse41 on SSE4.1; falls back to two
ImplProcessBlock calls. New include files:
Salsa20ProcessBlocks2Sse41_x86_64.inc,
Salsa20ProcessBlocks2Sse41_i386.inc.

ProcessBytes — batch dispatch rewrite

The old per-byte loop (one ReturnByte equivalent per iteration) is
replaced with a structured while-loop that:

  1. Drains the partial-block keystream buffer (FIndex <> 0).
  2. Dispatches ProcessBlocks2 while ALen >= 128.
  3. Dispatches ImplProcessBlock while ALen >= 64.
  4. Handles the sub-64 tail with GenerateKeyStream + AdvanceCounter
    • UInt64 XOR loop + byte cleanup.

SNotBlockAligned resource string added for the block-alignment error.


TPoly1305 (ClpPoly1305)

State consolidation — TPoly1305State

A new TPoly1305State record (72 bytes) collects R0..R4, S1..S4,
H0..H4, and K0..K3 into a single value. The previous 13 separate
FR*, FS*, FH*, FK* fields are replaced by FState: TPoly1305State.
A new FPowTable: TCryptoLibByteArray field holds the AVX2 power table,
and doubles as a dispatch flag (FPowTable <> nil iff a SIMD path is active).

New scalar primitives (unit-level, not methods)

  • Poly1305StateReset — zeroes H0..H4 in FState.
  • Poly1305StateAbsorbR — clamps and packs the first 16 key bytes into
    R0..R4 / S1..S4 using the RFC 7539 clamp masks.
  • Poly1305StateAbsorbS — packs the Poly1305 "s" key into K0..K3.
  • Poly1305StateProcessBlock — the (H + M) * r mod p step in
    radix-2^26, operating on TPoly1305State. Replaces the inline
    expansion previously duplicated in ProcessBlock and DoFinal.
  • Poly1305StateProcessBlocksScalar — iterates
    Poly1305StateProcessBlock N times; used as scalar bulk path and as
    tail handler for the AVX2 path.

AVX2 4-way bulk evaluation

  • Poly1305MulLimbs — 5-limb radix-2^26 field multiply used at SetKey
    time to derive r^2, r^3, r^4 for the power table.
  • Poly1305Avx2InitPowerTable — allocates a 320-byte power table and
    packs r^1..r^4 limbs + their 5x wraparound multipliers in the
    post-VPERMD layout expected by the assembly kernel.
  • Poly1305BlocksBulkAvx2Core — Pascal ABI wrapper around
    Poly1305BlocksBulkAvx2Core_x86_64.inc /
    Poly1305BlocksBulkAvx2Core_i386.inc. Takes a TPoly1305State
    pointer, the power table pointer, the input buffer pointer, byte
    length, and a padding flag.
  • Poly1305ProcessBlocksAvx2 — rounds ANumBlocks down to a multiple
    of 4, dispatches the AVX2 kernel for that bulk, forwards the 0–3
    remainder to Poly1305StateProcessBlocksScalar.

SetKey — SIMD table allocation

After populating FState via Poly1305StateAbsorbR / Poly1305StateAbsorbS,
SetKey resets FPowTable to nil then conditionally calls
Poly1305Avx2InitPowerTable when HasAVX2() is true.

BlockUpdate — batch dispatch

The inner while-loop is replaced with a block-count calculation
(LNb := LRemaining shr 4); when LNb > 0, the batch is routed to
Poly1305ProcessBlocksAvx2 (when FPowTable <> nil and HasAVX2())
or Poly1305StateProcessBlocksScalar.

ProcessBlock reduced to a one-liner

Delegates to Poly1305StateProcessBlock(FState, ABuf, AOff).


TChaCha20Poly1305 (ClpChaCha20Poly1305)

  • FChaCha20 type changed from IChaCha7539Engine to
    TChaCha7539Engine (concrete reference, enabling ProcessBlocks4).
  • ClpIChaCha7539Engine removed from the uses clause.
  • ProcessBlocks4 added (delegates to FChaCha20.ProcessBlocks4,
    increments FDataCount by 256).
  • ProcessBytes encrypt loop now drives a ProcessBlocks4 / AVX2
    drain pass (while AInOff <= LInLimit2 - BufSize * 2) before the
    existing ProcessBlocks2 loop. Decrypt path mirrors the same change.
  • LChaCha20Params construction simplified: as IParametersWithIV
    cast removed (the class reference is now used directly).
  • destructor Destroy added to free the owned TChaCha7539Engine.

Include file refactoring

ChaChaSse2_DoubleRoundBody.inc (new, shared)

The 28-instruction SSE2 double-round body (column round + diagonal
round, each using paddd/pxor/movdqa/psrld/pslld/por/pshufd)
is extracted into a single shared include. Previously duplicated
verbatim in both ChaCha20BlockSse2_x86_64.inc and
ChaCha20BlockSse2_i386.inc; both now reduce to
{$I ChaChaSse2_DoubleRoundBody.inc} inside their loop.

Shared AVX2 sub-includes (new)

The 128B AVX2 state-load and XOR-store sequences are factored out of
the monolithic ChaCha7539ProcessBlocks2Avx2_x86_64.inc into:

  • ChaCha7539_Avx2_128B_LoadState_x86_64.inc / _i386.inc
  • ChaCha7539_Avx2_128B_XorStore_x86_64.inc / _i386.inc
  • ChaCha7539_Avx2_DoubleRound.inc (architecture-neutral VEX db-encoded body)

These are composed by both ChaCha7539ProcessBlocks2Avx2_*.inc and
ChaCha7539ProcessBlocks4Avx2_*.inc, eliminating a second verbatim
copy of the AVX2 double-round body.


New Tests

Test What it covers
TTestChaCha7539ProcessBlocks2.Test256ByteProcessBlocks4VsTwoProcessBlocks2 Asserts ProcessBlocks4(256B) == two sequential ProcessBlocks2(128B) calls from the same key/nonce.
TTestChaCha7539ProcessBlocks2.TestProcessBytesVsReturnByte Asserts ProcessBytes(300B) == 300 sequential ReturnByte calls.
TTestSalsa20.TestProcessBytesVsReturnByte Same parity check for Salsa20 over 400 bytes.
TTestPoly1305.TestBlockUpdateOneShotVsChunked Asserts one-shot BlockUpdate == chunked BlockUpdate with a {1,5,16,32,13} chunk pattern across 13 message lengths (0–2048 bytes).
TTestPoly1305.TestLcgMessageBulkLengths Same parity check using LCG-generated messages at bulk sizes (240, 4000, 8000, 272 bytes).
TTestChaCha20Poly1305.TestAeadInputChunking Asserts one-shot ProcessBytes(600B) == chunked ProcessBytes at offsets {1, 255, 256, 88} for both ciphertext and tag.

TTestChaCha7539ProcessBlocks2 also switches from IChaCha7539Engine
to TChaCha7539Engine (concrete), adds try/finally guards around
all engine instances, and removes the now-unused ClpIChaCha7539Engine
import.

@Xor-el Xor-el merged commit 4e2e832 into vNext Apr 28, 2026
11 checks passed
@Xor-el Xor-el deleted the enhancement/chacha-and-salsa-simd-enhancements branch April 28, 2026 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant