ChaCha7539/ChaCha20/Salsa20/Poly1305 SIMD by Xor-el · Pull Request #86 · Xor-el/CryptoLib4Pascal

Xor-el · 2026-04-28T16:41:48Z

Overview

A broad throughput pass across the ChaCha20/ChaCha7539, Salsa20, and
Poly1305 primitives. Key additions: a 4-block (256-byte) fast path for
ChaCha7539 with a dedicated AVX2 kernel, a fully pipelined
ProcessBytes override for ChaCha7539 and Salsa20, a UInt64-wide XOR
tail that replaces the per-byte loop throughout, AVX2 4-way bulk
evaluation for Poly1305, a ProcessBlocks2 override on the Salsa20
base engine, and a shared ChaChaSse2_DoubleRoundBody.inc that
deduplicates the SSE2 double-round body across all consumers. The
ProcessBlocks2Sse2 stub is promoted from an inline two-call sequence
into its own proper SIMD kernel covering both x86_64 and i386.

`TChaCha7539Engine` (`ClpChaCha7539Engine`)

`ProcessBlocks4` (new)

256-byte (4-block) variant. Dispatches to a new dedicated
ChaCha7539ProcessBlocks4Avx2 kernel on AVX2; falls back to two
sequential ProcessBlocks2 calls on everything else. The counter
overflow guard is checked before the AVX2 path is entered. New include
files: ChaCha7539ProcessBlocks4Avx2_x86_64.inc,
ChaCha7539ProcessBlocks4Avx2_i386.inc.

`ProcessBlocks2` refactored

The previous SSE2 fallback (two separate ChaCha7539BlockSse2 calls +
manual XOR loops) is replaced by a proper dedicated two-block SSE2
kernel ChaCha7539ProcessBlocks2Sse2, covering both x86_64 and i386.
The {$IFDEF CRYPTOLIB_X86_64_ASM} guard around the AVX2 dispatch is
removed — both stubs are now bi-arch. New include files:
ChaCha7539ProcessBlocks2Sse2_x86_64.inc,
ChaCha7539ProcessBlocks2Sse2_i386.inc.

`ProcessBytes` override (new)

A fully pipelined ProcessBytes replaces the inherited Salsa20
implementation. The dispatch ladder in priority order:

Drain any bytes remaining in FKeyStream from a partial block.
ProcessBlocks4 while ALen >= 256.
ProcessBlocks2 while ALen >= 128.
ImplProcessBlock while ALen >= 64.
Scalar tail: GenerateKeyStream + AdvanceCounter + UInt64 XOR
loop (8 words) + byte-level cleanup for the remainder.

`DoFinal` — `ProcessBlocks4` drain + UInt64 XOR tail

DoFinal now drains 256-byte chunks via ProcessBlocks4 before the
existing 128-byte drain loop. The per-byte XOR in the tail keystream
path is replaced with a UInt64-wide XOR over AInLen shr 3 words
followed by a byte-level cleanup loop for the remaining 0–7 bytes.

`ImplProcessBlock` — UInt64 XOR

The per-byte XOR loop (64 iterations) is replaced with 8 PUInt64
pointer-cast XOR operations via LInP, LOutP, LKeyP locals.

`IChaCha7539Engine` — `ProcessBlocks2` removed

ProcessBlocks2 is removed from the interface. The method now lives
solely on TChaCha7539Engine (and its base TSalsa20Engine), letting
TChaCha20Poly1305 hold a concrete TChaCha7539Engine reference
rather than IChaCha7539Engine, giving it direct access to
ProcessBlocks4 without an extra interface method.

`TChaChaEngine` (`ClpChaChaEngine`)

`ProcessBlocks2` override (new)

A ProcessBlocks2 override is added at the TChaChaEngine level. It
calls AssertInitialisedAndBlockAligned then delegates to two
ImplProcessBlock calls. This plugs the virtual dispatch slot for
non-7539 ChaCha variants so the base TSalsa20Engine.ProcessBlocks2
SSE4.1 path does not run on ChaCha (which uses a different state
format).

`TSalsa20Engine` (`ClpSalsa20Engine`)

`AssertInitialisedAndBlockAligned` (new)

A shared inline helper that checks FInitialised and FIndex = 0,
raising EInvalidOperationCryptoLibException with the appropriate
message for either failure. Replaces the inline duplicated checks in
ProcessBlock, ProcessBlocks2, and ProcessBlocks4.

`ImplProcessBlock` (new, promoted to `strict protected`)

Encapsulates GenerateKeyStream + AdvanceCounter + the 8-wide
PUInt64 XOR + write path. Used internally by ProcessBlocks2 (scalar
fallback) and TChaChaEngine.ProcessBlocks2.

`ProcessBlocks2` override (new, virtual)

Added to TSalsa20Engine as a virtual method. Calls
AssertInitialisedAndBlockAligned, then dispatches to
Salsa20ProcessBlocks2Sse41 on SSE4.1; falls back to two
ImplProcessBlock calls. New include files:
Salsa20ProcessBlocks2Sse41_x86_64.inc,
Salsa20ProcessBlocks2Sse41_i386.inc.

`ProcessBytes` — batch dispatch rewrite

The old per-byte loop (one ReturnByte equivalent per iteration) is
replaced with a structured while-loop that:

Drains the partial-block keystream buffer (FIndex <> 0).
Dispatches ProcessBlocks2 while ALen >= 128.
Dispatches ImplProcessBlock while ALen >= 64.
Handles the sub-64 tail with GenerateKeyStream + AdvanceCounter
- UInt64 XOR loop + byte cleanup.

SNotBlockAligned resource string added for the block-alignment error.

`TPoly1305` (`ClpPoly1305`)

State consolidation — `TPoly1305State`

A new TPoly1305State record (72 bytes) collects R0..R4, S1..S4,
H0..H4, and K0..K3 into a single value. The previous 13 separate
FR*, FS*, FH*, FK* fields are replaced by FState: TPoly1305State.
A new FPowTable: TCryptoLibByteArray field holds the AVX2 power table,
and doubles as a dispatch flag (FPowTable <> nil iff a SIMD path is active).

New scalar primitives (unit-level, not methods)

Poly1305StateReset — zeroes H0..H4 in FState.
Poly1305StateAbsorbR — clamps and packs the first 16 key bytes into
R0..R4 / S1..S4 using the RFC 7539 clamp masks.
Poly1305StateAbsorbS — packs the Poly1305 "s" key into K0..K3.
Poly1305StateProcessBlock — the (H + M) * r mod p step in
radix-2^26, operating on TPoly1305State. Replaces the inline
expansion previously duplicated in ProcessBlock and DoFinal.
Poly1305StateProcessBlocksScalar — iterates
Poly1305StateProcessBlock N times; used as scalar bulk path and as
tail handler for the AVX2 path.

AVX2 4-way bulk evaluation

Poly1305MulLimbs — 5-limb radix-2^26 field multiply used at SetKey
time to derive r^2, r^3, r^4 for the power table.
Poly1305Avx2InitPowerTable — allocates a 320-byte power table and
packs r^1..r^4 limbs + their 5x wraparound multipliers in the
post-VPERMD layout expected by the assembly kernel.
Poly1305BlocksBulkAvx2Core — Pascal ABI wrapper around
Poly1305BlocksBulkAvx2Core_x86_64.inc /
Poly1305BlocksBulkAvx2Core_i386.inc. Takes a TPoly1305State
pointer, the power table pointer, the input buffer pointer, byte
length, and a padding flag.
Poly1305ProcessBlocksAvx2 — rounds ANumBlocks down to a multiple
of 4, dispatches the AVX2 kernel for that bulk, forwards the 0–3
remainder to Poly1305StateProcessBlocksScalar.

`SetKey` — SIMD table allocation

After populating FState via Poly1305StateAbsorbR / Poly1305StateAbsorbS,
SetKey resets FPowTable to nil then conditionally calls
Poly1305Avx2InitPowerTable when HasAVX2() is true.

`BlockUpdate` — batch dispatch

The inner while-loop is replaced with a block-count calculation
(LNb := LRemaining shr 4); when LNb > 0, the batch is routed to
Poly1305ProcessBlocksAvx2 (when FPowTable <> nil and HasAVX2())
or Poly1305StateProcessBlocksScalar.

`ProcessBlock` reduced to a one-liner

Delegates to Poly1305StateProcessBlock(FState, ABuf, AOff).

`TChaCha20Poly1305` (`ClpChaCha20Poly1305`)

FChaCha20 type changed from IChaCha7539Engine to
TChaCha7539Engine (concrete reference, enabling ProcessBlocks4).
ClpIChaCha7539Engine removed from the uses clause.
ProcessBlocks4 added (delegates to FChaCha20.ProcessBlocks4,
increments FDataCount by 256).
ProcessBytes encrypt loop now drives a ProcessBlocks4 / AVX2
drain pass (while AInOff <= LInLimit2 - BufSize * 2) before the
existing ProcessBlocks2 loop. Decrypt path mirrors the same change.
LChaCha20Params construction simplified: as IParametersWithIV
cast removed (the class reference is now used directly).
destructor Destroy added to free the owned TChaCha7539Engine.

Include file refactoring

`ChaChaSse2_DoubleRoundBody.inc` (new, shared)

The 28-instruction SSE2 double-round body (column round + diagonal
round, each using paddd/pxor/movdqa/psrld/pslld/por/pshufd)
is extracted into a single shared include. Previously duplicated
verbatim in both ChaCha20BlockSse2_x86_64.inc and
ChaCha20BlockSse2_i386.inc; both now reduce to
{$I ChaChaSse2_DoubleRoundBody.inc} inside their loop.

Shared AVX2 sub-includes (new)

The 128B AVX2 state-load and XOR-store sequences are factored out of
the monolithic ChaCha7539ProcessBlocks2Avx2_x86_64.inc into:

ChaCha7539_Avx2_128B_LoadState_x86_64.inc / _i386.inc
ChaCha7539_Avx2_128B_XorStore_x86_64.inc / _i386.inc
ChaCha7539_Avx2_DoubleRound.inc (architecture-neutral VEX db-encoded body)

These are composed by both ChaCha7539ProcessBlocks2Avx2_*.inc and
ChaCha7539ProcessBlocks4Avx2_*.inc, eliminating a second verbatim
copy of the AVX2 double-round body.

New Tests

Test	What it covers
`TTestChaCha7539ProcessBlocks2.Test256ByteProcessBlocks4VsTwoProcessBlocks2`	Asserts `ProcessBlocks4(256B)` == two sequential `ProcessBlocks2(128B)` calls from the same key/nonce.
`TTestChaCha7539ProcessBlocks2.TestProcessBytesVsReturnByte`	Asserts `ProcessBytes(300B)` == 300 sequential `ReturnByte` calls.
`TTestSalsa20.TestProcessBytesVsReturnByte`	Same parity check for Salsa20 over 400 bytes.
`TTestPoly1305.TestBlockUpdateOneShotVsChunked`	Asserts one-shot `BlockUpdate` == chunked `BlockUpdate` with a `{1,5,16,32,13}` chunk pattern across 13 message lengths (0–2048 bytes).
`TTestPoly1305.TestLcgMessageBulkLengths`	Same parity check using LCG-generated messages at bulk sizes (240, 4000, 8000, 272 bytes).
`TTestChaCha20Poly1305.TestAeadInputChunking`	Asserts one-shot `ProcessBytes(600B)` == chunked `ProcessBytes` at offsets `{1, 255, 256, 88}` for both ciphertext and tag.

TTestChaCha7539ProcessBlocks2 also switches from IChaCha7539Engine
to TChaCha7539Engine (concrete), adds try/finally guards around
all engine instances, and removes the now-unused ClpIChaCha7539Engine
import.

Xor-el added 6 commits April 27, 2026 10:38

initial optimizations

ddf315e

some more perf enhancements

fe012d0

minor cleanup

cff01a6

add Avx2 for I386 ChaCha

99feb04

Avx2 Implementation of Poly1305

0751e3a

minor cleanup in Poly1305Tests

a29cf16

Xor-el merged commit 4e2e832 into vNext Apr 28, 2026
11 checks passed

Xor-el deleted the enhancement/chacha-and-salsa-simd-enhancements branch April 28, 2026 16:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ChaCha7539/ChaCha20/Salsa20/Poly1305 SIMD#86

ChaCha7539/ChaCha20/Salsa20/Poly1305 SIMD#86
Xor-el merged 6 commits intovNextfrom
enhancement/chacha-and-salsa-simd-enhancements

Xor-el commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Xor-el commented Apr 28, 2026

Overview

TChaCha7539Engine (ClpChaCha7539Engine)

ProcessBlocks4 (new)

ProcessBlocks2 refactored

ProcessBytes override (new)

DoFinal — ProcessBlocks4 drain + UInt64 XOR tail

ImplProcessBlock — UInt64 XOR

IChaCha7539Engine — ProcessBlocks2 removed

TChaChaEngine (ClpChaChaEngine)

ProcessBlocks2 override (new)

TSalsa20Engine (ClpSalsa20Engine)

AssertInitialisedAndBlockAligned (new)

ImplProcessBlock (new, promoted to strict protected)

ProcessBlocks2 override (new, virtual)

ProcessBytes — batch dispatch rewrite

TPoly1305 (ClpPoly1305)

State consolidation — TPoly1305State

New scalar primitives (unit-level, not methods)

AVX2 4-way bulk evaluation

SetKey — SIMD table allocation

BlockUpdate — batch dispatch

ProcessBlock reduced to a one-liner

TChaCha20Poly1305 (ClpChaCha20Poly1305)

Include file refactoring

ChaChaSse2_DoubleRoundBody.inc (new, shared)

Shared AVX2 sub-includes (new)

New Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`TChaCha7539Engine` (`ClpChaCha7539Engine`)

`ProcessBlocks4` (new)

`ProcessBlocks2` refactored

`ProcessBytes` override (new)

`DoFinal` — `ProcessBlocks4` drain + UInt64 XOR tail

`ImplProcessBlock` — UInt64 XOR

`IChaCha7539Engine` — `ProcessBlocks2` removed

`TChaChaEngine` (`ClpChaChaEngine`)

`ProcessBlocks2` override (new)

`TSalsa20Engine` (`ClpSalsa20Engine`)

`AssertInitialisedAndBlockAligned` (new)

`ImplProcessBlock` (new, promoted to `strict protected`)

`ProcessBlocks2` override (new, virtual)

`ProcessBytes` — batch dispatch rewrite

`TPoly1305` (`ClpPoly1305`)

State consolidation — `TPoly1305State`

`SetKey` — SIMD table allocation

`BlockUpdate` — batch dispatch

`ProcessBlock` reduced to a one-liner

`TChaCha20Poly1305` (`ClpChaCha20Poly1305`)

`ChaChaSse2_DoubleRoundBody.inc` (new, shared)