Skip to content

v0.6.1 - vectorized executor follow-ups + mypy strict fix

Choose a tag to compare

@dataeducator dataeducator released this 13 May 01:11
· 23 commits to main since this release

Patch release. Three fixes layered on v0.6.0:

  1. mypy --strict regression in update_swarm_vectorized (51c87f9)
  2. Sprint 8 / G11 — batched fairness gradient (a3eda6e)
  3. Sprint 8 / G12 — threads chunked dispatch + honest backend docs (dad8f9e)

No public-API change. Defaults unchanged. pip install fairswarm==0.6.1 is a drop-in upgrade from 0.6.0.

What this release closes

G11 — Batched fairness gradient

update_swarm_vectorized shipped in v0.6.0 with a residual per-particle
Python loop calling compute_fairness_gradient once per particle. That
loop was the dominant cost in the vectorized path and capped its
speedup at ~1.7x serial.

v0.6.1 adds compute_fairness_gradient_batched(positions, clients, target)
that returns the entire (P, n) gradient matrix in a single matmul plus
a handful of vector ops:

S_p = sum(X[p]) + eps                 # per-particle position sum
W   = X / S[:, None]                  # (P, n)  soft selection weights
C   = W @ D  --normalize--            # (P, k)  soft coalition demographics
G   = log(C / target_safe) + 1        # (P, k)  KL gradient w.r.t. C
M   = G @ D.T                         # (P, n)  one matmul
q   = (G * C).sum(axis=1)             # (P,)    per-particle scalar
grad = -(M - q[:, None]) / S[:, None] # (P, n)  output

Equivalence: batched output agrees with the per-particle reference
to 4.12e-17 absolute error on a (P=25, n=30, k=5) sweep — bit-exact
up to floating-point rounding. Per-particle norm clipping
(max_grad_norm=10.0) is preserved row-by-row so each particle's
restoring force stays proportional to its own divergence (Theorem 2's
drift analysis is per-particle).

Wall-clock impact (commodity CPU, 30 iter, 3 repeats):

P n serial vectorized v0.6.0 vectorized v0.6.1 speedup
30 50 1284 ms 299 ms (4.3x) 257 ms 5.0x
100 50 3278 ms 2945 ms (1.1x) 965 ms 3.4x
200 50 7232 ms 7721 ms (0.9x) 1291 ms 5.6x
30 200 2722 ms 2806 ms (1.0x) 306 ms 8.9x
100 200 9650 ms 8973 ms (1.1x) 1324 ms 7.3x
200 200 18439 ms 7821 ms (1.7x) 1750 ms 10.5x

Speedup is now uniform 3-10x across the (P, n) grid rather than
concentrated at the single largest cell.

G12 — Threads chunked dispatch

The threads executor previously submitted one Future per particle.
ThreadPoolExecutor.submit + Future.result has ~50-200us of round-trip
overhead per dispatch on a 4-worker pool (Windows numbers; lower on
Linux). For lightweight fitness functions like DemographicFitness
(~30-100us per evaluation), per-particle dispatch loses to the serial
baseline because the dispatch tax exceeds the work.

v0.6.1 changes the threads dispatch to bundle particles into
n_workers contiguous chunks and submit one task per chunk. Each
task processes its slice serially within one worker thread, then
returns the slice; the executor re-stitches the slices in input order.
Worst-case threads improved from 0.14x to 0.41x of serial.

Threads still loses to serial on lightweight fitness — and that's
expected.
The honest answer (now in the module docstring) is that
threads is for fitness functions that release the GIL for substantial
wall-time:

  • a federated training round (network or disk I/O bound),
  • a large BLAS-heavy accuracy evaluation,
  • a fitness that calls into a C extension for milliseconds.

For fast closed-form scores (the common case in coalition-selection
benchmarks), use vectorized (3-10x speedup) or serial.

mypy --strict regression fix

The narrowing if ctx0.target_distribution is not None in
update_swarm_vectorized only constrained ctx0.target_distribution,
not the loop-local ctx.target_distribution. v0.6.1 binds the
narrowed reference once outside the loop. All contexts in one
iteration share the same snapshot anyway, so this is also a tiny
performance win (no per-iteration attribute access).

Tests & quality

  • 845 passing (+4 vs v0.6.0), 1 skipped, 0 failing
  • mypy --strict: clean across 48 source files
  • Four new tests pin the new contracts:
    • TestBatchedFairnessGradient::test_batched_matches_per_particle_machine_precision
    • TestBatchedFairnessGradient::test_batched_clip_matches_per_particle_clip
    • TestThreadsChunkedDispatch::test_threads_chunked_result_in_input_order
    • TestThreadsChunkedDispatch::test_threads_chunked_matches_serial_n_workers_invariant

Install

pip install --upgrade fairswarm==0.6.1

Reproducibility

Re-run the vectorized executor benchmark:

python experiments/bench_vectorized.py

The post-G11+G12 artifact is committed at
results/parallel_speedup/bench_vectorized_20260513_010046.json.

Reference

T. Norwood, D. Das, P. Chatterjee, E. Bentley, and U. Ghosh,
"FairSwarm: Trustworthy Coalition Selection for Fair and Secure
Federated Intelligence," IEEE Trans. Consum. Electron., 2026 (Submitted).