v0.6.1 - vectorized executor follow-ups + mypy strict fix
Patch release. Three fixes layered on v0.6.0:
- mypy --strict regression in
update_swarm_vectorized(51c87f9) - Sprint 8 / G11 — batched fairness gradient (a3eda6e)
- Sprint 8 / G12 — threads chunked dispatch + honest backend docs (dad8f9e)
No public-API change. Defaults unchanged. pip install fairswarm==0.6.1 is a drop-in upgrade from 0.6.0.
What this release closes
G11 — Batched fairness gradient
update_swarm_vectorized shipped in v0.6.0 with a residual per-particle
Python loop calling compute_fairness_gradient once per particle. That
loop was the dominant cost in the vectorized path and capped its
speedup at ~1.7x serial.
v0.6.1 adds compute_fairness_gradient_batched(positions, clients, target)
that returns the entire (P, n) gradient matrix in a single matmul plus
a handful of vector ops:
S_p = sum(X[p]) + eps # per-particle position sum
W = X / S[:, None] # (P, n) soft selection weights
C = W @ D --normalize-- # (P, k) soft coalition demographics
G = log(C / target_safe) + 1 # (P, k) KL gradient w.r.t. C
M = G @ D.T # (P, n) one matmul
q = (G * C).sum(axis=1) # (P,) per-particle scalar
grad = -(M - q[:, None]) / S[:, None] # (P, n) output
Equivalence: batched output agrees with the per-particle reference
to 4.12e-17 absolute error on a (P=25, n=30, k=5) sweep — bit-exact
up to floating-point rounding. Per-particle norm clipping
(max_grad_norm=10.0) is preserved row-by-row so each particle's
restoring force stays proportional to its own divergence (Theorem 2's
drift analysis is per-particle).
Wall-clock impact (commodity CPU, 30 iter, 3 repeats):
| P | n | serial | vectorized v0.6.0 | vectorized v0.6.1 | speedup |
|---|---|---|---|---|---|
| 30 | 50 | 1284 ms | 299 ms (4.3x) | 257 ms | 5.0x |
| 100 | 50 | 3278 ms | 2945 ms (1.1x) | 965 ms | 3.4x |
| 200 | 50 | 7232 ms | 7721 ms (0.9x) | 1291 ms | 5.6x |
| 30 | 200 | 2722 ms | 2806 ms (1.0x) | 306 ms | 8.9x |
| 100 | 200 | 9650 ms | 8973 ms (1.1x) | 1324 ms | 7.3x |
| 200 | 200 | 18439 ms | 7821 ms (1.7x) | 1750 ms | 10.5x |
Speedup is now uniform 3-10x across the (P, n) grid rather than
concentrated at the single largest cell.
G12 — Threads chunked dispatch
The threads executor previously submitted one Future per particle.
ThreadPoolExecutor.submit + Future.result has ~50-200us of round-trip
overhead per dispatch on a 4-worker pool (Windows numbers; lower on
Linux). For lightweight fitness functions like DemographicFitness
(~30-100us per evaluation), per-particle dispatch loses to the serial
baseline because the dispatch tax exceeds the work.
v0.6.1 changes the threads dispatch to bundle particles into
n_workers contiguous chunks and submit one task per chunk. Each
task processes its slice serially within one worker thread, then
returns the slice; the executor re-stitches the slices in input order.
Worst-case threads improved from 0.14x to 0.41x of serial.
Threads still loses to serial on lightweight fitness — and that's
expected. The honest answer (now in the module docstring) is that
threads is for fitness functions that release the GIL for substantial
wall-time:
- a federated training round (network or disk I/O bound),
- a large BLAS-heavy accuracy evaluation,
- a fitness that calls into a C extension for milliseconds.
For fast closed-form scores (the common case in coalition-selection
benchmarks), use vectorized (3-10x speedup) or serial.
mypy --strict regression fix
The narrowing if ctx0.target_distribution is not None in
update_swarm_vectorized only constrained ctx0.target_distribution,
not the loop-local ctx.target_distribution. v0.6.1 binds the
narrowed reference once outside the loop. All contexts in one
iteration share the same snapshot anyway, so this is also a tiny
performance win (no per-iteration attribute access).
Tests & quality
- 845 passing (+4 vs v0.6.0), 1 skipped, 0 failing
mypy --strict: clean across 48 source files- Four new tests pin the new contracts:
TestBatchedFairnessGradient::test_batched_matches_per_particle_machine_precisionTestBatchedFairnessGradient::test_batched_clip_matches_per_particle_clipTestThreadsChunkedDispatch::test_threads_chunked_result_in_input_orderTestThreadsChunkedDispatch::test_threads_chunked_matches_serial_n_workers_invariant
Install
pip install --upgrade fairswarm==0.6.1Reproducibility
Re-run the vectorized executor benchmark:
python experiments/bench_vectorized.pyThe post-G11+G12 artifact is committed at
results/parallel_speedup/bench_vectorized_20260513_010046.json.
Reference
T. Norwood, D. Das, P. Chatterjee, E. Bentley, and U. Ghosh,
"FairSwarm: Trustworthy Coalition Selection for Fair and Secure
Federated Intelligence," IEEE Trans. Consum. Electron., 2026 (Submitted).