You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
feat: k-mer-seeded top-k ungapped local alignment (0.6.0) (#32)
* feat: add k-mer-seeded top-k ungapped local alignment
`top_k_ungapped_local_align_kmer(seqa, seqb, score_matrix, k,
kmer_size, max_hits_per_kmer, filter_overlap_a=True,
filter_overlap_b=True)` is a faster alternative to
`top_k_ungapped_local_align` for callers that work on long inputs
over a moderate-sized alphabet.
Algorithm: BLAST-style k-mer seeding plus the existing per-diagonal
positive-segment scan as the extension.
1. Build a rolling-hash k-mer index of `seqa`
(packed k-mer -> sa positions; `kmer_size` in `[1, 8]`, packed
into a `u64`).
2. Walk `seqb`'s rolling k-mer; for each lookup, collect the set of
hit diagonals (`diag = sa_pos - sb_pos`). Skip k-mers whose
`seqa`-occurrence count exceeds `max_hits_per_kmer` to bound
low-complexity blowup (long runs of spaces, common bigrams, etc.).
3. Run the per-diagonal positive-segment scan *only* on diagonals
with at least one hit, into the same candidate heap.
4. Pop top-`k` non-overlapping via the existing overlap-filter logic.
The output `Alignment` shape and overlap semantics are identical to
`top_k_ungapped_local_align`, so the new function is a drop-in
replacement at the seed step of any caller.
Implementation reuses the existing diagonal scan by extracting two
helpers out of `_top_k_ungapped_local_align_core`:
- `process_diagonal_into_candidates`: SW-style positive-segment
scan along one diagonal of (sb x sa), pushing every positive HSP
peak into a `BinaryHeap<Candidate>`.
- `select_top_k_with_overlap_filter`: pop top-`k` from the heap,
skipping HSPs that overlap an already-accepted one on `A` and/or
`B`.
Both `_top_k_ungapped_local_align_core` and the new k-mer variant
share these helpers. Behaviour-preserving refactor — all existing
top-k tests pass unchanged.
Correctness caveat (documented on the new function): an HSP is found
iff at least one window of `kmer_size` consecutive exact-match bytes
lies on its diagonal between the two sequences. HSPs whose match
runs are all strictly shorter than `kmer_size` are missed. Over
text-like alphabets at `kmer_size <= 5` this only happens at very
small HSP scores; callers that need exhaustive coverage at the
boundary should keep using `top_k_ungapped_local_align` or fall
through to a full local SW (as `anchorite.chained_alignment`'s
recursive gap-fill does).
New tests in `tests/test_seq_smith.py`:
- simple HSP recovery (matches the existing
`test_top_k_ungapped_simple` case),
- near-identity equivalence with the full-scan top-k on random
DNA at the high end of the score distribution,
- overlap filter on `B` (matches `test_top_k_ungapped_overlap`),
- top-`k` limit (matches `test_top_k_ungapped_limit`),
- `kmer_size` validation rejects 0 and 9+ with `ValueError`,
- sequences shorter than `kmer_size` return `[]`,
- `max_hits_per_kmer=0` returns `[]` even on a poly-A identity
pair (the cap that protects against low-complexity blowup).
* feat: add `min_kmer_hits_per_diagonal` threshold to k-mer seeder
`top_k_ungapped_local_align_kmer(..., min_kmer_hits_per_diagonal=1)`
now exposes a per-diagonal hit-count floor that runs the extension
scan only on diagonals whose accumulated k-mer hit count is at least
the given threshold. Defaults to `1` (preserves the previous "any
hit triggers extension" behaviour exactly).
Why: profiling on text-like inputs (markdown vs PDF-extracted text
over a ~37-character alphabet) showed `max_hits_per_kmer` alone is a
weak filter. At cap=16 on a 17K x 17K pair, ~34% of grid diagonals
still see a hit; most of them carry only 1-2 random k-mer collisions
and contribute no real HSP, but every one still pays a full
positive-segment scan.
Per-diagonal hit count, in contrast, is a principled signal:
under +1/-1-style scoring a match-run of length L deposits
`L - kmer_size + 1` hits on its diagonal, so thresholding
"hits >= T" filters out diagonals whose best possible ungapped HSP
score cannot exceed a small constant. This is a per-diagonal
score-floor pre-filter that costs ~O(1) per diagonal versus running
the extension.
End-to-end on a 39-pair markdown/PDF corpus (kmer_size=5,
max_hits_per_kmer=16, k=500), full-scan vs. k-mer-seeded with the
new threshold:
pair (n x m) full T=1 T=5 T=8
17K x 17K 2364 ms 896 ms 206 ms 91 ms
(2.6x) (11x) (25x)
73K x 72K 13038 ms 4362 ms 1209 ms 599 ms
(2.7x) (10x) (20x)
50K x 55K 4379 ms 2479 ms 778 ms 413 ms
(1.8x) (5.6x) (11x)
Score-band recall vs. the full-scan baseline:
- HSPs with score >= 50: **100% recall on every pair at every T
tested (1..21)**. This is the band that drives chain anchoring
in seed-and-extend callers like `anchorite.chained_alignment`.
- HSPs with 10 <= score < 50: > 0.92 recall up to T=5, degrades
gradually thereafter.
Setting `min_kmer_hits_per_diagonal=0` is the escape hatch: the
function then bypasses k-mer seeding entirely and delegates to
`_top_k_ungapped_local_align_core` (the exhaustive scan). Callers
who need exhaustive coverage at the low-score boundary can use this
without switching APIs.
Internal: the per-diagonal hit-count table is a dense
`Vec<u32>` of length `sa_len + sb_len - 1`, indexed by
`diag + (sb_len - 1)`. At text-alphabet densities most diagonals
see >= 1 hit so a HashMap is mostly populated and pays allocator
overhead for no benefit; the Vec is ~580 KB worst case on the
corpus and is ~10% faster end-to-end at low T.
New tests in `tests/test_seq_smith.py`:
- `test_top_k_ungapped_kmer_min_hits_per_diagonal_filters`:
a perfect HSP of length L over `kmer_size` k deposits
`L - k + 1` hits; threshold above that drops it, threshold
at/below keeps it.
- `test_top_k_ungapped_kmer_min_hits_zero_falls_back_to_exhaustive`:
on random DNA, `min_kmer_hits_per_diagonal=0` returns exactly
the same HSPs as `top_k_ungapped_local_align`, including HSPs
whose match runs are shorter than `kmer_size`.
All 45 tests pass.
* feat: span-based extension with X-drop backward for k-mer seeder
Replace the per-diagonal hit-count + full-diagonal scan with BLAST-style
span emission and X-drop-bounded extension. The forward extension is the
same Kadane positive-segment scan as `top_k_ungapped_local_align`, so
within the seeded regime the kmer fn now reports the exact same HSP
scores as the all-diagonals scan -- not just an approximation.
Algorithm change. Per-diagonal seeding becomes per-span:
1. Walk seqb's k-mers; for each hit at diagonal `d` and position `p`,
extend an open `(start, last, count)` span if `p - last <= max_hit_gap`,
otherwise close it and start a new one. Per-diagonal open-span state
plus a global `Vec<ClosedSpan>`.
2. Sort spans by `(diag, start_pos)`, filter by `min_kmer_hits_per_span`.
3. For each span, walk backward along the diagonal until
`back_sum < back_max - max_hit_gap` (X-drop with `max_hit_gap` as the
X budget). The position with maximum back_sum is `new_start` -- the
position from which a forward Kadane scan reproduces the same peak
as a full-from-zero scan.
4. Forward Kadane from `new_start`, terminating at the first reset
that occurs at-or-past the span's last hit. Resume position becomes
the dedup floor for subsequent spans on the same diagonal.
The strict zero-crossing rule on the backward walk is too tight: a
lead-in of the form `(k-1) matches, 1 mismatch, k matches, ...` dips
`back_sum` to -1 transiently before climbing higher, and we'd terminate
prematurely with an undercounted score. Reusing `max_hit_gap` as the
X-drop budget keeps a single user-facing knob ("how many missing bytes
am I willing to tunnel through?") and lets `back_sum` recover from
short dips.
API changes.
- Rename `min_kmer_hits_per_diagonal` -> `min_kmer_hits_per_span`
(semantic shift: each emitted span is now the unit of filtering,
not the diagonal). `=0` still falls back to the exhaustive scan.
- New `max_hit_gap` parameter (default 20) -- governs both span
emission (close span when next hit is > max_hit_gap past last) and
backward X-drop tolerance. Tune to scoring scheme: 20 is suitable
for +1 / -1 over text-like alphabets.
Measured end-to-end on the 39-pair markdown/PDF corpus (kmer_size=5,
max_hits_per_kmer=16, max_hit_gap=20, min_kmer_hits_per_span=1,
k=500), full-scan vs new kmer-seeded:
pair (n x m) full new speedup recall(s>=50)
17K x 17K 2410 ms 4.2 ms 580x 1.000
11K x 11K 955 ms 2.5 ms 381x 1.000
73K x 72K 12712 ms 17.5 ms 728x 1.000
50K x 55K 4367 ms 12.0 ms 365x 1.000
Top-band (score >= 50) recall is 100% on every pair: HSPs that anchor
chained alignment are all preserved. Mid-band (10 <= s < 50) recall
is 0.88-0.96 at T=1 and degrades with T.
Test changes.
- `test_top_k_ungapped_kmer_matches_full_scan_random_dna` now passes
again exactly (was failing under min/max bounding due to lead-in
truncation; X-drop backward fixes it).
- Rename `min_kmer_hits_per_diagonal_filters` test to
`min_kmer_hits_per_span_filters`, update assertion semantics.
- `min_hits_zero_falls_back_to_exhaustive` updated to the new param
name; behaviour unchanged.
All 45 tests pass.
* Bump version to 0.6.0
Minor release covering the k-mer-seeded top-k ungapped local alignment
work merged on this branch:
- 3be4d56 feat: add k-mer-seeded top-k ungapped local alignment
(top_k_ungapped_local_align_kmer; BLAST-style k-mer index
of seqa + per-diagonal positive-segment scan; behaviour-
preserving extraction of process_diagonal_into_candidates
and select_top_k_with_overlap_filter as shared helpers.)
- 51207a0 feat: add min_kmer_hits_per_diagonal threshold
(Per-diagonal hit-count floor as a score-floor pre-filter;
min_kmer_hits_per_diagonal=0 falls back to the exhaustive
all-diagonals scan; dense Vec<u32> instead of HashMap.)
- 7b8e7d2 feat: span-based extension with X-drop backward
(Replaced per-diagonal hit-count + full-diagonal scan with
BLAST-style span emission and X-drop-bounded extension;
renamed min_kmer_hits_per_diagonal -> min_kmer_hits_per_span;
added max_hit_gap parameter; 365-728x end-to-end speedup
vs. top_k_ungapped_local_align on the markdown/PDF corpus
with 100% recall of HSPs at score >= 50.)
No breaking changes for existing callers of `top_k_ungapped_local_align`
or other public APIs. `top_k_ungapped_local_align_kmer` is a new entry
point introduced in this release.
* Fix pre-commit: ruff-format + ANN001 on test key fn
- ruff-format reformatted multi-arg call sites to one-arg-per-line in
the new k-mer tests (matches the file's prevailing style).
- Add `Alignment` type annotation on the inner `key` function in
`test_top_k_ungapped_kmer_matches_full_scan_random_dna` (ANN001).
* Apply cargo fmt
* Add rustfmt pre-commit hook and Rust toolchain to lint CI
`rustfmt --check --edition 2021` runs as a local pre-commit hook on
every staged `.rs` file. The CI lint job now installs the stable Rust
toolchain with the `rustfmt` component before running `pre-commit run
--all-files`, so the new hook is exercised on PR.
Mechanical formatting drift caught by this hook was committed in the
previous commit ('Apply cargo fmt').
---------
Co-authored-by: Leonhard Gruenschloss <l.gruenschloss@vcgs.org.au>