Skip to content

v0.6.0

Latest

Choose a tag to compare

@folded folded released this 12 May 22:19
bf66af7
feat: k-mer-seeded top-k ungapped local alignment (0.6.0) (#32)

* feat: add k-mer-seeded top-k ungapped local alignment

`top_k_ungapped_local_align_kmer(seqa, seqb, score_matrix, k,
kmer_size, max_hits_per_kmer, filter_overlap_a=True,
filter_overlap_b=True)` is a faster alternative to
`top_k_ungapped_local_align` for callers that work on long inputs
over a moderate-sized alphabet.

Algorithm: BLAST-style k-mer seeding plus the existing per-diagonal
positive-segment scan as the extension.

1. Build a rolling-hash k-mer index of `seqa`
   (packed k-mer -> sa positions; `kmer_size` in `[1, 8]`, packed
   into a `u64`).
2. Walk `seqb`'s rolling k-mer; for each lookup, collect the set of
   hit diagonals (`diag = sa_pos - sb_pos`).  Skip k-mers whose
   `seqa`-occurrence count exceeds `max_hits_per_kmer` to bound
   low-complexity blowup (long runs of spaces, common bigrams, etc.).
3. Run the per-diagonal positive-segment scan *only* on diagonals
   with at least one hit, into the same candidate heap.
4. Pop top-`k` non-overlapping via the existing overlap-filter logic.

The output `Alignment` shape and overlap semantics are identical to
`top_k_ungapped_local_align`, so the new function is a drop-in
replacement at the seed step of any caller.

Implementation reuses the existing diagonal scan by extracting two
helpers out of `_top_k_ungapped_local_align_core`:

  - `process_diagonal_into_candidates`: SW-style positive-segment
    scan along one diagonal of (sb x sa), pushing every positive HSP
    peak into a `BinaryHeap<Candidate>`.
  - `select_top_k_with_overlap_filter`: pop top-`k` from the heap,
    skipping HSPs that overlap an already-accepted one on `A` and/or
    `B`.

Both `_top_k_ungapped_local_align_core` and the new k-mer variant
share these helpers.  Behaviour-preserving refactor — all existing
top-k tests pass unchanged.

Correctness caveat (documented on the new function): an HSP is found
iff at least one window of `kmer_size` consecutive exact-match bytes
lies on its diagonal between the two sequences.  HSPs whose match
runs are all strictly shorter than `kmer_size` are missed.  Over
text-like alphabets at `kmer_size <= 5` this only happens at very
small HSP scores; callers that need exhaustive coverage at the
boundary should keep using `top_k_ungapped_local_align` or fall
through to a full local SW (as `anchorite.chained_alignment`'s
recursive gap-fill does).

New tests in `tests/test_seq_smith.py`:

  - simple HSP recovery (matches the existing
    `test_top_k_ungapped_simple` case),
  - near-identity equivalence with the full-scan top-k on random
    DNA at the high end of the score distribution,
  - overlap filter on `B` (matches `test_top_k_ungapped_overlap`),
  - top-`k` limit (matches `test_top_k_ungapped_limit`),
  - `kmer_size` validation rejects 0 and 9+ with `ValueError`,
  - sequences shorter than `kmer_size` return `[]`,
  - `max_hits_per_kmer=0` returns `[]` even on a poly-A identity
    pair (the cap that protects against low-complexity blowup).

* feat: add `min_kmer_hits_per_diagonal` threshold to k-mer seeder

`top_k_ungapped_local_align_kmer(..., min_kmer_hits_per_diagonal=1)`
now exposes a per-diagonal hit-count floor that runs the extension
scan only on diagonals whose accumulated k-mer hit count is at least
the given threshold.  Defaults to `1` (preserves the previous "any
hit triggers extension" behaviour exactly).

Why: profiling on text-like inputs (markdown vs PDF-extracted text
over a ~37-character alphabet) showed `max_hits_per_kmer` alone is a
weak filter.  At cap=16 on a 17K x 17K pair, ~34% of grid diagonals
still see a hit; most of them carry only 1-2 random k-mer collisions
and contribute no real HSP, but every one still pays a full
positive-segment scan.

Per-diagonal hit count, in contrast, is a principled signal:
under +1/-1-style scoring a match-run of length L deposits
`L - kmer_size + 1` hits on its diagonal, so thresholding
"hits >= T" filters out diagonals whose best possible ungapped HSP
score cannot exceed a small constant.  This is a per-diagonal
score-floor pre-filter that costs ~O(1) per diagonal versus running
the extension.

End-to-end on a 39-pair markdown/PDF corpus (kmer_size=5,
max_hits_per_kmer=16, k=500), full-scan vs. k-mer-seeded with the
new threshold:

  pair (n x m)        full       T=1         T=5         T=8
  17K x 17K        2364 ms    896 ms     206 ms       91 ms
                                (2.6x)    (11x)       (25x)
  73K x 72K       13038 ms   4362 ms    1209 ms      599 ms
                                (2.7x)    (10x)       (20x)
  50K x 55K        4379 ms   2479 ms     778 ms      413 ms
                                (1.8x)     (5.6x)    (11x)

Score-band recall vs. the full-scan baseline:
  - HSPs with score >= 50: **100% recall on every pair at every T
    tested (1..21)**.  This is the band that drives chain anchoring
    in seed-and-extend callers like `anchorite.chained_alignment`.
  - HSPs with 10 <= score < 50: > 0.92 recall up to T=5, degrades
    gradually thereafter.

Setting `min_kmer_hits_per_diagonal=0` is the escape hatch: the
function then bypasses k-mer seeding entirely and delegates to
`_top_k_ungapped_local_align_core` (the exhaustive scan).  Callers
who need exhaustive coverage at the low-score boundary can use this
without switching APIs.

Internal: the per-diagonal hit-count table is a dense
`Vec<u32>` of length `sa_len + sb_len - 1`, indexed by
`diag + (sb_len - 1)`.  At text-alphabet densities most diagonals
see >= 1 hit so a HashMap is mostly populated and pays allocator
overhead for no benefit; the Vec is ~580 KB worst case on the
corpus and is ~10% faster end-to-end at low T.

New tests in `tests/test_seq_smith.py`:

  - `test_top_k_ungapped_kmer_min_hits_per_diagonal_filters`:
    a perfect HSP of length L over `kmer_size` k deposits
    `L - k + 1` hits; threshold above that drops it, threshold
    at/below keeps it.
  - `test_top_k_ungapped_kmer_min_hits_zero_falls_back_to_exhaustive`:
    on random DNA, `min_kmer_hits_per_diagonal=0` returns exactly
    the same HSPs as `top_k_ungapped_local_align`, including HSPs
    whose match runs are shorter than `kmer_size`.

All 45 tests pass.

* feat: span-based extension with X-drop backward for k-mer seeder

Replace the per-diagonal hit-count + full-diagonal scan with BLAST-style
span emission and X-drop-bounded extension.  The forward extension is the
same Kadane positive-segment scan as `top_k_ungapped_local_align`, so
within the seeded regime the kmer fn now reports the exact same HSP
scores as the all-diagonals scan -- not just an approximation.

Algorithm change.  Per-diagonal seeding becomes per-span:

  1. Walk seqb's k-mers; for each hit at diagonal `d` and position `p`,
     extend an open `(start, last, count)` span if `p - last <= max_hit_gap`,
     otherwise close it and start a new one.  Per-diagonal open-span state
     plus a global `Vec<ClosedSpan>`.
  2. Sort spans by `(diag, start_pos)`, filter by `min_kmer_hits_per_span`.
  3. For each span, walk backward along the diagonal until
     `back_sum < back_max - max_hit_gap` (X-drop with `max_hit_gap` as the
     X budget).  The position with maximum back_sum is `new_start` -- the
     position from which a forward Kadane scan reproduces the same peak
     as a full-from-zero scan.
  4. Forward Kadane from `new_start`, terminating at the first reset
     that occurs at-or-past the span's last hit.  Resume position becomes
     the dedup floor for subsequent spans on the same diagonal.

The strict zero-crossing rule on the backward walk is too tight: a
lead-in of the form `(k-1) matches, 1 mismatch, k matches, ...` dips
`back_sum` to -1 transiently before climbing higher, and we'd terminate
prematurely with an undercounted score.  Reusing `max_hit_gap` as the
X-drop budget keeps a single user-facing knob ("how many missing bytes
am I willing to tunnel through?") and lets `back_sum` recover from
short dips.

API changes.

  - Rename `min_kmer_hits_per_diagonal` -> `min_kmer_hits_per_span`
    (semantic shift: each emitted span is now the unit of filtering,
    not the diagonal).  `=0` still falls back to the exhaustive scan.
  - New `max_hit_gap` parameter (default 20) -- governs both span
    emission (close span when next hit is > max_hit_gap past last) and
    backward X-drop tolerance.  Tune to scoring scheme: 20 is suitable
    for +1 / -1 over text-like alphabets.

Measured end-to-end on the 39-pair markdown/PDF corpus (kmer_size=5,
max_hits_per_kmer=16, max_hit_gap=20, min_kmer_hits_per_span=1,
k=500), full-scan vs new kmer-seeded:

  pair (n x m)        full       new       speedup    recall(s>=50)
  17K x 17K        2410 ms     4.2 ms      580x       1.000
  11K x 11K         955 ms     2.5 ms      381x       1.000
  73K x 72K       12712 ms    17.5 ms      728x       1.000
  50K x 55K        4367 ms    12.0 ms      365x       1.000

Top-band (score >= 50) recall is 100% on every pair: HSPs that anchor
chained alignment are all preserved.  Mid-band (10 <= s < 50) recall
is 0.88-0.96 at T=1 and degrades with T.

Test changes.

  - `test_top_k_ungapped_kmer_matches_full_scan_random_dna` now passes
    again exactly (was failing under min/max bounding due to lead-in
    truncation; X-drop backward fixes it).
  - Rename `min_kmer_hits_per_diagonal_filters` test to
    `min_kmer_hits_per_span_filters`, update assertion semantics.
  - `min_hits_zero_falls_back_to_exhaustive` updated to the new param
    name; behaviour unchanged.

All 45 tests pass.

* Bump version to 0.6.0

Minor release covering the k-mer-seeded top-k ungapped local alignment
work merged on this branch:

  - 3be4d56 feat: add k-mer-seeded top-k ungapped local alignment
            (top_k_ungapped_local_align_kmer; BLAST-style k-mer index
            of seqa + per-diagonal positive-segment scan; behaviour-
            preserving extraction of process_diagonal_into_candidates
            and select_top_k_with_overlap_filter as shared helpers.)

  - 51207a0 feat: add min_kmer_hits_per_diagonal threshold
            (Per-diagonal hit-count floor as a score-floor pre-filter;
            min_kmer_hits_per_diagonal=0 falls back to the exhaustive
            all-diagonals scan; dense Vec<u32> instead of HashMap.)

  - 7b8e7d2 feat: span-based extension with X-drop backward
            (Replaced per-diagonal hit-count + full-diagonal scan with
            BLAST-style span emission and X-drop-bounded extension;
            renamed min_kmer_hits_per_diagonal -> min_kmer_hits_per_span;
            added max_hit_gap parameter; 365-728x end-to-end speedup
            vs. top_k_ungapped_local_align on the markdown/PDF corpus
            with 100% recall of HSPs at score >= 50.)

No breaking changes for existing callers of `top_k_ungapped_local_align`
or other public APIs.  `top_k_ungapped_local_align_kmer` is a new entry
point introduced in this release.

* Fix pre-commit: ruff-format + ANN001 on test key fn

- ruff-format reformatted multi-arg call sites to one-arg-per-line in
  the new k-mer tests (matches the file's prevailing style).
- Add `Alignment` type annotation on the inner `key` function in
  `test_top_k_ungapped_kmer_matches_full_scan_random_dna` (ANN001).

* Apply cargo fmt

* Add rustfmt pre-commit hook and Rust toolchain to lint CI

`rustfmt --check --edition 2021` runs as a local pre-commit hook on
every staged `.rs` file.  The CI lint job now installs the stable Rust
toolchain with the `rustfmt` component before running `pre-commit run
--all-files`, so the new hook is exercised on PR.

Mechanical formatting drift caught by this hook was committed in the
previous commit ('Apply cargo fmt').

---------

Co-authored-by: Leonhard Gruenschloss <l.gruenschloss@vcgs.org.au>