Skip to content

v0.2: multi-witness master alignment graph#10

Merged
bsesic merged 27 commits into
developfrom
feature/v0.2-multi-witness
May 28, 2026
Merged

v0.2: multi-witness master alignment graph#10
bsesic merged 27 commits into
developfrom
feature/v0.2-multi-witness

Conversation

@bsesic

@bsesic bsesic commented May 28, 2026

Copy link
Copy Markdown
Owner

Implements Stage 2 of the long-term TRACE vision (see `docs/ROADMAP.md`): simultaneous multi-witness alignment producing a canonical variant graph plus a derived aligned table view, built on top of v0.1's pairwise aligner.

Closes #9 (Epic).

Summary

  • New `tracealign.multi` subpackage with five language-agnostic modules:
    • `distance.py` — pairwise distance matrix via v0.1 `align()`
    • `guide_tree.py` — UPGMA construction with deterministic tie-breaking, post-order traversal
    • `graph.py` — `VariantGraph` DAG with `GraphNode` / `GraphEdge`, lossless `witness_path()` and `variants()` queries
    • `table.py` — `AlignedTable` derived view with `re_anchor()` and `format_text()`
    • `merge.py` — `node_match_score`, topological order, POA forward DP, traceback, full `align_sequence_to_graph` and `progressive_merge` wrapper
  • Public API: `tracealign.align_multi(witnesses, lang, config) -> MultiAlignmentResult`. `MultiAlignerConfig` exposes `node_match` (`max` / `mean` / `min`), `guide_tree_method` (`upgma`), `gap_penalty_multi`, and a nested `pairwise` v0.1 `AlignerConfig`. All multi-witness types are re-exported at the top level.
  • JSON persistence via the new `tracealign.io.multi_result` module (separate from `io.result` for the pairwise case). Round-trip preserves the guide tree's distance matrix end-to-end.
  • Correctness gates pinned by property tests:
    • Lossless reconstruction — every input witness's path through the result graph reproduces its original token sequence exactly.
    • Permutation invariance — the result does not depend on input dict insertion order.
  • End-to-end Hebrew golden test with four constructed witnesses exercising agreement, niqqud-stripping, plene/defective, abbreviation, and insertion.
  • Documentation: `docs/usage.md` gains a multi-witness section; `docs/details.md` gains the three-phase algorithm description; `docs/index.md` and `README.md` are refreshed to advertise both pairwise and multi-witness alignment; `docs/faq.md` adds six new entries covering scope, determinism, scale, UPGMA-vs-NJ, incremental-add status, and persistence.
  • Roadmap marks Stage 2 as in-progress on `feature/v0.2-multi-witness`.

Algorithm (three phases)

  1. Pairwise distances — N(N-1)/2 calls to v0.1 `align()`, distance = `1 - total_score`. Witness ids sorted lexicographically for determinism.
  2. UPGMA guide tree — closest-pair merging with tie-breaking on canonical `(min, max)` of cluster members. Produces a binary tree carrying cumulative-distance heights and the original distance matrix.
  3. Progressive POA-based merge — walks the guide tree in post-order, aligning each successive witness against the current graph via partial-order alignment (Lee et al. 2002 style). Matches grow existing nodes, insertions add new nodes, deletions cause the new witness's edge to bypass nodes.

`node_match_score` aggregates the v0.1 tiered score across the witnesses already in a target node; default `max` (CollateX-aligned permissive), with `mean` and `min` configurable.

Stats

  • 26 implementation tasks done end-to-end (Tasks 0-25 from the plan plus one docs-revision follow-up)
  • 73 new tests; total suite at 182 passing (was 109 at v0.1.3)
  • `flake8 src/ tests/` clean
  • `sphinx-build -W docs/ ...` clean

Reference documents

Test plan

  • CI matrix passes on Python 3.10 / 3.11 / 3.12
  • `pytest -q` locally → 182 passed
  • `flake8 src/ tests/` → no output
  • `sphinx-build -W -b html docs docs/_build/html` → build succeeded
  • Manual smoke test of `align_multi` on the README example
  • Property tests pass: `test_lossless_reconstruction` and `test_permutation_invariance`

Out of scope (deferred to later stages)

Transposition detection, apparatus generation (Stage 5), Geniza fragment anchor detection (Stage 3), stemmatic reconstruction proper (Stage 7), cross-tradition alignment (Stage 6), incremental add-witness API (v0.2.x or later), CollateX interoperability (future v0.x), visualisation, performance pass.

bsesic added 27 commits May 22, 2026 14:13
Adds empty src/tracealign/multi/ and tests/multi/ packages. Subsequent
tasks fill in distance, guide_tree, graph, table, merge, and api
modules per docs/superpowers/specs/2026-05-21-trace-v0.2-multi-witness-design.md.
GraphNode carries the witness_id -> Token mapping at one aligned position.
GraphEdge carries the set of witnesses that traverse the directed edge.
Both are Pydantic models with extra='forbid' for typo safety.
VariantGraph holds a topologically sorted node list plus the edge list and
the witness_ids that contribute to the alignment. START and END are
sentinel nodes with empty tokens dicts.
from_sequence builds the initial linear graph for a single witness; this
is the seed of progressive_merge. witness_path walks the edges where a
given witness appears, returning the non-sentinel nodes traversed; it is
the basis of the lossless-reconstruction correctness property.
Yields the variant loci — non-sentinel nodes whose constituent witnesses
disagree on the token text. Nodes carrying identical text from multiple
witnesses count as agreement, not variation.
AlignedTable is the tabular view over a VariantGraph. Each TableCell
carries either a Token or a None gap, plus a backreference to the
VariantGraph node it came from.
re_anchor returns a new table with the requested witness as the first
row; columns and cells are preserved verbatim. format_text renders the
table as ASCII for inspection, with a gap marker for missing cells.
Pydantic recursive type for the binary UPGMA tree; the GuideTree carries
the method, the original distance matrix, and the witness_ids labelling
the matrix rows. format_text renders the tree as an indented ASCII tree.
Computes the N x N symmetric distance matrix as 1 - total_score of each
pairwise alignment. Returns the matrix plus the canonical (sorted)
witness_id ordering used for rows and columns, guaranteeing input-order
independence.
build_upgma builds a binary GuideTree from a symmetric distance matrix.
Ties on minimum distance are broken on the canonical (min, max)
lexicographic order of the cluster members, making the algorithm
order-independent at the input level.
Returns the witness ids in canonical post-order traversal of the binary
guide tree. progressive_merge uses this as its canonical merge order, so
witnesses that are similar (siblings or close cousins in the tree)
are merged early, building consensus structure before distant witnesses
are added.
Aggregates the per-constituent tiered score across the witnesses already
in a graph node. Default mode 'max' is permissive (CollateX-style);
'mean' and 'min' are also available via MultiAlignerConfig.node_match.
Stable topological sort over graph nodes; ties broken by sorted node id.
Provides the canonical traversal order for the POA DP that follows.
Implements the three POA transitions (match, insertion in sequence,
deletion of graph node) over a topologically ordered DAG. Returns the
DP table, backpointer table, and best score; traceback follows in the
next task.

Sentinel transitions are free so that reaching END after exactly m
consumed sequence tokens does not pay an extra gap penalty.
Walks from (m, END) to (0, START), returning a forward-ordered list of
(op, prev_i, prev_node_id, curr_node_id) tuples that the merge step will
apply to the variant graph.
Combines _run_poa_dp + _traceback_ops + the actual graph mutation:
matches grow existing nodes, insertions add new nodes, deletions cause
the new witness's edge to bypass nodes. Re-numbers node ids by the
final topological sort so callers always work with stable n:NNNNNN
identifiers.
Walks the guide tree in post-order and incrementally merges each witness
into the variant graph via align_sequence_to_graph. The first witness
seeds the graph as a linear chain; subsequent witnesses extend it.
Mirrors v0.1's AlignerConfig style. pairwise is nested for the
Phase 1 distance calculations; node_match defaults to 'max' with
'mean' and 'min' available; guide_tree_method is 'upgma' in v0.2.
_validate_config rejects unknown enum values at the API boundary.
End-to-end pipeline: pairwise_distances -> build_upgma -> progressive_merge,
then derive an AlignedTable from the graph. params snapshot carries
trace_version, language_pack_version, and the full config. summary is
empty in v0.2.0 and can be enriched in v0.2.x without an API break.
align_multi, MultiAlignerConfig, MultiAlignmentResult, VariantGraph,
GraphNode, GraphEdge, AlignedTable, TableColumn, TableCell, GuideTree,
GuideTreeNode are now reachable as tracealign.<name>.
Dedicated module separate from io/result.py for v0.1's AlignmentResult.
dumps/loads for string round-trips, dump/load for file round-trips.
Round-trip preserves the guide tree's distance matrix end-to-end.
For every witness w in the input, the path through the result graph must
yield exactly the original token sequence. This is the v0.2 correctness
guarantee against information loss during merging.
The same set of witnesses in different input dict insertion order must
yield the same alignment (same witness paths, same variant loci). This
justifies the guide tree as the algorithmic foundation of v0.2.
Four constructed witnesses exercising agreement (שלום עולם), niqqud
stripping (רַבִּי vs רבי), plene/defective (דויד vs דוד), abbreviation
(ר"י vs רבי), and insertion (טוב in W4 only). Pins the lossless
reconstruction property and that the W4 insertion surfaces as a column
in the re-anchored AlignedTable.
Covers the align_multi entry point, MultiAlignmentResult fields,
MultiAlignerConfig, and JSON persistence via io/multi_result.
Adds a v0.2 algorithm details section to details.md describing the
three-phase pipeline (pairwise distances, UPGMA guide tree,
POA-based progressive merge) and the two correctness properties.

Marks stage 2 in ROADMAP.md as in-progress on
feature/v0.2-multi-witness.
- README: extends the tagline and Highlights to cover both pairwise
  and multi-witness alignment; adds a from-source install section
  and a verify step; adds a Quick-start section for align_multi
  alongside the existing pairwise example; updates the documentation
  table to mention multi-witness usage and the POA algorithm; refreshes
  the Project status block with the current PyPI release, v0.2 spec
  link, released-vs-in-progress stages, and the full long-term
  sub-project list; rewrites the Citation block to reference the
  Zenodo concept DOI (now live).
- docs/index.md: same tagline / At-a-glance refresh plus a project
  status block update that lists the full ten-stage roadmap.
- docs/faq.md: new entries explaining how multi-witness alignment
  differs from pairwise, the determinism guarantees, the scale target,
  the UPGMA-vs-Neighbor-Joining choice, the incremental-add API status,
  and the persistence module.

Sphinx -W build is clean; the test suite stays at 182 passing.
@bsesic bsesic merged commit a04509a into develop May 28, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant