v0.2: multi-witness master alignment graph#10
Merged
Conversation
Adds empty src/tracealign/multi/ and tests/multi/ packages. Subsequent tasks fill in distance, guide_tree, graph, table, merge, and api modules per docs/superpowers/specs/2026-05-21-trace-v0.2-multi-witness-design.md.
GraphNode carries the witness_id -> Token mapping at one aligned position. GraphEdge carries the set of witnesses that traverse the directed edge. Both are Pydantic models with extra='forbid' for typo safety.
VariantGraph holds a topologically sorted node list plus the edge list and the witness_ids that contribute to the alignment. START and END are sentinel nodes with empty tokens dicts.
from_sequence builds the initial linear graph for a single witness; this is the seed of progressive_merge. witness_path walks the edges where a given witness appears, returning the non-sentinel nodes traversed; it is the basis of the lossless-reconstruction correctness property.
Yields the variant loci — non-sentinel nodes whose constituent witnesses disagree on the token text. Nodes carrying identical text from multiple witnesses count as agreement, not variation.
AlignedTable is the tabular view over a VariantGraph. Each TableCell carries either a Token or a None gap, plus a backreference to the VariantGraph node it came from.
re_anchor returns a new table with the requested witness as the first row; columns and cells are preserved verbatim. format_text renders the table as ASCII for inspection, with a gap marker for missing cells.
Pydantic recursive type for the binary UPGMA tree; the GuideTree carries the method, the original distance matrix, and the witness_ids labelling the matrix rows. format_text renders the tree as an indented ASCII tree.
Computes the N x N symmetric distance matrix as 1 - total_score of each pairwise alignment. Returns the matrix plus the canonical (sorted) witness_id ordering used for rows and columns, guaranteeing input-order independence.
build_upgma builds a binary GuideTree from a symmetric distance matrix. Ties on minimum distance are broken on the canonical (min, max) lexicographic order of the cluster members, making the algorithm order-independent at the input level.
Returns the witness ids in canonical post-order traversal of the binary guide tree. progressive_merge uses this as its canonical merge order, so witnesses that are similar (siblings or close cousins in the tree) are merged early, building consensus structure before distant witnesses are added.
Aggregates the per-constituent tiered score across the witnesses already in a graph node. Default mode 'max' is permissive (CollateX-style); 'mean' and 'min' are also available via MultiAlignerConfig.node_match.
Stable topological sort over graph nodes; ties broken by sorted node id. Provides the canonical traversal order for the POA DP that follows.
Implements the three POA transitions (match, insertion in sequence, deletion of graph node) over a topologically ordered DAG. Returns the DP table, backpointer table, and best score; traceback follows in the next task. Sentinel transitions are free so that reaching END after exactly m consumed sequence tokens does not pay an extra gap penalty.
Walks from (m, END) to (0, START), returning a forward-ordered list of (op, prev_i, prev_node_id, curr_node_id) tuples that the merge step will apply to the variant graph.
Combines _run_poa_dp + _traceback_ops + the actual graph mutation: matches grow existing nodes, insertions add new nodes, deletions cause the new witness's edge to bypass nodes. Re-numbers node ids by the final topological sort so callers always work with stable n:NNNNNN identifiers.
Walks the guide tree in post-order and incrementally merges each witness into the variant graph via align_sequence_to_graph. The first witness seeds the graph as a linear chain; subsequent witnesses extend it.
Mirrors v0.1's AlignerConfig style. pairwise is nested for the Phase 1 distance calculations; node_match defaults to 'max' with 'mean' and 'min' available; guide_tree_method is 'upgma' in v0.2. _validate_config rejects unknown enum values at the API boundary.
End-to-end pipeline: pairwise_distances -> build_upgma -> progressive_merge, then derive an AlignedTable from the graph. params snapshot carries trace_version, language_pack_version, and the full config. summary is empty in v0.2.0 and can be enriched in v0.2.x without an API break.
align_multi, MultiAlignerConfig, MultiAlignmentResult, VariantGraph, GraphNode, GraphEdge, AlignedTable, TableColumn, TableCell, GuideTree, GuideTreeNode are now reachable as tracealign.<name>.
Dedicated module separate from io/result.py for v0.1's AlignmentResult. dumps/loads for string round-trips, dump/load for file round-trips. Round-trip preserves the guide tree's distance matrix end-to-end.
For every witness w in the input, the path through the result graph must yield exactly the original token sequence. This is the v0.2 correctness guarantee against information loss during merging.
The same set of witnesses in different input dict insertion order must yield the same alignment (same witness paths, same variant loci). This justifies the guide tree as the algorithmic foundation of v0.2.
Four constructed witnesses exercising agreement (שלום עולם), niqqud stripping (רַבִּי vs רבי), plene/defective (דויד vs דוד), abbreviation (ר"י vs רבי), and insertion (טוב in W4 only). Pins the lossless reconstruction property and that the W4 insertion surfaces as a column in the re-anchored AlignedTable.
Covers the align_multi entry point, MultiAlignmentResult fields, MultiAlignerConfig, and JSON persistence via io/multi_result.
Adds a v0.2 algorithm details section to details.md describing the three-phase pipeline (pairwise distances, UPGMA guide tree, POA-based progressive merge) and the two correctness properties. Marks stage 2 in ROADMAP.md as in-progress on feature/v0.2-multi-witness.
- README: extends the tagline and Highlights to cover both pairwise and multi-witness alignment; adds a from-source install section and a verify step; adds a Quick-start section for align_multi alongside the existing pairwise example; updates the documentation table to mention multi-witness usage and the POA algorithm; refreshes the Project status block with the current PyPI release, v0.2 spec link, released-vs-in-progress stages, and the full long-term sub-project list; rewrites the Citation block to reference the Zenodo concept DOI (now live). - docs/index.md: same tagline / At-a-glance refresh plus a project status block update that lists the full ten-stage roadmap. - docs/faq.md: new entries explaining how multi-witness alignment differs from pairwise, the determinism guarantees, the scale target, the UPGMA-vs-Neighbor-Joining choice, the incremental-add API status, and the persistence module. Sphinx -W build is clean; the test suite stays at 182 passing.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements Stage 2 of the long-term TRACE vision (see `docs/ROADMAP.md`): simultaneous multi-witness alignment producing a canonical variant graph plus a derived aligned table view, built on top of v0.1's pairwise aligner.
Closes #9 (Epic).
Summary
Algorithm (three phases)
`node_match_score` aggregates the v0.1 tiered score across the witnesses already in a target node; default `max` (CollateX-aligned permissive), with `mean` and `min` configurable.
Stats
Reference documents
Test plan
Out of scope (deferred to later stages)
Transposition detection, apparatus generation (Stage 5), Geniza fragment anchor detection (Stage 3), stemmatic reconstruction proper (Stage 7), cross-tradition alignment (Stage 6), incremental add-witness API (v0.2.x or later), CollateX interoperability (future v0.x), visualisation, performance pass.