align matches for correctness and readability #111

thatbudakguy · 2020-10-16T14:52:02Z

using a matcher based on edit distance ratios results in matches that may include irrelevant material at the end; alignment is necessary to remove this material from the match.

additionally, alignment allows for padding gaps with spaces for easier comparison of the results, and can aid in colorizing the results to highlight differences (#54). it could play into the structured output (#39).

thatbudakguy · 2020-10-23T19:17:40Z

candidates for libraries to do this:

~~nwalign3: doesn't look particularly maintained but has a C backend~~ no support for utf-8
~~minineedle: very small, pure python, maybe maintained?~~ no custom matrices
~~uta-align: small, C-backed, not passing its tests?~~ no docs, no custom matrices?
~~parasail-python: backed by popular SIMD C library, somewhat larger~~ no support for utf-8
~~abydos: huge, well-maintained kitchen-sink NLP library~~ only scoring, no alignments
cltk: pure python, arbitrary matrices & utf-8 supported
gotoh2: C-backed, supported scoring matrices via csv
lingpy: C-backed, supports custom scorer dict – using this one

thatbudakguy · 2020-11-13T21:54:46Z

rather than attempting to "trim" low-similarity regions from the two sequences, it might be better to switch to Smith-Waterman or another local alignment, since ultimately we're concerned with all of the high-similarity regions within the two sequences wherever they are (not truly a global alignment).

currently, using Needleman-Wunsch results in alignments like:

由也千乘之國可使治其賦也不知其仁也求也何如子曰求也千室之邑百乘之家可使為之宰也不知其仁也赤也何如子曰赤也束帶立於朝可使與賓客言　　　　　　　　　　　　　　　　　　　　　　也　
由也千乘之國可使治其賦　　　　　　　也　　　　求也千室之邑百乘之家可使為之宰　　　　　　　　　　　　赤也束帶立於朝可使與賓客言也又曰子謂子產有君子之道四焉其行己也恭其事上也敬

where both outlying 也 are undesirable, and we certainly want to avoid the second one since it's far outside the main high-scoring region. the optimal alignment is:

由也千乘之國可使治其賦也不知其仁也求也何如子曰求也千室之邑百乘之家可使為之宰也不知其仁也赤也何如子曰赤也束帶立於朝可使與賓客言也　
由也千乘之國可使治其賦也　　　　　　　　　　　求也千室之邑百乘之家可使為之宰　　　　　　　　　　　　赤也束帶立於朝可使與賓客言也

- Imports lingpy's Smith-Waterman implementation - Adds classes for auto-derived and custom scoring aligners See #111

- Make alignment non-mutating operation - Combine Smith-Waterman aligner variants into one class - Use alignment values to update match bounds See #111

- Imports lingpy's Smith-Waterman implementation - Adds classes for auto-derived and custom scoring aligners See #111

- Make alignment non-mutating operation - Combine Smith-Waterman aligner variants into one class - Use alignment values to update match bounds See #111

thatbudakguy added the enhancement New feature or request label Oct 16, 2020

thatbudakguy added this to the v2.0 milestone Oct 16, 2020

thatbudakguy added a commit that referenced this issue Oct 16, 2020

Skip aligner tests until #111 is complete

2c1e035

thatbudakguy mentioned this issue Oct 17, 2020

supporting filtering results to only those with graphic variation #117

Closed

thatbudakguy added a commit that referenced this issue Nov 18, 2020

Implement match alignment

da87e00

- Imports lingpy's Smith-Waterman implementation - Adds classes for auto-derived and custom scoring aligners See #111

thatbudakguy mentioned this issue Nov 18, 2020

Implement match alignment #132

Merged

thatbudakguy linked a pull request Nov 18, 2020 that will close this issue

Implement match alignment #132

Merged

thatbudakguy added a commit that referenced this issue Nov 20, 2020

Update alignment strategy

49b1fb0

- Make alignment non-mutating operation - Combine Smith-Waterman aligner variants into one class - Use alignment values to update match bounds See #111

thatbudakguy added a commit that referenced this issue Nov 20, 2020

Implement match alignment

abfe6c9

- Imports lingpy's Smith-Waterman implementation - Adds classes for auto-derived and custom scoring aligners See #111

thatbudakguy added a commit that referenced this issue Nov 20, 2020

Update alignment strategy

ad003ec

- Make alignment non-mutating operation - Combine Smith-Waterman aligner variants into one class - Use alignment values to update match bounds See #111

thatbudakguy closed this as completed Nov 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

align matches for correctness and readability #111

align matches for correctness and readability #111

thatbudakguy commented Oct 16, 2020

thatbudakguy commented Oct 23, 2020 •

edited

thatbudakguy commented Nov 13, 2020 •

edited

align matches for correctness and readability #111

align matches for correctness and readability #111

Comments

thatbudakguy commented Oct 16, 2020

thatbudakguy commented Oct 23, 2020 • edited

thatbudakguy commented Nov 13, 2020 • edited

thatbudakguy commented Oct 23, 2020 •

edited

thatbudakguy commented Nov 13, 2020 •

edited