Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

align matches for correctness and readability #111

Closed
thatbudakguy opened this issue Oct 16, 2020 · 2 comments · Fixed by #132
Closed

align matches for correctness and readability #111

thatbudakguy opened this issue Oct 16, 2020 · 2 comments · Fixed by #132
Labels
enhancement New feature or request
Milestone

Comments

@thatbudakguy
Copy link
Member

using a matcher based on edit distance ratios results in matches that may include irrelevant material at the end; alignment is necessary to remove this material from the match.

additionally, alignment allows for padding gaps with spaces for easier comparison of the results, and can aid in colorizing the results to highlight differences (#54). it could play into the structured output (#39).

@thatbudakguy
Copy link
Member Author

thatbudakguy commented Oct 23, 2020

candidates for libraries to do this:

  • nwalign3: doesn't look particularly maintained but has a C backend no support for utf-8
  • minineedle: very small, pure python, maybe maintained? no custom matrices
  • uta-align: small, C-backed, not passing its tests? no docs, no custom matrices?
  • parasail-python: backed by popular SIMD C library, somewhat larger no support for utf-8
  • abydos: huge, well-maintained kitchen-sink NLP library only scoring, no alignments
  • cltk: pure python, arbitrary matrices & utf-8 supported
  • gotoh2: C-backed, supported scoring matrices via csv
  • lingpy: C-backed, supports custom scorer dict – using this one

@thatbudakguy
Copy link
Member Author

thatbudakguy commented Nov 13, 2020

rather than attempting to "trim" low-similarity regions from the two sequences, it might be better to switch to Smith-Waterman or another local alignment, since ultimately we're concerned with all of the high-similarity regions within the two sequences wherever they are (not truly a global alignment).

currently, using Needleman-Wunsch results in alignments like:

由也千乘之國可使治其賦也不知其仁也求也何如子曰求也千室之邑百乘之家可使為之宰也不知其仁也赤也何如子曰赤也束帶立於朝可使與賓客言                      也 
由也千乘之國可使治其賦       也    求也千室之邑百乘之家可使為之宰            赤也束帶立於朝可使與賓客言也又曰子謂子產有君子之道四焉其行己也恭其事上也敬

where both outlying 也 are undesirable, and we certainly want to avoid the second one since it's far outside the main high-scoring region. the optimal alignment is:

由也千乘之國可使治其賦也不知其仁也求也何如子曰求也千室之邑百乘之家可使為之宰也不知其仁也赤也何如子曰赤也束帶立於朝可使與賓客言也 
由也千乘之國可使治其賦也           求也千室之邑百乘之家可使為之宰            赤也束帶立於朝可使與賓客言也

thatbudakguy added a commit that referenced this issue Nov 18, 2020
- Imports lingpy's Smith-Waterman implementation
- Adds classes for auto-derived and custom scoring aligners

See #111
@thatbudakguy thatbudakguy linked a pull request Nov 18, 2020 that will close this issue
thatbudakguy added a commit that referenced this issue Nov 20, 2020
- Make alignment non-mutating operation
- Combine Smith-Waterman aligner variants into one class
- Use alignment values to update match bounds

See #111
thatbudakguy added a commit that referenced this issue Nov 20, 2020
- Imports lingpy's Smith-Waterman implementation
- Adds classes for auto-derived and custom scoring aligners

See #111
thatbudakguy added a commit that referenced this issue Nov 20, 2020
- Make alignment non-mutating operation
- Combine Smith-Waterman aligner variants into one class
- Use alignment values to update match bounds

See #111
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant