Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Algorithm details #5

Open
de-code opened this issue May 5, 2017 · 6 comments
Open

Algorithm details #5

de-code opened this issue May 5, 2017 · 6 comments

Comments

@de-code
Copy link
Contributor

de-code commented May 5, 2017

Hi,

Thank you for providing the module.
I was wondering whether you could provide information on the algorithms used?
e.g. Smith Watermann?
(While your code is well structured I am getting a bit lost with some of the variables)

I am considering using / modifying it, but have currently the following blockers:

  • It doesn't give me back the indices or the original sequence elements (because I am meant to encode them)
  • Recursion is in general good but in this case it leads to a too deep stack level with long sequences
@eseraygun
Copy link
Owner

eseraygun commented May 5, 2017

Sorry for the lack of documentation. I should improve it sometime.

Global sequence alignment uses Needleman-Wunsch algorithm. Local sequence alignment uses Smith-Waterman algorithm.

About your blockers:

  • It does give you the original elements if you decode the aligned sequences. Check out the final loop in the example. As for the indices, you're right; it doesn't give you the original indices. This can definitely be improved.

  • I was a little bit obsessed about generating all alignments with the max score when I wrote the backtracing algorithm. Hence the recursion. It should be pretty easy to write an iterative, constant space backtracer that stops when it founds one solution.

So there seems to be three issues that needs fixing:

  1. Document the code
  2. Store original indices in the alignments or at least provide a way to infer them.
  3. Implement a greedy, constant space backtracing.

I hope I can find time to address these but please feel free to contribute if you have the motivation.

@de-code
Copy link
Contributor Author

de-code commented May 5, 2017

Thank you for getting back to me and clarifying it. What algorithm is StrictGlobalSequenceAligner meant to be?

Regarding getting the original elements back. I tried that. It didn't work. I realised it is because of the Vocabulary as it is converting it to a number. So I can get back some object that was put into the sequence but not the one from the particular sequence. (e.g. my object might contain the index, or a reference where the element originally came from).

Just some further question before I embark on a PR:

How would you feel about making the vocabulary optional? I can see it being a good optimisation when matching but doesn't seem to be required for the algorithm.

How would you feel about making breaking changes?
e.g. if I was to address it, then I would probably change SequenceAlignment how it stores sequences and the backtracing (the latter is probably more internal).

@eseraygun
Copy link
Owner

eseraygun commented May 5, 2017

Both GlobalSequenceAligner and StrictGlobalSequenceAligner use Needleman-Wunsch. GlobalSequenceAligner is modified to not include the initial and terminal gaps in the result. This is a common practice in bioinformatics. It is sometimes called semiglobal or glocal alignment.

Elements are assumed to have the equivalence relationship: They represent the same thing if and only if they are equal. If you have non-trivial objects as elements, you can implement __eq__ and __hash__ methods in a way that satisfies the equivalence relationship.

I wrote this library out of necessity and one of my requirements was to compare all pairs of sequences in a set. That's why I needed the vocabulary: You can encode all sequences once, perform the alignments on fast integer arrays and decode only the alignments that you find useful. I think that's a very common use case for alignment tasks and I wouldn't want to lose the performance gain that comes with vocabularies. However, if you think you can generalize the existing algorithms without causing extra lag and without duplicating the code, you can go for it. I'm sure Python's duck typing can handle it :)

I'm not sure how many people are actively using the library but I'm aware that several people used it in the past. I would want to break their code (nor mine). Constant time backtracing can definitely be implemented as an extension to the existing code. If you need to get rid of the vocabulary though, I would like to see the end result before committing to it.

@de-code
Copy link
Contributor Author

de-code commented May 5, 2017

Okay, you have a PR.

I have to admit I don't know much about bioinformatics. I just want to align text from different sources.

I did use __eq__ and __hash__ - but the issue with the vocabulary is that those elements are then equal (otherwise they wouldn't align). As it will then use the integer representation it won't be possible to translate the integer back to the original sequence object (unless I have the index or it is getting the object from the passed in sequence and not the vocabulary - both will work with the PR)

@eseraygun
Copy link
Owner

I see your problem. I guess you're right. Vocabulary is not ideal in cases where you cannot simply enumerate all possible elements.

I'm pretty much convinced that an interface that supports arbitrary lists of objects would be more "pythonic" and useful for non-bioinformatics-related problems. So, I suggest we refactor the whole library without caring too much about backwards compatibility and release a version 2.0. I'll try to organize this through issues and labels.

@de-code
Copy link
Contributor Author

de-code commented May 6, 2017

Okay, sure. I think it would be possible to be largely backward compatible by keeping the Sequence wrappers as an optional extra. But I guess it would be good to allow a fresh start. I was also considering Cython.

If you prefer to allow more time for the changes then I could keep my changes in a branch on my fork for now (that will allow me to continue with what I actually wanted to use it for).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants