New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple sequence alignment gives sub-optimal results #288
Comments
Multiple sequence alignments are never guaranteed to give an optimal solution and to define an optimal solution of an alignment the score of each possible alignment is needed. The observed differences here might be caused by the different order of incorporation of the sequences into the alignment and differences in gap penalties. If the difference between gap opening and extension penalties in BioJava is lower than a C-A mismatch, the BioJava alignment can score higher than the clustal one. @sbliven, do you remember the scoring matrix and gap penalties you used for the alignment? |
can we close this? |
@lafita if I try to analyse it in detail, there indeed seems to be a dependence on the order in which the sequences are provided: public static void main(String[] args) {
SubstitutionMatrix<NucleotideCompound> matrix = SubstitutionMatrixHelper.getNuc4_4();
SimpleGapPenalty gapP = new SimpleGapPenalty((short)5, (short)2);
DNASequence a = null,b = null,c = null;
try {
a = new DNASequence("TTGGGGCCTCTAAACGGGGTCTT");
b = new DNASequence("TTGGGGCTCTAACGGGTCTT");
c = new DNASequence("TTGGGGCCTCTAAACGGGTCTT");
} catch (CompoundNotFoundException e) {
e.printStackTrace();
}
SequencePair<DNASequence, NucleotideCompound> psa =
Alignments.getPairwiseAlignment(a, b, PairwiseSequenceAlignerType.LOCAL, gapP, matrix);
System.out.println("alignment a-b: \n" + psa);
psa = Alignments.getPairwiseAlignment(b, a, PairwiseSequenceAlignerType.LOCAL, gapP, matrix);
System.out.println("alignment b-a: \n" + psa);
} results in:
while the same analysis with emboss needle is submission order independent (with the same gap penalty 5/2):
I do not quite understand why the order of submission should result in two different alignments? And this result throws a stick in my algorithm, as I do not know the order in which a bunch of sequences will be aligned. Could you eventually point out a way around this problem? Regards |
Hi @stefanharjes, Actually, the example you provided does not demonstrate that there is an order dependence in the sequence alignment, since both results have the same score (they are both optimal solutions). A sequence alignment problem can have multiple optimal solutions (sometimes all solutions cannot even be enumerated because of the combinatorial explosion). The algorithm is guaranteed to return an optimal solution, but it will be only one of the possible. If you calculate the scores for both results, you can see that they are exactly the same, no matter what scoring matrix or gap penalty is used: alignment a-b: TTGGGGCCTCTAAACGGGGTCTT
TTGGGGC-TCTAA-CGGG-TCTT
7 T-T pairs
7 G-G pairs
4 C-C pairs
2 A-A pairs
3 gap opening penalty
Score(a,b) = 7*S(T,T) + 7*S(G,G) + 4*S(C,C) + 2*S(A,A) - 3*Gap_Opening alignment b-a: TTGGGG-CTCT-AAC-GGGTCTT
TTGGGGCCTCTAAACGGGGTCTT
7 T-T pairs
7 G-G pairs
4 C-C pairs
2 A-A pairs
3 gap opening penalty
Score(b,a) = 7*S(T,T) + 7*S(G,G) + 4*S(C,C) + 2*S(A,A) - 3*Gap_Opening Probably the emboss needle program sorts the sequences before the alignment, or maybe they have some heuristic on which of the optimal results to return (in the solution reconstruction step of DP). I do not think this is something to worry about, but we could search and/or ask how to return always the same solution for the same two sequences when multiple optimal solutions exist. This issue, however, is about multiple sequence alignments, where the optimal solution is not guaranteed, since the search space is much bigger. In the progressive alignment algorithm, pairwise alignments are combined to generate the final multiple sequence alignment. When I pointed to the order of alignment I was referring to the order in which the pairwise alignments are combined. To compute the order of combination a guide tree of the sequences is made. The original article of the algorithm is from Feng and Doolittle, 1987. The way in which the tree is computed (order of alignment) can affect the solution, but this is an algorithm design (can change between implementations) and not a bug. |
I have been testing the code and the difference that @sbliven was getting seem to be related to the gap penalties. In biojava, using a gap penalty of 5/2 the clustal alignment is obtained and using the default biojava gap penalty of 10/1 the biojava alignment is obtained. The scoring matrix used is nuc-4_4, which gives a -4 for a missmatch and a 5 for a match. When calculating the scores, in both gap penalties the clustal alignment is the optimal one, so something is wrong as @sbliven pointed out: BioJava: TTGGGGCCTCTAAACGGGGTCTT
TTGGGGC-TCTAAC--GGGTCTT
TTGGGGCCTCTAAACGGG-TCTT
Score = matches + missmatches + gap_open + gap_extend
= (18 * 2 + 1) * 5 + (-4) + 3 * (-10) + (-1) = 150 Clustal: TTGGGGCCTCTAAACGGGGTCTT
TTGGGG-CTCT-AACGGG-TCTT
TTGGGGCCTCTAAACGGG-TCTT
Score = matches + missmatches + gap_open + gap_extend
= (20 * 2) * 5 + 0 + 4 * (-10) + 0 = 160 I created a test that uses both gap penalties, and they should return the same solution. However, the 10/1 gap penalty is not passing right now, so I set it to be ignored until this issue is fixed. |
One of the tests is ignored until issue is fixed.
It seems that a refinement stage was intended for MSA but never implemented. See Alignments.getMultipleSequenceAlignment |
nothing is really wrong here |
As reported on the mailing list, BioJava gives different results than expected for some DNASequence multiple alignments:
biojava:
clustal:
The choice of scoring matrix might have some influence on the result (e.g. implementing the IUB matrix might help), but it is more likely to come from the missing implementation of the refinement step (currently a TODO). Either way we should look into the example more.
The text was updated successfully, but these errors were encountered: