Multiple sequence alignment gives sub-optimal results #288

sbliven · 2015-06-25T14:02:28Z

As reported on the mailing list, BioJava gives different results than expected for some DNASequence multiple alignments:

biojava:

TTGGGGCCTCTAAACGGGGTCTT
TTGGGGC-TCTAAC--GGGTCTT
TTGGGGCCTCTAAACGGG-TCTT

clustal:

TTGGGGCCTCTAAACGGGGTCTT
TTGGGG-CTCT-AACGGG-TCTT
TTGGGGCCTCTAAACGGG-TCTT

The choice of scoring matrix might have some influence on the result (e.g. implementing the IUB matrix might help), but it is more likely to come from the missing implementation of the refinement step (currently a TODO). Either way we should look into the example more.

The text was updated successfully, but these errors were encountered:

lafita · 2016-02-25T16:36:22Z

Multiple sequence alignments are never guaranteed to give an optimal solution and to define an optimal solution of an alignment the score of each possible alignment is needed.

The observed differences here might be caused by the different order of incorporation of the sequences into the alignment and differences in gap penalties. If the difference between gap opening and extension penalties in BioJava is lower than a C-A mismatch, the BioJava alignment can score higher than the clustal one.

@sbliven, do you remember the scoring matrix and gap penalties you used for the alignment?

andreasprlic · 2016-03-02T16:01:05Z

can we close this?

stefanharjes · 2016-03-03T03:44:34Z

@lafita if I try to analyse it in detail, there indeed seems to be a dependence on the order in which the sequences are provided:

public static void main(String[] args) { 
        SubstitutionMatrix<NucleotideCompound> matrix = SubstitutionMatrixHelper.getNuc4_4();
        SimpleGapPenalty gapP = new SimpleGapPenalty((short)5, (short)2);
        DNASequence a = null,b = null,c = null;
        try {
            a = new DNASequence("TTGGGGCCTCTAAACGGGGTCTT");
            b = new DNASequence("TTGGGGCTCTAACGGGTCTT");
            c = new DNASequence("TTGGGGCCTCTAAACGGGTCTT");
        } catch (CompoundNotFoundException e) {
            e.printStackTrace();
        }
        SequencePair<DNASequence, NucleotideCompound> psa =
                Alignments.getPairwiseAlignment(a, b, PairwiseSequenceAlignerType.LOCAL, gapP, matrix);
        System.out.println("alignment a-b: \n" + psa);
        psa = Alignments.getPairwiseAlignment(b, a, PairwiseSequenceAlignerType.LOCAL, gapP, matrix);
        System.out.println("alignment b-a: \n" + psa); 
       }

results in:

alignment a-b: 
TTGGGGCCTCTAAACGGGGTCTT
TTGGGGC-TCTAA-CGGG-TCTT

alignment b-a: 
TTGGGG-CTCT-AAC-GGGTCTT
TTGGGGCCTCTAAACGGGGTCTT

while the same analysis with emboss needle is submission order independent (with the same gap penalty 5/2):

a                  1  TTGGGGCCTCTAAACGGGGTCTT     22
b                  1  TTGGGG-CTCT-AAC-GGGTCTT     20
#########  versus ###########################
b                  1  TTGGGG-CTCT-AAC-GGGTCTT     20
a                  1  TTGGGGCCTCTAAACGGGGTCTT     22

I do not quite understand why the order of submission should result in two different alignments? And this result throws a stick in my algorithm, as I do not know the order in which a bunch of sequences will be aligned. Could you eventually point out a way around this problem?

Regards
Stefan

lafita · 2016-03-08T09:09:16Z

Hi @stefanharjes,

Actually, the example you provided does not demonstrate that there is an order dependence in the sequence alignment, since both results have the same score (they are both optimal solutions). A sequence alignment problem can have multiple optimal solutions (sometimes all solutions cannot even be enumerated because of the combinatorial explosion). The algorithm is guaranteed to return an optimal solution, but it will be only one of the possible.

If you calculate the scores for both results, you can see that they are exactly the same, no matter what scoring matrix or gap penalty is used:

alignment a-b:

TTGGGGCCTCTAAACGGGGTCTT
TTGGGGC-TCTAA-CGGG-TCTT

7 T-T pairs
7 G-G pairs
4 C-C pairs
2 A-A pairs
3 gap opening penalty

Score(a,b) = 7*S(T,T) + 7*S(G,G) + 4*S(C,C) + 2*S(A,A) - 3*Gap_Opening

alignment b-a:

TTGGGG-CTCT-AAC-GGGTCTT
TTGGGGCCTCTAAACGGGGTCTT

7 T-T pairs
7 G-G pairs
4 C-C pairs
2 A-A pairs
3 gap opening penalty

Score(b,a) = 7*S(T,T) + 7*S(G,G) + 4*S(C,C) + 2*S(A,A) - 3*Gap_Opening

Probably the emboss needle program sorts the sequences before the alignment, or maybe they have some heuristic on which of the optimal results to return (in the solution reconstruction step of DP). I do not think this is something to worry about, but we could search and/or ask how to return always the same solution for the same two sequences when multiple optimal solutions exist.

This issue, however, is about multiple sequence alignments, where the optimal solution is not guaranteed, since the search space is much bigger. In the progressive alignment algorithm, pairwise alignments are combined to generate the final multiple sequence alignment. When I pointed to the order of alignment I was referring to the order in which the pairwise alignments are combined. To compute the order of combination a guide tree of the sequences is made. The original article of the algorithm is from Feng and Doolittle, 1987. The way in which the tree is computed (order of alignment) can affect the solution, but this is an algorithm design (can change between implementations) and not a bug.

lafita · 2016-03-08T10:34:59Z

I have been testing the code and the difference that @sbliven was getting seem to be related to the gap penalties. In biojava, using a gap penalty of 5/2 the clustal alignment is obtained and using the default biojava gap penalty of 10/1 the biojava alignment is obtained.

The scoring matrix used is nuc-4_4, which gives a -4 for a missmatch and a 5 for a match. When calculating the scores, in both gap penalties the clustal alignment is the optimal one, so something is wrong as @sbliven pointed out:

BioJava:

TTGGGGCCTCTAAACGGGGTCTT
TTGGGGC-TCTAAC--GGGTCTT
TTGGGGCCTCTAAACGGG-TCTT

Score = matches + missmatches + gap_open + gap_extend 
      = (18 * 2 + 1) * 5 + (-4) + 3 * (-10) + (-1) = 150

Clustal:

TTGGGGCCTCTAAACGGGGTCTT
TTGGGG-CTCT-AACGGG-TCTT
TTGGGGCCTCTAAACGGG-TCTT

Score = matches + missmatches + gap_open + gap_extend 
      = (20 * 2) * 5 + 0 + 4 * (-10) + 0 = 160

I created a test that uses both gap penalties, and they should return the same solution. However, the 10/1 gap penalty is not passing right now, so I set it to be ignored until this issue is fixed.

One of the tests is ignored until issue is fixed.

sbliven · 2016-03-08T17:25:04Z

It seems that a refinement stage was intended for MSA but never implemented. See Alignments.getMultipleSequenceAlignment

andreasprlic · 2016-09-14T15:48:24Z

nothing is really wrong here

sbliven added the bug Bugs and bugfixes label Jun 25, 2015

andreasprlic added this to the BioJava 4.2.0 milestone Mar 2, 2016

andreasprlic closed this as completed Mar 2, 2016

abradle reopened this Mar 2, 2016

abradle modified the milestones: BioJava 5.0.0, BioJava 4.2.0 Mar 2, 2016

lafita mentioned this issue Mar 8, 2016

Alignments do not allow consecutive gaps #213

Open

lafita added a commit that referenced this issue Mar 8, 2016

Create a test for issue #288 sub-optimal MSA

e449976

One of the tests is ignored until issue is fixed.

andreasprlic closed this as completed Sep 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple sequence alignment gives sub-optimal results #288

Multiple sequence alignment gives sub-optimal results #288

sbliven commented Jun 25, 2015

lafita commented Feb 25, 2016

andreasprlic commented Mar 2, 2016

stefanharjes commented Mar 3, 2016

lafita commented Mar 8, 2016

lafita commented Mar 8, 2016

sbliven commented Mar 8, 2016

andreasprlic commented Sep 14, 2016

Multiple sequence alignment gives sub-optimal results #288

Multiple sequence alignment gives sub-optimal results #288

Comments

sbliven commented Jun 25, 2015

lafita commented Feb 25, 2016

andreasprlic commented Mar 2, 2016

stefanharjes commented Mar 3, 2016

lafita commented Mar 8, 2016

lafita commented Mar 8, 2016

sbliven commented Mar 8, 2016

andreasprlic commented Sep 14, 2016