Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple sequence alignment gives sub-optimal results #288

Closed
sbliven opened this issue Jun 25, 2015 · 7 comments
Closed

Multiple sequence alignment gives sub-optimal results #288

sbliven opened this issue Jun 25, 2015 · 7 comments
Labels
bug Bugs and bugfixes
Milestone

Comments

@sbliven
Copy link
Member

sbliven commented Jun 25, 2015

As reported on the mailing list, BioJava gives different results than expected for some DNASequence multiple alignments:

biojava:

TTGGGGCCTCTAAACGGGGTCTT
TTGGGGC-TCTAAC--GGGTCTT
TTGGGGCCTCTAAACGGG-TCTT

clustal:

TTGGGGCCTCTAAACGGGGTCTT
TTGGGG-CTCT-AACGGG-TCTT
TTGGGGCCTCTAAACGGG-TCTT

The choice of scoring matrix might have some influence on the result (e.g. implementing the IUB matrix might help), but it is more likely to come from the missing implementation of the refinement step (currently a TODO). Either way we should look into the example more.

@sbliven sbliven added the bug Bugs and bugfixes label Jun 25, 2015
@lafita
Copy link
Member

lafita commented Feb 25, 2016

Multiple sequence alignments are never guaranteed to give an optimal solution and to define an optimal solution of an alignment the score of each possible alignment is needed.

The observed differences here might be caused by the different order of incorporation of the sequences into the alignment and differences in gap penalties. If the difference between gap opening and extension penalties in BioJava is lower than a C-A mismatch, the BioJava alignment can score higher than the clustal one.

@sbliven, do you remember the scoring matrix and gap penalties you used for the alignment?

@andreasprlic
Copy link
Member

can we close this?

@andreasprlic andreasprlic added this to the BioJava 4.2.0 milestone Mar 2, 2016
@abradle abradle reopened this Mar 2, 2016
@abradle abradle modified the milestones: BioJava 5.0.0, BioJava 4.2.0 Mar 2, 2016
@stefanharjes
Copy link
Contributor

@lafita if I try to analyse it in detail, there indeed seems to be a dependence on the order in which the sequences are provided:

public static void main(String[] args) { 
        SubstitutionMatrix<NucleotideCompound> matrix = SubstitutionMatrixHelper.getNuc4_4();
        SimpleGapPenalty gapP = new SimpleGapPenalty((short)5, (short)2);
        DNASequence a = null,b = null,c = null;
        try {
            a = new DNASequence("TTGGGGCCTCTAAACGGGGTCTT");
            b = new DNASequence("TTGGGGCTCTAACGGGTCTT");
            c = new DNASequence("TTGGGGCCTCTAAACGGGTCTT");
        } catch (CompoundNotFoundException e) {
            e.printStackTrace();
        }
        SequencePair<DNASequence, NucleotideCompound> psa =
                Alignments.getPairwiseAlignment(a, b, PairwiseSequenceAlignerType.LOCAL, gapP, matrix);
        System.out.println("alignment a-b: \n" + psa);
        psa = Alignments.getPairwiseAlignment(b, a, PairwiseSequenceAlignerType.LOCAL, gapP, matrix);
        System.out.println("alignment b-a: \n" + psa); 
       }

results in:

alignment a-b: 
TTGGGGCCTCTAAACGGGGTCTT
TTGGGGC-TCTAA-CGGG-TCTT

alignment b-a: 
TTGGGG-CTCT-AAC-GGGTCTT
TTGGGGCCTCTAAACGGGGTCTT

while the same analysis with emboss needle is submission order independent (with the same gap penalty 5/2):

a                  1  TTGGGGCCTCTAAACGGGGTCTT     22
b                  1  TTGGGG-CTCT-AAC-GGGTCTT     20
#########  versus ###########################
b                  1  TTGGGG-CTCT-AAC-GGGTCTT     20
a                  1  TTGGGGCCTCTAAACGGGGTCTT     22

I do not quite understand why the order of submission should result in two different alignments? And this result throws a stick in my algorithm, as I do not know the order in which a bunch of sequences will be aligned. Could you eventually point out a way around this problem?

Regards
Stefan

@lafita
Copy link
Member

lafita commented Mar 8, 2016

Hi @stefanharjes,

Actually, the example you provided does not demonstrate that there is an order dependence in the sequence alignment, since both results have the same score (they are both optimal solutions). A sequence alignment problem can have multiple optimal solutions (sometimes all solutions cannot even be enumerated because of the combinatorial explosion). The algorithm is guaranteed to return an optimal solution, but it will be only one of the possible.

If you calculate the scores for both results, you can see that they are exactly the same, no matter what scoring matrix or gap penalty is used:

alignment a-b:

TTGGGGCCTCTAAACGGGGTCTT
TTGGGGC-TCTAA-CGGG-TCTT

7 T-T pairs
7 G-G pairs
4 C-C pairs
2 A-A pairs
3 gap opening penalty

Score(a,b) = 7*S(T,T) + 7*S(G,G) + 4*S(C,C) + 2*S(A,A) - 3*Gap_Opening

alignment b-a:

TTGGGG-CTCT-AAC-GGGTCTT
TTGGGGCCTCTAAACGGGGTCTT

7 T-T pairs
7 G-G pairs
4 C-C pairs
2 A-A pairs
3 gap opening penalty

Score(b,a) = 7*S(T,T) + 7*S(G,G) + 4*S(C,C) + 2*S(A,A) - 3*Gap_Opening

Probably the emboss needle program sorts the sequences before the alignment, or maybe they have some heuristic on which of the optimal results to return (in the solution reconstruction step of DP). I do not think this is something to worry about, but we could search and/or ask how to return always the same solution for the same two sequences when multiple optimal solutions exist.

This issue, however, is about multiple sequence alignments, where the optimal solution is not guaranteed, since the search space is much bigger. In the progressive alignment algorithm, pairwise alignments are combined to generate the final multiple sequence alignment. When I pointed to the order of alignment I was referring to the order in which the pairwise alignments are combined. To compute the order of combination a guide tree of the sequences is made. The original article of the algorithm is from Feng and Doolittle, 1987. The way in which the tree is computed (order of alignment) can affect the solution, but this is an algorithm design (can change between implementations) and not a bug.

@lafita
Copy link
Member

lafita commented Mar 8, 2016

I have been testing the code and the difference that @sbliven was getting seem to be related to the gap penalties. In biojava, using a gap penalty of 5/2 the clustal alignment is obtained and using the default biojava gap penalty of 10/1 the biojava alignment is obtained.

The scoring matrix used is nuc-4_4, which gives a -4 for a missmatch and a 5 for a match. When calculating the scores, in both gap penalties the clustal alignment is the optimal one, so something is wrong as @sbliven pointed out:

BioJava:

TTGGGGCCTCTAAACGGGGTCTT
TTGGGGC-TCTAAC--GGGTCTT
TTGGGGCCTCTAAACGGG-TCTT

Score = matches + missmatches + gap_open + gap_extend 
      = (18 * 2 + 1) * 5 + (-4) + 3 * (-10) + (-1) = 150

Clustal:

TTGGGGCCTCTAAACGGGGTCTT
TTGGGG-CTCT-AACGGG-TCTT
TTGGGGCCTCTAAACGGG-TCTT

Score = matches + missmatches + gap_open + gap_extend 
      = (20 * 2) * 5 + 0 + 4 * (-10) + 0 = 160

I created a test that uses both gap penalties, and they should return the same solution. However, the 10/1 gap penalty is not passing right now, so I set it to be ignored until this issue is fixed.

lafita added a commit that referenced this issue Mar 8, 2016
One of the tests is ignored until issue is fixed.
@sbliven
Copy link
Member Author

sbliven commented Mar 8, 2016

It seems that a refinement stage was intended for MSA but never implemented. See Alignments.getMultipleSequenceAlignment

@andreasprlic
Copy link
Member

nothing is really wrong here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs and bugfixes
Projects
None yet
Development

No branches or pull requests

5 participants