Alignments do not allow consecutive gaps #213

sbliven · 2014-12-02T16:42:15Z

The alignments are defined in such a way that it is not possible to have consecutive gaps. Here's a test case (to go in NeedlemanWunschTest. Requires 4bc1d20.):

@Test
    public void testSingleMismatch() throws CompoundNotFoundException {
        SubstitutionMatrix<AminoAcidCompound> matrix = SubstitutionMatrixHelper.getIdentity();


        ProteinSequence s1 = new ProteinSequence("AAAKKDDDFDD");
        ProteinSequence s2 = new ProteinSequence("AAAFFDDDFDD");

        GapPenalty penalty = new SimpleGapPenalty((short)9,(short)1);

        NeedlemanWunsch<ProteinSequence,AminoAcidCompound> nw = 
                new NeedlemanWunsch<ProteinSequence,AminoAcidCompound>(s1,s2, penalty, matrix);

        String expected1 = "AAAKK--DDDFDD";
        String expected2 = "AAA--FFDDDFDD";

        assertTrue(String.format("Incorrect alignment. Expected:%n  %s%n  %s%nFound:%n  %s%n  %s%n",
                expected1, expected2,
                nw.getPair().getAlignedSequence(1),nw.getPair().getAlignedSequence(2) ),
                expected1.equals(nw.getPair().getAlignedSequence(1).toString()) &&
                expected2.equals(nw.getPair().getAlignedSequence(2).toString()) );
    }

The key point here is that the 'KK' and 'FF' sections don't align to anything, and we would like both to be considered gaps, giving a score of -4 (9 matches, 1 gap initiation, 4 gap extensions). The current NW alignment is the following:

AAAKKDDDF-----DD
AAA-----FFDDDFDD

with a score of -22 (6 matches, 2 initiations, 10 gaps). That is because it has to find at least one residue after the gap to align. It gets even worse if such a residue is not available. If the trailing 'F' were not there, we get an alignment with a score of -10012 (8 matches, 2 initiations, 2 gaps, and a forbidden mismatch):

AAA-KKDDDDD
AAAFF-DDDDD

This problem is fundamental to the recursion equations we selected for dynamic programming. Changing it would be a big project, but I think that this is a moderately serious bug.

The text was updated successfully, but these errors were encountered:

sbliven · 2014-12-02T16:44:39Z

Note that this is also a problem in CE alignments (FATCAT is ok). Here's a contrived example that shows the problem quite clearly. The G/N residues greatly increase the overall RMSD, but they have to be included because of the gap definition.

@ignore

Currently fails, so marked @ignore

@ignore

Currently fails, so marked @ignore

@ignore

Currently fails, so marked @ignore

dmyersturnbull · 2015-02-08T23:29:10Z

The issue is mostly in setScorePoint. It should be:

    public static Last[] setScorePoint(int x, int y, int gop, int gep, int sub, int[][][] scores) {
        Last[] pointers = new Last[3];

        // substitution
        {
            int fromSubstitution = scores[x - 1][y - 1][0] + sub;
            int fromDeletion = scores[x - 1][y - 1][1] + sub;
            int fromInsertion = scores[x - 1][y - 1][2] + sub;
            if (fromDeletion > fromSubstitution && fromDeletion > fromInsertion) {
                scores[x][y][0] = fromDeletion;
                pointers[0] = Last.DELETION;
            } else if (fromInsertion > fromSubstitution && fromInsertion > fromDeletion) {
                scores[x][y][0] = fromInsertion;
                pointers[0] = Last.INSERTION;
            } else { // includes if strictly equal, so this is the preferred option
                scores[x][y][0] = fromSubstitution;
                pointers[0] = Last.SUBSTITUTION;
            }
        }

        // deletion
        {
            int fromDeletion = scores[x - 1][y][1] + gep;
            int fromSubstitution = scores[x - 1][y][0] + gop + gep;
            int fromInsertion = scores[x - 1][y][2] + gop + gep; // insertion to deletion; still requires gap open
            if (fromDeletion > fromSubstitution && fromDeletion > fromInsertion) {
                scores[x][y][1] = fromDeletion;
                pointers[1] = Last.DELETION;
            } else if (fromInsertion > fromSubstitution && fromInsertion > fromDeletion) {
                scores[x][y][1] = fromInsertion;
                pointers[1] = Last.INSERTION;
            } else { // includes if strictly equal, so this is the preferred option
                scores[x][y][1] = fromSubstitution;
                pointers[1] = Last.SUBSTITUTION;
            }
        }

        // insertion
        {
            int fromInsertion = scores[x][y - 1][2] + gep;
            int fromSubstitution = scores[x][y - 1][0] + gop + gep;
            int fromDeletion = scores[x][y - 1][1] + gop + gep;
            if (fromInsertion > fromSubstitution && fromInsertion > fromDeletion) {
                scores[x][y][2] = fromSubstitution;
                pointers[2] = Last.INSERTION;
            } else if (fromDeletion > fromSubstitution && fromDeletion > fromInsertion) {
                scores[x][y][2] = fromDeletion;
                pointers[2] = Last.DELETION;
            } else {
                scores[x][y][2] = fromInsertion;
                pointers[2] = Last.SUBSTITUTION;
            }
        }

        return pointers;
    }

Unfortunately, this change results in NullPointerExceptions that I haven't sorted out.

stefanharjes · 2015-06-19T05:05:51Z

I do not quite understand this issue of extra gaps. If I compare the output
of 'standard' alignment programs like needle, clustalx and muscle, all these
programs produce basically the same syntax, where the two not matching
amino acids are marked as a mismatch, however there are no gaps introduced.

needle:
1 AAAKKDDDFDD     11
  |||..||||||
1 AAAFFDDDFDD     11

clustal:
ta              AAAKKDDDFDD
tb              AAAFFDDDFDD
                ***  ******

In structural alignments as you showed I think the syntax is different and I
have seen gaps introduced for non-structurally aligned parts but in order to
avoid using them for the RMSD, the non matching parts are often cut away
entierly. However, I think for normal sequence based alignments there should be
no gaps!

sbliven · 2015-06-19T08:15:42Z

@stefanharjes The examples you showed introduce mismatches, but they still align the K & F regions. The 'Identity' substitution matrix I'm using has a mismatch score of -1000, which effectively disallows mismatches.

lafita · 2016-02-25T14:42:21Z

This is not a bug in biojava, but a feature or characteristic of the dynamic programming algorithms for sequence alignment. The solution would involve changing the design of the DP table calculation and the way the solution is reconstructed.

The explanation of why such consecutive gaps are not allowed is in the DP table calculation: only substitution, insertion and deletion are allowed.

T(i,j) = max{ T(i-1,j-1) + Score(i,j), T(i-1,j) - Gap Penalty, T(i,j-1) - Gap Penalty }

Note that no term accounts for introducing a consecutive gap (gap at both i and j), which would be done by adding to the T(i,j) the following term:

T(i-1, j-1) - Consecutive Gap Penalty

Thus, using the identity matrix that @sbliven described will not be useful to find consecutive gaps (and thus will not disallow mismatches), because they are not allowed by algorithm definition. This will not be a problem for normal substitution matrices, but only with this identity matrix where only identical residues are "allowed".

My opinion is that consecutive gaps make no sense evolutionarily, because both compared sequences are assumed to share a common ancestor (be evolutionarily related), so sequences can evolve by aminoacid substitution and deletion or insertion of a sequence region or single aminoacid. A consecutive gap would mean an independent insertion at the same position in both sequences once they have diverged. Even accepting this possible, the probability would be much lower compared to the other events so that the algorithm performance might be compromised (false positives/negatives) and an extra parameter for consecutive gaps will have to be determined.

The same applies for structural alignment algorithms using DP.

stefanharjes · 2016-02-27T02:23:13Z

HI Aleix,
it has been quite a while, and I will look up the issue again. As fas as I remember, I compared the pairwise alignment to emboss needle and the multiple alignment to clustalx. If I recall correctly the pairwise homology returned different gaps from the multiple alignment?

Regarding evolution, I believe that it is commonly agreed upon, that eukaryotic evolution is not driven by single AS substitution, but by domain exchange. I think you can introduce quite big changes using homologous recombination. Also my directed evolution approaches often involve the removal of whole loops...

RegardsStefan

Aleix Lafita <notifications@github.com> schrieb am 3:42 Freitag, 26.Februar 2016:

This is not a bug in biojava, but a feature or characteristic of the dynamic programming algorithms for sequence alignment. The solution would involve changing the design of the DP table calculation and the way the solution is reconstructed.The explanation of why such consecutive gaps are not allowed is in the DP table calculation: only substitution, insertion and deletion are allowed.T(i,j) = max{ T(i-1,j-1) + Score(i,j), T(i-1,j) - Gap Penalty, T(i,j-1) - Gap Penalty }
Note that no term accounts for introducing a consecutive gap (gap at both i and j), which would be done by adding to the T(i,j) the following term:T(i-1, j-1) - Consecutive Gap Penalty
Thus, using the identity matrix that @sbliven described will not be useful to find consecutive gaps (and thus will not disallow mismatches), because they are not allowed by algorithm definition. This will not be a problem for normal substitution matrices, but only with this identity matrix where only identical residues are "allowed".My opinion is that consecutive gaps make no sense evolutionarily, because both compared sequences are assumed to share a common ancestor (be evolutionarily related), so sequences can evolve by aminoacid substitution and deletion or insertion of a sequence region or single aminoacid. A consecutive gap would mean an independent insertion at the same position in both sequences once they have diverged. Even accepting this possible, the probability would be much lower compared to the other events so that the algorithm performance might be compromised (false positives/negatives) and an extra parameter for consecutive gaps will have to be determined.—
Reply to this email directly or view it on GitHub.

lafita · 2016-02-29T14:59:02Z

Hi @stefanharjes,

I was actually arguing to close the issue (without any further modification), because DP algorithms for sequence alignment have the underlying model of AA substitution. My point is that the matrix called here identity is not achieving what it is intended for, but it is not because of a mistake in BioJava, but because of a limitation of the alignment algorithms themselves. This means that the alignment algorithms are correctly implemented in BioJava, so there is no bug to fix.
Would you agree in closing it?

stefanharjes · 2016-03-08T03:46:09Z

Hi Aleix,
as I wrote on github, I think that the issue is caused basically because the order in which the sequences are submitted causes different alignments. I always expect that "a+b=b+a" and deduced this to be the case for alignments as well. You calculate two different alignments depending on the order of which the sequences are submitted? I think that this result makes the algorithm a lot less usable. Also my personal benchmark for pairwise alignment: emboss needle does calculate order independent alignments.
RegardsStefan

Aleix Lafita <notifications@github.com> schrieb am 3:59 Dienstag, 1.März 2016:

Hi @stefanharjes,I was actually arguing to close the issue (without any further modification), because DP algorithms for sequence alignment have the underlying model of AA substitution. My point is that the matrix called here identity is not achieving what it is intended for, but it is not because of a mistake in BioJava, but because of a limitation of the alignment algorithms themselves. This means that the alignment algorithms are correctly implemented in BioJava, so there is no bug to fix.
Would you agree in closing it?—
Reply to this email directly or view it on GitHub.

lafita · 2016-03-08T08:37:57Z

@stefanharjes I think you confused this issue with #288 (about sub-optimal results). I will answer to your comment in that issue instead of here.

This issue is about consecutive gaps, and I was arguing for closing it since I see the expected behavior of the DP sequence alignment algorithms.

sbliven added the bug Bugs and bugfixes label Dec 2, 2014

sbliven added a commit to sbliven/biojava-sbliven that referenced this issue Dec 3, 2014

Adding test for biojava#213

156d2ec

Currently fails, so marked @ignore

sbliven added a commit to sbliven/biojava-sbliven that referenced this issue Dec 3, 2014

Adding test for biojava#213

d8cf66a

Currently fails, so marked @ignore

sbliven added a commit to sbliven/biojava-sbliven that referenced this issue Dec 9, 2014

Adding test for biojava#213

1a35e5d

Currently fails, so marked @ignore

dmyersturnbull mentioned this issue Dec 19, 2015

Biojava handles contiguous gaps incorrectly dmyersturnbull/sequence-alignment#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alignments do not allow consecutive gaps #213

Alignments do not allow consecutive gaps #213

sbliven commented Dec 2, 2014

sbliven commented Dec 2, 2014

dmyersturnbull commented Feb 8, 2015

stefanharjes commented Jun 19, 2015

sbliven commented Jun 19, 2015

lafita commented Feb 25, 2016

stefanharjes commented Feb 27, 2016

lafita commented Feb 29, 2016

stefanharjes commented Mar 8, 2016

lafita commented Mar 8, 2016

Alignments do not allow consecutive gaps #213

Alignments do not allow consecutive gaps #213

Comments

sbliven commented Dec 2, 2014

sbliven commented Dec 2, 2014

dmyersturnbull commented Feb 8, 2015

stefanharjes commented Jun 19, 2015

sbliven commented Jun 19, 2015

lafita commented Feb 25, 2016

stefanharjes commented Feb 27, 2016

lafita commented Feb 29, 2016

stefanharjes commented Mar 8, 2016

lafita commented Mar 8, 2016