Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alignments do not allow consecutive gaps #213

Open
sbliven opened this issue Dec 2, 2014 · 9 comments
Open

Alignments do not allow consecutive gaps #213

sbliven opened this issue Dec 2, 2014 · 9 comments
Labels
bug Bugs and bugfixes

Comments

@sbliven
Copy link
Member

sbliven commented Dec 2, 2014

The alignments are defined in such a way that it is not possible to have consecutive gaps. Here's a test case (to go in NeedlemanWunschTest. Requires 4bc1d20.):

@Test
    public void testSingleMismatch() throws CompoundNotFoundException {
        SubstitutionMatrix<AminoAcidCompound> matrix = SubstitutionMatrixHelper.getIdentity();


        ProteinSequence s1 = new ProteinSequence("AAAKKDDDFDD");
        ProteinSequence s2 = new ProteinSequence("AAAFFDDDFDD");

        GapPenalty penalty = new SimpleGapPenalty((short)9,(short)1);

        NeedlemanWunsch<ProteinSequence,AminoAcidCompound> nw = 
                new NeedlemanWunsch<ProteinSequence,AminoAcidCompound>(s1,s2, penalty, matrix);

        String expected1 = "AAAKK--DDDFDD";
        String expected2 = "AAA--FFDDDFDD";

        assertTrue(String.format("Incorrect alignment. Expected:%n  %s%n  %s%nFound:%n  %s%n  %s%n",
                expected1, expected2,
                nw.getPair().getAlignedSequence(1),nw.getPair().getAlignedSequence(2) ),
                expected1.equals(nw.getPair().getAlignedSequence(1).toString()) &&
                expected2.equals(nw.getPair().getAlignedSequence(2).toString()) );
    }

The key point here is that the 'KK' and 'FF' sections don't align to anything, and we would like both to be considered gaps, giving a score of -4 (9 matches, 1 gap initiation, 4 gap extensions). The current NW alignment is the following:

AAAKKDDDF-----DD
AAA-----FFDDDFDD

with a score of -22 (6 matches, 2 initiations, 10 gaps). That is because it has to find at least one residue after the gap to align. It gets even worse if such a residue is not available. If the trailing 'F' were not there, we get an alignment with a score of -10012 (8 matches, 2 initiations, 2 gaps, and a forbidden mismatch):

AAA-KKDDDDD
AAAFF-DDDDD

This problem is fundamental to the recursion equations we selected for dynamic programming. Changing it would be a big project, but I think that this is a moderately serious bug.

@sbliven sbliven added the bug Bugs and bugfixes label Dec 2, 2014
@sbliven
Copy link
Member Author

sbliven commented Dec 2, 2014

Note that this is also a problem in CE alignments (FATCAT is ok). Here's a contrived example that shows the problem quite clearly. The G/N residues greatly increase the overall RMSD, but they have to be included because of the gap definition.
d7bab312-3877-49d8-95e9-d5140862d861

sbliven added a commit to sbliven/biojava-sbliven that referenced this issue Dec 3, 2014
Currently fails, so marked @ignore
sbliven added a commit to sbliven/biojava-sbliven that referenced this issue Dec 3, 2014
Currently fails, so marked @ignore
sbliven added a commit to sbliven/biojava-sbliven that referenced this issue Dec 9, 2014
Currently fails, so marked @ignore
@dmyersturnbull
Copy link
Contributor

The issue is mostly in setScorePoint. It should be:

    public static Last[] setScorePoint(int x, int y, int gop, int gep, int sub, int[][][] scores) {
        Last[] pointers = new Last[3];

        // substitution
        {
            int fromSubstitution = scores[x - 1][y - 1][0] + sub;
            int fromDeletion = scores[x - 1][y - 1][1] + sub;
            int fromInsertion = scores[x - 1][y - 1][2] + sub;
            if (fromDeletion > fromSubstitution && fromDeletion > fromInsertion) {
                scores[x][y][0] = fromDeletion;
                pointers[0] = Last.DELETION;
            } else if (fromInsertion > fromSubstitution && fromInsertion > fromDeletion) {
                scores[x][y][0] = fromInsertion;
                pointers[0] = Last.INSERTION;
            } else { // includes if strictly equal, so this is the preferred option
                scores[x][y][0] = fromSubstitution;
                pointers[0] = Last.SUBSTITUTION;
            }
        }

        // deletion
        {
            int fromDeletion = scores[x - 1][y][1] + gep;
            int fromSubstitution = scores[x - 1][y][0] + gop + gep;
            int fromInsertion = scores[x - 1][y][2] + gop + gep; // insertion to deletion; still requires gap open
            if (fromDeletion > fromSubstitution && fromDeletion > fromInsertion) {
                scores[x][y][1] = fromDeletion;
                pointers[1] = Last.DELETION;
            } else if (fromInsertion > fromSubstitution && fromInsertion > fromDeletion) {
                scores[x][y][1] = fromInsertion;
                pointers[1] = Last.INSERTION;
            } else { // includes if strictly equal, so this is the preferred option
                scores[x][y][1] = fromSubstitution;
                pointers[1] = Last.SUBSTITUTION;
            }
        }

        // insertion
        {
            int fromInsertion = scores[x][y - 1][2] + gep;
            int fromSubstitution = scores[x][y - 1][0] + gop + gep;
            int fromDeletion = scores[x][y - 1][1] + gop + gep;
            if (fromInsertion > fromSubstitution && fromInsertion > fromDeletion) {
                scores[x][y][2] = fromSubstitution;
                pointers[2] = Last.INSERTION;
            } else if (fromDeletion > fromSubstitution && fromDeletion > fromInsertion) {
                scores[x][y][2] = fromDeletion;
                pointers[2] = Last.DELETION;
            } else {
                scores[x][y][2] = fromInsertion;
                pointers[2] = Last.SUBSTITUTION;
            }
        }

        return pointers;
    }

Unfortunately, this change results in NullPointerExceptions that I haven't sorted out.

@stefanharjes
Copy link
Contributor

I do not quite understand this issue of extra gaps. If I compare the output
of 'standard' alignment programs like needle, clustalx and muscle, all these
programs produce basically the same syntax, where the two not matching
amino acids are marked as a mismatch, however there are no gaps introduced.

needle:
1 AAAKKDDDFDD     11
  |||..||||||
1 AAAFFDDDFDD     11

clustal:
ta              AAAKKDDDFDD
tb              AAAFFDDDFDD
                ***  ******

In structural alignments as you showed I think the syntax is different and I
have seen gaps introduced for non-structurally aligned parts but in order to
avoid using them for the RMSD, the non matching parts are often cut away
entierly. However, I think for normal sequence based alignments there should be
no gaps!

@sbliven
Copy link
Member Author

sbliven commented Jun 19, 2015

@stefanharjes The examples you showed introduce mismatches, but they still align the K & F regions. The 'Identity' substitution matrix I'm using has a mismatch score of -1000, which effectively disallows mismatches.

@lafita
Copy link
Member

lafita commented Feb 25, 2016

This is not a bug in biojava, but a feature or characteristic of the dynamic programming algorithms for sequence alignment. The solution would involve changing the design of the DP table calculation and the way the solution is reconstructed.

The explanation of why such consecutive gaps are not allowed is in the DP table calculation: only substitution, insertion and deletion are allowed.

T(i,j) = max{ T(i-1,j-1) + Score(i,j), T(i-1,j) - Gap Penalty, T(i,j-1) - Gap Penalty }

Note that no term accounts for introducing a consecutive gap (gap at both i and j), which would be done by adding to the T(i,j) the following term:

T(i-1, j-1) - Consecutive Gap Penalty

Thus, using the identity matrix that @sbliven described will not be useful to find consecutive gaps (and thus will not disallow mismatches), because they are not allowed by algorithm definition. This will not be a problem for normal substitution matrices, but only with this identity matrix where only identical residues are "allowed".

My opinion is that consecutive gaps make no sense evolutionarily, because both compared sequences are assumed to share a common ancestor (be evolutionarily related), so sequences can evolve by aminoacid substitution and deletion or insertion of a sequence region or single aminoacid. A consecutive gap would mean an independent insertion at the same position in both sequences once they have diverged. Even accepting this possible, the probability would be much lower compared to the other events so that the algorithm performance might be compromised (false positives/negatives) and an extra parameter for consecutive gaps will have to be determined.

The same applies for structural alignment algorithms using DP.

@stefanharjes
Copy link
Contributor

HI Aleix,
it has been quite a while, and I will look up the issue again. As fas as I remember, I compared the pairwise alignment to emboss needle and the multiple alignment to clustalx. If I recall correctly the pairwise homology returned different gaps from the multiple alignment?

Regarding evolution, I believe that it is commonly agreed upon, that eukaryotic evolution is not driven by single AS substitution, but by domain exchange. I think you can introduce quite big changes using homologous recombination. Also my directed evolution approaches often involve the removal of whole loops...

RegardsStefan

Aleix Lafita <notifications@github.com> schrieb am 3:42 Freitag, 26.Februar 2016:

This is not a bug in biojava, but a feature or characteristic of the dynamic programming algorithms for sequence alignment. The solution would involve changing the design of the DP table calculation and the way the solution is reconstructed.The explanation of why such consecutive gaps are not allowed is in the DP table calculation: only substitution, insertion and deletion are allowed.T(i,j) = max{ T(i-1,j-1) + Score(i,j), T(i-1,j) - Gap Penalty, T(i,j-1) - Gap Penalty }
Note that no term accounts for introducing a consecutive gap (gap at both i and j), which would be done by adding to the T(i,j) the following term:T(i-1, j-1) - Consecutive Gap Penalty
Thus, using the identity matrix that @sbliven described will not be useful to find consecutive gaps (and thus will not disallow mismatches), because they are not allowed by algorithm definition. This will not be a problem for normal substitution matrices, but only with this identity matrix where only identical residues are "allowed".My opinion is that consecutive gaps make no sense evolutionarily, because both compared sequences are assumed to share a common ancestor (be evolutionarily related), so sequences can evolve by aminoacid substitution and deletion or insertion of a sequence region or single aminoacid. A consecutive gap would mean an independent insertion at the same position in both sequences once they have diverged. Even accepting this possible, the probability would be much lower compared to the other events so that the algorithm performance might be compromised (false positives/negatives) and an extra parameter for consecutive gaps will have to be determined.—
Reply to this email directly or view it on GitHub.

@lafita
Copy link
Member

lafita commented Feb 29, 2016

Hi @stefanharjes,

I was actually arguing to close the issue (without any further modification), because DP algorithms for sequence alignment have the underlying model of AA substitution. My point is that the matrix called here identity is not achieving what it is intended for, but it is not because of a mistake in BioJava, but because of a limitation of the alignment algorithms themselves. This means that the alignment algorithms are correctly implemented in BioJava, so there is no bug to fix.
Would you agree in closing it?

@stefanharjes
Copy link
Contributor

Hi Aleix,
as I wrote on github, I think that the issue is caused basically because the order in which the sequences are submitted causes different alignments. I always expect that "a+b=b+a" and deduced this to be the case for alignments as well. You calculate two different alignments depending on the order of which the sequences are submitted? I think that this result makes the algorithm a lot less usable. Also my personal benchmark for pairwise alignment: emboss needle does calculate order independent alignments.
RegardsStefan

Aleix Lafita <notifications@github.com> schrieb am 3:59 Dienstag, 1.März 2016:

Hi @stefanharjes,I was actually arguing to close the issue (without any further modification), because DP algorithms for sequence alignment have the underlying model of AA substitution. My point is that the matrix called here identity is not achieving what it is intended for, but it is not because of a mistake in BioJava, but because of a limitation of the alignment algorithms themselves. This means that the alignment algorithms are correctly implemented in BioJava, so there is no bug to fix.
Would you agree in closing it?—
Reply to this email directly or view it on GitHub.

@lafita
Copy link
Member

lafita commented Mar 8, 2016

@stefanharjes I think you confused this issue with #288 (about sub-optimal results). I will answer to your comment in that issue instead of here.

This issue is about consecutive gaps, and I was arguing for closing it since I see the expected behavior of the DP sequence alignment algorithms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs and bugfixes
Projects
None yet
Development

No branches or pull requests

4 participants