New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alignments do not allow consecutive gaps #213
Comments
Currently fails, so marked @ignore
Currently fails, so marked @ignore
Currently fails, so marked @ignore
The issue is mostly in setScorePoint. It should be: public static Last[] setScorePoint(int x, int y, int gop, int gep, int sub, int[][][] scores) {
Last[] pointers = new Last[3];
// substitution
{
int fromSubstitution = scores[x - 1][y - 1][0] + sub;
int fromDeletion = scores[x - 1][y - 1][1] + sub;
int fromInsertion = scores[x - 1][y - 1][2] + sub;
if (fromDeletion > fromSubstitution && fromDeletion > fromInsertion) {
scores[x][y][0] = fromDeletion;
pointers[0] = Last.DELETION;
} else if (fromInsertion > fromSubstitution && fromInsertion > fromDeletion) {
scores[x][y][0] = fromInsertion;
pointers[0] = Last.INSERTION;
} else { // includes if strictly equal, so this is the preferred option
scores[x][y][0] = fromSubstitution;
pointers[0] = Last.SUBSTITUTION;
}
}
// deletion
{
int fromDeletion = scores[x - 1][y][1] + gep;
int fromSubstitution = scores[x - 1][y][0] + gop + gep;
int fromInsertion = scores[x - 1][y][2] + gop + gep; // insertion to deletion; still requires gap open
if (fromDeletion > fromSubstitution && fromDeletion > fromInsertion) {
scores[x][y][1] = fromDeletion;
pointers[1] = Last.DELETION;
} else if (fromInsertion > fromSubstitution && fromInsertion > fromDeletion) {
scores[x][y][1] = fromInsertion;
pointers[1] = Last.INSERTION;
} else { // includes if strictly equal, so this is the preferred option
scores[x][y][1] = fromSubstitution;
pointers[1] = Last.SUBSTITUTION;
}
}
// insertion
{
int fromInsertion = scores[x][y - 1][2] + gep;
int fromSubstitution = scores[x][y - 1][0] + gop + gep;
int fromDeletion = scores[x][y - 1][1] + gop + gep;
if (fromInsertion > fromSubstitution && fromInsertion > fromDeletion) {
scores[x][y][2] = fromSubstitution;
pointers[2] = Last.INSERTION;
} else if (fromDeletion > fromSubstitution && fromDeletion > fromInsertion) {
scores[x][y][2] = fromDeletion;
pointers[2] = Last.DELETION;
} else {
scores[x][y][2] = fromInsertion;
pointers[2] = Last.SUBSTITUTION;
}
}
return pointers;
} Unfortunately, this change results in NullPointerExceptions that I haven't sorted out. |
I do not quite understand this issue of extra gaps. If I compare the output
In structural alignments as you showed I think the syntax is different and I |
@stefanharjes The examples you showed introduce mismatches, but they still align the K & F regions. The 'Identity' substitution matrix I'm using has a mismatch score of -1000, which effectively disallows mismatches. |
This is not a bug in biojava, but a feature or characteristic of the dynamic programming algorithms for sequence alignment. The solution would involve changing the design of the DP table calculation and the way the solution is reconstructed. The explanation of why such consecutive gaps are not allowed is in the DP table calculation: only substitution, insertion and deletion are allowed. T(i,j) = max{ T(i-1,j-1) + Score(i,j), T(i-1,j) - Gap Penalty, T(i,j-1) - Gap Penalty } Note that no term accounts for introducing a consecutive gap (gap at both i and j), which would be done by adding to the T(i,j) the following term: T(i-1, j-1) - Consecutive Gap Penalty Thus, using the identity matrix that @sbliven described will not be useful to find consecutive gaps (and thus will not disallow mismatches), because they are not allowed by algorithm definition. This will not be a problem for normal substitution matrices, but only with this identity matrix where only identical residues are "allowed". My opinion is that consecutive gaps make no sense evolutionarily, because both compared sequences are assumed to share a common ancestor (be evolutionarily related), so sequences can evolve by aminoacid substitution and deletion or insertion of a sequence region or single aminoacid. A consecutive gap would mean an independent insertion at the same position in both sequences once they have diverged. Even accepting this possible, the probability would be much lower compared to the other events so that the algorithm performance might be compromised (false positives/negatives) and an extra parameter for consecutive gaps will have to be determined. The same applies for structural alignment algorithms using DP. |
HI Aleix, Regarding evolution, I believe that it is commonly agreed upon, that eukaryotic evolution is not driven by single AS substitution, but by domain exchange. I think you can introduce quite big changes using homologous recombination. Also my directed evolution approaches often involve the removal of whole loops... RegardsStefan
This is not a bug in biojava, but a feature or characteristic of the dynamic programming algorithms for sequence alignment. The solution would involve changing the design of the DP table calculation and the way the solution is reconstructed.The explanation of why such consecutive gaps are not allowed is in the DP table calculation: only substitution, insertion and deletion are allowed.T(i,j) = max{ T(i-1,j-1) + Score(i,j), T(i-1,j) - Gap Penalty, T(i,j-1) - Gap Penalty } |
Hi @stefanharjes, I was actually arguing to close the issue (without any further modification), because DP algorithms for sequence alignment have the underlying model of AA substitution. My point is that the matrix called here identity is not achieving what it is intended for, but it is not because of a mistake in BioJava, but because of a limitation of the alignment algorithms themselves. This means that the alignment algorithms are correctly implemented in BioJava, so there is no bug to fix. |
Hi Aleix,
Hi @stefanharjes,I was actually arguing to close the issue (without any further modification), because DP algorithms for sequence alignment have the underlying model of AA substitution. My point is that the matrix called here identity is not achieving what it is intended for, but it is not because of a mistake in BioJava, but because of a limitation of the alignment algorithms themselves. This means that the alignment algorithms are correctly implemented in BioJava, so there is no bug to fix. |
@stefanharjes I think you confused this issue with #288 (about sub-optimal results). I will answer to your comment in that issue instead of here. This issue is about consecutive gaps, and I was arguing for closing it since I see the expected behavior of the DP sequence alignment algorithms. |
The alignments are defined in such a way that it is not possible to have consecutive gaps. Here's a test case (to go in NeedlemanWunschTest. Requires 4bc1d20.):
The key point here is that the 'KK' and 'FF' sections don't align to anything, and we would like both to be considered gaps, giving a score of -4 (9 matches, 1 gap initiation, 4 gap extensions). The current NW alignment is the following:
with a score of -22 (6 matches, 2 initiations, 10 gaps). That is because it has to find at least one residue after the gap to align. It gets even worse if such a residue is not available. If the trailing 'F' were not there, we get an alignment with a score of -10012 (8 matches, 2 initiations, 2 gaps, and a forbidden mismatch):
This problem is fundamental to the recursion equations we selected for dynamic programming. Changing it would be a big project, but I think that this is a moderately serious bug.
The text was updated successfully, but these errors were encountered: