New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More consistent coordinate specification for Maf alignment access in MafIO.py #504
Conversation
Is the full test file available? |
I generated the test file as follows:
|
That sounds like a good test case to have. Can you make this PR pass the Travis tests? |
Is the mafindex rebuilt anew during the Travis tests ? |
Reading the logs, I realize that some of my changes of |
It seems that removing the mafindex files present in Tests/MAF enables the Travis tests to go further. Now it fails as follows:
I looked in the maf file, and find the following potentially concerned blocks:
For each (exonstart, exonend) pair (in this case
|
TravisCI starts from a clean VM, so there should be no pre-existing stale indexes (unless they are explicitly checked into git's version tracking). In general our tests should ideally use temp files for the indexes (to ensure unique) or explicitly delete any pre-existing indexes if you want to test with a specific filename. See |
I modified the tests to use the "zero-based and inclusive" coordinate specification. I'm still not fully sure that a "zero-based and inclusive" coordinate specification is a good thing: It is neither the "human-readable" way ("1-based and inclusive" coordinates), neither the bedtools / "python slice" way. I decided to do "zero-based" because MAF format is zero-based. But I decided to make the end coordinate inclusive, because it is not very intuitive for the human user to have to provide a list of "inclusive" start coordinates, but "exclusive" end coordinates when calling MafIndex.search and MafIndex.get_spliced. Do you have an opinion on the issue? |
Thank you for your work on this branch! IMO since there is no standardized coordinate specification in sequence data, it doesn't really matter. The user of any script and data format must bear the burden of ensuring they are taking into account different base systems. I think this is fine. I'd like to again voice my opinion that this pull request be merged. I am regularly contacted by users wondering when this will happen. |
Do you have a strong reason not to follow a more Python-slicing like coordinate system? |
I would also like this to be merged. On Wed, Sep 30, 2015 at 2:03 AM, Peter Cock notifications@github.com
|
Good point Peter. Maybe blaiseli can comment on why it was necessary to change from 0-based, end exclusive, which is both the UCSC spec and Pythonic? The test case at the beginning of this pull request makes sense. An error should be thrown when start and end coordinates are the same. Adam, #350 seems to not pass the build tests. Any idea why? |
350 is probably a victim of over-eager Github merging. The Travis tests
There's that file of expected test output, and if it doesn't contain I bet the tests were re-run against a new master commit, the results file On Wed, Sep 30, 2015 at 2:48 PM, Andrew Sczesnak notifications@github.com
|
As I wrote in an earlier comment: I decided to do "zero-based" because MAF format is zero-based. But I decided to make the end coordinate inclusive, because it is not very intuitive for the human user to have to provide a list of "inclusive" start coordinates, but "exclusive" end coordinates when calling MafIndex.search and MafIndex.get_spliced. So these are not at all "strong" reasons not to use classic python slicing coordinates. When I find some time, I can work on switching to the python way. |
I would disagree--I think it's intuitive for the human user to have consistency across the BioPython package and within Python. |
I've been looking back at the code of The Let's imagine the coordinates were end-exclusive. How do we handle a case where a query segment We have A similar problem happens when the start of a query segment begins just the position after the end of an alignment block. Therefore, we will somehow have to shift some coordinates somewhere for the code to be correct, or use a different test in the SQL query. I think it is less error-prone and more readable to use inclusive coordinates as "early" as possible. I mean: we should avoid coordinates "conversions" deep in the code. It's easier to ensure the coordinates are in the correct system when reading or writing MAF format. If my intuition is correct, the code for determining which bins a segment belongs to also uses inclusive start and end coordinates internally (see Another point I would like to make is the following: I don't think that asking the user to provide end-inclusive coordinates will bring much consistency with respect to python syntax in general. The Maybe other Biopython parts also deal with lists of starts and ends. In this case, I agree that the user interface should be consistent regarding coordinates inclusive- or exclusiveness. I hope the above comments are not too badly phrased. |
I just added comments and modified the end coordinates in the
I don't know how these tests work, but it looks like if some expected results were hard-coded and no test with "maf" format was anticipated. A test using this format may have been automatically activated and inserted before the test for "nexus". Strange that this did not happen before. |
That's a print-and-compare style test, use |
I regenerated
|
Strange - it is and should be under version control: https://github.com/biopython/biopython/blob/master/Tests/output/test_SeqIO You would need to explicitly include the changes to
|
Actually, I was mistaken when I wrote that the file was not under version control. What appears to have happened is that So I'm still clueless as to how to make the tests pass. |
0bb1f58
to
ebeee03
Compare
In an attempt to force another try of the Travis tests, I tried to rebase on the more recent commits of the main biopython, based on explanations found here https://github.com/edx/edx-platform/wiki/How-to-Rebase-a-Pull-Request, but I'm lost: my last commits do not seem to appear in the log any more, however, I can see that the modifications I had made indeed have been taken into account. Sorry for the mess. |
I see your commits. Did you want something after "Expected output from test_SeqIO.py has changed."? Maybe |
The commit "Expected output from test_SeqIO.py has changed" was done after rebasing.
So to summarize: the MafIO.py file is as it should, but the history of its modifications may be messy due to manual merging during rebasing. |
Don't worry about the history too much: I could probably squash down some of this as part of any merge/rebase to the master. |
So what outstanding work is there to do on this? Has a coordinate system On Thu, Dec 3, 2015 at 4:06 AM, Peter Cock notifications@github.com wrote:
|
Regarding coordinates, I presented some arguments for using end-inclusive coordinate in earlier comments: #504 (comment) and #504 (comment) (where I give reasons why I don't see it as inconsistent). The current version of the code uses zero-based end-inclusive coordinate both internally (including bin determination, now) and in its interface with the user. There's no list-like interface where slice syntax with end-exclusive coordinates would be required, that's why I don't consider the current implementation inconsistent with respect to the python way. Feel free to argue against my point of view if you really think the way I decided to treat coordinates is a problem. |
So what is the status of this PR? I ran across this on the website, which cites a forked branch that hasn't been updated since 2012. Tests seem to be passing, so what's the hold-up? |
I reverted from the "avoid dots" to the normal method call approach, and also fixed more issues (some of which I had introduced when trying to remove conflicts using the online editor). There is still at least one issue in the tests. |
I'm reading back the commits before merge and conflict resolutions, and I now understand the reason for the error type inconsistency. I will review more of the earlier commits to check that the present version is as intended by the other contributors. |
- from commit 49e7da7: Remove unwanted white space after docstrings (etc) $ pydocstyle Bio/ BioSQL/ Tests/ Scripts/ Doc/ --select D202 ... D202: No blank lines allowed after function docstring - from commit a8cf8c0: Avoid ValueError for file not found Aim here was to be user friendly, but a file-system relevant exception is probably better for error handling.
Bio/AlignIO/MafIO.py
Outdated
@@ -123,7 +121,6 @@ def write_alignment(self, alignment): | |||
Writes every SeqRecord in a MultipleSeqAlignment object to its own | |||
MAF block (beginning with an 'a' line, containing 's' lines) | |||
""" | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was a deliberate change to follow PEP257 docstring style guidelines as now enforced in TravisCI with pydocstyle - see 49e7da7
Tests/test_AlignIO.py
Outdated
else: | ||
assert str(r1.seq) == str(r2.seq), \ | ||
"Seq does not match %s vs %s (%s vs %s)" \ | ||
% (r1.seq, r2.seq, r1.id, r2.id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code block is not needed anymore, slightly further up we have:
assert str(r1.seq) == str(r2.seq)
Based on squashed commit of pull request biopython#504, with edits by Peter (mostly leaving out a few changes, and to line wrapping). This should have *NO* functional change.
…nto alignio-maf
…nto alignio-maf
…nto alignio-maf
Based on squashed commit of pull request #504, with edits by Peter (mostly leaving out a few changes, and to line wrapping). This should have *NO* functional change.
Most of the remaining changes have now been applied via pull request #1086 (including a fix for #1083). There are still some potential changes to do with indexing boundaries (including the special case of asking for one column), for which @blaiseli is going to make a new pull request as discussed on #1086. |
The use of inclusive coodinates for the sqlite MAX index is an attempt to fix issues discussed in the following pull requests: biopython#504 biopython#1086 A test has been added in `test_MafIO_index.py` to check that the number of MAF alignment blocks returned when querying for a single position will be 1 at the boundary between blocks (it could be 2 with the previous MAF index, which is a bug). The indices built with end-exclusive coordinates will not be compatible with the present version.
The use of inclusive coodinates for the sqlite MAX index is an attempt to fix issues discussed in the following pull requests: biopython#504 biopython#1086 A test has been added in `test_MafIO_index.py` to check that the number of MAF alignment blocks returned when querying for a single position will be 1 at the boundary between blocks (it could be 2 with the previous MAF index, which is a bug). The indices built with end-exclusive coordinates will not be compatible with the present version.
The use of inclusive coodinates for the sqlite MAX index is an attempt to fix issues discussed in the following pull requests: biopython#504 biopython#1086 A test has been added in `test_MafIO_index.py` to check that the number of MAF alignment blocks returned when querying for a single position will be 1 at the boundary between blocks (it could be 2 with the previous MAF index, which is a bug). The indices built with end-exclusive coordinates will not be compatible with the present version.
Based on squashed commit of pull request biopython#504, with edits by Peter (mostly leaving out a few changes, and to line wrapping). This should have *NO* functional change.
The use of inclusive coodinates for the sqlite MAX index is an attempt to fix issues discussed in the following pull requests: biopython#504 biopython#1086 A test has been added in `test_MafIO_index.py` to check that the number of MAF alignment blocks returned when querying for a single position will be 1 at the boundary between blocks (it could be 2 with the previous MAF index, which is a bug). The indices built with end-exclusive coordinates will not be compatible with the present version.
The use of inclusive coodinates for the sqlite MAX index is an attempt to fix issues discussed in the following pull requests: biopython#504 biopython#1086 A test has been added in `test_MafIO_index.py` to check that the number of MAF alignment blocks returned when querying for a single position will be 1 at the boundary between blocks (it could be 2 with the previous MAF index, which is a bug). The indices built with end-exclusive coordinates will not be compatible with the present version.
Squashed commit of pull request #1088. The use of inclusive coodinates for the sqlite MAX index is an attempt to fix issues discussed in the following pull requests: #504 #1086 A test has been added in `test_MafIO_index.py` to check that the number of MAF alignment blocks returned when querying for a single position will be 1 at the boundary between blocks (it could be 2 with the previous MAF index, which is a bug). The indices built with end-exclusive coordinates will not be compatible with the present version. * Boundary tests with more than 2 columns. * Tests that get_spliced gets correct sequences. * Check only MAF files base names equality. This enables loading an index from a different working directory than the one from which it was built. * Resolving index conflicts. * Better index incompatibility error message.
Trying to fuse #503 with #350
I had started working with polyatail version of MafIO.py, but after making my pull request today, I discovered that another one existed based on a more thorough re-working of polyatail's code. So I tried to fuse both. I'm not an experienced git user, so I hope it's not too much of a mess.
I tried to make the MAF parser more internally consistent with respect to start end end coordinates and checked that the alignments returned by MafIndex.search and MafIndex.get_spliced were correct.
I show a test in message commit 4be9fd1
Maybe it's better to reproduce it here:
Fixed coordinate specifications inconsistences.
The coordinate specification system was inconsistent, as the following test shows:
The test.maf file ends with the following two blocks:
Depending on how coordinates have to be specified, we can expect various
behaviour from
MafIndex.search
. But what is observed with the previousimplementation seems incoherent.
Tests using commit 02120d1:
If
start == end
, then there is an error. So we whould assume thatone-character colums have to be specified with
end = start + 1
But the with
start, end = 13315, 13316
andstart, end = 13316, 13317
, twoalignment blocks are yielded.
Now we allow start and end to be the same, which specifies a single position.
I will probably close my other pull request