Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check for off-by-1 errors #8

Closed
brentp opened this issue Apr 15, 2011 · 7 comments
Closed

check for off-by-1 errors #8

brentp opened this issue Apr 15, 2011 · 7 comments

Comments

@brentp
Copy link
Contributor

brentp commented Apr 15, 2011

you guys have probably already handled this but let's add some tests that mix gff and bed input/output and make sure everything is as we expect for length, overlap amount, etc.

@daler
Copy link
Owner

daler commented Apr 15, 2011

good idea.

@brentp
Copy link
Contributor Author

brentp commented May 4, 2011

so all starts in bedFile.h are converted to 0-based (gff, vcf have 1 subtracted) and ends are unchanged. So everything will have BED-like coordinates.
Nowhere in cbedtools.pyx do we account for this. so,

import pybedtools
b = pybedtools.BedTool('pybedtools/test/data/c.gff')
d = iter(b).next()
print d
d.start = d.start
print d

gives:
chr1 ucb gene 465 805 . + . ID=thaliana_1_465_805;match=scaffold_801404.1;rname=thaliana_1_465_805
chr1 ucb gene 464 805 . + . ID=thaliana_1_465_805;match=scaffold_801404.1;rname=thaliana_1_465_805

so the start has changed. Thoughts on how to address this?

@daler
Copy link
Owner

daler commented May 4, 2011

Phew, nice catch.

It seems that the fundamental issue is that there are 2 different 'start' values -- one is in _bed.start and the other is in _bed.fields[idx].

import pybedtools
b = pybedtools.BedTool('pybedtools/test/data/c.gff')
d = iter(b).next()
print d.fields[3] # 465
print d.start     # 464

Is there a good reason for bedFile.h to subtract 1?

@brentp
Copy link
Contributor Author

brentp commented May 4, 2011

| Is there a good reason for bedFile.h to subtract 1?

then it makes things like .length work the same for all.
maybe it's easiest just to have the .start property setter do:

if self.file_type != "bed": start -= 1

will that solve everything?

@daler
Copy link
Owner

daler commented May 4, 2011

looks like it won't . . .
the start -= 1 will continuously decrement the start position of d if you do:

d.start = d.start
d.start = d.start
etc

i just added a test for starts to be able to check this, 50974ad

@daler
Copy link
Owner

daler commented May 5, 2011

OK, I think this is fixed now. I assumed these conventions:

  • Interval.start always contains the 0-based coord. This makes things like len() internally consistent for all feature types, and leaves bedFile.h alone.
  • even when setting Interval.start for a GFF feature, the user-provided value provided is still assumed to be 0-based
  • Interval.fields always contains the "string representation" to make the Interval a valid line for whatever format it represents. So for GFF features, Interval.fields[3] will always contain the 1-based start position
  • For GFF files, when you set .start, Interval.fields[3] is updated to be str(start + 1)

If a user tries to use .fields[3] for something, hopefully the fact that they have to do an extra int() on it will be a reminder that it's different than the already-as-an-int Interval.start.

(see c0163ce for this, which includes BED- and GFF-specific tests)

@brentp
Copy link
Contributor Author

brentp commented May 17, 2011

closing as this is tested and documented (thanks @daler).

@brentp brentp closed this as completed May 17, 2011
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants