-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
start site off-by-one errors when creating BedTool from list of intervals #53
Comments
Hi Wolfgang, thanks for reporting this. Which version are you using? I pasted in your exact commands in doctest_mode and can't reproduce this:
Can you try this short script to see what results you get? import pybedtools
x = pybedtools.BedTool('test.gtf')
xm = x.merge()
iter_xm = pybedtools.BedTool(i for i in x).merge()
print x
print xm
print x[0].start
print xm[0].start
print iter_xm[0].start i get the following:
|
Hi Ryan,
Program: mergeBed (v2.12.0) output from your short script: chr5 137924248 137924468 137924247 Odd. Thanks, On Tue, Feb 28, 2012 at 4:33 PM, Ryan Dale <
|
Can you pull the latest pybedtools from GitHub? I actually haven't changed the version number since 0.5.5 despite many changes elsewhere in the code base (though I'm about to release 0.6). You also might want to upgrade BEDTools to 2.15 if possible. It's working on my end, using the github versions of pybedtools and BEDTools, so it's likely this issue has been fixed -- just not in released versions yet. |
Actually just upgrading to bedtools 2.15 solved it. I had not realized i Thanks for the very quick reply! Wolf chr5 ucsc_refseq exon 137924248 137924468 . - chr5 137924247 137924468 137924247 On Tue, Feb 28, 2012 at 5:03 PM, Ryan Dale <
|
Good to hear. By the way, import pybedtools
import itertools
ex = pybedtools.BedTool('mm9.gtf.chr18')\
.filter(lambda x: x[2]=='exon')\
.saveas()
def key(x):
return x['gene_name']
exons_by_gene = sorted(ex, key=key)
for gene, exons in itertools.groupby(exons_by_gene, key=key):
print gene,
print pybedtools.BedTool(exons).sort().total_coverage() |
That is indeed better than what i was doing. On Tue, Feb 28, 2012 at 5:14 PM, Ryan Dale <
|
My issue:
I'm reading in a GTF file with exons. I want to collect the exons for all transcripts of each gene (as intervals in a list), create a BedTool from the list of exons for each gene, merge exons, and determine the total number of nucleotides covered by the exons. The problem is that when i create a bedtool from the list of intervals and call .merge(), the start sites end up incremented by one (i.e. as if the temporary file created was using the string fields instead of the correct .start attribute. That surprised me. Is this intended? If I'm missing something fundamental, feel free to ignore me and delete the issue.
Thanks.
Wolfgang
Reproducible example:
GTF file:
chr5 ucsc_refseq exon 137924248 137924468 . - . gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942"; tss_id "tss_07393"; exon_number 5
chr5 ucsc_refseq exon 137925209 137925388 . - . gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942"; tss_id "tss_07393"; exon_number 4
IPython session:
In [1]: import pybedtools
In [2]: gtf = pybedtools.BedTool("test.gtf")
In [3]: gtf.file_type
Out[3]: 'gff'
In [4]: gtf_intervals = list(gtf)
In [5]: gtf_intervals
Out[5]: [Interval(chr5:137924247-137924468), Interval(chr5:137925208-137925388)]
In [6]: print "\n".join(str(x) for x in gtf)
chr5 ucsc_refseq exon 137924248 137924468 . - . gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942"; tss_id "tss_07393"; exon_number 5
chr5 ucsc_refseq exon 137925209 137925388 . - . gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942"; tss_id "tss_07393"; exon_number 4
In [7]: gtf_intervals[0].start
Out[7]: 137924247L
In [24]: new_bt = pybedtools.BedTool(gtf_intervals)
In [25]: new_bt
Out[25]: [Interval(chr5:137924247-137924468), Interval(chr5:137925208-137925388)]
In [26]: new_bt_merged = new_bt.merge()
In [27]: new_bt_merged
Out[27]: <BedTool(/tmp/pybedtools.aVwERJ.tmp)>
In [28]: new_bt_merged[0]
Out[28]: Interval(chr5:137924248-137924468)
In [29]: new_bt_merged[0].start
Out[29]: 137924248L
The text was updated successfully, but these errors were encountered: