start site off-by-one errors when creating BedTool from list of intervals #53

wresch · 2012-02-28T21:04:37Z

My issue:
I'm reading in a GTF file with exons. I want to collect the exons for all transcripts of each gene (as intervals in a list), create a BedTool from the list of exons for each gene, merge exons, and determine the total number of nucleotides covered by the exons. The problem is that when i create a bedtool from the list of intervals and call .merge(), the start sites end up incremented by one (i.e. as if the temporary file created was using the string fields instead of the correct .start attribute. That surprised me. Is this intended? If I'm missing something fundamental, feel free to ignore me and delete the issue.

Thanks.
Wolfgang

Reproducible example:

GTF file:
chr5 ucsc_refseq exon 137924248 137924468 . - . gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942"; tss_id "tss_07393"; exon_number 5
chr5 ucsc_refseq exon 137925209 137925388 . - . gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942"; tss_id "tss_07393"; exon_number 4

IPython session:
In [1]: import pybedtools

In [2]: gtf = pybedtools.BedTool("test.gtf")

In [3]: gtf.file_type
Out[3]: 'gff'

In [4]: gtf_intervals = list(gtf)

In [5]: gtf_intervals
Out[5]: [Interval(chr5:137924247-137924468), Interval(chr5:137925208-137925388)]

In [6]: print "\n".join(str(x) for x in gtf)
chr5 ucsc_refseq exon 137924248 137924468 . - . gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942"; tss_id "tss_07393"; exon_number 5
chr5 ucsc_refseq exon 137925209 137925388 . - . gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942"; tss_id "tss_07393"; exon_number 4

In [7]: gtf_intervals[0].start
Out[7]: 137924247L

In [24]: new_bt = pybedtools.BedTool(gtf_intervals)

In [25]: new_bt
Out[25]: [Interval(chr5:137924247-137924468), Interval(chr5:137925208-137925388)]

In [26]: new_bt_merged = new_bt.merge()

In [27]: new_bt_merged
Out[27]: <BedTool(/tmp/pybedtools.aVwERJ.tmp)>

In [28]: new_bt_merged[0]
Out[28]: Interval(chr5:137924248-137924468)

In [29]: new_bt_merged[0].start
Out[29]: 137924248L

daler · 2012-02-28T21:33:26Z

Hi Wolfgang, thanks for reporting this. Which version are you using? I pasted in your exact commands in doctest_mode and can't reproduce this:

>>> In [1]: import pybedtools
>>> In [2]: gtf = pybedtools.BedTool("test.gtf")
>>> In [4]: gtf_intervals = list(gtf)
>>> In [5]: gtf_intervals
[Interval(chr5:137924247-137924468), Interval(chr5:137925208-137925388)]
>>> In [24]: new_bt = pybedtools.BedTool(gtf_intervals)
>>> new_bt
[Interval(chr5:137924247-137924468), Interval(chr5:137925208-137925388)]
>>> In [26]: new_bt_merged = new_bt.merge()
>>> In [27]: new_bt_merged
<BedTool(/tmp/pybedtools.kHuwlf.tmp)>
>>> In [28]: new_bt_merged[0]
Interval(chr5:137924247-137924468)
>>> In [29]: new_bt_merged[0].start
137924247L

Can you try this short script to see what results you get?

import pybedtools
x = pybedtools.BedTool('test.gtf')
xm = x.merge()
iter_xm = pybedtools.BedTool(i for i in x).merge()
print x
print xm
print x[0].start
print xm[0].start
print iter_xm[0].start

i get the following:

chr5    ucsc_refseq exon    137924248   137924468   .   -   .   gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942"; tss_id "tss_07393"; exon_number 5
chr5    ucsc_refseq exon    137925209   137925388   .   -   .   gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942"; tss_id "tss_07393"; exon_number 4

chr5    137924247   137924468
chr5    137925208   137925388

137924247
137924247
137924247

wresch · 2012-02-28T21:55:01Z

Hi Ryan,

import pybedtools
pybedtools.version
pybedtools.version
'0.5.5'

mergeBed -h

Program: mergeBed (v2.12.0)
Author: Aaron Quinlan (aaronquinlan@gmail.com)
Summary: Merges overlapping BED/GFF/VCF entries into a single interval.

output from your short script:
chr5 ucsc_refseq exon 137924248 137924468 . -
. gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942";
tss_id "tss_07393"; exon_number 5
chr5 ucsc_refseq exon 137925209 137925388 . -
. gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942";
tss_id "tss_07393"; exon_number 4

chr5 137924248 137924468
chr5 137925209 137925388

137924247
137924248
137924248

Odd.

Thanks,
Wolf

On Tue, Feb 28, 2012 at 4:33 PM, Ryan Dale <
reply@reply.github.com

wrote:

Hi Wolfgang, thanks for reporting this. Which version are you using? I
pasted in your exact commands in doctest_mode and can't reproduce this:

>>> In [1]: import pybedtools
>>> In [2]: gtf = pybedtools.BedTool("test.gtf")
>>> In [4]: gtf_intervals = list(gtf)
>>> In [5]: gtf_intervals
[Interval(chr5:137924247-137924468), Interval(chr5:137925208-137925388)]
>>> In [24]: new_bt = pybedtools.BedTool(gtf_intervals)
>>> new_bt
[Interval(chr5:137924247-137924468), Interval(chr5:137925208-137925388)]
>>> In [26]: new_bt_merged = new_bt.merge()
>>> In [27]: new_bt_merged
<BedTool(/tmp/pybedtools.kHuwlf.tmp)>
>>> In [28]: new_bt_merged[0]
Interval(chr5:137924247-137924468)
>>> In [29]: new_bt_merged[0].start
137924247L

Can you try this short script to see what results you get?

import pybedtools
x = pybedtools.BedTool('test.gtf')
xm = x.merge()
iter_xm = pybedtools.BedTool(i for i in x).merge()
print x
print xm
print x[0].start
print xm[0].start
print iter_xm[0].start

i get the following:

chr5    ucsc_refseq     exon    137924248       137924468       .       -
      .       gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942";
tss_id "tss_07393"; exon_number 5
chr5    ucsc_refseq     exon    137925209       137925388       .       -
      .       gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942";
tss_id "tss_07393"; exon_number 4

chr5    137924247       137924468
chr5    137925208       137925388

137924247
137924247
137924247

Reply to this email directly or view it on GitHub:
#53 (comment)

daler · 2012-02-28T22:03:35Z

Can you pull the latest pybedtools from GitHub? I actually haven't changed the version number since 0.5.5 despite many changes elsewhere in the code base (though I'm about to release 0.6).

You also might want to upgrade BEDTools to 2.15 if possible. It's working on my end, using the github versions of pybedtools and BEDTools, so it's likely this issue has been fixed -- just not in released versions yet.

wresch · 2012-02-28T22:11:16Z

Actually just upgrading to bedtools 2.15 solved it. I had not realized i
was behind the curve by that much...

Thanks for the very quick reply!

Wolf

chr5 ucsc_refseq exon 137924248 137924468 . -
. gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942";
tss_id "tss_07393"; exon_number 5
chr5 ucsc_refseq exon 137925209 137925388 . -
. gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942";
tss_id "tss_07393"; exon_number 4

chr5 137924247 137924468
chr5 137925208 137925388

137924247
137924247
137924247

On Tue, Feb 28, 2012 at 5:03 PM, Ryan Dale <
reply@reply.github.com

wrote:

Can you pull the latest pybedtools from GitHub? I actually haven't
changed the version number since 0.5.5 despite many changes elsewhere in
the code base (though I'm about to release 0.6).

You also might want to upgrade BEDTools to 2.15 if possible. It's working
on my end, using the github versions of pybedtools and BEDTools, so it's
likely this issue has been fixed -- just not in released versions yet.

Reply to this email directly or view it on GitHub:
#53 (comment)

daler · 2012-02-28T22:14:56Z

Good to hear.

By the way, itertools.groupBy and the BedTool.total_coverage method will probably come in handy for what you're doing . . . just tested this on a mouse GTF file. It should run a lot faster if you can assume the input file is already sorted by gene name.

import pybedtools
import itertools

ex = pybedtools.BedTool('mm9.gtf.chr18')\
        .filter(lambda x: x[2]=='exon')\
        .saveas()

def key(x):
    return x['gene_name']

exons_by_gene = sorted(ex, key=key)
for gene, exons in itertools.groupby(exons_by_gene, key=key):
    print gene,
    print pybedtools.BedTool(exons).sort().total_coverage()

wresch · 2012-02-28T22:24:17Z

That is indeed better than what i was doing.

On Tue, Feb 28, 2012 at 5:14 PM, Ryan Dale <
reply@reply.github.com

wrote:

Good to hear.

By the way, itertools.groupBy and the BedTool.total_coverage method
will probably come in handy for what you're doing . . . just tested this on
a mouse GTF file. It should run a lot faster if you can assume the input
file is already sorted by gene name.
import pybedtools
import itertools

ex = pybedtools.BedTool('mm9.gtf.chr18')\
       .filter(lambda x: x[2]=='exon')\
       .saveas()

def key(x):
   return x['gene_name']

exons_by_gene = sorted(ex, key=key)
for gene, exons in itertools.groupby(exons_by_gene, key=key):
   print gene,
   print pybedtools.BedTool(exons).sort().total_coverage()
Reply to this email directly or view it on GitHub:
#53 (comment)

daler closed this as completed Feb 28, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

start site off-by-one errors when creating BedTool from list of intervals #53

start site off-by-one errors when creating BedTool from list of intervals #53

wresch commented Feb 28, 2012

daler commented Feb 28, 2012

wresch commented Feb 28, 2012

daler commented Feb 28, 2012

wresch commented Feb 28, 2012

daler commented Feb 28, 2012

wresch commented Feb 28, 2012

start site off-by-one errors when creating BedTool from list of intervals #53

start site off-by-one errors when creating BedTool from list of intervals #53

Comments

wresch commented Feb 28, 2012

daler commented Feb 28, 2012

wresch commented Feb 28, 2012

daler commented Feb 28, 2012

wresch commented Feb 28, 2012

daler commented Feb 28, 2012

wresch commented Feb 28, 2012