Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

start site off-by-one errors when creating BedTool from list of intervals #53

Closed
wresch opened this issue Feb 28, 2012 · 6 comments
Closed

Comments

@wresch
Copy link

wresch commented Feb 28, 2012

My issue:
I'm reading in a GTF file with exons. I want to collect the exons for all transcripts of each gene (as intervals in a list), create a BedTool from the list of exons for each gene, merge exons, and determine the total number of nucleotides covered by the exons. The problem is that when i create a bedtool from the list of intervals and call .merge(), the start sites end up incremented by one (i.e. as if the temporary file created was using the string fields instead of the correct .start attribute. That surprised me. Is this intended? If I'm missing something fundamental, feel free to ignore me and delete the issue.

Thanks.
Wolfgang

Reproducible example:

GTF file:
chr5 ucsc_refseq exon 137924248 137924468 . - . gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942"; tss_id "tss_07393"; exon_number 5
chr5 ucsc_refseq exon 137925209 137925388 . - . gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942"; tss_id "tss_07393"; exon_number 4

IPython session:
In [1]: import pybedtools

In [2]: gtf = pybedtools.BedTool("test.gtf")

In [3]: gtf.file_type
Out[3]: 'gff'

In [4]: gtf_intervals = list(gtf)

In [5]: gtf_intervals
Out[5]: [Interval(chr5:137924247-137924468), Interval(chr5:137925208-137925388)]

In [6]: print "\n".join(str(x) for x in gtf)
chr5 ucsc_refseq exon 137924248 137924468 . - . gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942"; tss_id "tss_07393"; exon_number 5
chr5 ucsc_refseq exon 137925209 137925388 . - . gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942"; tss_id "tss_07393"; exon_number 4

In [7]: gtf_intervals[0].start
Out[7]: 137924247L

In [24]: new_bt = pybedtools.BedTool(gtf_intervals)

In [25]: new_bt
Out[25]: [Interval(chr5:137924247-137924468), Interval(chr5:137925208-137925388)]

In [26]: new_bt_merged = new_bt.merge()

In [27]: new_bt_merged
Out[27]: <BedTool(/tmp/pybedtools.aVwERJ.tmp)>

In [28]: new_bt_merged[0]
Out[28]: Interval(chr5:137924248-137924468)

In [29]: new_bt_merged[0].start
Out[29]: 137924248L

@daler
Copy link
Owner

daler commented Feb 28, 2012

Hi Wolfgang, thanks for reporting this. Which version are you using? I pasted in your exact commands in doctest_mode and can't reproduce this:

>>> In [1]: import pybedtools
>>> In [2]: gtf = pybedtools.BedTool("test.gtf")
>>> In [4]: gtf_intervals = list(gtf)
>>> In [5]: gtf_intervals
[Interval(chr5:137924247-137924468), Interval(chr5:137925208-137925388)]
>>> In [24]: new_bt = pybedtools.BedTool(gtf_intervals)
>>> new_bt
[Interval(chr5:137924247-137924468), Interval(chr5:137925208-137925388)]
>>> In [26]: new_bt_merged = new_bt.merge()
>>> In [27]: new_bt_merged
<BedTool(/tmp/pybedtools.kHuwlf.tmp)>
>>> In [28]: new_bt_merged[0]
Interval(chr5:137924247-137924468)
>>> In [29]: new_bt_merged[0].start
137924247L

Can you try this short script to see what results you get?

import pybedtools
x = pybedtools.BedTool('test.gtf')
xm = x.merge()
iter_xm = pybedtools.BedTool(i for i in x).merge()
print x
print xm
print x[0].start
print xm[0].start
print iter_xm[0].start

i get the following:

chr5    ucsc_refseq exon    137924248   137924468   .   -   .   gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942"; tss_id "tss_07393"; exon_number 5
chr5    ucsc_refseq exon    137925209   137925388   .   -   .   gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942"; tss_id "tss_07393"; exon_number 4

chr5    137924247   137924468
chr5    137925208   137925388

137924247
137924247
137924247

@wresch
Copy link
Author

wresch commented Feb 28, 2012

Hi Ryan,

import pybedtools
pybedtools.version
pybedtools.version
'0.5.5'


mergeBed -h

Program: mergeBed (v2.12.0)
Author: Aaron Quinlan (aaronquinlan@gmail.com)
Summary: Merges overlapping BED/GFF/VCF entries into a single interval.


output from your short script:
chr5 ucsc_refseq exon 137924248 137924468 . -
. gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942";
tss_id "tss_07393"; exon_number 5
chr5 ucsc_refseq exon 137925209 137925388 . -
. gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942";
tss_id "tss_07393"; exon_number 4

chr5 137924248 137924468
chr5 137925209 137925388

137924247
137924248
137924248

Odd.

Thanks,
Wolf

On Tue, Feb 28, 2012 at 4:33 PM, Ryan Dale <
reply@reply.github.com

wrote:

Hi Wolfgang, thanks for reporting this. Which version are you using? I
pasted in your exact commands in doctest_mode and can't reproduce this:

>>> In [1]: import pybedtools
>>> In [2]: gtf = pybedtools.BedTool("test.gtf")
>>> In [4]: gtf_intervals = list(gtf)
>>> In [5]: gtf_intervals
[Interval(chr5:137924247-137924468), Interval(chr5:137925208-137925388)]
>>> In [24]: new_bt = pybedtools.BedTool(gtf_intervals)
>>> new_bt
[Interval(chr5:137924247-137924468), Interval(chr5:137925208-137925388)]
>>> In [26]: new_bt_merged = new_bt.merge()
>>> In [27]: new_bt_merged
<BedTool(/tmp/pybedtools.kHuwlf.tmp)>
>>> In [28]: new_bt_merged[0]
Interval(chr5:137924247-137924468)
>>> In [29]: new_bt_merged[0].start
137924247L

Can you try this short script to see what results you get?

import pybedtools
x = pybedtools.BedTool('test.gtf')
xm = x.merge()
iter_xm = pybedtools.BedTool(i for i in x).merge()
print x
print xm
print x[0].start
print xm[0].start
print iter_xm[0].start

i get the following:

chr5    ucsc_refseq     exon    137924248       137924468       .       -
      .       gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942";
tss_id "tss_07393"; exon_number 5
chr5    ucsc_refseq     exon    137925209       137925388       .       -
      .       gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942";
tss_id "tss_07393"; exon_number 4

chr5    137924247       137924468
chr5    137925208       137925388

137924247
137924247
137924247

Reply to this email directly or view it on GitHub:
#53 (comment)

@daler
Copy link
Owner

daler commented Feb 28, 2012

Can you pull the latest pybedtools from GitHub? I actually haven't changed the version number since 0.5.5 despite many changes elsewhere in the code base (though I'm about to release 0.6).

You also might want to upgrade BEDTools to 2.15 if possible. It's working on my end, using the github versions of pybedtools and BEDTools, so it's likely this issue has been fixed -- just not in released versions yet.

@wresch
Copy link
Author

wresch commented Feb 28, 2012

Actually just upgrading to bedtools 2.15 solved it. I had not realized i
was behind the curve by that much...

Thanks for the very quick reply!

Wolf

chr5 ucsc_refseq exon 137924248 137924468 . -
. gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942";
tss_id "tss_07393"; exon_number 5
chr5 ucsc_refseq exon 137925209 137925388 . -
. gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942";
tss_id "tss_07393"; exon_number 4

chr5 137924247 137924468
chr5 137925208 137925388

137924247
137924247
137924247

On Tue, Feb 28, 2012 at 5:03 PM, Ryan Dale <
reply@reply.github.com

wrote:

Can you pull the latest pybedtools from GitHub? I actually haven't
changed the version number since 0.5.5 despite many changes elsewhere in
the code base (though I'm about to release 0.6).

You also might want to upgrade BEDTools to 2.15 if possible. It's working
on my end, using the github versions of pybedtools and BEDTools, so it's
likely this issue has been fixed -- just not in released versions yet.


Reply to this email directly or view it on GitHub:
#53 (comment)

@daler
Copy link
Owner

daler commented Feb 28, 2012

Good to hear.

By the way, itertools.groupBy and the BedTool.total_coverage method will probably come in handy for what you're doing . . . just tested this on a mouse GTF file. It should run a lot faster if you can assume the input file is already sorted by gene name.

import pybedtools
import itertools

ex = pybedtools.BedTool('mm9.gtf.chr18')\
        .filter(lambda x: x[2]=='exon')\
        .saveas()

def key(x):
    return x['gene_name']

exons_by_gene = sorted(ex, key=key)
for gene, exons in itertools.groupby(exons_by_gene, key=key):
    print gene,
    print pybedtools.BedTool(exons).sort().total_coverage()

@daler daler closed this as completed Feb 28, 2012
@wresch
Copy link
Author

wresch commented Feb 28, 2012

That is indeed better than what i was doing.

On Tue, Feb 28, 2012 at 5:14 PM, Ryan Dale <
reply@reply.github.com

wrote:

Good to hear.

By the way, itertools.groupBy and the BedTool.total_coverage method
will probably come in handy for what you're doing . . . just tested this on
a mouse GTF file. It should run a lot faster if you can assume the input
file is already sorted by gene name.

import pybedtools
import itertools

ex = pybedtools.BedTool('mm9.gtf.chr18')\
       .filter(lambda x: x[2]=='exon')\
       .saveas()

def key(x):
   return x['gene_name']

exons_by_gene = sorted(ex, key=key)
for gene, exons in itertools.groupby(exons_by_gene, key=key):
   print gene,
   print pybedtools.BedTool(exons).sort().total_coverage()

Reply to this email directly or view it on GitHub:
#53 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants