Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tabix indexed BedTool evaluates to None #132

Closed
roryk opened this issue Apr 7, 2015 · 14 comments
Closed

tabix indexed BedTool evaluates to None #132

roryk opened this issue Apr 7, 2015 · 14 comments

Comments

@roryk
Copy link

roryk commented Apr 7, 2015

Hi Ryan,

Before tabix indexing, if a BedTool is non-empty, it will be treated as not None, but after tabix indexing it will not.

import pybedtools
foo = pybedtools.example_bedtool('b.bed')
if foo:
    print "Before tabix indexing."
foo = foo.tabix()
if foo:
    print "After tabix indexing."

I'm cool with 'don't do that' as a valid answer if it wasn't intended to be used that way. :)

Thanks!

@daler
Copy link
Owner

daler commented Apr 8, 2015

Hi Rory -
Not sure I understand . . . just to be clear, in both the development version of pybedtools and v0.6.9 (and with tabix 0.2.5 and 1.2.1), the example code:

import pybedtools
foo = pybedtools.example_bedtool('b.bed')
if foo:
    print "Before tabix indexing; foo is None?:", foo is None
foo = foo.tabix()
if foo:
    print "After tabix indexing; foo is None?:", foo is None

prints:

Before tabix indexing; foo is None?: False
After tabix indexing; foo is None?: False

So in both cases foo is treated as not None, which is the behavior I'd expect. Can you confirm this is the output you're getting, and if not, which versions of tabix and pybedtools you're using?

@roryk
Copy link
Author

roryk commented Apr 8, 2015

Hi Ryan,

Thanks for the reply. Weird, pybedtools is version 0.6.9 and tabix is version 1.2.1. If I run the code you posted I just get the first line and not the second. This is on OSX 10.10.2. The development version of pybedtools has the same behavior as well. I was one tic behind with cython and tried installing it again with the most recent version but it has the same behavior. Any other ideas?

@daler
Copy link
Owner

daler commented Apr 8, 2015

Hmm. My tests are on Ubuntu 14.04, but I have an OSX 10.9 machine that I'll try tomorrow. If I can't reproduce with that, I'll track down one with 10.10.2.

The only thing I can think of off the top of my head is that the bgzipped file is empty for some reason. Here's a a paranoid test case that ensures everything is being cleaned up and using force=True when calling tabix():

import pybedtools
from pybedtools.bedtool import isBGZIP
import gzip
import os

template = """
fn          : {0}
isBGZIP     : {1}
._tabixed() : {2}
length      : {3}
is None     : {4}
contents    : 
{5}
"""

ex_fn = 'example.bed'
os.system('cp %s %s' % (pybedtools.example_filename('b.bed'), ex_fn))


def clean():
    """
    delete the .gz and .gz.tbi files for the example filename
    """
    for f in [ex_fn + '.gz', ex_fn + '.gz.tbi']:
        if os.path.exists(f):
            os.unlink(f)

def check(i):
    if i.fn.endswith('.gz'):
        contents = gzip.open(i.fn).read()
    else:
        contents = open(i.fn).read()
    print template.format(i.fn, isBGZIP(i.fn), i._tabixed(), len(i), i is None, contents)

# base case: not BGZIPed, not tabixed()
a = pybedtools.BedTool(ex_fn)
clean()
check(a)
clean()

# convert `a` to gzip. Should not be BGZIPed, not tabixed()
with gzip.open(a.fn + '_gzipped_not_bgzipped.gz', 'w') as fout:
    fout.write(open(a.fn).read())
b = pybedtools.BedTool(a.fn + '_gzipped_not_bgzipped.gz')
check(b)
clean()

# BGZIPed and tabixed()
c = a.tabix(force=True, in_place=False)
check(c)
clean()

# BGZIPed and tabixed()
d = a.tabix(force=True, in_place=True)
check(d)
clean()

Output for me on Ubuntu is:

fn          : example.bed
isBGZIP     : False
._tabixed() : None
length      : 2
is None     : False
contents    : 
chr1    155 200 feature5    0   -
chr1    800 901 feature6    0   +



fn          : example.bed_gzipped_not_bgzipped.gz
isBGZIP     : False
._tabixed() : None
length      : 2
is None     : False
contents    : 
chr1    155 200 feature5    0   -
chr1    800 901 feature6    0   +



fn          : /tmp/pybedtools.p_sn70.tmp.gz
isBGZIP     : True
._tabixed() : True
length      : 2
is None     : False
contents    : 
chr1    155 200 feature5    0   -
chr1    800 901 feature6    0   +



fn          : example.bed.gz
isBGZIP     : True
._tabixed() : True
length      : 2
is None     : False
contents    : 
chr1    155 200 feature5    0   -
chr1    800 901 feature6    0   +

@roryk
Copy link
Author

roryk commented Apr 8, 2015

fn          : example.bed
isBGZIP     : False
._tabixed() : None
length      : 2
is None     : False
contents    :
chr1    155 200 feature5    0   -
chr1    800 901 feature6    0   +



fn          : example.bed_gzipped_not_bgzipped.gz
isBGZIP     : False
._tabixed() : None
length      : 2
is None     : False
contents    :
chr1    155 200 feature5    0   -
chr1    800 901 feature6    0   +



fn          : /var/folders/z6/vlq46wcn68d_knbqlyzd__w40000gn/T/pybedtools.AgmFvy.tmp.gz
isBGZIP     : True
._tabixed() : True
length      : 0
is None     : False
contents    :
chr1    155 200 feature5    0   -
chr1    800 901 feature6    0   +



fn          : example.bed.gz
isBGZIP     : True
._tabixed() : True
length      : 0
is None     : False
contents    :
chr1    155 200 feature5    0   -
chr1    800 901 feature6    0   +

@daler
Copy link
Owner

daler commented Apr 8, 2015

Hmm, so the tabixed versions are evaluating to None-like because, despite actually having contents in the file, pybedtools sees it as empty.

Would you send me the last example.bed.gz file that's created so I can try debugging it?

@roryk
Copy link
Author

roryk commented Apr 8, 2015

Hi Ryan,

Thanks for debugging this. It doesn't actually create the last two files. If I bgzip and tabix index the example.bed file, I get the same behavior. I stuck the bgzipped file and the tabix index up here: https://dl.dropboxusercontent.com/u/2822886/example.tar

@daler
Copy link
Owner

daler commented Apr 8, 2015

Thanks for the file. But . . . argh, it's working fine for me on OSX 10.9.5.

My guess is that it's on the Cython side. So 1) I'll test this on an OSX 10.10 machine and 2) I'll get together some more tests for the Cython side that you can try.

@daler
Copy link
Owner

daler commented Apr 12, 2015

Just reporting back that I get the same results as you reported here on OSX 10.10 after installing Xcode 6.3, tabix 1.2, and this gcc:

gcc --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 6.1.0 (clang-602.0.49) (based on LLVM 3.6.0svn)
Target: x86_64-apple-darwin14.1.0
Thread model: posix

@roryk
Copy link
Author

roryk commented Apr 13, 2015

Hi Ryan,

I'm rocking a slightly older version of clang, I haven't updated Xcode in a bit:

gcc --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 6.0 (clang-600.0.54) (based on LLVM 3.5svn)
Target: x86_64-apple-darwin14.3.0
Thread model: posix

@daler
Copy link
Owner

daler commented Apr 13, 2015

OK, thanks. I'm trying to reproduce this on a machine I can debug on, so I just upgraded my development Mac from 10.9 to 10.10 . . . and . . . the above test passes. The 10.10 machine I was able to reproduce on before was borrowed, and never had 10.9 on it.

Next up is installing a fresh 10.10 onto a new partition to see how that goes. Back down the rabbit hole!

@roryk
Copy link
Author

roryk commented Apr 14, 2015

So confusing; this machine was upgraded from 10.9 -> 10.10.

@daler daler closed this as completed in 933417d Apr 15, 2015
@daler
Copy link
Owner

daler commented Apr 15, 2015

No kidding confusing . . fresh install of 10.10 still works for me.

I'm still stumped on the platform/version specificity, but I have a fix that hopefully avoids the issue. The problem is actually an inconsistency in samtools.

Short answer

try the latest commit (933417d) to see if that helps.

Long answer

(for my own reference in the future)

Fundamentally, the problem was that the bgzipped file was being seen as length 0 by pybedtools, but there didn't seem to be anything funky with the file itself or with tabix.

I tracked it down to improper creation of a BAM IntervalIterator when it should be just a regular IntervalIterator. That is, BedTool._isbam() wasn't working right.

In BedTool._isbam() I'm differentiating a BAM from a non-bam BGZIP format file by checking the error code of samtools view -H. If that command returns exit code 0, then checking the header worked -- so it must be a BAM. Non-zero exit code? Failed, so not a BAM.

And there's the problem.

# Mac OSX 10.10
samtools view -H a_non_BAM_filename   # exit = 0 (sometimes)

# Other platforms / OSX versions
samtools view -H a_non_BAM_filename   # exit = 1

In a platform- and OS version-specific manner (still can't reliably reproduce but only has been seen on Mac OSX 10.10 Yosemite), that command incorrectly reports a exit code of 0. The same command on other platforms and other Mac OSX versions correctly return non-zero exit codes and print error messages to stderr.

So the fix in 933417d is to avoid looking at the return code and instead look at the contents of the reported header. Now, the presence of an actual header means it's a BAM and the exit code is ignored.

@roryk
Copy link
Author

roryk commented Apr 15, 2015

Hi Ryan,

You are awesome, this fixes the issue for me. I Pinboarded this issue as my canonical example for open source software being amazing, I would have never figured this out. Thank you so much, you're a beast.

@daler
Copy link
Owner

daler commented Apr 15, 2015

Phew, glad it worked! That was a sneaky little bug. Nice easy fix in the end though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants