New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot read 10x bam file #15

Closed
olgabot opened this Issue Sep 6, 2018 · 4 comments

Comments

Projects
None yet
2 participants
@olgabot
Copy link

olgabot commented Sep 6, 2018

Describe the bug

Cannot read 10x bam file

To Reproduce
Steps to reproduce the behavior:

Using the possorted_genome_bam.bam file from here, the following code fails:

import bamnostic as bs

bam_file = bs.AlignmentFile(
    os.path.join(folder, 'possorted_genome_bam.bam'), mode='rb')

Expected behavior
Should be able to read the file and iterate through alignments

Screenshots

(sourmash) 
 ⚙  Thu  6 Sep - 13:33  ~/code/sourmash   origin ☊ olgabot/10x-singlecell-bam ↑1 8☀ 3● 3‒ 1✚ ⚑ 
  sourmash compute -k 31 --input-is-10x  --force tests/test-data/lung_ptprc
computing signatures for files: tests/test-data/lung_ptprc
Computing signature for ksizes: [31]
Computing only DNA (and not protein) signatures.
Computing a total of 1 signature(s).
Traceback (most recent call last):
  File "/anaconda3/envs/sourmash/bin/sourmash", line 11, in <module>
    load_entry_point('sourmash', 'console_scripts', 'sourmash')()
  File "/Users/olgabot/code/sourmash/sourmash/__main__.py", line 77, in main
    cmd(sys.argv[2:])
  File "/Users/olgabot/code/sourmash/sourmash/commands.py", line 268, in compute
    barcodes, bam_file = read_10x_folder(filename)
  File "/Users/olgabot/code/sourmash/sourmash/tenx.py", line 17, in read_10x_folder
    os.path.join(folder, 'possorted_genome_bam.bam'), mode='rb')
  File "/Users/olgabot/anaconda/envs/sourmash/lib/python3.6/site-packages/bamnostic/core.py", line 153, in __init__
    bgzf.BgzfReader.__init__(self, **kwargs)
  File "/Users/olgabot/anaconda/envs/sourmash/lib/python3.6/site-packages/bamnostic/bgzf.py", line 471, in __init__
    self._load_header(check_sq)
  File "/Users/olgabot/anaconda/envs/sourmash/lib/python3.6/site-packages/bamnostic/bgzf.py", line 652, in _load_header
    self._header = BAMheader(self)
  File "/Users/olgabot/anaconda/envs/sourmash/lib/python3.6/site-packages/bamnostic/bgzf.py", line 311, in __init__
    tag, value = field.split(':')
ValueError: too many values to unpack (expected 2)

Desktop (please complete the following information):

  • OS: macOS
  • Python Version: 3.6.5
  • bamnostic Version: 0.8.11

Additional context
None

@olgabot

This comment has been minimized.

Copy link
Author

olgabot commented Sep 6, 2018

Bamnostic was suggested (dib-lab/sourmash#539) as a replacement to pysam but I'm not able to use it for 10x-generated single cell bam files.

@betteridiot

This comment has been minimized.

Copy link
Owner

betteridiot commented Sep 6, 2018

I saw your issue. I thought by the tag, field part that it might have been an unpacking issue in the binary part of the data. The problem was actually with how it parses the SAM header. The fields of the SAM header aren't expected to be 3-field because the field type is only really necessary for the BAM format to figure out how to unpack the tag data.

I fixed this to handle 3-field tags in SAM header though. Thank you for the heads-up.

I made an update to PyPI and conda. Conda might take awhile until it is finalized. If you want the updates now, feel free to just github install it.

@betteridiot

This comment has been minimized.

Copy link
Owner

betteridiot commented Sep 6, 2018

Oops, forgot to quote something. Should be fixed with f7cfb0b

Sorry.

@olgabot

This comment has been minimized.

Copy link
Author

olgabot commented Sep 6, 2018

betteridiot added a commit that referenced this issue Sep 14, 2018

Fixes #16: Detecting EOF for serial access
The program was originally looking for the EOF marker, but I forgot that when the EOF marker BGZF block is decompressed, it evaluates to an empty bytestring `b''`. I put in some extra checks so that if an empty block is found, it ensures it is at the file and is a legitimate end of file.

If the file is truncated, it will `raise` a slightly different `StopIteration` error stating that the *potential* end of file is reached.

I ran this on the file you mentioned in #15 (btw, awesome Issue suggestion. Thank you for following the template). My debugging is properly running through the file now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment