New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate signatures from 10x single cell bam file #539

Merged
merged 24 commits into from Oct 1, 2018

Conversation

Projects
None yet
3 participants
@olgabot
Copy link
Contributor

olgabot commented Sep 1, 2018

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?
@luizirber

This comment has been minimized.

Copy link
Member

luizirber commented Sep 2, 2018

Hi @olgabot! This is pretty cool =]

I created a PR for your branch to fix the tests and avoid a direct dependency on pysam. When you write the tests for 10x you can do like the tests in tests/test_sbt.py, where they check if ipfsapi and redis are installed:

ipfsapi = pytest.importorskip('ipfsapi')

(so it will be pysam = pytest.importorskip('pysam')

Another thing: did you try bamnostic? It's a pure Python SAM/BAM reader, and avoids problems like pysam not working on Python 3.7 at the moment...

@olgabot

This comment has been minimized.

Copy link
Contributor Author

olgabot commented Sep 2, 2018

Glad you like it! @jamestwebber helped me get started with the code :)

(so it will be pysam = pytest.importorskip('pysam')

Great! I'll add that change.

Another thing: did you try bamnostic? It's a pure Python SAM/BAM reader, and avoids problems like pysam not working on Python 3.7 at the moment...

I hadn't heard of bamnostic before! My concern is that running this code took ~3 hours on a smallish 10X run of ~600 QC-passing cells and since bamnostic doesn't use the C-level API, it will be even slower. I can play around with using joblib to parallelize, though. But wider usability may win over speed...

olgabot added some commits Sep 2, 2018

Merge pull request #1 from luizirber/fix_tests
avoid direct dep on pysam
@luizirber

This comment has been minimized.

Copy link
Member

luizirber commented Sep 2, 2018

almost there, we are missing an
import pytest
in the beginning of tests/test_sourmash.py, and then tests will work =]

@luizirber

This comment has been minimized.

Copy link
Member

luizirber commented Sep 5, 2018

Cool, tests passing! \o/

On the performance side, you're building one minhash per barcode, and then saving all the barcodes on a signature:
https://github.com/dib-lab/sourmash/blob/ef25bcf36a2db9c55ef6b5812abd6d2128c5ad8c/sourmash/commands.py#L250L266
I think you can rewrite this block and use a multiprocessing.pool to process them in parallel, but I wouldn't necessarily require that for accepting this PR

@olgabot

This comment has been minimized.

Copy link
Contributor Author

olgabot commented Sep 6, 2018

@olgabot

This comment has been minimized.

Copy link
Contributor Author

olgabot commented Sep 6, 2018

Hello! I'm trying to figure out whether it's the iterating over the alignment file or creating the hashes that's taking the most time but I can't seem to get the profiler working due to the relative imports in commands.py, e.g.

from . import DEFAULT_SEED, MinHash, load_sbt_index, create_sbt_index

Here's the full stack trace below:

 ⚙  Thu  6 Sep - 11:01  ~/code/sourmash   origin ☊ olgabot/10x-singlecell-bam 3☀ 1● 
  python -m cProfile -o sourmash_10x.prof sourmash/commands.py compute -k 31 --input-is-10x tests/test-data/lung_ptprc/
Traceback (most recent call last):
  File "/anaconda3/envs/sourmash/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/anaconda3/envs/sourmash/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/anaconda3/envs/sourmash/lib/python3.6/cProfile.py", line 160, in <module>
    main()
  File "/anaconda3/envs/sourmash/lib/python3.6/cProfile.py", line 153, in main
    runctx(code, globs, None, options.outfile, options.sort)
  File "/anaconda3/envs/sourmash/lib/python3.6/cProfile.py", line 20, in runctx
    filename, sort)
  File "/anaconda3/envs/sourmash/lib/python3.6/profile.py", line 64, in runctx
    prof.runctx(statement, globals, locals)
  File "/anaconda3/envs/sourmash/lib/python3.6/cProfile.py", line 100, in runctx
    exec(cmd, globals, locals)
  File "sourmash/commands.py", line 12, in <module>
    from . import DEFAULT_SEED, MinHash, load_sbt_index, create_sbt_index
ImportError: cannot import name 'DEFAULT_SEED'

Do you know what may be happening?

@olgabot

This comment has been minimized.

Copy link
Contributor Author

olgabot commented Sep 6, 2018

omg nevermind .. somehow I moved the __init__.py file to the root dir:

⚙  Thu  6 Sep - 11:47  ~/code/sourmash   origin ☊ olgabot/10x-singlecell-bam ↑1 5☀ 4● 1‒ 1✚ ⚑ 
  git status
On branch olgabot/10x-singlecell-bam
Your branch is ahead of 'origin/olgabot/10x-singlecell-bam' by 1 commit.
  (use "git push" to publish your local commits)

Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

	renamed:    sourmash/__init__.py -> __init__.py

-_-;;;;;

🤦‍♀️

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Sep 6, 2018

Codecov Report

Merging #539 into master will increase coverage by 0.04%.
The diff coverage is 97.22%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #539      +/-   ##
=========================================
+ Coverage   90.76%   90.8%   +0.04%     
=========================================
  Files          33      34       +1     
  Lines        5011    5047      +36     
  Branches       36      36              
=========================================
+ Hits         4548    4583      +35     
- Misses        463     464       +1
Impacted Files Coverage Δ
sourmash/tenx.py 100% <100%> (ø)
sourmash/commands.py 91.75% <96.15%> (+0.15%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 12aa76f...3f03839. Read the comment docs.

@olgabot

This comment has been minimized.

Copy link
Contributor Author

olgabot commented Sep 6, 2018

BTW, bamnostic seems to not be able to deal with 10x bam files, which have 3-field tags like this: CB:Z:TACTTACAGCGAGAAA-1

(sourmash) 
 ⚙  Thu  6 Sep - 13:33  ~/code/sourmash   origin ☊ olgabot/10x-singlecell-bam ↑1 8☀ 3● 3‒ 1✚ ⚑ 
  sourmash compute -k 31 --input-is-10x  --force tests/test-data/lung_ptprc
computing signatures for files: tests/test-data/lung_ptprc
Computing signature for ksizes: [31]
Computing only DNA (and not protein) signatures.
Computing a total of 1 signature(s).
Traceback (most recent call last):
  File "/anaconda3/envs/sourmash/bin/sourmash", line 11, in <module>
    load_entry_point('sourmash', 'console_scripts', 'sourmash')()
  File "/Users/olgabot/code/sourmash/sourmash/__main__.py", line 77, in main
    cmd(sys.argv[2:])
  File "/Users/olgabot/code/sourmash/sourmash/commands.py", line 268, in compute
    barcodes, bam_file = read_10x_folder(filename)
  File "/Users/olgabot/code/sourmash/sourmash/tenx.py", line 17, in read_10x_folder
    os.path.join(folder, 'possorted_genome_bam.bam'), mode='rb')
  File "/Users/olgabot/anaconda/envs/sourmash/lib/python3.6/site-packages/bamnostic/core.py", line 153, in __init__
    bgzf.BgzfReader.__init__(self, **kwargs)
  File "/Users/olgabot/anaconda/envs/sourmash/lib/python3.6/site-packages/bamnostic/bgzf.py", line 471, in __init__
    self._load_header(check_sq)
  File "/Users/olgabot/anaconda/envs/sourmash/lib/python3.6/site-packages/bamnostic/bgzf.py", line 652, in _load_header
    self._header = BAMheader(self)
  File "/Users/olgabot/anaconda/envs/sourmash/lib/python3.6/site-packages/bamnostic/bgzf.py", line 311, in __init__
    tag, value = field.split(':')
ValueError: too many values to unpack (expected 2)

@olgabot olgabot referenced this pull request Sep 6, 2018

Closed

Cannot read 10x bam file #15

@luizirber

This comment has been minimized.

Copy link
Member

luizirber commented Sep 6, 2018

I see, thanks for creating an issue in their repo!

Another profiling approach you can take is https://github.com/benfred/py-spy (saw it yesterday, tested and it works pretty well!)

@olgabot

This comment has been minimized.

Copy link
Contributor Author

olgabot commented Sep 13, 2018

Update from me: I'm migrating to bamnostic because pysam alignment objects cannot be pickled. But there's still issues with reading the 10x bam files: betteridiot/bamnostic#16

@olgabot

This comment has been minimized.

Copy link
Contributor Author

olgabot commented Sep 14, 2018

Alright, the bamnostic issue has been fixed! It does require the latest bamnostic git version (reflected in the requirements.txt) and I asked them to do a version bump soon so the requirements are simpler.

@olgabot

This comment has been minimized.

Copy link
Contributor Author

olgabot commented Sep 14, 2018

ahh I spoke too soon... fixing now

olgabot added some commits Sep 14, 2018

@olgabot

This comment has been minimized.

Copy link
Contributor Author

olgabot commented Sep 14, 2018

Now this seems to be failing due to a deprecation error that happens for every read. Hopefully fixed by: betteridiot/bamnostic#18

@olgabot

This comment has been minimized.

Copy link
Contributor Author

olgabot commented Sep 18, 2018

Ugh Python2.7 gets in the way of everything!! @betteridiot is working on fixing a RecursionError in Python2.7 in bamnostic here: betteridiot/bamnostic#20

@codecov

This comment has been minimized.

Copy link

codecov bot commented Sep 18, 2018

Codecov Report

Merging #539 into master will decrease coverage by 0.16%.
The diff coverage is 70.73%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #539      +/-   ##
==========================================
- Coverage   90.78%   90.62%   -0.17%     
==========================================
  Files          33       34       +1     
  Lines        5016     5057      +41     
  Branches       37       37              
==========================================
+ Hits         4554     4583      +29     
- Misses        462      474      +12
Impacted Files Coverage Δ
sourmash/tenx.py 100% <100%> (ø)
sourmash/commands.py 90.52% <61.29%> (-1.25%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2534989...f0cf1b1. Read the comment docs.

@olgabot olgabot changed the title [WIP] Generate signatures from 10x single cell bam file Generate signatures from 10x single cell bam file Sep 19, 2018

@olgabot

This comment has been minimized.

Copy link
Contributor Author

olgabot commented Sep 19, 2018

The build is passing now! Though it looks like I didn't cover everything with the tests.. let me know if you want me to add more.

@luizirber

This comment has been minimized.

Copy link
Member

luizirber commented Sep 21, 2018

Hi @olgabot! This is looking great =]

Two things:

  • czbiohub#2 makes 10x support optional (more comments there).
  • The test takes some time to run, do you think we can make a smaller test case?

@luizirber luizirber merged commit 05584d2 into dib-lab:master Oct 1, 2018

2 checks passed

WIP ready for review
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@luizirber

This comment has been minimized.

Copy link
Member

luizirber commented Oct 1, 2018

Thanks @olgabot! 🎊

I've punted the test issue to #551, and made 10x support optional in #550

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment