Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diamond on unclassified kraken leftover reads #737

Closed
wants to merge 1 commit into from

Conversation

yesimon
Copy link
Contributor

@yesimon yesimon commented Dec 12, 2017

Unfortunately kraken output is undefined order so requires a sort

Copy link
Member

@dpark01 dpark01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some smaller things and the main bigger question I have is: can you elaborate further on why diamond requires sorted bam input here?

@@ -834,13 +868,18 @@ def diamond(inBam, db, taxDb, outReport, outReads=None, threads=None):

if outReads is not None:
# Interstitial save of stdout to output file
cmd += ' | tee >(gzip --best > {out})'.format(out=outReads)
reads_cmd = 'cat {inf} | tee >(gzip --best > {out}) > {pipe}'.format(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we now have pigz as part of requirements-conda.txt, can you replace gzip --best with pigz --best -p {threads}?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I've never really experienced problems in the past with standard pipefitting of Popen objects (and I've never needed to use the codecs package for any of it)... I really think the shell-based pipes could be cleaned up at some point (doesn't have to be this PR though.. more for the future).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thoughts are that diamond is slow enough that even single threaded gzip can keep up with the output. The codecs is necessary for operating on pipes in py3 but it seems pipefitting can work if careful.

metagenomics.py Outdated
env['LC_ALL'] = 'C'

return subprocess.Popen('''
samtools view -h {bam} > {output} && \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay ... a few comments here.

  • this level of shell piping is making me uncomfortable (more than my previous comment below in the diamond method).
  • at the very minimum it needs a set -o pipefail.. I've seen previous bits of code like this fail silently before I changed them
  • are you sure it truly needs a sort before going into diamond? diamond doesn't require sorted input. If all you're trying to do is filter on unclassified input, that shouldn't require a sort, just a filter on taxid:32644.
  • kraken_leftover_reads seems to be doing two things at once.. and it creates a weird user interface where the user is being asked for a kraken_db to run diamond on the leftover kraken reads, which isn't clear why that's necessary. It looks like the kraken_db is needed for kraken-filter (I'm not sure why, but oh well). Can we break up kraken_leftover_reads into kraken_filter and kraken_unclassified (or something like that) where the first step calls kraken-filter and the second does not require a kraken_db, and then kraken_leftover_diamond can pipe-glue them all together (if filterThreshold is not None... but if filterThreshold is None, just pipe-glue the first and last step and then you don't even need a kraken_db at all). In the most typical scenarios, I think users won't be messing with re-filtering much unless they're experimenting with things (since the normal kraken command includes a baseline filter) so any time we can avoid having to provide a 100GB db that would be great.
  • separately, since we have our own fork of kraken, do you know if it's really necessary for kraken-filter to have a kraken_db as input?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broke it up to kraken_leftover_reads and doing both the leftover reads subset and diamond is now only in the snakemake rule.

It only needs the taxonomy files from the kraken_db so it doesn't load any of the big files. The goal is to use the kraken db's included taxonomy files to prevent any potential conflicts

metagenomics.py Outdated


def parser_kraken_build(parser=argparse.ArgumentParser()):
parser.add_argument('db', help='Kraken database output directory.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this section looks like unintentionally duplicated code... I moved all the parser_ methods before the main methods in metagenomics.py to keep it consistent with other top-level python scripts (keeps the description and self-documentation close to the method spec and docstring) and also moved the __commands__.append line up next to each main method as well (again, keeping in line with other top level python scripts).

@yesimon
Copy link
Contributor Author

yesimon commented Dec 13, 2017

Diamond doesn't require sorted input, but rather kraken outputs the classifications in random order compared to the input bam. Therefore we can find the unclassified reads in the input bam either by loading all read ids in memory and searching the input bam or do a join on QNAME. The join to subset the input bam by the unclassified read ids requires sort of both lists which is inefficient. Instead, modifying the kraken fork to output reads in order (at a small performance cost) means we can do a merge-sort subset of the input bam which is O(n) time and O(1) memory.

@dpark01
Copy link
Member

dpark01 commented Dec 13, 2017

Modifying the kraken-fork to sort the output with minimal cost sounds like a useful modification anyway. Alternatively, you could also just grep/cut the list of read IDs from the kraken output that is unclassified and feed that to Picard FilterSamReads on the original input bam--that's pretty much what every tool in taxon_filter.py does.

@dpark01
Copy link
Member

dpark01 commented Dec 13, 2017

I can't say I've ever used the codecs library before and I've got a number of Popen pipes working in Python3 just fine... they just call string .decode('UTF-8') and you're good to go.

@yesimon
Copy link
Contributor Author

yesimon commented Dec 13, 2017

Yeah .decode('UTF-8') should have the same effect with a different API.

@dpark01 dpark01 added this to Backlog (not for next release) in v1.19.1 Jan 18, 2018
@dpark01 dpark01 added this to Backlog (not necessarily this release) in v1.19.2 Jan 29, 2018
@yesimon yesimon force-pushed the sy-diamond-leftover branch 3 times, most recently from e89f625 to 0a87acf Compare March 1, 2018 01:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
v1.19.1
  
Backlog (not for this release)
v1.19.2
  
Backlog (not necessarily this release)
Development

Successfully merging this pull request may close these issues.

None yet

2 participants