Diamond on unclassified kraken leftover reads #737

yesimon · 2017-12-12T05:20:06Z

Unfortunately kraken output is undefined order so requires a sort

dpark01

Some smaller things and the main bigger question I have is: can you elaborate further on why diamond requires sorted bam input here?

dpark01 · 2017-12-12T13:26:36Z

metagenomics.py

@@ -834,13 +868,18 @@ def diamond(inBam, db, taxDb, outReport, outReads=None, threads=None):

        if outReads is not None:
            # Interstitial save of stdout to output file
-            cmd += ' | tee >(gzip --best > {out})'.format(out=outReads)
+            reads_cmd = 'cat {inf} | tee >(gzip --best > {out}) > {pipe}'.format(


we now have pigz as part of requirements-conda.txt, can you replace gzip --best with pigz --best -p {threads}?

Also, I've never really experienced problems in the past with standard pipefitting of Popen objects (and I've never needed to use the codecs package for any of it)... I really think the shell-based pipes could be cleaned up at some point (doesn't have to be this PR though.. more for the future).

My thoughts are that diamond is slow enough that even single threaded gzip can keep up with the output. The codecs is necessary for operating on pipes in py3 but it seems pipefitting can work if careful.

dpark01 · 2017-12-12T13:46:16Z

metagenomics.py

+    env['LC_ALL'] = 'C'
+
+    return subprocess.Popen('''
+    samtools view -h {bam} > {output} && \


Okay ... a few comments here.

this level of shell piping is making me uncomfortable (more than my previous comment below in the diamond method).

at the very minimum it needs a set -o pipefail.. I've seen previous bits of code like this fail silently before I changed them

are you sure it truly needs a sort before going into diamond? diamond doesn't require sorted input. If all you're trying to do is filter on unclassified input, that shouldn't require a sort, just a filter on taxid:32644.

kraken_leftover_reads seems to be doing two things at once.. and it creates a weird user interface where the user is being asked for a kraken_db to run diamond on the leftover kraken reads, which isn't clear why that's necessary. It looks like the kraken_db is needed for kraken-filter (I'm not sure why, but oh well). Can we break up kraken_leftover_reads into kraken_filter and kraken_unclassified (or something like that) where the first step calls kraken-filter and the second does not require a kraken_db, and then kraken_leftover_diamond can pipe-glue them all together (if filterThreshold is not None... but if filterThreshold is None, just pipe-glue the first and last step and then you don't even need a kraken_db at all). In the most typical scenarios, I think users won't be messing with re-filtering much unless they're experimenting with things (since the normal kraken command includes a baseline filter) so any time we can avoid having to provide a 100GB db that would be great.

separately, since we have our own fork of kraken, do you know if it's really necessary for kraken-filter to have a kraken_db as input?

Broke it up to kraken_leftover_reads and doing both the leftover reads subset and diamond is now only in the snakemake rule.

It only needs the taxonomy files from the kraken_db so it doesn't load any of the big files. The goal is to use the kraken db's included taxonomy files to prevent any potential conflicts

dpark01 · 2017-12-12T13:49:15Z

metagenomics.py

+
+
+def parser_kraken_build(parser=argparse.ArgumentParser()):
+    parser.add_argument('db', help='Kraken database output directory.')


this section looks like unintentionally duplicated code... I moved all the parser_ methods before the main methods in metagenomics.py to keep it consistent with other top-level python scripts (keeps the description and self-documentation close to the method spec and docstring) and also moved the __commands__.append line up next to each main method as well (again, keeping in line with other top level python scripts).

yesimon · 2017-12-13T17:40:40Z

Diamond doesn't require sorted input, but rather kraken outputs the classifications in random order compared to the input bam. Therefore we can find the unclassified reads in the input bam either by loading all read ids in memory and searching the input bam or do a join on QNAME. The join to subset the input bam by the unclassified read ids requires sort of both lists which is inefficient. Instead, modifying the kraken fork to output reads in order (at a small performance cost) means we can do a merge-sort subset of the input bam which is O(n) time and O(1) memory.

dpark01 · 2017-12-13T21:38:22Z

Modifying the kraken-fork to sort the output with minimal cost sounds like a useful modification anyway. Alternatively, you could also just grep/cut the list of read IDs from the kraken output that is unclassified and feed that to Picard FilterSamReads on the original input bam--that's pretty much what every tool in taxon_filter.py does.

dpark01 · 2017-12-13T21:39:33Z

I can't say I've ever used the codecs library before and I've got a number of Popen pipes working in Python3 just fine... they just call string .decode('UTF-8') and you're good to go.

yesimon · 2017-12-13T21:50:26Z

Yeah .decode('UTF-8') should have the same effect with a different API.

yesimon force-pushed the sy-diamond-leftover branch from 24a1ef4 to 7d4f6c5 Compare December 12, 2017 05:20

dpark01 requested changes Dec 12, 2017

View reviewed changes

yesimon force-pushed the sy-diamond-leftover branch from 7d4f6c5 to d6fe03b Compare December 13, 2017 17:22

dpark01 added this to Backlog (not for next release) in v1.19.1 Jan 18, 2018

dpark01 added this to Backlog (not necessarily this release) in v1.19.2 Jan 29, 2018

yesimon force-pushed the sy-diamond-leftover branch 3 times, most recently from e89f625 to 0a87acf Compare March 1, 2018 01:57

Diamond on unclassified kraken leftover reads

9c350f4

yesimon force-pushed the sy-diamond-leftover branch from 0a87acf to 9c350f4 Compare March 1, 2018 02:28

dpark01 mentioned this pull request Jun 7, 2018

Processing unclassified metagenomics contigs #795

Closed

dpark01 closed this Jan 9, 2019

yesimon deleted the sy-diamond-leftover branch February 25, 2019 20:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diamond on unclassified kraken leftover reads #737

Diamond on unclassified kraken leftover reads #737

yesimon commented Dec 12, 2017

dpark01 left a comment

dpark01 Dec 12, 2017

dpark01 Dec 12, 2017

yesimon Dec 13, 2017

dpark01 Dec 12, 2017

yesimon Dec 13, 2017

dpark01 Dec 12, 2017

yesimon commented Dec 13, 2017

dpark01 commented Dec 13, 2017

dpark01 commented Dec 13, 2017

yesimon commented Dec 13, 2017



		def parser_kraken_build(parser=argparse.ArgumentParser()):
		parser.add_argument('db', help='Kraken database output directory.')

Diamond on unclassified kraken leftover reads #737

Diamond on unclassified kraken leftover reads #737

Conversation

yesimon commented Dec 12, 2017

dpark01 left a comment

Choose a reason for hiding this comment

dpark01 Dec 12, 2017

Choose a reason for hiding this comment

dpark01 Dec 12, 2017

Choose a reason for hiding this comment

yesimon Dec 13, 2017

Choose a reason for hiding this comment

dpark01 Dec 12, 2017

Choose a reason for hiding this comment

yesimon Dec 13, 2017

Choose a reason for hiding this comment

dpark01 Dec 12, 2017

Choose a reason for hiding this comment

yesimon commented Dec 13, 2017

dpark01 commented Dec 13, 2017

dpark01 commented Dec 13, 2017

yesimon commented Dec 13, 2017