-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Diamond on unclassified kraken leftover reads #737
Conversation
24a1ef4
to
7d4f6c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some smaller things and the main bigger question I have is: can you elaborate further on why diamond requires sorted bam input here?
@@ -834,13 +868,18 @@ def diamond(inBam, db, taxDb, outReport, outReads=None, threads=None): | |||
|
|||
if outReads is not None: | |||
# Interstitial save of stdout to output file | |||
cmd += ' | tee >(gzip --best > {out})'.format(out=outReads) | |||
reads_cmd = 'cat {inf} | tee >(gzip --best > {out}) > {pipe}'.format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we now have pigz
as part of requirements-conda.txt
, can you replace gzip --best
with pigz --best -p {threads}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I've never really experienced problems in the past with standard pipefitting of Popen
objects (and I've never needed to use the codecs
package for any of it)... I really think the shell-based pipes could be cleaned up at some point (doesn't have to be this PR though.. more for the future).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thoughts are that diamond is slow enough that even single threaded gzip
can keep up with the output. The codecs
is necessary for operating on pipes in py3 but it seems pipefitting can work if careful.
metagenomics.py
Outdated
env['LC_ALL'] = 'C' | ||
|
||
return subprocess.Popen(''' | ||
samtools view -h {bam} > {output} && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay ... a few comments here.
- this level of shell piping is making me uncomfortable (more than my previous comment below in the
diamond
method). - at the very minimum it needs a
set -o pipefail
.. I've seen previous bits of code like this fail silently before I changed them - are you sure it truly needs a sort before going into diamond? diamond doesn't require sorted input. If all you're trying to do is filter on unclassified input, that shouldn't require a sort, just a filter on taxid:32644.
kraken_leftover_reads
seems to be doing two things at once.. and it creates a weird user interface where the user is being asked for a kraken_db to run diamond on the leftover kraken reads, which isn't clear why that's necessary. It looks like the kraken_db is needed forkraken-filter
(I'm not sure why, but oh well). Can we break upkraken_leftover_reads
intokraken_filter
andkraken_unclassified
(or something like that) where the first step callskraken-filter
and the second does not require a kraken_db, and thenkraken_leftover_diamond
can pipe-glue them all together (if filterThreshold is not None... but if filterThreshold is None, just pipe-glue the first and last step and then you don't even need a kraken_db at all). In the most typical scenarios, I think users won't be messing with re-filtering much unless they're experimenting with things (since the normal kraken command includes a baseline filter) so any time we can avoid having to provide a 100GB db that would be great.- separately, since we have our own fork of kraken, do you know if it's really necessary for
kraken-filter
to have a kraken_db as input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broke it up to kraken_leftover_reads
and doing both the leftover reads subset and diamond is now only in the snakemake rule.
It only needs the taxonomy files from the kraken_db
so it doesn't load any of the big files. The goal is to use the kraken db's included taxonomy files to prevent any potential conflicts
metagenomics.py
Outdated
|
||
|
||
def parser_kraken_build(parser=argparse.ArgumentParser()): | ||
parser.add_argument('db', help='Kraken database output directory.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this section looks like unintentionally duplicated code... I moved all the parser_
methods before the main methods in metagenomics.py to keep it consistent with other top-level python scripts (keeps the description and self-documentation close to the method spec and docstring) and also moved the __commands__.append
line up next to each main method as well (again, keeping in line with other top level python scripts).
7d4f6c5
to
d6fe03b
Compare
Diamond doesn't require sorted input, but rather kraken outputs the classifications in random order compared to the input bam. Therefore we can find the unclassified reads in the input bam either by loading all read ids in memory and searching the input bam or do a join on QNAME. The |
Modifying the kraken-fork to sort the output with minimal cost sounds like a useful modification anyway. Alternatively, you could also just grep/cut the list of read IDs from the kraken output that is unclassified and feed that to Picard FilterSamReads on the original input bam--that's pretty much what every tool in taxon_filter.py does. |
I can't say I've ever used the |
Yeah |
e89f625
to
0a87acf
Compare
0a87acf
to
9c350f4
Compare
Unfortunately kraken output is undefined order so requires a sort