add metagenomic_report_merge() #342

tomkinsc · 2016-06-14T21:08:26Z

This addresses #314, and adds a function, metagenomic_report_merge(), to metagenomics.py to combine multiple metagenomic reports into a single file usable as Krona input, with an option to recreate the Kraken summary (db required).

add a function, metagenomic_report_merge(), to metagenomics.py to combine multiple metagenomic reports into a single file usable as Krona input, with an option to recreate the Kraken summary (db required)

tomkinsc · 2016-06-14T21:08:44Z

(This is a WIP)

dpark01 · 2016-06-15T12:56:51Z

Hm, since you're starting to deal with the diamond output (I haven't touched any of this yet), I was going to say that one thing I'd really like to do is modify the current diamond entry point to throw a blank/empty column in front of the whole output file, so that the format actually matches a format that we want (not the format the tool gives us), similar to how much muscle we put into forcing vphaser to conform to our desired input/output file formats. Then we can drop all this stuff about specifying column numbers and the user having to know whether a report was generated with one tool or another (which really kills the usability/portability if they have to keep track of that kind of thing).

tomkinsc · 2016-06-15T13:30:47Z

We could modify the Diamond output; it would certainly simplify our code and calls. One argument against is that we'd be increasing the report size (due to extra delimiters or column info) without adding useful information. For a MiSeq run it wouldn't be much, but for HiSeq runs, it could add up. Compression would help, of course. I'm also not sure (yet) how the krona tool handles empty columns—whether it will honor empty columns as it should, or whether it will treat multiple delimiters as single delimiters. Another concern is whether we should worry about users calling Diamond from our metagenomics.py wrapper and getting an output format that's different from what they may expect from Diamond. Our modified format could be optional. I guess I'll add a flag for it to the parser, and see how Krona deals with empty columns. If it all checks out we can modify our calls to include the flag and strip out our column specifications.

I can say that Krona works fine on two-column files, and the Krona input part of the merger is functional as it is currently.

dpark01 · 2016-06-15T14:00:18Z

I'm much less concerned about simplifying our code and more concerned about simplifying usability from the user side. Our use cases will always compress the raw read-based output anyway, and an extra byte per read (the tab character) should compress nicely. Do need to double check whether kraken-report and krona handle a blank column okay or whether we have to put a dummy value there.

dpark01 · 2016-06-15T14:54:37Z

Oh also, our diamond wrapper already emits something that is very different (both format and semantically) from what comes out of raw diamond... there's a whole bunch of custom LCA code after the raw diamond call to reinterpret the output. These command line utilities are less about replicating functionality of some baseline tool (no one needs us for that.. just use bioconda). It's about performing a functional task of some kind, using some tool underneath, but hiding away the ugly interactions with that tool so people don't have to think about all the file conversions and peculiarities and foibles of each program.

The current output file format from metagenomics.kraken happens to match kraken's native output format, but it's not on purpose, it's by coincidence. I asked Simon to pick a generic metagenomic output format that should be portable across all tools, and he picked the kraken formats because they were pretty good and conceptually universal.

…th Kraken format

tomkinsc · 2016-06-15T20:44:25Z

Ok, the merger interface has been simplified to accept metagenomic reports in Kraken format without separate arguments for kraken/diamond reports, and the Diamond output now prepends an extra column (\t) so the columns align with Kraken format.

dpark01 · 2016-06-15T21:01:49Z

metagenomics.py

+
+def parser_metagenomic_report_merge(parser=argparse.ArgumentParser()):
+    parser.add_argument("metagenomic_reports", help="Input metagenomic reports created by Kraken", nargs='+', type=argparse.FileType('r'))
+    parser.add_argument("--outKrakenSummary", dest="out_kraken_summary", help="Input metagenomic reports created by Diamond") #, type=argparse.FileType('w'))


A few questions about the argparse signature on metagenomic_report_merge:

outKrakenSummary has a help signature that says "Input metagenomic reports created by Diamond"

krakenDB has nargs='+' which is weird

can we call the outputs outSummaryReport and outByReads or something like that? remove the vendor-specific implications of what these are good for (for example, I'd hope that metagenomic_report_merge could take as input files that were created by metagenomic_report_merge --outKronaInput)

My fault. I'll fix these.

made metagenomic report merger parameters more generic. The merge operation is now mostly a file cat, rather than a two-column cut

add metagenomic_report_merge()

9f8b0f5

add a function, metagenomic_report_merge(), to metagenomics.py to combine multiple metagenomic reports into a single file usable as Krona input, with an option to recreate the Kraken summary (db required)

dpark01 added 1 - Ready 2 - Working and removed 1 - Ready labels Jun 15, 2016

tomkinsc and others added 3 commits June 15, 2016 13:48

Merge branch 'master' into ct-metagenomic-report-merger

85f5b76

Diamond output prepends a '\t' to each output row so columns align wi…

2486fd7

…th Kraken format

remove column prepend option from diamond parser

aa4f70a

dpark01 reviewed Jun 15, 2016
View reviewed changes

made metagenomic report merger parameters more generic

1e98a5b

made metagenomic report merger parameters more generic. The merge operation is now mostly a file cat, rather than a two-column cut

dpark01 merged commit 7695b5b into master Jun 16, 2016

tomkinsc mentioned this pull request Jun 16, 2016

create metagenomic report merger #314

Closed

tomkinsc deleted the ct-metagenomic-report-merger branch June 16, 2016 15:17

dpark01 removed the 2 - Working label Jun 16, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add metagenomic_report_merge() #342

add metagenomic_report_merge() #342

tomkinsc commented Jun 14, 2016 •

edited by dpark01

tomkinsc commented Jun 14, 2016

dpark01 commented Jun 15, 2016

tomkinsc commented Jun 15, 2016 •

edited

dpark01 commented Jun 15, 2016

dpark01 commented Jun 15, 2016

tomkinsc commented Jun 15, 2016

dpark01 Jun 15, 2016

tomkinsc Jun 15, 2016

add metagenomic_report_merge() #342

add metagenomic_report_merge() #342

Conversation

tomkinsc commented Jun 14, 2016 • edited by dpark01

tomkinsc commented Jun 14, 2016

dpark01 commented Jun 15, 2016

tomkinsc commented Jun 15, 2016 • edited

dpark01 commented Jun 15, 2016

dpark01 commented Jun 15, 2016

tomkinsc commented Jun 15, 2016

dpark01 Jun 15, 2016

Choose a reason for hiding this comment

tomkinsc Jun 15, 2016

Choose a reason for hiding this comment

tomkinsc commented Jun 14, 2016 •

edited by dpark01

tomkinsc commented Jun 15, 2016 •

edited