-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add metagenomic_report_merge() #342
Conversation
add a function, metagenomic_report_merge(), to metagenomics.py to combine multiple metagenomic reports into a single file usable as Krona input, with an option to recreate the Kraken summary (db required)
(This is a WIP) |
Hm, since you're starting to deal with the diamond output (I haven't touched any of this yet), I was going to say that one thing I'd really like to do is modify the current diamond entry point to throw a blank/empty column in front of the whole output file, so that the format actually matches a format that we want (not the format the tool gives us), similar to how much muscle we put into forcing vphaser to conform to our desired input/output file formats. Then we can drop all this stuff about specifying column numbers and the user having to know whether a report was generated with one tool or another (which really kills the usability/portability if they have to keep track of that kind of thing). |
We could modify the Diamond output; it would certainly simplify our code and calls. One argument against is that we'd be increasing the report size (due to extra delimiters or column info) without adding useful information. For a MiSeq run it wouldn't be much, but for HiSeq runs, it could add up. Compression would help, of course. I'm also not sure (yet) how the krona tool handles empty columns—whether it will honor empty columns as it should, or whether it will treat multiple delimiters as single delimiters. Another concern is whether we should worry about users calling Diamond from our I can say that Krona works fine on two-column files, and the Krona input part of the merger is functional as it is currently. |
I'm much less concerned about simplifying our code and more concerned about simplifying usability from the user side. Our use cases will always compress the raw read-based output anyway, and an extra byte per read (the tab character) should compress nicely. Do need to double check whether kraken-report and krona handle a blank column okay or whether we have to put a dummy value there. |
Oh also, our diamond wrapper already emits something that is very different (both format and semantically) from what comes out of raw diamond... there's a whole bunch of custom LCA code after the raw diamond call to reinterpret the output. These command line utilities are less about replicating functionality of some baseline tool (no one needs us for that.. just use bioconda). It's about performing a functional task of some kind, using some tool underneath, but hiding away the ugly interactions with that tool so people don't have to think about all the file conversions and peculiarities and foibles of each program. The current output file format from metagenomics.kraken happens to match kraken's native output format, but it's not on purpose, it's by coincidence. I asked Simon to pick a generic metagenomic output format that should be portable across all tools, and he picked the kraken formats because they were pretty good and conceptually universal. |
Ok, the merger interface has been simplified to accept metagenomic reports in Kraken format without separate arguments for kraken/diamond reports, and the Diamond output now prepends an extra column ( |
|
||
def parser_metagenomic_report_merge(parser=argparse.ArgumentParser()): | ||
parser.add_argument("metagenomic_reports", help="Input metagenomic reports created by Kraken", nargs='+', type=argparse.FileType('r')) | ||
parser.add_argument("--outKrakenSummary", dest="out_kraken_summary", help="Input metagenomic reports created by Diamond") #, type=argparse.FileType('w')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions about the argparse signature on metagenomic_report_merge:
- outKrakenSummary has a help signature that says "Input metagenomic reports created by Diamond"
- krakenDB has nargs='+' which is weird
- can we call the outputs outSummaryReport and outByReads or something like that? remove the vendor-specific implications of what these are good for (for example, I'd hope that metagenomic_report_merge could take as input files that were created by metagenomic_report_merge --outKronaInput)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My fault. I'll fix these.
made metagenomic report merger parameters more generic. The merge operation is now mostly a file cat, rather than a two-column cut
This addresses #314, and adds a function,
metagenomic_report_merge()
, tometagenomics.py
to combine multiple metagenomic reports into a single file usable as Krona input, with an option to recreate the Kraken summary (db required).