Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add metagenomic_report_merge() #342

Merged
merged 5 commits into from
Jun 16, 2016
Merged

Conversation

tomkinsc
Copy link
Member

@tomkinsc tomkinsc commented Jun 14, 2016

This addresses #314, and adds a function, metagenomic_report_merge(), to metagenomics.py to combine multiple metagenomic reports into a single file usable as Krona input, with an option to recreate the Kraken summary (db required).

add a function, metagenomic_report_merge(),  to metagenomics.py to
combine multiple metagenomic reports into a single file usable as Krona
input, with an option to recreate the Kraken summary (db required)
@tomkinsc
Copy link
Member Author

(This is a WIP)

@dpark01
Copy link
Member

dpark01 commented Jun 15, 2016

Hm, since you're starting to deal with the diamond output (I haven't touched any of this yet), I was going to say that one thing I'd really like to do is modify the current diamond entry point to throw a blank/empty column in front of the whole output file, so that the format actually matches a format that we want (not the format the tool gives us), similar to how much muscle we put into forcing vphaser to conform to our desired input/output file formats. Then we can drop all this stuff about specifying column numbers and the user having to know whether a report was generated with one tool or another (which really kills the usability/portability if they have to keep track of that kind of thing).

@tomkinsc
Copy link
Member Author

tomkinsc commented Jun 15, 2016

We could modify the Diamond output; it would certainly simplify our code and calls. One argument against is that we'd be increasing the report size (due to extra delimiters or column info) without adding useful information. For a MiSeq run it wouldn't be much, but for HiSeq runs, it could add up. Compression would help, of course. I'm also not sure (yet) how the krona tool handles empty columns—whether it will honor empty columns as it should, or whether it will treat multiple delimiters as single delimiters. Another concern is whether we should worry about users calling Diamond from our metagenomics.py wrapper and getting an output format that's different from what they may expect from Diamond. Our modified format could be optional. I guess I'll add a flag for it to the parser, and see how Krona deals with empty columns. If it all checks out we can modify our calls to include the flag and strip out our column specifications.

I can say that Krona works fine on two-column files, and the Krona input part of the merger is functional as it is currently.

@dpark01
Copy link
Member

dpark01 commented Jun 15, 2016

I'm much less concerned about simplifying our code and more concerned about simplifying usability from the user side. Our use cases will always compress the raw read-based output anyway, and an extra byte per read (the tab character) should compress nicely. Do need to double check whether kraken-report and krona handle a blank column okay or whether we have to put a dummy value there.

@dpark01
Copy link
Member

dpark01 commented Jun 15, 2016

Oh also, our diamond wrapper already emits something that is very different (both format and semantically) from what comes out of raw diamond... there's a whole bunch of custom LCA code after the raw diamond call to reinterpret the output. These command line utilities are less about replicating functionality of some baseline tool (no one needs us for that.. just use bioconda). It's about performing a functional task of some kind, using some tool underneath, but hiding away the ugly interactions with that tool so people don't have to think about all the file conversions and peculiarities and foibles of each program.

The current output file format from metagenomics.kraken happens to match kraken's native output format, but it's not on purpose, it's by coincidence. I asked Simon to pick a generic metagenomic output format that should be portable across all tools, and he picked the kraken formats because they were pretty good and conceptually universal.

@tomkinsc
Copy link
Member Author

Ok, the merger interface has been simplified to accept metagenomic reports in Kraken format without separate arguments for kraken/diamond reports, and the Diamond output now prepends an extra column (\t) so the columns align with Kraken format.


def parser_metagenomic_report_merge(parser=argparse.ArgumentParser()):
parser.add_argument("metagenomic_reports", help="Input metagenomic reports created by Kraken", nargs='+', type=argparse.FileType('r'))
parser.add_argument("--outKrakenSummary", dest="out_kraken_summary", help="Input metagenomic reports created by Diamond") #, type=argparse.FileType('w'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions about the argparse signature on metagenomic_report_merge:

  • outKrakenSummary has a help signature that says "Input metagenomic reports created by Diamond"
  • krakenDB has nargs='+' which is weird
  • can we call the outputs outSummaryReport and outByReads or something like that? remove the vendor-specific implications of what these are good for (for example, I'd hope that metagenomic_report_merge could take as input files that were created by metagenomic_report_merge --outKronaInput)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My fault. I'll fix these.

made metagenomic report merger parameters more generic. The merge
operation is now mostly a file cat, rather than a two-column cut
@dpark01 dpark01 merged commit 7695b5b into master Jun 16, 2016
@tomkinsc tomkinsc deleted the ct-metagenomic-report-merger branch June 16, 2016 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants