Additonal output stats to CollectInsertSizeMetrics. #87

travc · 2014-09-29T23:04:14Z

Small addition to the output of CollectInsertSizeMetrics.
Gives 2.5%, 25%, 75%, and 97.5% percentile values useful for making boxplots and such.

Just uses the methods already available in the Histogram class, so it is a very simple addition.
New output is at the end of the current stats fields (though before SAMPLE, LIBRARY, and READ_GROUP), so impact on tools which process the output might exist, but should be minimal.

…75%, and 97.5% percentile values useful for making boxplots and such.

nh13 · 2014-10-03T18:39:01Z

One potential issue with these metrics is that if the insert size is not well-behaved (say Gaussian), then they may not be particularly useful. This was the motivation behind the WIDTH_OF_10_PERCENT and associated metrics, which are centered around the median, and computed by extending 1bp out from the median until the fractional width is achieved. Could those metrics be used instead, and/or do you want some additional values there?

travc · 2014-10-03T22:01:23Z

The WIDTH_OF_*_PERCENT values give no indication of skew, which we find is a critical indicator of quality for our libraries. On the other hand, the standard 5 or 7 number summary statistics (mean, stdev, median, quantiles) do. Furthermore, they are concise, easy to visualize (boxplot) and familiar to pretty much everyone.
I agree that insert size (and lots of other NGS values) are potentially problematic since they aren't normally distributed, but that is an argument for quantile like summary statistics as opposed to just using mean and stdev or metrics which imply a balanced distribution around the median.

nh13 · 2014-10-08T14:36:26Z

Thanks for your thoughtful comments and explanation.

For each new metric, could you add a descriptive comment? That comment will be shown upon release on the website. Please see http://broadinstitute.github.io/picard/picard-metric-definitions.html

Once ready, we can merge it in.

… for proper documentation.

travc · 2014-10-08T18:05:07Z

Ok... I think the comments are more sensible now.
I really like the way the metric outputs are generated from the code and auto documented. Java (not a language I normally use) can be pretty nifty at times.

…case 'N' Added: SOFT_MASKED option: Treat all lower-case letters like 'N' Added: INTERVAL_LIST option: Interval list file specifiying regions to include in metrics comutation Added: INTERVAL_LIST_STRING option: String of whitespace separated chrom:[start[-end]] (like samtools regions) specifying regions to include in metrics computation. Can be used with or without the INTERVAL_LIST file option.

travc · 2014-10-13T03:11:30Z

I've added a bugfix and a couple of simple (but useful) features to CollectWgsMetrics. These are more generally useful and important than the changes to CollectInsertSizeMetrics, so I'm kind of regretting not making a new fork and a separate (hopefully expedited) pull request for them.

They a documented in the commit and the code.
The only wobbly part is the parsing of the command line option INTERVAL_LIST_STRING... The code is only a few lines, but really should be a utility function someplace outside of CollectWgsMetrics.java. However, I don't know where the proper place to put it would be. Anyways, it is an optional option.

vdauwera · 2015-04-26T23:24:43Z

What's the status of this pull request? If still relevant, is it waiting on us (@nh13) or @travc ?

travc · 2015-04-27T01:04:45Z

I haven't been keeping up with the code, but think it was pretty much done from my POV.

yfarjoun · 2015-04-27T02:13:14Z

Thanks for the bug fix and the extra options.

I am wondering what is the use-case for an interval list for WGS metrics?

cheers,

Yossi.

On Sun, Apr 26, 2015 at 7:24 PM, Geraldine Van der Auwera <
notifications@github.com> wrote:

What's the status of this pull request? If still relevant, is it waiting
on us (@nh13 https://github.com/nh13) or @travc
https://github.com/travc ?

—
Reply to this email directly or view it on GitHub
#87 (comment).

travc · 2015-04-27T04:54:49Z

I'm mapping to a fairly gappy ref. It also includes some scaffolds which are unplaced as well as a male-only chromosome. When trying to evaluate the quality of mapping, it really only makes sense to look at the "good" portions of the reference. It is also a big speedup to just compute metrics for a few characteristic regions.

Generally, the use case will be when you want to metrics for different regions... probably different sorts or regions. So one might run metrics on a just some relatively conserved regions. Alternatively, one might want to look at the metrics just for a region (or some regions) which are highly variable to check that reference is close enough to generate reasonable results.

I can think of lots of other cases when you want metrics just on specific regions, but I think you get the idea. The interval list implementation seemed to be the most general way of doing it as well as fitting in with how other picard tools does it.

Additonal output stats to CollectInsertSizeMetrics. Gives 2.5%, 25%, …

6ca19ed

…75%, and 97.5% percentile values useful for making boxplots and such.

nh13 self-assigned this Oct 8, 2014

travc added 2 commits October 8, 2014 10:55

Fixed comments for PERCENT_??_INSERT_SIZE metrics so they can be used…

7706d43

… for proper documentation.

Just use '<=' instead of '≤' in comments.

c1379bc

travc mentioned this pull request Oct 13, 2014

Feature request: Collect* only for specified chromosomes/contigs #86

Closed

travc closed this Apr 27, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additonal output stats to CollectInsertSizeMetrics. #87

Additonal output stats to CollectInsertSizeMetrics. #87

travc commented Sep 29, 2014

nh13 commented Oct 3, 2014

travc commented Oct 3, 2014

nh13 commented Oct 8, 2014

travc commented Oct 8, 2014

travc commented Oct 13, 2014

vdauwera commented Apr 26, 2015

travc commented Apr 27, 2015

yfarjoun commented Apr 27, 2015

travc commented Apr 27, 2015

Additonal output stats to CollectInsertSizeMetrics. #87

Additonal output stats to CollectInsertSizeMetrics. #87

Conversation

travc commented Sep 29, 2014

nh13 commented Oct 3, 2014

travc commented Oct 3, 2014

nh13 commented Oct 8, 2014

travc commented Oct 8, 2014

travc commented Oct 13, 2014

vdauwera commented Apr 26, 2015

travc commented Apr 27, 2015

yfarjoun commented Apr 27, 2015

travc commented Apr 27, 2015