Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additonal output stats to CollectInsertSizeMetrics. #87

Closed
wants to merge 4 commits into from

Conversation

travc
Copy link

@travc travc commented Sep 29, 2014

Small addition to the output of CollectInsertSizeMetrics.
Gives 2.5%, 25%, 75%, and 97.5% percentile values useful for making boxplots and such.

Just uses the methods already available in the Histogram class, so it is a very simple addition.
New output is at the end of the current stats fields (though before SAMPLE, LIBRARY, and READ_GROUP), so impact on tools which process the output might exist, but should be minimal.

…75%, and 97.5% percentile values useful for making boxplots and such.
@nh13
Copy link
Collaborator

nh13 commented Oct 3, 2014

One potential issue with these metrics is that if the insert size is not well-behaved (say Gaussian), then they may not be particularly useful. This was the motivation behind the WIDTH_OF_10_PERCENT and associated metrics, which are centered around the median, and computed by extending 1bp out from the median until the fractional width is achieved. Could those metrics be used instead, and/or do you want some additional values there?

@travc
Copy link
Author

travc commented Oct 3, 2014

The WIDTH_OF_*_PERCENT values give no indication of skew, which we find is a critical indicator of quality for our libraries. On the other hand, the standard 5 or 7 number summary statistics (mean, stdev, median, quantiles) do. Furthermore, they are concise, easy to visualize (boxplot) and familiar to pretty much everyone.
I agree that insert size (and lots of other NGS values) are potentially problematic since they aren't normally distributed, but that is an argument for quantile like summary statistics as opposed to just using mean and stdev or metrics which imply a balanced distribution around the median.

@nh13
Copy link
Collaborator

nh13 commented Oct 8, 2014

Thanks for your thoughtful comments and explanation.

For each new metric, could you add a descriptive comment? That comment will be shown upon release on the website. Please see http://broadinstitute.github.io/picard/picard-metric-definitions.html

Once ready, we can merge it in.

@nh13 nh13 self-assigned this Oct 8, 2014
@travc
Copy link
Author

travc commented Oct 8, 2014

Ok... I think the comments are more sensible now.
I really like the way the metric outputs are generated from the code and auto documented. Java (not a language I normally use) can be pretty nifty at times.

…case 'N'

Added: SOFT_MASKED option: Treat all lower-case letters like 'N'
Added: INTERVAL_LIST option: Interval list file specifiying regions to include in metrics comutation
Added: INTERVAL_LIST_STRING option: String of whitespace separated chrom:[start[-end]] (like samtools regions) specifying regions to include in metrics computation.  Can be used with or without the INTERVAL_LIST file option.
@travc
Copy link
Author

travc commented Oct 13, 2014

I've added a bugfix and a couple of simple (but useful) features to CollectWgsMetrics. These are more generally useful and important than the changes to CollectInsertSizeMetrics, so I'm kind of regretting not making a new fork and a separate (hopefully expedited) pull request for them.

They a documented in the commit and the code.
The only wobbly part is the parsing of the command line option INTERVAL_LIST_STRING... The code is only a few lines, but really should be a utility function someplace outside of CollectWgsMetrics.java. However, I don't know where the proper place to put it would be. Anyways, it is an optional option.

@vdauwera
Copy link
Contributor

What's the status of this pull request? If still relevant, is it waiting on us (@nh13) or @travc ?

@travc
Copy link
Author

travc commented Apr 27, 2015

I haven't been keeping up with the code, but think it was pretty much done from my POV.

@travc travc closed this Apr 27, 2015
@yfarjoun
Copy link
Contributor

Thanks for the bug fix and the extra options.

I am wondering what is the use-case for an interval list for WGS metrics?

cheers,

Yossi.

On Sun, Apr 26, 2015 at 7:24 PM, Geraldine Van der Auwera <
notifications@github.com> wrote:

What's the status of this pull request? If still relevant, is it waiting
on us (@nh13 https://github.com/nh13) or @travc
https://github.com/travc ?


Reply to this email directly or view it on GitHub
#87 (comment).

@travc
Copy link
Author

travc commented Apr 27, 2015

I'm mapping to a fairly gappy ref. It also includes some scaffolds which are unplaced as well as a male-only chromosome. When trying to evaluate the quality of mapping, it really only makes sense to look at the "good" portions of the reference. It is also a big speedup to just compute metrics for a few characteristic regions.

Generally, the use case will be when you want to metrics for different regions... probably different sorts or regions. So one might run metrics on a just some relatively conserved regions. Alternatively, one might want to look at the metrics just for a region (or some regions) which are highly variable to check that reference is close enough to generate reasonable results.

I can think of lots of other cases when you want metrics just on specific regions, but I think you get the idea. The interval list implementation seemed to be the most general way of doing it as well as fitting in with how other picard tools does it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants