-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Depth based confidence intervals around copy number estimates #28
Comments
Hi Miika, here are a couple of things that come to mind that can be done with the current code base and the
Modelling the reliability or confidence interval of each segment based on the underlying read depths, as you described, is really more a feature of the segmentation algorithm. For example, the recently published CODEX introduces a segmentation method based on raw read counts and their expected distribution, not entirely unlike cn.MOPS: CNVkit has a modular approach to segmentation: Just implement a wrapper for a method in After the next point release I'm going to change the I'm also working on replacing the internal representation of genomic intervals to use Pandas instead of Numpy structured arrays, so that .cnr and .cns files can more easily store arbitrary additional columns -- more like Bioconductor's GenomicRange package. (If you know of an existing Python package that provides pandas-based GenomicRanges, please let me know!) This would provide a place to store the segment-level metadata other segmentation algorithms emit, if any. If you'd like to work on something more exploratory I'd be happy help with that, too -- just e-mail me. |
@etal Thanks for the very comprehensive reply, much appreciated. I'll take a look at the Not seen the CODEX paper before, will give it a read. I'm not an expert in Python numerical packages but I'll keep an eye out for any that may reproduce GenomicRanges like functionality. |
Just a quick update, I've been downsampling a publicly available cell line (pair) with a known amplification to observe what happens to the estimated copy number. When going from avg coverage of 5x to 0.5x the log2 copy number stays roughly around the same (nice). I did a very crude estimation of the lower bound of a 95% confidence interval (if you can call it that) based on the 2.5% percentile of all the log2 values in the cnr file for the region. Given that the individual intervals are roughly the same width for all the values I didn't do any weighting of the values. At 5x the lower bound is roughly 0.8 log2 units below the copy number and at 0.5x it's roughly 2 log2 units below. Nice to see the numbers support the fact that with lower coverage there's more variability. |
@etal I've been toying around with interval estimation at https://github.com/etal/cnvkit/blob/master/cnvlib/segmentation/__init__.py#L148 The following code would estimate a prediction/confidence interval from the distribution of the bin means:
However, I'm not sure how to best handle this at https://github.com/etal/cnvkit/blob/master/cnvlib/segmentation/__init__.py#L47 to get the interval printed out in the cns file. Do you have any recommendations/suggestions? I've tried a few things to inject the interval into the output but I get cryptic errors from numpy. |
It looks like you're formatting the confidence interval as a string, so you could put it in the "gene" column of the .cns file, as it looks like you were aiming to do given the line you selected. The R code you pasted could be replicated in Python using the function Incidentally, or maybe more importantly, is the 2.5-97.5-percentile range in the observed data really the 95% confidence interval as we usually think of it? |
Thanks @etal, I'll look into piggy backing the interval somehow in the gene column and will prepare a pull request. I was thinking of using the Python percentile function too, but not all the information required is returned from the R function so rather than bloating the table R returns, I'd say it's easier to carry out the calculations in R. I've been thinking about the philosophical side of the percentile range too and what it means. On one hand, I think what I'm proposing is a sort of resampling based confidence interval for a (one) bin, as it's based on observing the distribution of the (log2) bin means from the same segment. The bin means are in a way repeated measurements of the same thing (within segment). On the other hand, it's also a prediction interval of the segment total mean. |
Which information is missing from the table returned by the R function? Is it the mapping of probes to segments? If so, it would make sense to add that. I can do so. I just used a script to discover:
|
Noticed yesterday too that the mean by CBS is literally just the mean of the values ignoring weights and the x-axis (wasn't sure if it was intentional or not)! I had a go using your script - here's an example output table:
Here are the results of my approach (ignoring weights and trusting the CBS means):
There are a few issues that would be good to discuss:
|
|
|
I see now. I think it might be wise to add a |
About those missing probes: CNVkit's R script that runs PSCBS filters out probes with log2 value < -20. These are usually very-low-coverage targets near telomeres and centromeres, and are more likely to appear in long segments that cover a whole chromosome arm. I'll confirm manually. If that's it, then the same filter should be applied when recalculating segment means. |
Nice debugging! Agree with you on |
Dropping by to say that the fact that the mean levels return by |
No confidence interval calculation yet, though.
Command-line help: show stats options in an argument group
I've added a |
Fantastic, thanks @etal, I'll test these in the next few days! |
There might be a small oversight somewhere, as my cns doesn't have segments in all chromosomes and produces the following error:
Should the positional arguments include both the cnn and cnr file or just one (and not the other)? Does order matter? |
The command line you used was right. Only one .cnr or .cnn file can be used at a time, and then the -s segments option is required. I fixed a bug where it would crash on .cnn files because of the missing "weight" column. So the .cnr and .cnn files that were used to create the .cns file should all work now. The missing chromosome bug is going to be fixed when the Did CNVkit generate this .cns file with missing chromosomes? Any idea how that happened, or do you think it's a bug? |
Thanks Eric, The data is from a small targeted panel and it looks like even though chr16 has a bit of data nothing got detected as having a copy number change and thus (I believe) there are no entries in the .cns file for chr16 (or chr18, chr21 and chrY). |
Upon further testing (different data) I got this error too:
Do you know what this could be about at all? |
|
Thanks for the clarification again! I've emailed you (first.last@ucsf.edu?) the two cases, let me know if you don't get my email. |
Got it, thanks! I'll take a look and follow up here. |
I've updated the I'm working on a proper fix to ensure segmentation emits better-behaved .cns files. |
With the change I just pushed, if you redo segmentation with the example .cnr files that previously triggered these bugs, then |
Hi Eric, |
Hi @etal, wanted to pick your brain on how to implement some kind of "confidence intervals" around the copy number estimates. Ideally there could be an interval around the point estimate that would reflect depth of sequencing and smoothness of the copy number signal around the interval (could be e.g. MAD http://en.wikipedia.org/wiki/Median_absolute_deviation with a provision for depth).
For example, a log2 copy number of 1 surely should be more meaningful in 20x coverage data compared to 1x coverage data, right?
I'm more than happy to implement something and help but wanted to hear your thoughts first.
The text was updated successfully, but these errors were encountered: