Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GATK CNV on WGS. #2858

Closed
droazen opened this issue Jun 5, 2017 · 43 comments · Fixed by #3913
Closed

GATK CNV on WGS. #2858

droazen opened this issue Jun 5, 2017 · 43 comments · Fixed by #3913
Assignees

Comments

@droazen
Copy link
Collaborator

@droazen droazen commented Jun 5, 2017

@LeeTL1220 commented on Mon May 23 2016

  • File issues for evaluations (and make sure each evaluation is described, including ground truth). Make sure that there is a milestone and assignee for each.

Need to show some measure of performance for the WGS evaluation (GATK CNV only. Not ACNV)

@droazen droazen mentioned this issue Jun 5, 2017
0 of 1 task complete
@samuelklee samuelklee added this to Legacy - Improvements / Miscellaneous in Copy Number Variants - General Jun 12, 2017
@samuelklee samuelklee self-assigned this Jul 9, 2017
@samuelklee

This comment has been minimized.

Copy link
Contributor

@samuelklee samuelklee commented Jul 9, 2017

Reviving this. This will essentially be a major refactor/rewrite of CreatePanelOfNormals to make it scalable enough to handle WGS.

  • CombineReadCounts is too cumbersome for large matrices. Change CreatePanelOfNormals to take in multiple -I instead.
  • Rename NormalizeSomaticReadCounts to DenoiseReadCounts and require integer read counts as input. These will still be backed by a ReadCountCollection until @asmirnov239's changes are in.
  • Remove optional outputs (factor-normalized and beta-hats) from DenoiseReadCounts. For now, TN and PTN output will remain in the same format (log2) to maintain compatibility with downstream tools.
  • Maximum number of eigensamples K to retain in the PoN is specified; the smaller of this or the number of samples remaining after filtering is used. The number actually used to denoise can be specified in DenoiseReadCounts. If we are going to spend energy computing K eigensamples, there is no reason we shouldn't expose all of them in the PoN, even if we don't want to use all of them for denoising. (Also, the current SVD utility methods do not allow for specification of K < N when performing SVD on an MxN matrix, even though the backend implementations that are called do allow for this; this is terrible. In any case, randomized SVD should be much faster than the currently available implementations, even when K = N).
  • Rename CreatePanelOfNormals to CreateReadCountPanelOfNormals
  • Refer to "targets" as intervals. See #3246.
  • Remove QC.
  • Refer to proportional coverage as fractional coverage.
  • Perform optional GC-bias correction internally if annotated intervals are passed as input.
  • Make standardization process for panel and case samples identical. Currently, a sample mean is taken at one point in the PoN standardization process, while a sample median is taken in the case standardization process.
  • HDF5 PoN will store version number, all integer read counts, all/panel intervals, all/panel sample paths/names, all annotated intervals (if GC-bias correction was performed), fractional-coverage medians for all intervals, relevant SVD results (eigenvalues and left-singular vectors) for the specified number of eigensamples, and command line.
  • In a future iteration, we could allow an input PoN to be the source of read counts. This would allow iteration on filter parameters without needing output from CombineReadCounts. The code should easily allow for this.
  • ReadCountCollection is too memory intensive; minimize use in DenoiseReadCounts when writing results.
  • Optimize and clean up HDF5 writing (e.g., transpose skinny matrices, make writing of intervals as a string matrix faster).
@samuelklee samuelklee changed the title WGS workflow evaluation GATK CNV on WGS. Jul 10, 2017
@samuelklee

This comment has been minimized.

Copy link
Contributor

@samuelklee samuelklee commented Jul 11, 2017

@vruano may also have some insight on the PCOV vs RAW issue.

@samuelklee

This comment has been minimized.

Copy link
Contributor

@samuelklee samuelklee commented Jul 11, 2017

FYI @sooheelee I pointed you at the docs recently, but realized they're slightly out of date. CreatePanelOfNormals currently expects proportional coverage, upon which it initially performs a series of filtering steps. The docs state that these steps are performed on integer counts, which is incorrect. The fact that filtering yields different post-filter coverage for the two types of input ultimately results in slightly different denoising. Not a big deal, but we should decide what the actual method should be doing and clarify in the code/docs.

@sooheelee

This comment has been minimized.

Copy link
Contributor

@sooheelee sooheelee commented Jul 12, 2017

I'll keep an eye out for the "integer counts" passage in the docs and change this to reflect our recommendations.

@samuelklee

This comment has been minimized.

Copy link
Contributor

@samuelklee samuelklee commented Jul 12, 2017

@samuelklee

This comment has been minimized.

Copy link
Contributor

@samuelklee samuelklee commented Jul 19, 2017

Can someone explain this pseudoinverse business to me? We standardize the counts to give a matrix C, calculate the SVD to get the left-singular matrix U and the pseudoinverse of C, but then perform another SVD on U to get the pseudoinverse of U. Why do we need these pseudoinverses? @LeeTL1220 @davidbenjamin?

@samuelklee

This comment has been minimized.

Copy link
Contributor

@samuelklee samuelklee commented Jul 19, 2017

The memory footprint used by ReadCountCollectionUtils.parse is extremely large compared to the size of the file. I suspect there may be memory leaks in TableReader.

Speed is also an issue; pandas is 4x faster (taking ~10s on my desktop) on a 300bp-bin WGS read-count file with ~11.5 million rows if all columns are read, and 10x faster if the Target name field (which is pretty much useless) is ignored.

(Also, for some reason, the speed difference is much greater when running within my Ubuntu VM on my Broad laptop; what takes pandas ~15s to load takes ReadCountCollectionUtils.parse so long that I just end up killing it...not sure why this is?)

@vruano

This comment has been minimized.

Copy link
Contributor

@vruano vruano commented Jul 19, 2017

TableReader is a dumb-dumb one-record at a time reader so it shouldn't suffer from memory leaks.

In contrast the parser() method uses a incremental "buffer" that accumulates the counts until the end when the actual returned table is created... The reason for this is to keep the ReadCountsCollection class constant.

So at some point you need at least twice the amount of memory as compare to a solution that would simply use the returned object as the accumulator.

@samuelklee

This comment has been minimized.

Copy link
Contributor

@samuelklee samuelklee commented Jul 19, 2017

You might be right---and I think it's worse, in that the ReadCountCollection argument validation (for uniqueness of targets, etc.) also adds to the overhead.

@vruano

This comment has been minimized.

Copy link
Contributor

@vruano vruano commented Jul 19, 2017

By default we aim for correctness over efficiency... for what you're saying about the target being useless it sounds that perhaps you should consider to generalize the ReadCountCollection so that in some sub implementations targets are implicit based on their index in the collection and so for those tools that don't care about target-names this operation would go much faster.

So RCC could be an interface rather than a concrete class and some child class would implement the current RCC s behavior, whereas some new child class would implement a more efficient solution for WG- fix size interval collections.

@vruano

This comment has been minimized.

Copy link
Contributor

@vruano vruano commented Jul 19, 2017

As for TableReader.... we could make it a bit more efficient by reusing DataLine instances. Currently it creates one per each input line, but same instance could be reused loading each new line data onto it before calling createRecord.

We are delegating to a external library to parse the lines into String[] arrays (one element per cell) .... we could save on that by implementing it ourselves more efficiently but of course that would be take one of our some of his/her precious development time...

In any case I don't know what the gain would be considering that these operations are done close to I/O that typically should be dominating the time-cost.

The DataLine reuse may save some memory churning and wouldn't take long to code.

@samuelklee

This comment has been minimized.

Copy link
Contributor

@samuelklee samuelklee commented Jul 21, 2017

First cut at a rewrite seems to be working fine and is much, much leaner.

Building a small PoN with 4 simulated normals with 5M bins each, CombineReadCounts took ~1 min, CreatePanelOfNormals (with no QC) took ~4.5 minutes (although ~1 minute of this is writing target weights, which I haven't added to the new version yet) and generated a 2.7GB PoN, and NormalizeSomaticReadCounts took ~8 minutes (~7.5 minutes of which was spent composing/writing results, thanks to overhead from ReadCountCollection).

In comparison, the new CreateReadCountPanelOfNormals took ~1 minute (which includes combining read-count files, which takes ~30s of I/O) and generated a 520MB PoN, and DenoiseReadCounts took ~30s (~10s of which was composing/writing results, as we are still forced to generate two ReadCountCollections).

Resulting PTN and TN copy ratios were identical down to 1E-16 levels. Differences are only due to removing the unnecessary pseudoinverse computation. Results after filtering and before SVD are identical, despite the code being rewritten from scratch to be more memory efficient (e.g., filtering is performed in place)---phew!

If we stored read counts as HDF5 instead of as plain text, this would make things much faster. Perhaps it would be best to make TSV an optional output of the new coverage collection tool. As a bonus, it would then only take a little bit more code to allow previous PoNs to be provided via -I as an additional source of read counts.

Remaining TODO's:

  • Allow DenoiseReadCounts to be run without a PoN. This will just perform standardization and optional GC correction. This gives us the ability to run the case sample pipeline without a PoN, which will give users more options and might be good enough if the data is not too noisy.
  • Actually, I'm not sure why we take perform SVD on the intervals x samples matrix and take the left-singular vectors. Typically, SVD is performed on the samples x intervals matrix and the right-singular vectors are taken, which saves some extra computation that is necessary to calculate the left-singular vectors. (EDIT: Actually, looks like Spark's SVD is faster on tall and skinny matrices, which might be due to the fact that the underlying implementation calls Fortran code. I still think that representing samples as row vectors has some benefits, so I've changed things to reflect this; I now just take a transpose before performing the SVD, so that we still operate on the same intervals x samples matrix.) This will also save us some transposing, which we do anyway to make HDF5 writes faster.
  • Change HDF5 matrix writing to allow matrices with NxM > MAX_INT, which can be done naively by chunking and writing to multiple HDF5 subdirectories. This will allow for smaller bin sizes. (EDIT: I implemented this in a way that allows one to set the maximum number of values allowed per chunk, so that heap usage can be controlled, but the downside is that this translates into a corresponding limit on the number of columns (i.e., intervals). On the other hand, you could theoretically crank this number up to Integer.MAX_VALUE, as long as you set -Xmx high enough... In practice, it's very unlikely that we'll need to go to bins smaller than a read length.)
  • Check that CreatePanelOfNormals works correctly on Spark cluster. Implement Randomized SVD, which should give better performance on large matrices. See https://arxiv.org/pdf/1007.5510.pdf and https://research.fb.com/fast-randomized-svd/. For now, I'll require that the coverage matrix can fit in RAM, but more sophisticated versions of the algorithm could be implemented in the future.
  • Update methods doc. Note that some of the CNV section is out of date and incorrect. In particular, we have been taking in PCOV as input to CreatePanelOfNormals for some time now, but the doc states that we take integer read counts. This already yields different results by the first filtering step (on intervals by interval median). @LeeTL1220 @davidbenjamin what is the "official ReCapSeg" behavior, and do we want to keep the current behavior? In general, I think all of the standardization (i.e., filtering/imputation/truncation/transformation) steps could stand some revisiting.

Evaluation:

  • Revisit standardization procedure by checking with simulated data. We should make sure that the centering of the data does not rescale the true copy ratio.
  • Investigate the effect of keeping duplicates. I am still not sure why we do this, and it may have a more drastic impact on WGS data. Turns out we don't keep duplicates for WGS; see #3367.
  • Check that GC-bias-correction+PCA and PCA-only perform comparably, even at small bin sizes (~300bp). From what I've seen, this is true for larger bin sizes (~3kbp), so explicit GC-bias correction may not be necessary. (That is, even at these (purportedly) large bin sizes, the effect of the read-based GC-bias correction is obvious for those samples where it is important. However, the end result is not very different from PCA-only denoising with no GC-bias correction performed.)
  • Check that changing CBS alpha parameter sufficiently reduces hypersegmentation. Looks like the hybrid p-value calculation in DNACopy is not accurate enough to handle WGS-size data. (Also, it's relatively slow, taking ~30 minutes on ~10M intervals.) Even if I set alpha to 0, I still get a ridiculous number of segments! So I think it's finally time to scrap CBS. I'll look into other R segmentation packages that might give us a quick drop-in solution, but we may want to roll our own simple algorithm (which we will scrap anyway once the coverage model is in for somatic). I've implemented a fast kernel-segmentation method that seems very promising, see below.
  • Investigate performance vs. CGA ReCapSeg pipeline on THCA samples.
  • Investigate concordance with Genome STRiP.
@samuelklee

This comment has been minimized.

Copy link
Contributor

@samuelklee samuelklee commented Jul 25, 2017

I created a panel of normals from 90 WGS TCGA samples with 250bp (~11.5M) bins, which took ~57 minutes total and produced an 11GB PoN (this file includes all of the input read counts---which take up 20GB as a combined TSV file and a whopping 63GB as individual TSV files---as well as the eigenvectors, filtering results, etc.). The run can be found in /dsde/working/slee/wgs-pon-test/tieout/no-gc. It completed successfully with -Xmx32G (in comparison, CreatePanelOfNormals crashed after 40 minutes with -Xmx128G). The runtime breakdown was as follows:

  • ~45 minutes simply from reading of the 90 TSV read-count files in serial. Hopefully #3349 should greatly speed this up. (In comparison, CombineReadCounts reading 10 files in parallel at a time took ~100 minutes to create the aforementioned 20GB combined TSV file, creating 25+GB of temp files along the way.)

  • ~5 minutes from the preprocessing and filtering steps. We could probably further optimize some of this code in terms of speed and heap usage. (I had to throw in a call to System.gc() to avoid an OOM with -Xmx32G, which I encountered in my first attempt at the run...)

  • ~5 minutes from performing the SVD of the post-filtering 8643028 x 86 matrix, maintaining 30 eigensamples. I could write a quick implementation of randomized SVD, which I think could bring this down a bit (the scikit-learn implementation takes <2 minutes on a 10M x 100 matrix), but this can probably wait.

Clearly making I/O faster and more space efficient is the highest priority. Luckily it's also low hanging fruit. The 8643028 x 30 matrix of eigenvectors takes <2 minutes to read from HDF5 when the WGS PoN is used in DenoiseReadCounts, which gives us a rough idea of how long it should take to read in the original ~11.5M x 90 counts from HDF5. So once #3349 is in, then I think that a ~15 minute single-core WGS PoN could easily be viable.

I believe that a PoN on the order of this size will be all that is required for WGS denoising, if it is not already overkill. To go bigger by more than an order of magnitude, we'll have to go out of core, which will require more substantial changes to the code. But since the real culprit responsible for hypersegmentation is CBS, rather than insufficient denoising, I'd rather focus on finding a viable segmentation alternative.

@samuelklee

This comment has been minimized.

Copy link
Contributor

@samuelklee samuelklee commented Jul 27, 2017

With HDF5 read-count file input enabled by #3349, a WGS PoN with 39 samples x ~1M bins (i.e., 3kb) took <1 minute to build. Thanks @LeeTL1220!

@LeeTL1220

This comment has been minimized.

Copy link
Contributor

@LeeTL1220 LeeTL1220 commented Jul 27, 2017

@LeeTL1220

This comment has been minimized.

Copy link
Contributor

@LeeTL1220 LeeTL1220 commented Jul 27, 2017

Sorry did not see previous comment

@samuelklee

This comment has been minimized.

Copy link
Contributor

@samuelklee samuelklee commented Jul 27, 2017

Note the number of samples and bin size are different from that comment.

@LeeTL1220

This comment has been minimized.

Copy link
Contributor

@LeeTL1220 LeeTL1220 commented Jul 27, 2017

@samuelklee I certainly agree about focusing on CBS alternative.

@samuelklee

This comment has been minimized.

Copy link
Contributor

@samuelklee samuelklee commented Aug 9, 2017

I've implemented the Gaussian-kernel binary-segmentation algorithm from this paper: https://hal.inria.fr/hal-01413230/document This method uses a low-rank approximation to the kernel to obtain an approximate segmentation in linear complexity in time and space. In practice, performance is actually quite impressive!

The implementation is relatively straightforward, clocking in at ~100 lines of python. Time complexity is O(log(maximum number of segments) * number of data points) and space complexity is O(number of data points * dimension of the kernel approximation), which makes use for WGS feasible.

Segmentation of 10^6 simulated points with 100 segments takes about a minute and tends to recover segments accurately. Compare this with CBS, where segmentation of a WGS sample with ~700k points takes ~10 minutes---and note that these ~700k points are split up amongst ~20 chromosomes to start!

There are a small number of parameters that can affect the segmentation, but we can probably find good defaults in practice. What's also nice is that this method can find changepoints in moments of the distribution other than the mean, which means that it can straightforwardly be used for alternate-allele fraction segmentation. For example, all segments were recovered in the following simulated multimodal data, even though all of the segments have zero mean:

baf

Replacing the SNP segmentation in ACNV (which performs expensive maximum-likelihood estimation of the allele-fraction model) with this method would give a significant speedup there.

Joint segmentation is straightforward and is simply given by addition of the kernels. However, complete data is still required.

Given such a fast heuristic, I'm more amenable to augmenting this method with additional heuristics to clean up or improve the segmentation if necessary. We can also use it to initialize our more sophisticated HMM models, as well.

@LeeTL1220 @mbabadi @davidbenjamin I'd be interested to hear your thoughts, if you have any.

@LeeTL1220

This comment has been minimized.

Copy link
Contributor

@LeeTL1220 LeeTL1220 commented Aug 9, 2017

@samuelklee This looks really good and the paper is a great find. I am unclear on a couple of things.

  • How accurate was CBS on the 700k points? 100 ground-truth segments became how many in CBS?
  • I'm a little unclear on how we would derive a segment mean for this approach. If I am missing something obvious, then I'd ask: How were the "segment means" in this approach vs. CBS?
  • Would it be easier to make a version that only addressed total copy ratio segments (GATK CNV) and then implement joint segmentation later?
  • What is the name of this approach? "KernSeg"?
@mbabadi

This comment has been minimized.

Copy link
Contributor

@mbabadi mbabadi commented Aug 9, 2017

@samuelklee this looks awesome!

  • What variant of the algorithm did you implement? the paper lists several.
  • I haven't read the paper in detail yet, but is it possible to choose a conservatively large number of possible break points and then filter bad break points, possibly based on the rapid decline of the change point probability? i.e. does the algorithm naturally produce change point probabilities?
  • Is it possible to throw in additional change points incrementally, without doing extra work, until a certain criterion is met? (see above)
@samuelklee

This comment has been minimized.

Copy link
Contributor

@samuelklee samuelklee commented Aug 9, 2017

How accurate was CBS on the 700k points? 100 ground-truth segments became how many in CBS?

The 700k came from a real WGS sample (as opposed to the simulated data with 100 segments and 1M points), so there is no ground truth there. I was just trying to get an idea of real-world runtime for CBS.

When I run CBS on the simulated data, the results are pretty much the same (both miss a changepoint where the adjacent segment means are close by chance; the runtime is also ~1.2 minutes in both cases):

CBS:
1m-cbs

Kernel segmentation
1m

However, compare how CBS does on the simulated zero-mean multimodal data (terrible, as expected!):
baf-cbs

I'm a little unclear on how we would derive a segment mean for this approach. If I am missing something obvious, then I'd ask: How were the "segment means" in this approach vs. CBS?

I'm pretty sure that the segment means in CBS are literally just that. So if the breakpoints are good, then the segment means are the same.

Would it be easier to make a version that only addressed total copy ratio segments (GATK CNV) and then implement joint segmentation later?

Sure, but joint segmentation is much cheaper and easier with this method than our previous probabilistic approaches. Even SNP segmentation will be much cheaper.

What is the name of this approach? "KernSeg"?

Not sure...I couldn't find an R package, although an R/C implementation is mentioned in the paper. But the python implementation is straightforward and a pure Java implementation should not be so bad. There are some cythonized numpy methods that my python implementation used, but I think equivalent implementations of these methods should be relatively fast in pure Java as well.

What variant of the algorithm did you implement? the paper lists several.

I implemented what they call ApproxKSeg. It's an approximate version that combines binary segmentation with the low-rank approximation to the Gaussian kernel.

I haven't read the paper in detail yet, but is it possible to choose a conservatively large number of possible break points and then filter bad break points, possibly based on the rapid decline of the change point probability? i.e. does the algorithm naturally produce change point probabilities?

Yes, you can oversegment and then choose which breakpoints to retain. However, there are no proper changepoint probabilities, only changepoint costs. Adding a penalty term based on the number of changepoints seems to perform relatively well in simple tests, but one could certainly devise other ways to filter changepoints (some of which could yield probabilities, if you are willing to assume a probabilistic model). I think we should just think of this as a fast, heuristic, non-parametric method for finding breakpoints in multidimensional data.

Is it possible to throw in additional change points incrementally, without doing extra work, until a certain criterion is met? (see above)

The version I implemented adds changepoints via binary segmentation. The time complexity required to split a segment is linear in the number of points contained in the segment, although some care must be taken in the implementation to ensure this.

@samuelklee

This comment has been minimized.

Copy link
Contributor

@samuelklee samuelklee commented Aug 9, 2017

I naively cythonized part of my python implementation to use memoryviews, resulting in a ~60x speedup!

The simulated data with 100 segments and 1M points ran in 1.3 seconds!!

THIS IS BANANAS!!!

@samuelklee

This comment has been minimized.

Copy link
Contributor

@samuelklee samuelklee commented Sep 29, 2017

The new pipeline is in a complete state. Nearly all tools and scripts were rewritten, many from scratch. I've tried to minimize interaction with old tools/exome code (notably, ReadCountCollection and SegmentUtils). There are still some improvements that can be made (especially in segment union and the subsequent modeling), but we should be ready to go for a first validation. Some notes:

WDL:

  • I've moved the old pipeline to somatic_old folders.
  • There is now just a matched-pair workflow and a panel workflow. We can add a single BAM case workflow or expand the matched-pair workflow to handle this, depending on the discussion at #3657.
  • WES/WGS is toggled by providing an optional target-file input.
  • For all workflows, we always collect integer read counts; for WGS, these are output as both HDF5 and TSV and the HDF5 is used for subsequent input.
  • For the case workflow, we always collect allelic counts at all sites and output as TSV.
  • We should output all data files as HDF5 by default and as TSV optionally. EDIT: This is done for CollectFragmentCounts.
  • We will need to update the workflows when @MartonKN and @asmirnov239 get PreprocessIntervals and CollectReadCounts merged, respectively. These tools will remove the awkwardness required by PadTargets and CalculateTargetCoverage/SparkGenomeReadCounts.

Denoising:

  • All parameters are exposed in the PoN creation tool (#3356).
  • Without a PoN, standardization and optional GC correction are performed (#3570).
  • Other than the minor point about sample mean/median being used inconsistently noted above, the denoising process is essentially exactly the same mathematically as before ("superficial" differences include the vastly improved memory and I/O optimizations, the ability to adjust number of principal components used, the removal of redundant SVDs, the enforcement of consistent GC-bias correction).
  • That said, I'll carry over this TODO from above: Revisit standardization procedure by checking with simulated data. We should make sure that the centering of the data does not rescale the true copy ratio.
  • The only major difference is we no longer make a QC PoN or check for large events. This was performed awkwardly in the old pipeline, so I'd rather not port it over. Eventually we will do all denoising with the gCNV coverage model anyway.
  • Pre/tangent-normalization copy ratio are now referred to as standardized/denoised copy ratio.
  • Old code is still used for GC-bias correction in CreateReadCountPanelOfNormals, and we still use the AnnotateTargets tool. We should port this over (possibly as part of PreprocessIntervals) at some point (actually, I think we will be forced to, since PreprocessIntervals will output a Picard interval list, and AnnotateTargets outputs a target file).
  • Integration tests are still needed for CreateReadCountPanelOfNormals. These might not test for correctness, but we could possibly compare to old PoNs.

Segmentation/modeling:

  • Instead of separate tools for copy-ratio segmentation (PerformSegmentation) and allele-fraction segmentation/union/modeling (AllelicCNV), there is now just a single segmentation/modeling tool (ModelSegments).
  • Input is denoised copy ratio and/or allelic counts. If only one input is provided, then we only model only the corresponding quantity.
  • There is no separate allele-fraction workflow. Unlike the old approach, we do not perform any genotyping or modeling before doing kernel segmentation.
  • Old code and classes are used for segment union. We should port or possibly replace this with a simple method that uses kernel segmentation. EDIT: Actually, just tried running a WGS sample and this is still a major bottleneck. EDIT 2: Hmm...actually doesn't seem to be an issue on my desktop (compared to my laptop, on which the run hangs here). Will try to track down the source of the discrepancy. EDIT 3: Added segment union based on single-changepoint detection using kernel segmentation.
  • Segment union should be replaced by a proper joint kernel segmentation. EDIT: I've added this, but there could be some minor improvements. Right now, only use one het per copy-ratio interval and throw away those off-target/bin. There are a few percent of targets/bins that have more than one het, and we could potentially rescue the off-target/bin hets as well with some care.
  • Old code and models are used for modeling. Since the old allele-fraction model only models hets, we perform a GetHetCoverage-like binomial genotyping step (and output the results) before modeling. However, instead of assuming the null hypothesis of het (f = 0.5) and accepting a site when we cannot reject the null, we assume the null hypothesis of hom (f = error rate or 1 - error rate) and accept a site when we can reject the null. This entire process is very similar to what @davidbenjamin does in #3638. We should consider combining this code (along with AllelicCount/PileupSummary) at some point.
  • Added option to use matched normal.
  • Rather than port over the old modeling code, I would rather expand the allele-fraction model to allow for the modeling of hom sites. I wrote up such a model in some notes I sent around a few months back. This model allows for an allelic PoN that uses all sites to learn reference bias, not just hets. Depending on how our python development proceeds, I may try to implement this model using the old GibbsSampler code instead.
  • In the meantime, we can try to speed up the old allele-fraction model, which is now the main bottleneck. An easy (lazy) strategy might simply be to downsample and scale likelihoods when estimating global parameters. Addresses #2884.
  • Even though the simple copy-ratio model is much faster, it still takes ~15-20 minutes for 100 iterations on WGS, so we can downsample here too.
  • Integration tests are still needed; again, these might not test for correctness.
  • I've added the ability to specify a prior for the minor-allele fraction, which alleviates the problem of residual bias in balanced segments.
  • I've reduced the verbosity of the modeled-segments file. I only report posterior mode and 10%, 50%, and 90% deciles. Global parameters have the full deciles output in the .param files, but I removed the mode and highest density credible interval (because of the below item).
  • Some residual bias remains in the estimate of the minor-allele fraction posterior mode. This is simply because we are performing kernel density estimation of a bounded quantity. One possibility would be to logit transform to an unbounded support, perform the estimation, then transform back. EDIT: Just removed kernel density estimation for now, partly due to #3599 as well.
  • Hmm, actually still a tiny bit of residual bias. This is apparent e.g. in WGS normals. I think focusing on a new allele-fraction model rather than trying to figure out where the old one is failing would be best.
  • For small bins (250bp), the copy-ratio model is currently a bit memory intensive, since it stores an outlier indicator boolean for every data point (it gets by with -Xmx12g for 100 iterations at 250bp). There is no easy away around storing this at the GibbsSampler level (although we could make some non-trivial changes to that code, as @davidbenjamin suggested long ago at broadinstitute/gatk-protected#195). However, I got rid of these at the CopyRatioModeller level. If we want to go down in memory, we could move to a BitSet, but I'm not sure what the performance hit will be. EDIT: It was trivial to switch over to a BitSet, which seems to let us get away with -Xmx8g instead of -Xmx12g.

Calling:

  • I've ported over the naive ReCapSegCaller wholesale. This can take in the output of ModelSegments, so we can take advantage of the improved segmentation as before, but we still don't use the modeled minor-allele fractions when making calls. The method for copy-ratio calling is also extremely naive, with hardcoded bounds for identifying the copy-neutral state.
  • @MartonKN is going to work on an improved caller for his next project. This caller should also make simple calls (not full allelic copy number, but just 0, +, -), but should also take advantage of the copy-ratio and minor-allele fraction posteriors estimated by ModelSegments to generate quality scores.

Plotting:

  • Other than the allele-fraction model, the limiting factor was the original plotting code (some plotting runs originally took several hours for a single WGS sample...) We can now make plotting much faster with the ordering enforced by TSVLocatableCollection (see below).
  • There are now two plotting tools, PlotDenoisedCopyRatios and PlotModeledSegments. This is in contrast to the old PlotSegmentedCopyRatio and PlotACNVResults.
  • Because ModelSegments optionally takes denoised copy ratio and/or allelic counts, PlotModeledSegments outputs only the corresponding plots appropriately.
  • I added a dependency on the R package data.table to slightly speed up the reading of input files.
  • Setting pch="." also sped up the generation of scatter plots.
  • Plotting now takes a couple of minutes, most of which is I/O (#3554).
  • AAF (rather than MAF) is now plotted for allele fraction (#2957).

Other:

  • I've introduced a LocatableCollection class to unify how allelic counts, copy ratios, and segments are stored and read/written from/to TSV (#2836). Intervals are always output in lexicographical order for now, to be consistent with the old coverage collection (#2951). Once @asmirnov239's CollectReadCounts is in, we can change everything over to ordering determined by the sequence dictionary.
  • Column headers and log2 copy ratio output have been standardized throughout (#2886).
  • I've also introduced a NamedSampleFile abstract class to tag files that have #SAMPLE_NAME=... as the first comment line. For CollectAllelicCounts, this simply uses code borrowed from GetSampleName. We should unify the reading and storing of sample names at some point (#2910).
  • We will need to replace SimpleReadCountCollection (which currently serves as the interface between the old coverage collection files and the new code) with one of these subclasses when CollectReadCounts is in. We can also change NamedSampleFile depending on what he's implemented.
  • We should eventually write proper SAM headers with useful tags to all TSV and HDF5 files generated by our tools that represent annotated intervals that can be associated with a single sample.

Documentation:

  • I need to update class javadoc and example invocations throughout. The initial PR will already be quite massive, so I'll leave this until later. Perhaps @sooheelee might want to be involved?
  • I will update the white paper at some point.
@sooheelee

This comment has been minimized.

Copy link
Contributor

@sooheelee sooheelee commented Oct 2, 2017

Yes, I'd like to test out the new tools in the workflow. All of these changes seem major and I'm excited to see how the workflow performs. I'll stop by to chat with you.

@samuelklee

This comment has been minimized.

Copy link
Contributor

@samuelklee samuelklee commented Oct 7, 2017

The HCC1143 WES sample from the tutorial data runs in < 1.5 minutes and looks pretty good!

Copy-ratio only (using a PoN with 40 samples):
hcc1143_t_clean no-ac modeled

Matched-normal mode, allele-fraction only (note that there are not very many hets in this data, since the common sites list we used was relatively small):
hcc1143_t_clean matched-normal no-cr modeled

Matched-normal mode (copy ratio + allele fraction):
hcc1143_t_clean matched-normal modeled

@sooheelee

This comment has been minimized.

Copy link
Contributor

@sooheelee sooheelee commented Oct 7, 2017

Thanks @samuelklee!

@samuelklee

This comment has been minimized.

Copy link
Contributor

@samuelklee samuelklee commented Oct 7, 2017

TCGA WGS at 250bp (~8.5M copy ratios after filtering during the denoising step from ~11.5M read counts, ~840k hets after filtering on count totals <30 and homozygosity from ~5.5M allelic counts) took ~27 minutes to run in matched-normal mode. Results look very reasonable. You can see the result at /dsde/working/slee/CNV-1.5_TCGA_WGS/TCGA-05-4389-01A-01D-1931-08.matched-normal.modeled.png

@sooheelee

This comment has been minimized.

Copy link
Contributor

@sooheelee sooheelee commented Oct 10, 2017

Here's a first pass draft of the figure for the ASHG poster. For the good/bad PoN illustration at bottom, I use figures I had originally generated with what is now the older workflow.


somatic_cnv

@samuelklee

This comment has been minimized.

Copy link
Contributor

@samuelklee samuelklee commented Jan 4, 2018

Just noticed an updated version of the kernel segmentation paper: https://hal.inria.fr/hal-01413230v2/document The original version referenced in this issue can be found at: https://hal.inria.fr/hal-01413230v1/document

A couple of noteworthy changes include the form of the penalty function and the fact that they use an estimate of the variance to normalize the data and set the kernel variance to unity (they also symmetrize the BAF, which I guess is required for the latter). I don't think either change will make a huge difference to the final segmentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.