Improvements to Mutect2 germline filtering #4509

davidbenjamin · 2018-03-07T18:25:14Z

@takutosato This uses minor allele fraction segmentation, which was already done internally in CalculateContamination, to improve tumor-only calling a lot. I also sw modest improvements in some tumor-normal validations.

Also, @chandrans @sooheelee this hopefully does away with the problems with af-of-alleles-not-in-resource by deriving a defensible default that doesn't result in all calls in tumor-only mode getting filtered.

takutosato

Finished reviewing changes to the docs. Wanted to publish these comments before I jump into the code.

takutosato · 2018-03-07T20:23:40Z

docs/mutect/mutect.tex

 \end{equation}

+The above equation, in which the factors of $\ell_t$ could cancel if we wished, is not quite right.  The tumor likelihood $\ell_t$ is the probability of the tumor data given that the allele exists in the tumor \textit{as a somatic variant}.  If the allele is in the tumor as a germline het we must modify $\ell_t$ to account for the fact that the allele fraction is determined by the ploidy -- it must be either $f_g$ or $1- f_g$with equal probability, where $f_g$ is the minor allele fraction of germline hets.  It would be awkward to recalculate the tumor likelihood constrained the allele frequency in the model to these two values, but we can estimate a correction factor as follows:  assuming that the posterior on the allele fraction in the somatic likelihoods model is fairly tight, the likelihood of $a$ alt reads out of $n$ total reads is $\binom na (1-f_t)^{n-a}f^a$, where $f_t$ is the tumor alt allele fraction.  That is, our sophisticated model that marginalizes over $f_t$ reduces to something more naive.  If the variant is a germline event, the likelihood becomes $\frac{1}{2} \binom na  \left[(1-f_g)^{n-a}f_g^a + f_g^{n-a}(1-f_g)^a \right]$.  Thus, in case (1) we have $\ell_t \rightarrow \chi \ell_t$, where


I think you meant "It would be awkward to recalculate the tumor likelihood with the allele frequencies constrained to these two values", or something along those lines

takutosato · 2018-03-07T21:42:44Z

docs/mutect/mutect.tex

 To filter, we set a threshold on this posterior probability.

+So far we have assumed that the population allele frequency $f$ is known, which is the case if it is found in our germline resource, such as gnomAD.  If $f$ is not known we must make a reasonable guess as follows.  Suppose the prior distribution on $f$ is ${\rm Beta}(\alpha, \beta)$.  The mean $\alpha/(\alpha +\beta)$ of this prior is the average human heterozygosity $\theta \approx 10^{-3}$, so we have $\beta \approx \alpha / \theta$.  We need one more constraint to determine $\alpha$ and $\beta$, and since we are concerned with imputing $f$ when $f$ is small we use a condition based on rare variants.  Specifically, the number of variant alleles $n$ at some site in a germline resource with $N/2$ samples, hence $N$ chromosomes, is given by $f \sim {\rm Beta}(\alpha, \beta), n \sim {\rm Binom}(N,f)$.  That is, $n \sim {\rm BetaBinom}(\alpha, \beta, N)$.  The probability of a site being non-variant in every sample is then $P(n = 0) = {\rm BetaBinom}(0 | \alpha, \beta, N)$, which we equate to the empirical proportion of non-variant sites in our resource, about $7/8$ for exonic sites in gnomAD.  Solving, we obtain approximately $\alpha = 0.01, \beta = 10$ for gnomAD.  Now, given that some allele found by Mutect is not in the resource, the posterior on $f$ is ${\rm Beta}(\alpha, \beta + N)$, the mean of which is, since $\beta << N$, about $\alpha / N$.  By default, Mutect uses this value.


which we equate to the empirical proportion of non-variant sites in our resource, about $7/8$ for exonic sites in gnomAD

Does this mean that in gnomAD there's a variant every 8 bp within the exome?

That's right.

takutosato

Thanks for your patience. A few comments. Back to you

takutosato · 2018-03-09T00:46:08Z

...n/java/org/broadinstitute/hellbender/tools/walkers/contamination/CalculateContamination.java


+    private double calculateMinorAlleleFraction(List<PileupSummary> segment) {


make segment final

done in two places

takutosato · 2018-03-09T01:08:47Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/mutect/M2ArgumentCollection.java

-                    "1/(2* number of samples in resource) if a germline resource is available; otherwise an average " +
-                    "heterozygosity rate such as 0.001 is reasonable.", optional = true)
-    public double afOfAllelesNotInGermlineResource = 0.001;
+            doc="Population allele fraction assigned to alleles not found in germline resource.  Please see docs/mutect2.pdf for" +


docs/mutect2.pdf -> docs/mutect/mutect2.pdf

takutosato · 2018-03-09T01:16:48Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/mutect/M2ArgumentCollection.java

-    public double afOfAllelesNotInGermlineResource = 0.001;
+            doc="Population allele fraction assigned to alleles not found in germline resource.  Please see docs/mutect2.pdf for" +
+                    "a derivation of the default value.", optional = true)
+    public double afOfAllelesNotInGermlineResource = 0.00000005;


5e-8 may be easier to read

takutosato · 2018-03-09T15:12:07Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/mutect/Mutect2FilteringEngine.java

@@ -126,20 +130,48 @@ private void applyReadPositionFilter(final M2FiltersArgumentCollection MTFAC, fi
        }
    }

+    private void applyGermlineVariantFilter(final M2FiltersArgumentCollection MTFAC, final VariantContext vc, final VariantContextBuilder vcb) {
+        if (vc.hasAttribute(GATKVCFConstants.TUMOR_LOD_KEY) && vc.hasAttribute(GATKVCFConstants.POPULATION_AF_VCF_ATTRIBUTE)) {


This makes population allele frequency not optional. I was under the impression that it is.

In Mutect2 it assigns a population AF to alleles that aren't in gnomAD, so if there's nothing by this point it means they weren't running our pipeline and we have no idea what's going on.

I see, thanks

takutosato · 2018-03-09T15:22:41Z

docs/mutect/mutect.tex


 We can determine the posterior probability that the variant exists in the normal genotype by calculating the unnormalized probabilities of four possibilities:
 \begin{enumerate}
-\item The variant exists is both the normal and the tumor samples.  This has unnormalized probability $\left(2f(1-f) + f^2 \right) \ell_n \ell_t (1 - \pi)$.
+\item The variant exists in the tumor and the normal as a germline het.  This has unnormalized probability $2f(1-f) \ell_n \ell_t (1 - \pi)$.
+\item The variant exists in the tumor and the normal as a germline hom alt.  This has unnormalized probability $f^2 \ell_n \ell_t (1 - \pi)$.


The normal likelihood ell_n assumes that the variant is het, and I'm not sure that we can use it for the hom alt case

ell_n is the total likelihood to be present, as either het or hom alt, in the normal. I added a footnote that clarifies this.

One could argue that we use some power by not modeling the normal more precisely i.e. the model as it stands allows a variant to be a germline het in the normal and a germline hom alt in the tumor. However, when we have a matched normal and ell_n isn't small we always filter the variant regardless of anything else and the details of the model don't matter. They only matter when we're in tumor-only mode (or have no coverage in the normal), in which case ell_n is basically deactivated

Got it, the variant won't even be emitted when normal is hom alt.

But in the code, \ell_n seems to be coming from SomaticGenotypingEngine::diploidAltLog10Odds, which assumes that the germline variant is het, so wouldn't it be incorrect to say that \ell_n is the total likelihood for het or hom alt?

I see your point. That method is the log odds of het vs hom ref, but if the variant is hom alt these log odds will also be overwhelmingly large, because het is so much closer to the truth than hom ref. So really, in a slightly sloppy way, \ell_n really is coming from both het and hom alt.

As a germline genotyper it's primitive, but for Mutect's purposes it's good enough.

takutosato · 2018-03-09T16:17:33Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/mutect/Mutect2FilteringEngine.java

+                final double log10GermlineAltMajorLikelihood = refCount * Math.log10(maf) + altCounts[n] * Math.log10(1 - maf);
+                final double log10GermlineLikelihood = MathUtils.LOG10_ONE_HALF + MathUtils.log10SumLog10(log10GermlineAltMinorLikelihood, log10GermlineAltMajorLikelihood);
+
+                final double f = altAlleleFractions[n];


IntelliJ tells me f is not used

takutosato · 2018-03-09T16:29:11Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/mutect/Mutect2FilteringEngine.java


 /**
 * Created by David Benjamin on 9/15/16.
 */
 public class Mutect2FilteringEngine {
+    public static final double MIN_ALLELE_FRACTION_FOR_GERMLINE_HET = 0.9;


Should this be a MAX instead of MIN?

Equivalently, it should be a HOM_ALT instead of HET, and that phrasing has a more direct link to the docs. Fixed it.

takutosato · 2018-03-09T17:44:45Z

I also noticed a couple things in the docs for Calculation Contamination that are not in this PR:

"with peaks for hom ref, alt minor het, alt major het, and hom ref". The second hom ref should probably be hom alt?
(at the end of section) ho alt sites > hom alt site

davidbenjamin · 2018-03-09T18:09:18Z

Fixed both of the things you noticed in the CalculateContamination docs.

davidbenjamin · 2018-03-09T18:18:23Z

@takutosato back to you.

codecov-io · 2018-03-09T19:06:44Z

Codecov Report

Merging #4509 into master will increase coverage by 0.008%.
The diff coverage is 89.706%.

@@               Coverage Diff               @@
##              master     #4509       +/-   ##
===============================================
+ Coverage     79.095%   79.104%   +0.008%     
+ Complexity     16631     16621       -10     
===============================================
  Files           1049      1050        +1     
  Lines          60115     59747      -368     
  Branches        9856      9792       -64     
===============================================
- Hits           47548     47262      -286     
+ Misses          8751      8682       -69     
+ Partials        3816      3803       -13

Impacted Files	Coverage Δ	Complexity Δ
...ute/hellbender/utils/variant/GATKVCFConstants.java	`80% <ø> (ø)`	`4 <0> (ø)`	⬇️
...der/tools/walkers/mutect/M2ArgumentCollection.java	`100% <ø> (ø)`	`1 <0> (ø)`	⬇️
...bender/tools/walkers/mutect/FilterMutectCalls.java	`95.833% <100%> (ø)`	`7 <0> (ø)`	⬇️
.../walkers/mutect/GermlineProbabilityCalculator.java	`90.323% <100%> (+1.037%)`	`12 <3> (+3)`	⬆️
...e/hellbender/utils/variant/GATKVCFHeaderLines.java	`99.291% <100%> (+0.005%)`	`10 <0> (ø)`	⬇️
.../tools/walkers/mutect/SomaticGenotypingEngine.java	`92.958% <100%> (-0.285%)`	`56 <0> (-2)`
...ls/walkers/mutect/M2FiltersArgumentCollection.java	`100% <100%> (ø)`	`1 <0> (ø)`	⬇️
...lkers/contamination/MinorAlleleFractionRecord.java	`84.091% <84.091%> (ø)`	`5 <5> (?)`
...r/tools/walkers/mutect/Mutect2FilteringEngine.java	`82.759% <89.474%> (+1.159%)`	`40 <7> (+4)`	⬆️
.../walkers/contamination/CalculateContamination.java	`95.522% <93.75%> (-0.345%)`	`41 <5> (+3)`
... and 47 more

takutosato · 2018-03-09T19:10:00Z

I'd love to talk about \ell_n in person on Monday, but merging the PR doesn't need to wait.

…del uses it

davidbenjamin added this to the Popularize Mutect 2 at the Broad milestone Mar 7, 2018

davidbenjamin assigned takutosato Mar 7, 2018

davidbenjamin requested a review from takutosato March 7, 2018 18:25

takutosato reviewed Mar 7, 2018

View reviewed changes

takutosato requested changes Mar 9, 2018

View reviewed changes

takutosato approved these changes Mar 9, 2018

View reviewed changes

davidbenjamin force-pushed the db_m2_germline_filter branch from 363d694 to 45ffaae Compare March 9, 2018 20:40

CalculateContamination emits its segmentation and Mutect2 germline mo…

86c166d

…del uses it

davidbenjamin force-pushed the db_m2_germline_filter branch from 45ffaae to 86c166d Compare March 9, 2018 20:43

davidbenjamin merged commit c4f8cdc into master Mar 11, 2018

davidbenjamin deleted the db_m2_germline_filter branch March 11, 2018 01:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to Mutect2 germline filtering #4509

Improvements to Mutect2 germline filtering #4509

davidbenjamin commented Mar 7, 2018

takutosato left a comment

takutosato Mar 7, 2018

davidbenjamin Mar 9, 2018

takutosato Mar 7, 2018

davidbenjamin Mar 9, 2018

takutosato left a comment

takutosato Mar 9, 2018

davidbenjamin Mar 9, 2018

takutosato Mar 9, 2018

davidbenjamin Mar 9, 2018

takutosato Mar 9, 2018

davidbenjamin Mar 9, 2018

takutosato Mar 9, 2018

davidbenjamin Mar 9, 2018

takutosato Mar 9, 2018

takutosato Mar 9, 2018

davidbenjamin Mar 9, 2018

takutosato Mar 9, 2018

davidbenjamin Mar 11, 2018 •

edited

takutosato Mar 9, 2018

davidbenjamin Mar 9, 2018

takutosato Mar 9, 2018

davidbenjamin Mar 9, 2018

takutosato commented Mar 9, 2018

davidbenjamin commented Mar 9, 2018

davidbenjamin commented Mar 9, 2018

codecov-io commented Mar 9, 2018 •

edited

takutosato commented Mar 9, 2018

		\end{equation}

		The above equation, in which the factors of $\ell_t$ could cancel if we wished, is not quite right. The tumor likelihood $\ell_t$ is the probability of the tumor data given that the allele exists in the tumor \textit{as a somatic variant}. If the allele is in the tumor as a germline het we must modify $\ell_t$ to account for the fact that the allele fraction is determined by the ploidy -- it must be either $f_g$ or $1- f_g$with equal probability, where $f_g$ is the minor allele fraction of germline hets. It would be awkward to recalculate the tumor likelihood constrained the allele frequency in the model to these two values, but we can estimate a correction factor as follows: assuming that the posterior on the allele fraction in the somatic likelihoods model is fairly tight, the likelihood of $a$ alt reads out of $n$ total reads is $\binom na (1-f_t)^{n-a}f^a$, where $f_t$ is the tumor alt allele fraction. That is, our sophisticated model that marginalizes over $f_t$ reduces to something more naive. If the variant is a germline event, the likelihood becomes $\frac{1}{2} \binom na \left[(1-f_g)^{n-a}f_g^a + f_g^{n-a}(1-f_g)^a \right]$. Thus, in case (1) we have $\ell_t \rightarrow \chi \ell_t$, where

		To filter, we set a threshold on this posterior probability.

		So far we have assumed that the population allele frequency $f$ is known, which is the case if it is found in our germline resource, such as gnomAD. If $f$ is not known we must make a reasonable guess as follows. Suppose the prior distribution on $f$ is ${\rm Beta}(\alpha, \beta)$. The mean $\alpha/(\alpha +\beta)$ of this prior is the average human heterozygosity $\theta \approx 10^{-3}$, so we have $\beta \approx \alpha / \theta$. We need one more constraint to determine $\alpha$ and $\beta$, and since we are concerned with imputing $f$ when $f$ is small we use a condition based on rare variants. Specifically, the number of variant alleles $n$ at some site in a germline resource with $N/2$ samples, hence $N$ chromosomes, is given by $f \sim {\rm Beta}(\alpha, \beta), n \sim {\rm Binom}(N,f)$. That is, $n \sim {\rm BetaBinom}(\alpha, \beta, N)$. The probability of a site being non-variant in every sample is then $P(n = 0) = {\rm BetaBinom}(0 \| \alpha, \beta, N)$, which we equate to the empirical proportion of non-variant sites in our resource, about $7/8$ for exonic sites in gnomAD. Solving, we obtain approximately $\alpha = 0.01, \beta = 10$ for gnomAD. Now, given that some allele found by Mutect is not in the resource, the posterior on $f$ is ${\rm Beta}(\alpha, \beta + N)$, the mean of which is, since $\beta << N$, about $\alpha / N$. By default, Mutect uses this value.


		private double calculateMinorAlleleFraction(List<PileupSummary> segment) {

Improvements to Mutect2 germline filtering #4509

Improvements to Mutect2 germline filtering #4509

Conversation

davidbenjamin commented Mar 7, 2018

takutosato left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

takutosato left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidbenjamin Mar 11, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

takutosato commented Mar 9, 2018

davidbenjamin commented Mar 9, 2018

davidbenjamin commented Mar 9, 2018

codecov-io commented Mar 9, 2018 • edited

Codecov Report

takutosato commented Mar 9, 2018

davidbenjamin Mar 11, 2018 •

edited

codecov-io commented Mar 9, 2018 •

edited