Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to Mutect2 germline filtering #4509

Merged
merged 1 commit into from Mar 11, 2018

Conversation

davidbenjamin
Copy link
Contributor

@takutosato This uses minor allele fraction segmentation, which was already done internally in CalculateContamination, to improve tumor-only calling a lot. I also sw modest improvements in some tumor-normal validations.

Also, @chandrans @sooheelee this hopefully does away with the problems with af-of-alleles-not-in-resource by deriving a defensible default that doesn't result in all calls in tumor-only mode getting filtered.

@davidbenjamin davidbenjamin added this to the Popularize Mutect 2 at the Broad milestone Mar 7, 2018
Copy link
Contributor

@takutosato takutosato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished reviewing changes to the docs. Wanted to publish these comments before I jump into the code.

\end{equation}

The above equation, in which the factors of $\ell_t$ could cancel if we wished, is not quite right. The tumor likelihood $\ell_t$ is the probability of the tumor data given that the allele exists in the tumor \textit{as a somatic variant}. If the allele is in the tumor as a germline het we must modify $\ell_t$ to account for the fact that the allele fraction is determined by the ploidy -- it must be either $f_g$ or $1- f_g$with equal probability, where $f_g$ is the minor allele fraction of germline hets. It would be awkward to recalculate the tumor likelihood constrained the allele frequency in the model to these two values, but we can estimate a correction factor as follows: assuming that the posterior on the allele fraction in the somatic likelihoods model is fairly tight, the likelihood of $a$ alt reads out of $n$ total reads is $\binom na (1-f_t)^{n-a}f^a$, where $f_t$ is the tumor alt allele fraction. That is, our sophisticated model that marginalizes over $f_t$ reduces to something more naive. If the variant is a germline event, the likelihood becomes $\frac{1}{2} \binom na \left[(1-f_g)^{n-a}f_g^a + f_g^{n-a}(1-f_g)^a \right]$. Thus, in case (1) we have $\ell_t \rightarrow \chi \ell_t$, where
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you meant "It would be awkward to recalculate the tumor likelihood with the allele frequencies constrained to these two values", or something along those lines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

To filter, we set a threshold on this posterior probability.

So far we have assumed that the population allele frequency $f$ is known, which is the case if it is found in our germline resource, such as gnomAD. If $f$ is not known we must make a reasonable guess as follows. Suppose the prior distribution on $f$ is ${\rm Beta}(\alpha, \beta)$. The mean $\alpha/(\alpha +\beta)$ of this prior is the average human heterozygosity $\theta \approx 10^{-3}$, so we have $\beta \approx \alpha / \theta$. We need one more constraint to determine $\alpha$ and $\beta$, and since we are concerned with imputing $f$ when $f$ is small we use a condition based on rare variants. Specifically, the number of variant alleles $n$ at some site in a germline resource with $N/2$ samples, hence $N$ chromosomes, is given by $f \sim {\rm Beta}(\alpha, \beta), n \sim {\rm Binom}(N,f)$. That is, $n \sim {\rm BetaBinom}(\alpha, \beta, N)$. The probability of a site being non-variant in every sample is then $P(n = 0) = {\rm BetaBinom}(0 | \alpha, \beta, N)$, which we equate to the empirical proportion of non-variant sites in our resource, about $7/8$ for exonic sites in gnomAD. Solving, we obtain approximately $\alpha = 0.01, \beta = 10$ for gnomAD. Now, given that some allele found by Mutect is not in the resource, the posterior on $f$ is ${\rm Beta}(\alpha, \beta + N)$, the mean of which is, since $\beta << N$, about $\alpha / N$. By default, Mutect uses this value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which we equate to the empirical proportion of non-variant sites in our resource, about $7/8$ for exonic sites in gnomAD

Does this mean that in gnomAD there's a variant every 8 bp within the exome?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right.

Copy link
Contributor

@takutosato takutosato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience. A few comments. Back to you


private double calculateMinorAlleleFraction(List<PileupSummary> segment) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make segment final

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in two places

"1/(2* number of samples in resource) if a germline resource is available; otherwise an average " +
"heterozygosity rate such as 0.001 is reasonable.", optional = true)
public double afOfAllelesNotInGermlineResource = 0.001;
doc="Population allele fraction assigned to alleles not found in germline resource. Please see docs/mutect2.pdf for" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs/mutect2.pdf -> docs/mutect/mutect2.pdf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

public double afOfAllelesNotInGermlineResource = 0.001;
doc="Population allele fraction assigned to alleles not found in germline resource. Please see docs/mutect2.pdf for" +
"a derivation of the default value.", optional = true)
public double afOfAllelesNotInGermlineResource = 0.00000005;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5e-8 may be easier to read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -126,20 +130,48 @@ private void applyReadPositionFilter(final M2FiltersArgumentCollection MTFAC, fi
}
}

private void applyGermlineVariantFilter(final M2FiltersArgumentCollection MTFAC, final VariantContext vc, final VariantContextBuilder vcb) {
if (vc.hasAttribute(GATKVCFConstants.TUMOR_LOD_KEY) && vc.hasAttribute(GATKVCFConstants.POPULATION_AF_VCF_ATTRIBUTE)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes population allele frequency not optional. I was under the impression that it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Mutect2 it assigns a population AF to alleles that aren't in gnomAD, so if there's nothing by this point it means they weren't running our pipeline and we have no idea what's going on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks


We can determine the posterior probability that the variant exists in the normal genotype by calculating the unnormalized probabilities of four possibilities:
\begin{enumerate}
\item The variant exists is both the normal and the tumor samples. This has unnormalized probability $\left(2f(1-f) + f^2 \right) \ell_n \ell_t (1 - \pi)$.
\item The variant exists in the tumor and the normal as a germline het. This has unnormalized probability $2f(1-f) \ell_n \ell_t (1 - \pi)$.
\item The variant exists in the tumor and the normal as a germline hom alt. This has unnormalized probability $f^2 \ell_n \ell_t (1 - \pi)$.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normal likelihood ell_n assumes that the variant is het, and I'm not sure that we can use it for the hom alt case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ell_n is the total likelihood to be present, as either het or hom alt, in the normal. I added a footnote that clarifies this.

One could argue that we use some power by not modeling the normal more precisely i.e. the model as it stands allows a variant to be a germline het in the normal and a germline hom alt in the tumor. However, when we have a matched normal and ell_n isn't small we always filter the variant regardless of anything else and the details of the model don't matter. They only matter when we're in tumor-only mode (or have no coverage in the normal), in which case ell_n is basically deactivated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, the variant won't even be emitted when normal is hom alt.

But in the code, \ell_n seems to be coming from SomaticGenotypingEngine::diploidAltLog10Odds, which assumes that the germline variant is het, so wouldn't it be incorrect to say that \ell_n is the total likelihood for het or hom alt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. That method is the log odds of het vs hom ref, but if the variant is hom alt these log odds will also be overwhelmingly large, because het is so much closer to the truth than hom ref. So really, in a slightly sloppy way, \ell_n really is coming from both het and hom alt.

As a germline genotyper it's primitive, but for Mutect's purposes it's good enough.

final double log10GermlineAltMajorLikelihood = refCount * Math.log10(maf) + altCounts[n] * Math.log10(1 - maf);
final double log10GermlineLikelihood = MathUtils.LOG10_ONE_HALF + MathUtils.log10SumLog10(log10GermlineAltMinorLikelihood, log10GermlineAltMajorLikelihood);

final double f = altAlleleFractions[n];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IntelliJ tells me f is not used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


/**
* Created by David Benjamin on 9/15/16.
*/
public class Mutect2FilteringEngine {
public static final double MIN_ALLELE_FRACTION_FOR_GERMLINE_HET = 0.9;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a MAX instead of MIN?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Equivalently, it should be a HOM_ALT instead of HET, and that phrasing has a more direct link to the docs. Fixed it.

@takutosato
Copy link
Contributor

I also noticed a couple things in the docs for Calculation Contamination that are not in this PR:

  • "with peaks for hom ref, alt minor het, alt major het, and hom ref". The second hom ref should probably be hom alt?
  • (at the end of section) ho alt sites > hom alt site

@davidbenjamin
Copy link
Contributor Author

Fixed both of the things you noticed in the CalculateContamination docs.

@davidbenjamin
Copy link
Contributor Author

@takutosato back to you.

@codecov-io
Copy link

codecov-io commented Mar 9, 2018

Codecov Report

Merging #4509 into master will increase coverage by 0.008%.
The diff coverage is 89.706%.

@@               Coverage Diff               @@
##              master     #4509       +/-   ##
===============================================
+ Coverage     79.095%   79.104%   +0.008%     
+ Complexity     16631     16621       -10     
===============================================
  Files           1049      1050        +1     
  Lines          60115     59747      -368     
  Branches        9856      9792       -64     
===============================================
- Hits           47548     47262      -286     
+ Misses          8751      8682       -69     
+ Partials        3816      3803       -13
Impacted Files Coverage Δ Complexity Δ
...ute/hellbender/utils/variant/GATKVCFConstants.java 80% <ø> (ø) 4 <0> (ø) ⬇️
...der/tools/walkers/mutect/M2ArgumentCollection.java 100% <ø> (ø) 1 <0> (ø) ⬇️
...bender/tools/walkers/mutect/FilterMutectCalls.java 95.833% <100%> (ø) 7 <0> (ø) ⬇️
.../walkers/mutect/GermlineProbabilityCalculator.java 90.323% <100%> (+1.037%) 12 <3> (+3) ⬆️
...e/hellbender/utils/variant/GATKVCFHeaderLines.java 99.291% <100%> (+0.005%) 10 <0> (ø) ⬇️
.../tools/walkers/mutect/SomaticGenotypingEngine.java 92.958% <100%> (-0.285%) 56 <0> (-2)
...ls/walkers/mutect/M2FiltersArgumentCollection.java 100% <100%> (ø) 1 <0> (ø) ⬇️
...lkers/contamination/MinorAlleleFractionRecord.java 84.091% <84.091%> (ø) 5 <5> (?)
...r/tools/walkers/mutect/Mutect2FilteringEngine.java 82.759% <89.474%> (+1.159%) 40 <7> (+4) ⬆️
.../walkers/contamination/CalculateContamination.java 95.522% <93.75%> (-0.345%) 41 <5> (+3)
... and 47 more

@takutosato
Copy link
Contributor

I'd love to talk about \ell_n in person on Monday, but merging the PR doesn't need to wait.

@davidbenjamin davidbenjamin merged commit c4f8cdc into master Mar 11, 2018
@davidbenjamin davidbenjamin deleted the db_m2_germline_filter branch March 11, 2018 01:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants