New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements to Mutect2 germline filtering #4509
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished reviewing changes to the docs. Wanted to publish these comments before I jump into the code.
docs/mutect/mutect.tex
Outdated
\end{equation} | ||
|
||
The above equation, in which the factors of $\ell_t$ could cancel if we wished, is not quite right. The tumor likelihood $\ell_t$ is the probability of the tumor data given that the allele exists in the tumor \textit{as a somatic variant}. If the allele is in the tumor as a germline het we must modify $\ell_t$ to account for the fact that the allele fraction is determined by the ploidy -- it must be either $f_g$ or $1- f_g$with equal probability, where $f_g$ is the minor allele fraction of germline hets. It would be awkward to recalculate the tumor likelihood constrained the allele frequency in the model to these two values, but we can estimate a correction factor as follows: assuming that the posterior on the allele fraction in the somatic likelihoods model is fairly tight, the likelihood of $a$ alt reads out of $n$ total reads is $\binom na (1-f_t)^{n-a}f^a$, where $f_t$ is the tumor alt allele fraction. That is, our sophisticated model that marginalizes over $f_t$ reduces to something more naive. If the variant is a germline event, the likelihood becomes $\frac{1}{2} \binom na \left[(1-f_g)^{n-a}f_g^a + f_g^{n-a}(1-f_g)^a \right]$. Thus, in case (1) we have $\ell_t \rightarrow \chi \ell_t$, where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you meant "It would be awkward to recalculate the tumor likelihood with the allele frequencies constrained to these two values", or something along those lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
To filter, we set a threshold on this posterior probability. | ||
|
||
So far we have assumed that the population allele frequency $f$ is known, which is the case if it is found in our germline resource, such as gnomAD. If $f$ is not known we must make a reasonable guess as follows. Suppose the prior distribution on $f$ is ${\rm Beta}(\alpha, \beta)$. The mean $\alpha/(\alpha +\beta)$ of this prior is the average human heterozygosity $\theta \approx 10^{-3}$, so we have $\beta \approx \alpha / \theta$. We need one more constraint to determine $\alpha$ and $\beta$, and since we are concerned with imputing $f$ when $f$ is small we use a condition based on rare variants. Specifically, the number of variant alleles $n$ at some site in a germline resource with $N/2$ samples, hence $N$ chromosomes, is given by $f \sim {\rm Beta}(\alpha, \beta), n \sim {\rm Binom}(N,f)$. That is, $n \sim {\rm BetaBinom}(\alpha, \beta, N)$. The probability of a site being non-variant in every sample is then $P(n = 0) = {\rm BetaBinom}(0 | \alpha, \beta, N)$, which we equate to the empirical proportion of non-variant sites in our resource, about $7/8$ for exonic sites in gnomAD. Solving, we obtain approximately $\alpha = 0.01, \beta = 10$ for gnomAD. Now, given that some allele found by Mutect is not in the resource, the posterior on $f$ is ${\rm Beta}(\alpha, \beta + N)$, the mean of which is, since $\beta << N$, about $\alpha / N$. By default, Mutect uses this value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which we equate to the empirical proportion of non-variant sites in our resource, about
$7/8$ for exonic sites in gnomAD
Does this mean that in gnomAD there's a variant every 8 bp within the exome?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your patience. A few comments. Back to you
|
||
private double calculateMinorAlleleFraction(List<PileupSummary> segment) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make segment
final
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in two places
"1/(2* number of samples in resource) if a germline resource is available; otherwise an average " + | ||
"heterozygosity rate such as 0.001 is reasonable.", optional = true) | ||
public double afOfAllelesNotInGermlineResource = 0.001; | ||
doc="Population allele fraction assigned to alleles not found in germline resource. Please see docs/mutect2.pdf for" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs/mutect2.pdf -> docs/mutect/mutect2.pdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
public double afOfAllelesNotInGermlineResource = 0.001; | ||
doc="Population allele fraction assigned to alleles not found in germline resource. Please see docs/mutect2.pdf for" + | ||
"a derivation of the default value.", optional = true) | ||
public double afOfAllelesNotInGermlineResource = 0.00000005; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5e-8
may be easier to read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -126,20 +130,48 @@ private void applyReadPositionFilter(final M2FiltersArgumentCollection MTFAC, fi | |||
} | |||
} | |||
|
|||
private void applyGermlineVariantFilter(final M2FiltersArgumentCollection MTFAC, final VariantContext vc, final VariantContextBuilder vcb) { | |||
if (vc.hasAttribute(GATKVCFConstants.TUMOR_LOD_KEY) && vc.hasAttribute(GATKVCFConstants.POPULATION_AF_VCF_ATTRIBUTE)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes population allele frequency not optional. I was under the impression that it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Mutect2
it assigns a population AF to alleles that aren't in gnomAD, so if there's nothing by this point it means they weren't running our pipeline and we have no idea what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks
|
||
We can determine the posterior probability that the variant exists in the normal genotype by calculating the unnormalized probabilities of four possibilities: | ||
\begin{enumerate} | ||
\item The variant exists is both the normal and the tumor samples. This has unnormalized probability $\left(2f(1-f) + f^2 \right) \ell_n \ell_t (1 - \pi)$. | ||
\item The variant exists in the tumor and the normal as a germline het. This has unnormalized probability $2f(1-f) \ell_n \ell_t (1 - \pi)$. | ||
\item The variant exists in the tumor and the normal as a germline hom alt. This has unnormalized probability $f^2 \ell_n \ell_t (1 - \pi)$. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The normal likelihood ell_n
assumes that the variant is het, and I'm not sure that we can use it for the hom alt case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ell_n
is the total likelihood to be present, as either het or hom alt, in the normal. I added a footnote that clarifies this.
One could argue that we use some power by not modeling the normal more precisely i.e. the model as it stands allows a variant to be a germline het in the normal and a germline hom alt in the tumor. However, when we have a matched normal and ell_n
isn't small we always filter the variant regardless of anything else and the details of the model don't matter. They only matter when we're in tumor-only mode (or have no coverage in the normal), in which case ell_n
is basically deactivated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, the variant won't even be emitted when normal is hom alt.
But in the code, \ell_n
seems to be coming from SomaticGenotypingEngine::diploidAltLog10Odds
, which assumes that the germline variant is het, so wouldn't it be incorrect to say that \ell_n
is the total likelihood for het or hom alt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. That method is the log odds of het vs hom ref, but if the variant is hom alt these log odds will also be overwhelmingly large, because het is so much closer to the truth than hom ref. So really, in a slightly sloppy way, \ell_n
really is coming from both het and hom alt.
As a germline genotyper it's primitive, but for Mutect's purposes it's good enough.
final double log10GermlineAltMajorLikelihood = refCount * Math.log10(maf) + altCounts[n] * Math.log10(1 - maf); | ||
final double log10GermlineLikelihood = MathUtils.LOG10_ONE_HALF + MathUtils.log10SumLog10(log10GermlineAltMinorLikelihood, log10GermlineAltMajorLikelihood); | ||
|
||
final double f = altAlleleFractions[n]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IntelliJ tells me f
is not used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
/** | ||
* Created by David Benjamin on 9/15/16. | ||
*/ | ||
public class Mutect2FilteringEngine { | ||
public static final double MIN_ALLELE_FRACTION_FOR_GERMLINE_HET = 0.9; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a MAX instead of MIN?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Equivalently, it should be a HOM_ALT
instead of HET
, and that phrasing has a more direct link to the docs. Fixed it.
I also noticed a couple things in the docs for Calculation Contamination that are not in this PR:
|
Fixed both of the things you noticed in the |
@takutosato back to you. |
Codecov Report
@@ Coverage Diff @@
## master #4509 +/- ##
===============================================
+ Coverage 79.095% 79.104% +0.008%
+ Complexity 16631 16621 -10
===============================================
Files 1049 1050 +1
Lines 60115 59747 -368
Branches 9856 9792 -64
===============================================
- Hits 47548 47262 -286
+ Misses 8751 8682 -69
+ Partials 3816 3803 -13
|
I'd love to talk about |
363d694
to
45ffaae
Compare
45ffaae
to
86c166d
Compare
@takutosato This uses minor allele fraction segmentation, which was already done internally in
CalculateContamination
, to improve tumor-only calling a lot. I also sw modest improvements in some tumor-normal validations.Also, @chandrans @sooheelee this hopefully does away with the problems with
af-of-alleles-not-in-resource
by deriving a defensible default that doesn't result in all calls in tumor-only mode getting filtered.