Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to Mutect2 germline filtering #4509

Merged
merged 1 commit into from Mar 11, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Binary file modified docs/mutect/mutect.pdf
Binary file not shown.
27 changes: 20 additions & 7 deletions docs/mutect/mutect.tex
Expand Up @@ -37,6 +37,7 @@
\newcommand{\vA}{{\bf A}}

\newcommand{\code}[1]{\texttt{#1}}
\newcommand*{\Comb}[2]{{}^{#1}C_{#2}}

\newtheorem{lemma}{Lemma}
\newtheorem{corollary}{Corollary}
Expand Down Expand Up @@ -207,23 +208,35 @@ \section{Strand Artifact Model}
where $\pi_o = P(z_o = 1)$ is the fixed prior probability of no artifact. We normalize the posteriors and filter the variant if the posterior probability of $z_+ = 1$ or $z_- = 1$ exceeds the threshold.

\section{Germline Filter}\label{germline-filter}
Suppose we have detected an allele such that its (somatic) likelihood in the tumor is $\ell_t$ and its (diploid) likelihood in the normal is $\ell_n$. By convention, both of these are relative to a likelihood of $1$ for the allele \textit{not} to be found. If we have no matched normal, $\ell_n = 1$. Suppose we also have the population allele frequency $f$ of this allele. Then the prior probability for the normal to be heterozygous or homozygous alt for the allele is $2f(1-f) + f^2$ and the prior probability for the normal genotype not to contain the allele is $(1-f)^2$. Finally, suppose that the prior for this allele to arise as a somatic variant is $\pi$.
Suppose we have detected an allele such that its (somatic) likelihood in the tumor is $\ell_t$ and its (diploid) likelihood in the normal is $\ell_n$\footnote{This is the total likelihood for het and hom alt in the normal.}. By convention, both of these are relative to a likelihood of $1$ for the allele \textit{not} to be found. If we have no matched normal, $\ell_n = 1$. Suppose we also have the population allele frequency $f$ of this allele. Then the prior probabilities for the normal to be heterozygous and homozygous alt for the allele are $2f(1-f)$ and $f^2$ and the prior probability for the normal genotype not to contain the allele is $(1-f)^2$. Finally, suppose that the prior for this allele to arise as a somatic variant is $\pi$.

We can determine the posterior probability that the variant exists in the normal genotype by calculating the unnormalized probabilities of four possibilities:
\begin{enumerate}
\item The variant exists is both the normal and the tumor samples. This has unnormalized probability $\left(2f(1-f) + f^2 \right) \ell_n \ell_t (1 - \pi)$.
\item The variant exists in the tumor and the normal as a germline het. This has unnormalized probability $2f(1-f) \ell_n \ell_t (1 - \pi)$.
\item The variant exists in the tumor and the normal as a germline hom alt. This has unnormalized probability $f^2 \ell_n \ell_t (1 - \pi)$.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normal likelihood ell_n assumes that the variant is het, and I'm not sure that we can use it for the hom alt case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ell_n is the total likelihood to be present, as either het or hom alt, in the normal. I added a footnote that clarifies this.

One could argue that we use some power by not modeling the normal more precisely i.e. the model as it stands allows a variant to be a germline het in the normal and a germline hom alt in the tumor. However, when we have a matched normal and ell_n isn't small we always filter the variant regardless of anything else and the details of the model don't matter. They only matter when we're in tumor-only mode (or have no coverage in the normal), in which case ell_n is basically deactivated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, the variant won't even be emitted when normal is hom alt.

But in the code, \ell_n seems to be coming from SomaticGenotypingEngine::diploidAltLog10Odds, which assumes that the germline variant is het, so wouldn't it be incorrect to say that \ell_n is the total likelihood for het or hom alt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. That method is the log odds of het vs hom ref, but if the variant is hom alt these log odds will also be overwhelmingly large, because het is so much closer to the truth than hom ref. So really, in a slightly sloppy way, \ell_n really is coming from both het and hom alt.

As a germline genotyper it's primitive, but for Mutect's purposes it's good enough.

\item The variant exists in the tumor but not the normal. This has unnormalized probability $(1-f)^2 \ell_t \pi$.
\item The variant exists in neither the tumor nor the normal. This has unnormalized probability $(1-f)^2 (1 - \pi)$.
\item The variants exists in the normal but not the tumor. This is biologically very unlikely. Furthermore, if it \textit{did} occur we wouldn't care about filtering the variant as a germline event because we wouldn't call it as a somatic event. Thus we neglect this possibility.
\end{enumerate}

We exclude possibilities in which the variant does not exist in the tumor sample because we really want the conditional probability that the variant is germline given that it would otherwise be called.

Normalizing, we obtain the following posterior probability that an allele is a germline variant:
\begin{equation}
P({\rm germline}) = \frac{(1)}{(1) + (2) + (3)} = \frac{\left(2f(1-f) + f^2 \right) \ell_n \ell_t (1 - \pi)}{\left(2f(1-f) + f^2 \right) \ell_n \ell_t (1 - \pi) + (1-f)^2 \ell_t \pi + (1-f)^2 (1 - \pi)}.
P({\rm germline}) = \frac{(1) + (2)}{(1) + (2) + (3)} = \frac{\left(2f(1-f) + f^2 \right) \ell_n \ell_t (1 - \pi)}{\left(2f(1-f) + f^2 \right) \ell_n \ell_t (1 - \pi) + \ell_t (1-f)^2 \pi}.
\end{equation}

The above equation, in which the factors of $\ell_t$ could cancel if we wished, is not quite right. The tumor likelihood $\ell_t$ is the probability of the tumor data given that the allele exists in the tumor \textit{as a somatic variant}. If the allele is in the tumor as a germline het we must modify $\ell_t$ to account for the fact that the allele fraction is determined by the ploidy -- it must be either $f_g$ or $1- f_g$with equal probability, where $f_g$ is the minor allele fraction of germline hets. It would be awkward to recalculate the tumor likelihood with the allele frequency constrained to these two values\footnote{The model could easily accommodate this change, but the likelihoods are long gone from memory once the germline computation occurs.}, but we can estimate a correction factor as follows: assuming that the posterior on the allele fraction in the somatic likelihoods model is fairly tight, the likelihood of $a$ alt reads out of $n$ total reads is $\binom na (1-f_t)^{n-a}f^a$, where $f_t$ is the tumor alt allele fraction. That is, our sophisticated model that marginalizes over $f_t$ reduces to something more naive. If the variant is a germline event, the likelihood becomes $\frac{1}{2} \binom na \left[(1-f_g)^{n-a}f_g^a + f_g^{n-a}(1-f_g)^a \right]$. Thus, in case (1) we have $\ell_t \rightarrow \chi \ell_t$, where
\begin{equation}
\chi = \frac{1}{2} \frac{(1-f_g)^{n-a}f_g^a + f_g^{n-a}(1-f_g)^a}{(1-f_t)^{n-a}f_t^a}.
\end{equation}
For germline hom alts, both the tumor and normal allele fractions will be similarly large, so to decent approximation we don't have to modify $\ell_t$. Of course, this only applies if the allele fraction is large. Rather than try to model the count of ref reads within a germline hom alt site, we simply set a threshold of allele fraction 0.9, so that in case (2) $\ell_t \rightarrow {\rm I}[f_t > 0.9] \ell_t$.
and the corrected germline probability is
\begin{equation}
P({\rm germline}) = \frac{(1) + (2)}{(1) + (2) + (3)} = \frac{\left( 2f(1-f) \chi + {\rm I}[f_t > 0.9] f^2 \right) \ell_n (1 - \pi)}{\left( 2f(1-f) \chi + {\rm I}[f_t > 0.9] f^2 \right) \ell_n (1 - \pi) + (1-f)^2 \pi}.
\end{equation}
To filter, we set a threshold on this posterior probability.

So far we have assumed that the population allele frequency $f$ is known, which is the case if it is found in our germline resource, such as gnomAD. If $f$ is not known we must make a reasonable guess as follows. Suppose the prior distribution on $f$ is ${\rm Beta}(\alpha, \beta)$. The mean $\alpha/(\alpha +\beta)$ of this prior is the average human heterozygosity $\theta \approx 10^{-3}$, so we have $\beta \approx \alpha / \theta$. We need one more constraint to determine $\alpha$ and $\beta$, and since we are concerned with imputing $f$ when $f$ is small we use a condition based on rare variants. Specifically, the number of variant alleles $n$ at some site in a germline resource with $N/2$ samples, hence $N$ chromosomes, is given by $f \sim {\rm Beta}(\alpha, \beta), n \sim {\rm Binom}(N,f)$. That is, $n \sim {\rm BetaBinom}(\alpha, \beta, N)$. The probability of a site being non-variant in every sample is then $P(n = 0) = {\rm BetaBinom}(0 | \alpha, \beta, N)$, which we equate to the empirical proportion of non-variant sites in our resource, about $7/8$ for exonic sites in gnomAD. Solving, we obtain approximately $\alpha = 0.01, \beta = 10$ for gnomAD. Now, given that some allele found by Mutect is not in the resource, the posterior on $f$ is ${\rm Beta}(\alpha, \beta + N)$, the mean of which is, since $\beta << N$, about $\alpha / N$. By default, Mutect uses this value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which we equate to the empirical proportion of non-variant sites in our resource, about $7/8$ for exonic sites in gnomAD

Does this mean that in gnomAD there's a variant every 8 bp within the exome?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right.


\section{Contamination Filter}\label{contamination-filter}
Suppose our tumor bam has contamination fraction $\alpha$ and that at some site we have $a$ alt reads out of $d$ total reads. Suppose further that the alt allele has population allele frequency $f$. We will compute a simple estimate of the posterior probability that these alt reads came from a contaminating sample and not from a true somatic variant. Let $\pi$ be the prior probability of somatic variation as above. Our crude model for the alt count distribution of somatic variation is a uniform distribution. That is, we assume that any value of $a$ from $0$ to $d$ is equally likely. Then the likelihood of the data given a true somatic variant is
\begin{equation}
Expand Down Expand Up @@ -309,13 +322,13 @@ \section{Calculating Contamination}
\end{equation}
This standard deviation is quite small. For example, in an exome with 1000 hom alt sites (as is typical) and an average depth of 30 reads, a true contamination of 0.05 is estimated with an error of 0.0013.

It remains to describe how we determine which sites are hom alt. The fundamental challenge here is that in cancer samples loss of heterozygosity may cause het sites to look like hom alt sites. Our strategy is to partition the genome into allelic copy-number segments, then infer the minor allele fraction of those segments. We segment the genome just as in GATK CNV, using a kernel segmenter with a Gaussian kernel computed on the alt fraction. A nonlinear kernel is important because each segment is multimodal, with peaks for hom ref, alt minor het, alt major het, and hom ref. In practice we remove most hom refs heuristically by filtering sites with alt fraction below some small constant in order to enrich the meaningful signal, which are the alt minor and alt major hets. We infer the minor allele fraction of each segment by heuristically assigning its hets to be those sites with allele fraction between 0.2 and 0.8 and then maximizing the following likelihood by brute force, that is, by plugging in different values of the minor allele fraction:
It remains to describe how we determine which sites are hom alt. The fundamental challenge here is that in cancer samples loss of heterozygosity may cause het sites to look like hom alt sites. Our strategy is to partition the genome into allelic copy-number segments, then infer the minor allele fraction of those segments. We segment the genome just as in GATK CNV, using a kernel segmenter with a Gaussian kernel computed on the alt fraction. A nonlinear kernel is important because each segment is multimodal, with peaks for hom ref, alt minor het, alt major het, and hom alt. In practice we remove most hom refs heuristically by filtering sites with alt fraction below some small constant in order to enrich the meaningful signal, which are the alt minor and alt major hets. We infer the minor allele fraction of each segment by heuristically assigning its hets to be those sites with allele fraction between 0.2 and 0.8 and then maximizing the following likelihood by brute force, that is, by plugging in different values of the minor allele fraction:
\begin{equation}
\prod_{{\rm hets~}s} \frac{1}{2} \left[ {\rm Binom}(a_s | d_s, f) + {\rm Binom}(a_s | d_s, 1 - f) \right] \label{het-maf}
\end{equation}
where $a_s$ and $d_s$ are the alt and total counts at site $s$ and $f$ is the minor allele fraction. This approach might seem to break down in the case of loss of heterozygosity when many hets are outside of the heuristic range, but if this happens the likelihood will be maximized at the end of the range and we will still detect loss of heterozygosity. Also, due to contamination the true distributions are overdispersed relative to a pure binomial. However, the important thing is inferring $f$ and not obtaining a perfect model fit. Overdispersion worsens the fit but does not significantly affect the estimate of $f$.

Once the minor allele fraction is determined we compare the posterior probabilities of hom alt and het. For hets we use the above likelihood, Equation \ref{het-maf}, with a prior of $2 q (1 - q)$, where $q$ is the population allele frequency. For hom alts we use a binomial likelihood ${\rm Binom}(a_s | d_s, 1 - \alpha)$, where $\alpha$ is the contamination, and a prior of $q^2$. These priors are not normalized since at this point we have thrown away obvious hom ref sites, but only their ratio matters. The fact that this likelihood requires the contamination means that after initializing with some guess for the contamination we iterate the entire procedure described above\footnote{This excludes segmentation, which is independent.} to obtain successive estimates of the contamination. These likelihoods are crude, as they must be because we do not know how many contaminant samples there are, or what their copy numbers are at each site. This generally doesn't matter because for most segments hom alts are easily distinguished from hets. Since low minor allele fractions can lead to confusion, we only use hom alt sites from segments with minor allele fraction as close to 1/2 as possible. Specifically, we use the largest cutoff of minor allele fraction that gives us at least 100 ho alt sites.
Once the minor allele fraction is determined we compare the posterior probabilities of hom alt and het. For hets we use the above likelihood, Equation \ref{het-maf}, with a prior of $2 q (1 - q)$, where $q$ is the population allele frequency. For hom alts we use a binomial likelihood ${\rm Binom}(a_s | d_s, 1 - \alpha)$, where $\alpha$ is the contamination, and a prior of $q^2$. These priors are not normalized since at this point we have thrown away obvious hom ref sites, but only their ratio matters. The fact that this likelihood requires the contamination means that after initializing with some guess for the contamination we iterate the entire procedure described above\footnote{This excludes segmentation, which is independent.} to obtain successive estimates of the contamination. These likelihoods are crude, as they must be because we do not know how many contaminant samples there are, or what their copy numbers are at each site. This generally doesn't matter because for most segments hom alts are easily distinguished from hets. Since low minor allele fractions can lead to confusion, we only use hom alt sites from segments with minor allele fraction as close to 1/2 as possible. Specifically, we use the largest cutoff of minor allele fraction that gives us at least 100 hom alt sites.



Expand Down
6 changes: 5 additions & 1 deletion scripts/mutect2_wdl/mutect2.wdl
Expand Up @@ -270,6 +270,7 @@ workflow Mutect2 {
compress = compress,
preemptible_attempts = preemptible_attempts,
contamination_table = CalculateContamination.contamination_table,
maf_segments = CalculateContamination.maf_segments,
m2_extra_filtering_args = m2_extra_filtering_args,
disk_space = ceil(size(MergeVCFs.merged_vcf, "GB") * small_input_to_output_multiplier) + disk_pad
}
Expand Down Expand Up @@ -650,7 +651,7 @@ task CalculateContamination {
fi

gatk --java-options "-Xmx${command_mem}m" GetPileupSummaries -R ${ref_fasta} -I ${tumor_bam} ${"-L " + intervals} -V ${variants_for_contamination} -O pileups.table
gatk --java-options "-Xmx${command_mem}m" CalculateContamination -I pileups.table -O contamination.table $NORMAL_CMD
gatk --java-options "-Xmx${command_mem}m" CalculateContamination -I pileups.table -O contamination.table --tumor-segmentation segments.table $NORMAL_CMD
}

runtime {
Expand All @@ -663,6 +664,7 @@ task CalculateContamination {
output {
File pileups = "pileups.table"
File contamination_table = "contamination.table"
File maf_segments = "segments.table"
}
}

Expand All @@ -676,6 +678,7 @@ task Filter {
String output_vcf = output_name + if compress then ".vcf.gz" else ".vcf"
String output_vcf_index = output_vcf + if compress then ".tbi" else ".idx"
File? contamination_table
File? maf_segments
String? m2_extra_filtering_args

File? gatk_override
Expand All @@ -700,6 +703,7 @@ task Filter {
gatk --java-options "-Xmx${command_mem}m" FilterMutectCalls -V ${unfiltered_vcf} \
-O ${output_vcf} \
${"--contamination-table " + contamination_table} \
${"--tumor-segmentation " + maf_segments}
${m2_extra_filtering_args}
}

Expand Down
Expand Up @@ -126,6 +126,13 @@ public class CalculateContamination extends CommandLineProgram {
doc="The output table")
private final File outputTable = null;

public static final String TUMOR_SEGMENTATION_LONG_NAME = "tumor-segmentation";
public static final String TUMOR_SEGMENTATION_SHORT_NAME = "segments";
@Argument(fullName= TUMOR_SEGMENTATION_LONG_NAME,
shortName= TUMOR_SEGMENTATION_SHORT_NAME,
doc="The output table containing segmentation of the tumor by minor allele fraction", optional = true)
private final File outputTumorSegmentation = null;

public static final String LOW_COVERAGE_RATIO_THRESHOLD_NAME = "low-coverage-ratio-threshold";
@Argument(fullName = LOW_COVERAGE_RATIO_THRESHOLD_NAME,
doc="The minimum coverage relative to the median.", optional = true)
Expand Down Expand Up @@ -181,6 +188,15 @@ public Object doWork() {
genotypingContamination.setValue(newGenotypingContamination);
}

if (outputTumorSegmentation != null) {
final List<List<PileupSummary>> tumorSegments = matchedPileupSummariesTable == null ?
genotypingSegments : findSegments(sites);
List<MinorAlleleFractionRecord> tumorMinorAlleleFractions = tumorSegments.stream()
.map(this::makeMinorAlleleFractionRecord).collect(Collectors.toList());
MinorAlleleFractionRecord.writeToFile(tumorMinorAlleleFractions, outputTumorSegmentation);

}

final List<PileupSummary> homAltSites = subsetSites(sites, homAltGenotypingSites);
final Pair<Double, Double> contaminationAndError = calculateContamination(homAltSites, errorRate(sites));
final double contamination = contaminationAndError.getLeft();
Expand Down Expand Up @@ -214,13 +230,23 @@ private static List<PileupSummary> subsetSites(final List<PileupSummary> sites,
}

private List<PileupSummary> segmentHomAlts(final List<PileupSummary> segment, final double contamination, double minimiumMinorAlleleFraction) {
final double minorAlleleFraction = calculateMinorAlleleFraction(segment);
return minorAlleleFraction < minimiumMinorAlleleFraction ? Collections.emptyList() :
segment.stream().filter(site -> homAltProbability(site, minorAlleleFraction, contamination) > 0.5).collect(Collectors.toList());
}

private double calculateMinorAlleleFraction(final List<PileupSummary> segment) {
final List<PileupSummary> hets = getLikelyHetsBasedOnAlleleFraction(segment);
final Function<Double, Double> objective = maf -> logLikelihoodOfHetsInSegment(hets, maf);
final double minorAlleleFraction = OptimizationUtils.argmax(objective, ALT_FRACTIONS_FOR_SEGMENTATION.getMinimum(), 0.5, 0.4, 0.01, 0.01, 20);
return OptimizationUtils.argmax(objective, ALT_FRACTIONS_FOR_SEGMENTATION.getMinimum(), 0.5, 0.4, 0.01, 0.01, 20);
}

return minorAlleleFraction < minimiumMinorAlleleFraction ? Collections.emptyList() :
segment.stream().filter(site -> homAltProbability(site, minorAlleleFraction, contamination) > 0.5).collect(Collectors.toList());
private MinorAlleleFractionRecord makeMinorAlleleFractionRecord(final List<PileupSummary> segment) {
final String contig = segment.get(0).getContig();
final int start = segment.get(0).getStart();
final int end = segment.get(segment.size() - 1).getEnd();
final double minorAlleleFraction = calculateMinorAlleleFraction(segment);
return new MinorAlleleFractionRecord(new SimpleInterval(contig, start, end), minorAlleleFraction);
}


Expand Down