New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Orientation Bias Filter #4895
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to review LearnReadOrientationModelEngine and the docs but here are comments on everything else to start with.
docs/mutect/mutect.tex
Outdated
|
||
Let $z_i$ denote a 1-hot encoding of the genotype extended to include the possibility of read orientation-based artifact at locus $i$. For example, at a locus where the 3 base-pair reference context is ACT, we have | ||
|
||
$$z \in \{ \text{F1R2}_A, \text{F1R2}_G, \text{F1R2}_T, \text{F2R1}_A, \text{F2R1}_G, \text{F2R1}_T, \text{Hom Ref}, \text{Germline Het}, \text{Somatic Het}, \text{Hom Var} \}$$ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use \begin{equation}
docs/mutect/mutect.tex
Outdated
|
||
$z_i$ has an associated vector of prior probabilities $\pi$. | ||
|
||
Assignment to $z_i$ dictates hyperparameters $\alpha, \beta$ for the conditional distribution over the number of alt reads, $m_i | z_i \sim \mathrm{BetaBinomial}(n, \alpha, \beta)$, where $n_i$ is the depth at site $i$. The number of F1R2 alt reads, denoted $x_i$, is also a beta-binomial distribution with parameters $\alpha'$ and $\beta'$ that depend on $z_i$, but it draws out of $m$ alt reads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assignment to
* | ||
* Returns null if at the end of a contig we cannot expand the window | ||
*/ | ||
public String getKmerAround(final int center, final int numBasesOnEachSide){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like some code that calls this method doesn't handle the possibility of returning null
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed (but not in tests)
* Created by tsato on 10/20/17. | ||
*/ | ||
public class ReadOrientationArtifact extends GenotypeAnnotation implements StandardMutectAnnotation { | ||
private File artifactPriorTable; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't need to be a member variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
@Override | ||
public List<VCFFormatHeaderLine> getDescriptions() { | ||
return Arrays.asList(new VCFFormatHeaderLine(GATKVCFConstants.ROF_POSTERIOR_KEY, 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put these in GATKVCFHeaderLines
so that you can use GATKVCFHeaderLines.getFormatLine(KEY)
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* @param n number of coin flips | ||
* @return probability density function evaluated at k | ||
*/ | ||
public static double log10BetaBinomialProbability(final int k, final int n, final double alpha, final double beta){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In light of the last comment, just checking that this is intentional code for this PR and not a rebasing artifact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup all is well I believe
@@ -60,6 +59,9 @@ private static void addFilterLine(final VCFFilterHeaderLine line) { | |||
addFilterLine(new VCFFilterHeaderLine(READ_POSITION_FILTER_NAME, "median distance of alt variants from end of reads")); | |||
addFilterLine(new VCFFilterHeaderLine(CONTAMINATION_FILTER_NAME, "contamination")); | |||
addFilterLine(new VCFFilterHeaderLine(DUPLICATED_EVIDENCE_FILTER_NAME, "evidence for alt allele is overrepresented by apparent duplicates")); | |||
addFilterLine(new VCFFilterHeaderLine(F1R2_ARTIFACT_FILTER_NAME, "over-representation of F1R2 reads among alt")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that it's an F1R2 orientation bias artifact instead, because just saying over-representation could be interpreted to mean that you just look at F1R2 and F2R1 counts without learning priors. We don't want users to jump to bad conclusions about the model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -927,7 +927,7 @@ class HHMMClassAndCopyNumberBasicCaller: | |||
|
|||
- Given q(c_s), the emission probability of each copy number class (tau) is determined | |||
(see _get_update_log_class_emission_tk_theano_func). The class prior and transition probabilities | |||
are fixed hyperparameters. Therefore, q(tau) can be updated immediately using a single run | |||
are fixed artifactPrior. Therefore, q(tau) can be updated immediately using a single run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IntelliJ was overzealous here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooops
public void test(final int depth, final double alleleFraction, final double altF1R2Fraction, | ||
final double expectedArtifactProb, final double epsilon, | ||
final Optional<ReadOrientation> expectedArtifactType) throws IOException { | ||
// Why is it that the annotations get larger window than the walker (CollectData?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this comment about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can safely be ignored. Removed
/** | ||
* Created by tsato on 3/14/18. | ||
*/ | ||
public class CollectF1R2CountsUnitTest extends CommandLineProgramTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IntegrationTest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reviewed everything now.
docs/mutect/mutect.tex
Outdated
|
||
\section{Read Orientation Artifact Filter} | ||
|
||
Read orientation artifact, also known as orientation bias, arises due to a chemical change in the nucleotide during library prep that results in, for example, G base-paring with A. The artifact is context-specific (e.g. C to A occurring only when the surrounding reference context is CCG) and therefore single-stranded. Downstream, this artifact manifests as low allele fraction SNPs whose evidence for the alt allele consists almost entirely of one read orientation i.e. F1R2 or F2R1 with the similar signature repeated across the sam/bam/cram. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Only" is too strong. Perhaps "predominantly"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Define F1R2 and F2R1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
docs/mutect/mutect.tex
Outdated
\begin{equation*} | ||
z_i \in \{ \text{F1R2}_A, \text{F1R2}_G, \text{F1R2}_T, \text{F2R1}_A, \text{F2R1}_G, \text{F2R1}_T, \text{Hom Ref}, \text{Germline Het}, \text{Somatic Het}, \text{Hom Var} \} | ||
\end{equation*} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This space shouldn't be here because it indents the following line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
docs/mutect/mutect.tex
Outdated
z_i \in \{ \text{F1R2}_A, \text{F1R2}_G, \text{F1R2}_T, \text{F2R1}_A, \text{F2R1}_G, \text{F2R1}_T, \text{Hom Ref}, \text{Germline Het}, \text{Somatic Het}, \text{Hom Var} \} | ||
\end{equation*} | ||
|
||
where $z_i = \rm{F1R2_A}$ denotes that at locus $i$ we have a context specific artifact in which the evidence for alt allele A consists entirely of reads in the F1R2 orientation. The rest of artifact states are defined analogously. Let $\vpi$ denote the prior probabilities of the $\vz_i$ under the reference context ACT. Then we have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Rest of" --> "other" or "remaining"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
docs/mutect/mutect.tex
Outdated
P(\vz_i) = \prod_k \pi_{k}^{z_{ik}} | ||
\end{equation} | ||
|
||
The number of alt reads at a locus is dependent on the genotype $z_i$. Let $n_i$ and $m_i$ denote the total depth and alt depth at locus $i$, respectively. The conditional distribution of $m_i$ is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"is dependent on" --> "depends on"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
docs/mutect/mutect.tex
Outdated
\begin{equation} | ||
P(m_i | z_i) = \rm{BetaBinomial}(m_i | n_i, \alpha_{z_i}, \beta_{z_i}) | ||
\end{equation} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, this space causes unwanted indentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
ArtifactState.getF2R1ArtifactStates().forEach(s -> alleleFractionPseudoCounts.put(s, new Pair<>(altPseudocount, refPseudocount))); | ||
|
||
|
||
for (ArtifactState z : ArtifactState.getNonArtifactStates()){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be four put
s instead of a switch
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
private static Map<ArtifactState, Pair<Double, Double>> getPseudoCountsForAltF1R2Fraction(){ | ||
final Map<ArtifactState, Pair<Double, Double>> altF1R2FractionPseudoCounts = new HashMap<>(ArtifactState.values().length); | ||
final double pseudoCountOfLikelyOutcome = 100.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extract constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
final Map<ArtifactState, Pair<Double, Double>> altF1R2FractionPseudoCounts = new HashMap<>(ArtifactState.values().length); | ||
final double pseudoCountOfLikelyOutcome = 100.0; | ||
final double pseudoCountOfRareOutcome = 1.0; | ||
final double balancedPrior = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think this could be a lot sharper -- Beta(10, 10) has a standard deviation of about 0.1.
return altF1R2FractionPseudoCounts; | ||
} | ||
|
||
public double[] getEffectiveCounts(){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unused according to IntelliJ.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's now used in a unit test
private Triple<Integer, Nucleotide, ReadOrientation> createKey(final int depth, final Nucleotide altAllele, final ReadOrientation orientation){ | ||
return new ImmutableTriple<>(depth, altAllele, orientation); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whitespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@davidbenjamin back to you. Rebasing got a bit complicated so I'm going to put that off until tomorrow. |
8d8a0ad
to
882cd93
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few more small things.
scripts/mutect2_wdl/mutect2.wdl
Outdated
export GATK_LOCAL_JAR=${default="/root/gatk.jar" gatk_override} | ||
|
||
gatk --java-options "-Xmx${command_mem}m" CollectF1R2Counts \ | ||
-I ${tumor_bam} -O "${tumor_sample_name}-artifact-prior-table.tsv" -R ${ref_fasta} \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indent these lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
bootDiskSizeGb: 12 | ||
memory: machine_mem + " MB" | ||
disks: "local-disk " + select_first([disk_space, 100]) + if use_ssd then " SSD" else " HDD" | ||
preemptible: select_first([preemptible_attempts, 10]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If my maxRetries
PR gets in first, you'll need to add that to the runtime block. When you rebase you'll see how the other tasks do it.
scripts/mutect2_wdl/mutect2.wdl
Outdated
export GATK_LOCAL_JAR=${default="/root/gatk.jar" gatk_override} | ||
|
||
gatk --java-options "-Xmx${command_mem}m" LearnReadOrientationModel \ | ||
-R ${ref_fasta} \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
scripts/mutect2_wdl/mutect2.wdl
Outdated
export GATK_LOCAL_JAR=${default="/root/gatk.jar" gatk_override} | ||
|
||
gatk --java-options "-Xmx${command_mem}m" LearnReadOrientationModel \ | ||
-R ${ref_fasta} \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the intervals get used here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, removed
bootDiskSizeGb: 12 | ||
memory: machine_mem + " MB" | ||
disks: "local-disk " + select_first([disk_space, 100]) + if use_ssd then " SSD" else " HDD" | ||
preemptible: select_first([preemptible_attempts, 10]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same about maxRetries
|
||
// Throw away bases that is below the minimum quality | ||
if (asm.getGenomePosition() == vc.getStart() && read.getBaseQuality(readOffset) < minimumBaseQuality){ | ||
// Throw away bases that is below the desired minimum quality |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
*/ | ||
for (final String refContext : F1R2FilterConstants.CANONICAL_KMERS){ | ||
final Optional<ArtifactPrior> ap = priors.stream().filter(a -> a.getReferenceContext().equals(refContext)).findAny(); | ||
Utils.validate(ap.isPresent(), "ArtifactPrior object isn't present for reference context " + refContext + "in file " + input); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be better as a UserException.BadInput
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
private double beta; | ||
|
||
public BetaDistributionShape(final double alpha, final double beta){ | ||
Utils.validateArg(alpha > 0 , "alpha must be greater than 0 but got " + beta); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ParamUtils.isPositive
is more precise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
private final static double REF_PSEUDOCOUNT = 9.0; | ||
|
||
// for home ref sites assume Q35, but maintaining some width | ||
private final static double PSEUDO_COUNT_OF_HOM_LIKELY = 10000.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PSEUDO_COUNT
--> PSEUDOCOUNT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Back to @takutosato. You're almost done. |
da41ad6
to
409a20c
Compare
@davidbenjamin I made the requested changes and submitted novaseq validation jobs. They haven't failed yet, but I'll monitor the jobs and make changes to the wdl as needed. Will let you know when they finish. |
@takutosato You can merge once tests pass. |
Codecov Report
@@ Coverage Diff @@
## master #4895 +/- ##
===============================================
+ Coverage 86.387% 86.614% +0.227%
- Complexity 28815 29833 +1018
===============================================
Files 1791 1813 +22
Lines 133696 137705 +4009
Branches 14896 15619 +723
===============================================
+ Hits 115496 119272 +3776
- Misses 12797 12889 +92
- Partials 5403 5544 +141
|
Woohoo!
…On Fri, Aug 3, 2018 at 12:44 PM, Takuto Sato ***@***.***> wrote:
Merged #4895 <#4895> into
master.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#4895 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AGRhdCSDz1_4UPPxztU6NZMIWkGFXdcfks5uNH3ZgaJpZM4Umhce>
.
--
Laura Doyle Gauthier, Ph.D.
Associate Director, Germline Methods
Data Sciences Platform
gauthier@broadinstitute.org
Broad Institute of MIT & Harvard
320 Charles St.
Cambridge MA 0214
|
…uplicated code in tests surrounding Mutect2.
…uplicated code in tests surrounding Mutect2.
…uplicated code in tests surrounding Mutect2. (#5123)
The WGS evaluation shows that the new filter is equivalent to or slightly better than the old filter for filtering the known Oxo-G and FFPE base artifacts. The new filter is strictly better at filtering artifacts we haven't encountered before, like that of the Novaseq data.
This PR also introduces the FDR-based two-pass filtering for FilterMutectCalls #4893
@davidbenjamin I still need to revise mutect2.tex, but the code is ready for review
@LeeTL1220 @madduran