Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Orientation Bias Filter #4895

Merged
merged 1 commit into from Aug 3, 2018
Merged

New Orientation Bias Filter #4895

merged 1 commit into from Aug 3, 2018

Conversation

takutosato
Copy link
Contributor

The WGS evaluation shows that the new filter is equivalent to or slightly better than the old filter for filtering the known Oxo-G and FFPE base artifacts. The new filter is strictly better at filtering artifacts we haven't encountered before, like that of the Novaseq data.

This PR also introduces the FDR-based two-pass filtering for FilterMutectCalls #4893

@davidbenjamin I still need to revise mutect2.tex, but the code is ready for review

@LeeTL1220 @madduran

Copy link
Contributor

@davidbenjamin davidbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to review LearnReadOrientationModelEngine and the docs but here are comments on everything else to start with.


Let $z_i$ denote a 1-hot encoding of the genotype extended to include the possibility of read orientation-based artifact at locus $i$. For example, at a locus where the 3 base-pair reference context is ACT, we have

$$z \in \{ \text{F1R2}_A, \text{F1R2}_G, \text{F1R2}_T, \text{F2R1}_A, \text{F2R1}_G, \text{F2R1}_T, \text{Hom Ref}, \text{Germline Het}, \text{Somatic Het}, \text{Hom Var} \}$$
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use \begin{equation}


$z_i$ has an associated vector of prior probabilities $\pi$.

Assignment to $z_i$ dictates hyperparameters $\alpha, \beta$ for the conditional distribution over the number of alt reads, $m_i | z_i \sim \mathrm{BetaBinomial}(n, \alpha, \beta)$, where $n_i$ is the depth at site $i$. The number of F1R2 alt reads, denoted $x_i$, is also a beta-binomial distribution with parameters $\alpha'$ and $\beta'$ that depend on $z_i$, but it draws out of $m$ alt reads.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assignment to $z_i$ dictates -> $z_i$ determines

*
* Returns null if at the end of a contig we cannot expand the window
*/
public String getKmerAround(final int center, final int numBasesOnEachSide){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like some code that calls this method doesn't handle the possibility of returning null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed (but not in tests)

* Created by tsato on 10/20/17.
*/
public class ReadOrientationArtifact extends GenotypeAnnotation implements StandardMutectAnnotation {
private File artifactPriorTable;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to be a member variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


@Override
public List<VCFFormatHeaderLine> getDescriptions() {
return Arrays.asList(new VCFFormatHeaderLine(GATKVCFConstants.ROF_POSTERIOR_KEY, 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put these in GATKVCFHeaderLines so that you can use GATKVCFHeaderLines.getFormatLine(KEY) here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* @param n number of coin flips
* @return probability density function evaluated at k
*/
public static double log10BetaBinomialProbability(final int k, final int n, final double alpha, final double beta){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In light of the last comment, just checking that this is intentional code for this PR and not a rebasing artifact.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup all is well I believe

@@ -60,6 +59,9 @@ private static void addFilterLine(final VCFFilterHeaderLine line) {
addFilterLine(new VCFFilterHeaderLine(READ_POSITION_FILTER_NAME, "median distance of alt variants from end of reads"));
addFilterLine(new VCFFilterHeaderLine(CONTAMINATION_FILTER_NAME, "contamination"));
addFilterLine(new VCFFilterHeaderLine(DUPLICATED_EVIDENCE_FILTER_NAME, "evidence for alt allele is overrepresented by apparent duplicates"));
addFilterLine(new VCFFilterHeaderLine(F1R2_ARTIFACT_FILTER_NAME, "over-representation of F1R2 reads among alt"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that it's an F1R2 orientation bias artifact instead, because just saying over-representation could be interpreted to mean that you just look at F1R2 and F2R1 counts without learning priors. We don't want users to jump to bad conclusions about the model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -927,7 +927,7 @@ class HHMMClassAndCopyNumberBasicCaller:

- Given q(c_s), the emission probability of each copy number class (tau) is determined
(see _get_update_log_class_emission_tk_theano_func). The class prior and transition probabilities
are fixed hyperparameters. Therefore, q(tau) can be updated immediately using a single run
are fixed artifactPrior. Therefore, q(tau) can be updated immediately using a single run
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IntelliJ was overzealous here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooops

public void test(final int depth, final double alleleFraction, final double altF1R2Fraction,
final double expectedArtifactProb, final double epsilon,
final Optional<ReadOrientation> expectedArtifactType) throws IOException {
// Why is it that the annotations get larger window than the walker (CollectData?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this comment about?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can safely be ignored. Removed

/**
* Created by tsato on 3/14/18.
*/
public class CollectF1R2CountsUnitTest extends CommandLineProgramTest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IntegrationTest

Copy link
Contributor

@davidbenjamin davidbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed everything now.


\section{Read Orientation Artifact Filter}

Read orientation artifact, also known as orientation bias, arises due to a chemical change in the nucleotide during library prep that results in, for example, G base-paring with A. The artifact is context-specific (e.g. C to A occurring only when the surrounding reference context is CCG) and therefore single-stranded. Downstream, this artifact manifests as low allele fraction SNPs whose evidence for the alt allele consists almost entirely of one read orientation i.e. F1R2 or F2R1 with the similar signature repeated across the sam/bam/cram.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Only" is too strong. Perhaps "predominantly"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define F1R2 and F2R1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

\begin{equation*}
z_i \in \{ \text{F1R2}_A, \text{F1R2}_G, \text{F1R2}_T, \text{F2R1}_A, \text{F2R1}_G, \text{F2R1}_T, \text{Hom Ref}, \text{Germline Het}, \text{Somatic Het}, \text{Hom Var} \}
\end{equation*}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This space shouldn't be here because it indents the following line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

z_i \in \{ \text{F1R2}_A, \text{F1R2}_G, \text{F1R2}_T, \text{F2R1}_A, \text{F2R1}_G, \text{F2R1}_T, \text{Hom Ref}, \text{Germline Het}, \text{Somatic Het}, \text{Hom Var} \}
\end{equation*}

where $z_i = \rm{F1R2_A}$ denotes that at locus $i$ we have a context specific artifact in which the evidence for alt allele A consists entirely of reads in the F1R2 orientation. The rest of artifact states are defined analogously. Let $\vpi$ denote the prior probabilities of the $\vz_i$ under the reference context ACT. Then we have
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Rest of" --> "other" or "remaining"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

P(\vz_i) = \prod_k \pi_{k}^{z_{ik}}
\end{equation}

The number of alt reads at a locus is dependent on the genotype $z_i$. Let $n_i$ and $m_i$ denote the total depth and alt depth at locus $i$, respectively. The conditional distribution of $m_i$ is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"is dependent on" --> "depends on"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

\begin{equation}
P(m_i | z_i) = \rm{BetaBinomial}(m_i | n_i, \alpha_{z_i}, \beta_{z_i})
\end{equation}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, this space causes unwanted indentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

ArtifactState.getF2R1ArtifactStates().forEach(s -> alleleFractionPseudoCounts.put(s, new Pair<>(altPseudocount, refPseudocount)));


for (ArtifactState z : ArtifactState.getNonArtifactStates()){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be four puts instead of a switch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


private static Map<ArtifactState, Pair<Double, Double>> getPseudoCountsForAltF1R2Fraction(){
final Map<ArtifactState, Pair<Double, Double>> altF1R2FractionPseudoCounts = new HashMap<>(ArtifactState.values().length);
final double pseudoCountOfLikelyOutcome = 100.0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract constants.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

final Map<ArtifactState, Pair<Double, Double>> altF1R2FractionPseudoCounts = new HashMap<>(ArtifactState.values().length);
final double pseudoCountOfLikelyOutcome = 100.0;
final double pseudoCountOfRareOutcome = 1.0;
final double balancedPrior = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think this could be a lot sharper -- Beta(10, 10) has a standard deviation of about 0.1.

return altF1R2FractionPseudoCounts;
}

public double[] getEffectiveCounts(){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unused according to IntelliJ.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's now used in a unit test

private Triple<Integer, Nucleotide, ReadOrientation> createKey(final int depth, final Nucleotide altAllele, final ReadOrientation orientation){
return new ImmutableTriple<>(depth, altAllele, orientation);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@takutosato
Copy link
Contributor Author

@davidbenjamin back to you. Rebasing got a bit complicated so I'm going to put that off until tomorrow.

Copy link
Contributor

@davidbenjamin davidbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more small things.

export GATK_LOCAL_JAR=${default="/root/gatk.jar" gatk_override}

gatk --java-options "-Xmx${command_mem}m" CollectF1R2Counts \
-I ${tumor_bam} -O "${tumor_sample_name}-artifact-prior-table.tsv" -R ${ref_fasta} \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent these lines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

bootDiskSizeGb: 12
memory: machine_mem + " MB"
disks: "local-disk " + select_first([disk_space, 100]) + if use_ssd then " SSD" else " HDD"
preemptible: select_first([preemptible_attempts, 10])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If my maxRetries PR gets in first, you'll need to add that to the runtime block. When you rebase you'll see how the other tasks do it.

export GATK_LOCAL_JAR=${default="/root/gatk.jar" gatk_override}

gatk --java-options "-Xmx${command_mem}m" LearnReadOrientationModel \
-R ${ref_fasta} \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

export GATK_LOCAL_JAR=${default="/root/gatk.jar" gatk_override}

gatk --java-options "-Xmx${command_mem}m" LearnReadOrientationModel \
-R ${ref_fasta} \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the intervals get used here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, removed

bootDiskSizeGb: 12
memory: machine_mem + " MB"
disks: "local-disk " + select_first([disk_space, 100]) + if use_ssd then " SSD" else " HDD"
preemptible: select_first([preemptible_attempts, 10])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same about maxRetries


// Throw away bases that is below the minimum quality
if (asm.getGenomePosition() == vc.getStart() && read.getBaseQuality(readOffset) < minimumBaseQuality){
// Throw away bases that is below the desired minimum quality
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that are

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

*/
for (final String refContext : F1R2FilterConstants.CANONICAL_KMERS){
final Optional<ArtifactPrior> ap = priors.stream().filter(a -> a.getReferenceContext().equals(refContext)).findAny();
Utils.validate(ap.isPresent(), "ArtifactPrior object isn't present for reference context " + refContext + "in file " + input);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be better as a UserException.BadInput

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

private double beta;

public BetaDistributionShape(final double alpha, final double beta){
Utils.validateArg(alpha > 0 , "alpha must be greater than 0 but got " + beta);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ParamUtils.isPositive is more precise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

private final static double REF_PSEUDOCOUNT = 9.0;

// for home ref sites assume Q35, but maintaining some width
private final static double PSEUDO_COUNT_OF_HOM_LIKELY = 10000.0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PSEUDO_COUNT --> PSEUDOCOUNT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@davidbenjamin
Copy link
Contributor

Back to @takutosato. You're almost done.

@takutosato takutosato force-pushed the ts_ob_2pass branch 2 times, most recently from da41ad6 to 409a20c Compare July 27, 2018 20:15
@takutosato
Copy link
Contributor Author

@davidbenjamin I made the requested changes and submitted novaseq validation jobs. They haven't failed yet, but I'll monitor the jobs and make changes to the wdl as needed. Will let you know when they finish.

@davidbenjamin
Copy link
Contributor

@takutosato You can merge once tests pass.

@codecov-io
Copy link

codecov-io commented Jul 31, 2018

Codecov Report

Merging #4895 into master will increase coverage by 0.227%.
The diff coverage is 93.158%.

@@               Coverage Diff               @@
##              master     #4895       +/-   ##
===============================================
+ Coverage     86.387%   86.614%   +0.227%     
- Complexity     28815     29833     +1018     
===============================================
  Files           1791      1813       +22     
  Lines         133696    137705     +4009     
  Branches       14896     15619      +723     
===============================================
+ Hits          115496    119272     +3776     
- Misses         12797     12889       +92     
- Partials        5403      5544      +141
Impacted Files Coverage Δ Complexity Δ
...nstitute/hellbender/engine/filters/ReadFilter.java 100% <ø> (ø) 10 <0> (ø) ⬇️
...ute/hellbender/utils/variant/GATKVCFConstants.java 80% <ø> (ø) 4 <0> (ø) ⬇️
...ion/basicshortmutpileup/PowerCalculationUtils.java 92.683% <0%> (-3.984%) 31 <0> (+13)
...hellbender/tools/walkers/mutect/Mutect2Engine.java 93.396% <0%> (+2.005%) 71 <1> (+16) ⬆️
...ientation/ReadOrientationModelIntegrationTest.java 100% <100%> (ø) 6 <6> (?)
...dinstitute/hellbender/utils/MathUtilsUnitTest.java 92.369% <100%> (+0.21%) 143 <4> (+3) ⬆️
...ools/walkers/annotator/OxoGReadCountsUnitTest.java 100% <100%> (ø) 15 <0> (ø) ⬇️
...e/hellbender/utils/variant/GATKVCFHeaderLines.java 99.329% <100%> (+0.028%) 10 <0> (ø) ⬇️
...walkers/readorientation/AltSiteRecordUnitTest.java 100% <100%> (ø) 10 <10> (?)
.../hellbender/tools/walkers/mutect/FilterResult.java 100% <100%> (ø) 7 <7> (?)
... and 94 more

@ldgauthier
Copy link
Contributor

ldgauthier commented Aug 3, 2018 via email

takutosato added a commit that referenced this pull request Aug 31, 2018
…uplicated code in tests surrounding Mutect2. (#5123)
@davidbenjamin davidbenjamin deleted the ts_ob_2pass branch February 24, 2021 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants