Assorted documentation fixes, enhancements and reorganization.

See issues referenced by the pull request for details.
broadgsa · Nov 21, 2015 · 2570cab · 2570cab
1 parent 6722ac8
commit 2570cab
Show file tree

Hide file tree

Showing 27 changed files with 306 additions and 123 deletions.
diff --git a/...in/java/org/broadinstitute/gatk/engine/recalibration/RecalibrationArgumentCollection.java b/...in/java/org/broadinstitute/gatk/engine/recalibration/RecalibrationArgumentCollection.java
@@ -75,18 +75,16 @@ public class RecalibrationArgumentCollection implements Cloneable {
 
     /**
      * This algorithm treats every reference mismatch as an indication of error. However, real genetic variation is expected to mismatch the reference,
-     * so it is critical that a database of known polymorphic sites is given to the tool in order to skip over those sites. This tool accepts any number of RodBindings (VCF, Bed, etc.)
-     * for use as this database. For users wishing to exclude an interval list of known variation simply use -XL my.interval.list to skip over processing those sites.
-     * Please note however that the statistics reported by the tool will not accurately reflected those sites skipped by the -XL argument.
+     * so it is critical that a database of known polymorphic sites (e.g. dbSNP) is given to the tool in order to mask out those sites.
      */
-    @Input(fullName = "knownSites", shortName = "knownSites", doc = "A database of known polymorphic sites to skip over in the recalibration algorithm", required = false)
+    @Input(fullName = "knownSites", shortName = "knownSites", doc = "A database of known polymorphic sites", required = false)
     public List<RodBinding<Feature>> knownSites = Collections.emptyList();
 
     /**
      * After the header, data records occur one per line until the end of the file. The first several items on a line are the
      * values of the individual covariates and will change depending on which covariates were specified at runtime. The last
      * three items are the data- that is, number of observations for this combination of covariates, number of reference mismatches,
-     * and the raw empirical quality score calculated by phred-scaling the mismatch rate.   Use '/dev/stdout' to print to standard out.
+     * and the raw empirical quality score calculated by phred-scaling the mismatch rate.
      */
     @Gather(BQSRGatherer.class)
     @Output(doc = "The output recalibration table file to create", required = true)
@@ -107,7 +105,7 @@ public class RecalibrationArgumentCollection implements Cloneable {
     @Argument(fullName = "covariate", shortName = "cov", doc = "One or more covariates to be used in the recalibration. Can be specified multiple times", required = false)
     public String[] COVARIATES = null;
 
-    /*
+    /**
      * The Cycle and Context covariates are standard and are included by default unless this argument is provided.
      * Note that the ReadGroup and QualityScore covariates are required and cannot be excluded.
      */

diff --git a/.../main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_BaseQualityRankSumTest.java b/.../main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_BaseQualityRankSumTest.java
@@ -64,15 +64,25 @@
 
 
 /**
- * Allele-specific rank Sum Test of REF versus each ALT base quality scores
+ * Allele-specific rank Sum Test of REF versus ALT base quality scores
  *
- * <p>This variant-level annotation tests compares the base qualities of the data supporting the reference allele with those supporting each alternate allele. The ideal result is a value close to zero, which indicates there is little to no difference. A negative value indicates that the bases supporting the alternate allele have lower quality scores than those supporting the reference allele. Conversely, a positive value indicates that the bases supporting the alternate allele have higher quality scores than those supporting the reference allele. Finding a statistically significant difference either way suggests that the sequencing process may have been biased or affected by an artifact.</p>
+ * <p>This variant-level annotation compares the base qualities of the data supporting the reference allele with those supporting each alternate allele. To be clear, it does so separately for each alternate allele. </p>
+ *
+ * <p>The ideal result is a value close to zero, which indicates there is little to no difference. A negative value indicates that the bases supporting the alternate allele have lower quality scores than those supporting the reference allele. Conversely, a positive value indicates that the bases supporting the alternate allele have higher quality scores than those supporting the reference allele. Finding a statistically significant difference either way suggests that the sequencing process may have been biased or affected by an artifact.</p>
  *
  * <h3>Statistical notes</h3>
  * <p>The value output for this annotation is the u-based z-approximation from the Mann-Whitney-Wilcoxon Rank Sum Test for base qualities (bases supporting REF vs. bases supporting ALT). See the <a href="http://www.broadinstitute.org/gatk/guide/article?id=4732">method document on statistical tests</a> for a more detailed explanation of the ranksum test.</p>
  *
- * <h3>Caveat</h3>
- * <p>Uninformative reads are not used in these calculations.</p>
+ * <h3>Caveats</h3>
+ * <ul>
+ * <li>Uninformative reads are not used in these calculations.</li>
+ * <li>The base quality rank sum test cannot be calculated for sites without a mixture of reads showing both the reference and alternate alleles.</li>
+ * </ul>
+ *
+ * <h3>Related annotations</h3>
+ * <ul>
+ * <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_BaseQualityRankSumTest.php">BaseQualityRankSumTest</a></b> outputs a version of this annotation that includes all alternate alleles in a single calculation.</li>
+ * </ul>
  *
  */
 public class AS_BaseQualityRankSumTest extends AS_RankSumTest implements AS_StandardAnnotation {

diff --git a/...tected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_FisherStrand.java b/...tected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_FisherStrand.java
@@ -65,7 +65,27 @@
 
 
 /**
- * Allele specific strand bias estimated using Fisher's Exact Test
+ * Allele-specific strand bias estimated using Fisher's Exact Test
+ *
+ * * <p>Strand bias is a type of sequencing bias in which one DNA strand is favored over the other, which can result in incorrect evaluation of the amount of evidence observed for one allele vs. the other.</p>
+ *
+ * <p>The AS_FisherStrand annotation is one of several methods that aims to evaluate whether there is strand bias in the data. It uses Fisher's Exact Test to determine if there is strand bias between forward and reverse strands for the reference or alternate allele, and does so separately for each alternate allele.</p>
+ * <p>The output is a Phred-scaled p-value. The higher the output value, the more likely there is to be bias. More bias is indicative of false positive calls.</p>
+ *
+ * <h3>Statistical notes</h3>
+ * <p>See the <a href="http://www.broadinstitute.org/gatk/guide/article?id=4732">method document on statistical tests</a> for a more detailed explanation of this application of Fisher's Exact Test.</p>
+ *
+ * <h3>Caveats</h3>
+ * <ul>
+ *     <li>The FisherStrand test may not be calculated for certain complex indel cases or for multi-allelic sites.</li>
+ *     <li>FisherStrand is best suited for low coverage situations. For testing strand bias in higher coverage situations, see the StrandOddsRatio annotation.</li>
+ * </ul>
+ * <h3>Related annotations</h3>
+ * <ul>
+ *     <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_AS_FisherStrand.php">AS_FisherStrand</a></b> outputs a version of this annotation that includes all alternate alleles in a single calculation.</li>
+ *     <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_StrandBiasBySample.php">StrandBiasBySample</a></b> outputs counts of read depth per allele for each strand orientation.</li>
+ *     <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_StrandOddsRatio.php">StrandOddsRatio</a></b> is an updated form of FisherStrand that uses a symmetric odds ratio calculation.</li>
+ * </ul>
  *
  */
 public class AS_FisherStrand extends AS_StrandBiasTest implements AS_StandardAnnotation {

diff --git a/...in/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_MappingQualityRankSumTest.java b/...in/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_MappingQualityRankSumTest.java
@@ -68,9 +68,26 @@
 
 
 /**
- * Allele specific Rank Sum Test for mapping qualities of REF versus each ALT reads
+ * Allele specific Rank Sum Test for mapping qualities of REF versus ALT reads
  *
- * Currently this annotation duplicate the MappingQualityRankSumTest annotation
+ * <p>This variant-level annotation compares the mapping qualities of the reads supporting the reference allele with those supporting each alternate allele. To be clear, it does so separately for each alternate allele. </p>
+ *
+ * <p>The ideal result is a value close to zero, which indicates there is little to no difference. A negative value indicates that the reads supporting the alternate allele have lower mapping quality scores than those supporting the reference allele. Conversely, a positive value indicates that the reads supporting the alternate allele have higher mapping quality scores than those supporting the reference allele.</p>
+ * <p>Finding a statistically significant difference in quality either way suggests that the sequencing and/or mapping process may have been biased or affected by an artifact. In practice, we only filter out low negative values when evaluating variant quality because the idea is to filter out variants for which the quality of the data supporting the alternate allele is comparatively low. The reverse case, where it is the quality of data supporting the reference allele that is lower (resulting in positive ranksum scores), is not really informative for filtering variants.
+ *
+ * <h3>Statistical notes</h3>
+ * <p>The value output for this annotation is the u-based z-approximation from the Mann-Whitney-Wilcoxon Rank Sum Test for mapping qualities (MAPQ of reads supporting REF vs. MAPQ of reads supporting ALT). See the <a href="http://www.broadinstitute.org/gatk/guide/article?id=4732">method document on statistical tests</a> for a more detailed explanation of the ranksum test.</p>
+ *
+ * <h3>Caveats</h3>
+ * <ul><li>The mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.</li>
+ * <li>Uninformative reads are not used in these annotations.</li>
+ * </ul>
+ *
+ * <h3>Related annotations</h3>
+ * <ul>
+ *     <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_MappingQualityRankSumTest.php">MappingQualityRankSumTest</a></b> outputs a version of this annotation that includes all alternate alleles in a single calculation.</li>
+ *     <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_RMSMappingQuality.php">RMSMappingQuality</a></b> gives an estimation of the overal read mapping quality supporting a variant call.</li>
+ * </ul>
  *
  */
 public class AS_MappingQualityRankSumTest extends AS_RankSumTest implements AS_StandardAnnotation {

diff --git a/...d/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_RMSMappingQuality.java b/...d/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_RMSMappingQuality.java
@@ -79,8 +79,13 @@
  * </ul>
  *
  * <h3>Caveat</h3>
- * <p>Uninformative reads are not used in these annotations.</p>
+ * <p>Uninformative reads are not used in this annotation.</p>
  *
+ * <h3>Related annotations</h3>
+ * <ul>
+ *     <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_RMSMappingQuality.php">RMSMappingQuality</a></b> outputs a version of this annotation that includes all alternate alleles in a single calculation.</li>
+ *     <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_MappingQualityRankSumTest.php">MappingQualityRankSumTest</a></b> compares the mapping quality of reads supporting the REF and ALT alleles.</li>
+ * </ul>
  */
 public class AS_RMSMappingQuality extends AS_RMSAnnotation implements AS_StandardAnnotation, ActiveRegionBasedAnnotation {
 

diff --git a/.../src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_ReadPosRankSumTest.java b/.../src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_ReadPosRankSumTest.java
@@ -63,9 +63,11 @@
 import java.util.List;
 
 /**
- * Allele-specific Rank Sum Test for relative positioning of REF versus each ALT allele within reads
+ * Allele-specific Rank Sum Test for relative positioning of REF versus ALT allele within reads
  *
- * <p>This variant-level annotation tests whether there is evidence of bias in the position of alleles within the reads that support them, between the reference and each alternate allele. Seeing an allele only near the ends of reads is indicative of error, because that is where sequencers tend to make the most errors. However, some variants located near the edges of sequenced regions will necessarily be covered by the ends of reads, so we can't just set an absolute "minimum distance from end of read" threshold. That is why we use a rank sum test to evaluate whether there is a difference in how well the reference allele and the alternate allele are supported.</p>
+ * <p>This variant-level annotation tests whether there is evidence of bias in the position of alleles within the reads that support them, between the reference and each alternate allele. To be clear, it does so separately for each alternate allele.</p>
+ *
+ * <p>Seeing an allele only near the ends of reads is indicative of error, because that is where sequencers tend to make the most errors. However, some variants located near the edges of sequenced regions will necessarily be covered by the ends of reads, so we can't just set an absolute "minimum distance from end of read" threshold. That is why we use a rank sum test to evaluate whether there is a difference in how well the reference allele and the alternate allele are supported.</p>
  *
  * <p>The ideal result is a value close to zero, which indicates there is little to no difference in where the alleles are found relative to the ends of reads. A negative value indicates that the alternate allele is found at the ends of reads more often than the reference allele. Conversely, a positive value indicates that the reference allele is found at the ends of reads more often than the alternate allele. </p>
  *
@@ -75,8 +77,15 @@
  * <p>The value output for this annotation is the u-based z-approximation from the Mann-Whitney-Wilcoxon Rank Sum Test for site position within reads (position within reads supporting REF vs. position within reads supporting ALT). See the <a href="http://www.broadinstitute.org/gatk/guide/article?id=4732">method document on statistical tests</a> for a more detailed explanation of the ranksum test.</p>
  *
  * <h3>Caveat</h3>
- * <p>Uninformative reads are not used in these annotations.</p>
+ * <ul>
+ * <li>The read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.</li>
+ * <li>Uninformative reads are not used in these annotations.</li>
+ * </ul>
  *
+ * <h3>Related annotations</h3>
+ * <ul>
+ *     <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_ReadPosRankSumTest.php">ReadPosRankRankSumTest</a></b> outputs a version of this annotation that includes all alternate alleles in a single calculation.</li>
+ * </ul>
  *
  */
 public class AS_ReadPosRankSumTest extends AS_RankSumTest implements AS_StandardAnnotation {

diff --git a/...ted/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_StrandOddsRatio.java b/...ted/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_StrandOddsRatio.java
@@ -65,6 +65,46 @@
 /**
  * Allele-specific strand bias estimated by the Symmetric Odds Ratio test
  *
+ * <p>Strand bias is a type of sequencing bias in which one DNA strand is favored over the other, which can result in incorrect evaluation of the amount of evidence observed for one allele vs. the other. </p>
+ *
+ * <p>The AS_StrandOddsRatio annotation is one of several methods that aims to evaluate whether there is strand bias in the data. It is an updated form of the Fisher Strand Test that is better at taking into account large amounts of data in high coverage situations. It is used to determine if there is strand bias between forward and reverse strands for the reference or alternate allele. It does so separately for each allele. The reported value is ln-scaled.</p>
+ *
+ * <h3>Statistical notes</h3>
+ * <p> Odds Ratios in the 2x2 contingency table below are</p>
+ *
+ * $$ R = \frac{X[0][0] * X[1][1]}{X[0][1] * X[1][0]} $$
+ *
+ * <p>and its inverse:</p>
+ *
+ * <table>
+ *      <tr><td>&nbsp;</td><td>+ strand </td><td>- strand</td></tr>
+ *      <tr><td>REF;</td><td>X[0][0]</td><td>X[0][1]</td></tr>
+ *      <tr><td>ALT;</td><td>X[1][0]</td><td>X[1][1]</td></tr>
+ * </table>
+ *
+ * <p>The sum R + 1/R is used to detect a difference in strand bias for REF and for ALT (the sum makes it symmetric). A high value is indicative of large difference where one entry is very small compared to the others. A scale factor of refRatio/altRatio where</p>
+ *
+ * $$ refRatio = \frac{max(X[0][0], X[0][1])}{min(X[0][0], X[0][1} $$
+ *
+ * <p>and </p>
+ *
+ * $$ altRatio = \frac{max(X[1][0], X[1][1])}{min(X[1][0], X[1][1]} $$
+ *
+ * <p>ensures that the annotation value is large only. </p>
+ *
+ * <p>See the <a href="http://www.broadinstitute.org/gatk/guide/article?id=4732">method document on statistical tests</a> for a more detailed explanation of this statistical test.</p>
+ *
+ * <h3>Caveat</h3>
+ * <p>
+ * The name AS_StrandOddsRatio is not entirely appropriate because the implementation was changed somewhere between the start of development and release of this annotation. Now SOR isn't really an odds ratio anymore. The goal was to separate certain cases of data without penalizing variants that occur at the ends of exons because they tend to only be covered by reads in one direction (depending on which end of the exon they're on), so if a variant has 10 ref reads in the + direction, 1 ref read in the - direction, 9 alt reads in the + direction and 2 alt reads in the - direction, it's actually not strand biased, but the FS score is pretty bad. The implementation that resulted derived in part from empirically testing some read count tables of various sizes with various ratios and deciding from there.</p>
+ *
+ * <h3>Related annotations</h3>
+ * <ul>
+ *     <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_StrandOddsRatio.php">StrandOddsRatio</a></b> outputs a version of this annotation that includes all alternate alleles in a single calculation.</li>
+ *     <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_StrandBiasBySample.php">StrandBiasBySample</a></b> outputs counts of read depth per allele for each strand orientation.</li>
+ *     <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_FisherStrand.php">FisherStrand</a></b> uses Fisher's Exact Test to evaluate strand bias.</li>
+ * </ul>
+ *
  */
 public class AS_StrandOddsRatio extends AS_StrandBiasTest implements AS_StandardAnnotation {