Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FOLD_80_BASE_PENALTY Metric from TargetMetricsCollector does not exclude 0 coverage regions. #1971

Open
Patrick-Schlaeger-Broad opened this issue Jul 30, 2024 · 0 comments

Comments

@Patrick-Schlaeger-Broad
Copy link

Patrick-Schlaeger-Broad commented Jul 30, 2024

Bug Report

Affected tool(s)

TargetMetricsCollector, HsMetricCollector, WgsMetrics

Affected version(s)

Latest public release version [3.20]
Latest development/master branch as of [7/30/24]

Description

There are two potential bugs that I have noted, both of which revolve around the FOLD_80_BASE_PENALTY metric.
The FOLD_80_BASE_PENALTY metric is defined as:
Screenshot 2024-07-29 at 2 47 30 PM
However the calculation for the FOLD_80_BASE_PENALTY does not actually filter out the zero-coverage targets from the histogram which it uses for its getPercentile() calculation. You can see this when you look at the code for the metric:

for (final Coverage c : this.highQualityCoverageByTarget.values()) {
if (!c.hasCoverage()) {
zeroCoverageTargets++;
highQualityDepthHistogram.increment(0, c.interval.length());
targetBases[0] += c.interval.length();
minDepth = 0;
continue;
}
for (final int depth : c.getDepths()) {
totalCoverage += depth;
highQualityDepthHistogram.increment(depth, 1);
minDepth = Math.min(minDepth, depth);
// Add to the "how many target bases at at-least X" calculations.
for (int i = 0; i < targetBasesDepth.length; i++) {
if (depth >= targetBasesDepth[i]) targetBases[i]++;
else break; // NB: assumes that targetBasesDepth is sorted in ascending order
}
}
}
if (targetBases[0] != highQualityCoverageByTarget.keySet().stream().mapToInt(Interval::length).sum()) {
throw new PicardException("the number of target bases with at least 0x coverage does not equal the number of target bases");
}
metrics.MEAN_TARGET_COVERAGE = (double) totalCoverage / metrics.TARGET_TERRITORY;
metrics.MEDIAN_TARGET_COVERAGE = highQualityDepthHistogram.getMedian();
metrics.MAX_TARGET_COVERAGE = hqMaxDepth;
// Use Math.min() to account for edge case where highQualityCoverageByTarget is empty (minDepth=Long.MAX_VALUE)
metrics.MIN_TARGET_COVERAGE = Math.min(minDepth, hqMaxDepth);
// compute the coverage value such that 80% of target bases have better coverage than it i.e. 20th percentile
// this roughly measures how much we must sequence extra such that 80% of target bases have coverage at least as deep as the current mean coverage
metrics.FOLD_80_BASE_PENALTY = metrics.MEAN_TARGET_COVERAGE / highQualityDepthHistogram.getPercentile(0.2);

Issue 1 -- Base Coverage Percentile Function:

In line 731, if an interval is found to have 0 coverage across all its bases, the histogram which is later used to calculate the 20th percentile has bin '0' incremented by the length of the interval: highQualityDepthHistogram.increment(0, c.interval.length()). Thus, the highQualityDepthHistogram includes zero-coverage target bases. Once all bases in the given intervals have been counted towards the highQualityDepthHistogram, the histogram is then called in line 762 for the calculation of Fold80: metrics.FOLD_80_BASE_PENALTY = metrics.MEAN_TARGET_COVERAGE / highQualityDepthHistogram.getPercentile(0.2). As a result, the FOLD_80_BASE_PENALTY metric is not calculating the fold over-coverage necessary to raise 80% of the non-zero coverage target bases to the mean coverage, but rather the fold over-coverage necessary to raise 80% of all target bases to the mean coverage.

Issue 2 -- Mean Coverage Calculation:

At line 762, FOLD_80_BASE_PENALTY is defined to be metrics.MEAN_TARGET_COVERAGE / highQualityDepthHistogram.getPercentile(0.2). The metrics.MEAN_TARGET_COVERAGE, which for this calculation should be the mean coverage of the non-zero-coverage target bases, is calculated in line 754: metrics.MEAN_TARGET_COVERAGE = (double) totalCoverage / metrics.TARGET_TERRITORY. totalCoverage is the sum of all the depths at each target base, while metrics.TARGET_TERRITORY is defined at line 407: metrics.TARGET_TERRITORY = targetTerritory. targetTerritory, in turn, is defined in line 301 as: this.targetTerritory = Interval.countBases(uniqueTargets). In summary, the metrics.MEAN_TARGET_COVERAGE is calculating the mean coverage of all target bases, while the FOLD_80_BASE_PENALTY metric is defined as requiring the mean target coverage of non-zero-coverage target bases. Although the output MEAN_TARGET_COVERAGE metric should independently be calculated including zero-coverage target bases, it should not be the metric that is used for the calculation of the FOLD_80_BASE_PENALTY metric.

Steps to reproduce

Run CollectHsMetrics on any exome bam file that has any number of zero-coverage target bases. The output FOLD_80_BASE_PENALTY will not match what you get if you were to manually calculate the FOLD_80_BASE_PENALTY.

Minimal test case:

  1. Download a sample exome .bam file, open it on IGV, find a small interval where some of the bases are at 0 coverage (I would recommend ~30 bases) .

  2. Copy the sample's reference exome target and bait interval list files, delete all of their intervals and add in the ~30 base interval that you found.

  3. Download the respective Fasta/Fai reference files.

  4. Run the gatk 4.5.0.0 or 4.6.0.0 Docker Image

docker run -it broadinstitute/gatk:4.6.0.0
  1. Run CollectHsMetrics with given inputs:
gatk CollectHsMetrics -I <input/bam/path> -R <reference/fasta/path> -BI <modified/bait/intervals/path> -TI <modified/target/intervals/path> -O <output/file/path>
  1. Check the output FOLD_80_BASE_PENALTY value in the HsMetrics output file.
  • For the example inputs, it should be: FLOAT
  1. Calculate the FOLD_80_BASE_PENALTY by hand for the interval:

  2. Note that the two FOLD_80_BASE_PENALTY values are not equivalent.

Example for test case:

  1. Found the following 28 base interval:
    Using chr7:142,560,458-142,560,485.
    This is the depth profile over that region shown in IGV:
Screenshot 2024-07-30 at 12 06 37 PM
  1. Modifying the hg19 Target/Bait Interval Lists:
- Before Modification:
Screenshot 2024-07-30 at 2 44 51 PM
- After Modification:
Screenshot 2024-07-30 at 2 43 04 PM
  1. Running gatk CollectHsMetrics via terminal:
Screenshot 2024-07-30 at 2 13 37 PM
  1. CollectHsMetrics output FOLD_80_BASE_PENALTY value:
    Screenshot 2024-07-30 at 2 37 48 PM

    FOLD_80_BASE_PENALTY = undefined

  2. Calculated expected FOLD_80_BASE_PENALTY output value:

- Excluding Zero-Coverage-Target-Bases:
Screenshot 2024-07-30 at 3 32 06 PM
- Including Zero-Coverage-Target-Bases:
Screenshot 2024-07-30 at 3 29 26 PM
  1. Conclusion:
    Since the two FOLD_80_BASE_PENALTY output values(1.136, undefined) are not equal, there must be an error in the FOLD_80_BASE_PENALTY calculation regarding zero-coverage target bases.

Expected behavior

The metrics collector returns the FOLD_80_BASE_PENALTY value required to raise 80% of the bases in the non-zero coverage target region to the mean coverage of that non-zero coverage target region.

Actual behavior

The metrics collector returns the FOLD_80_BASE_PENALTY value required to raise 80% of the bases in the target region to the mean coverage of that target region.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant