Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LocusDepthtoBAF tool #7776

Merged
merged 4 commits into from Apr 28, 2022
Merged

LocusDepthtoBAF tool #7776

merged 4 commits into from Apr 28, 2022

Conversation

tedsharpe
Copy link
Contributor

No description provided.

@gatk-bot
Copy link

Travis reported job failures from build 38788
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 38788.12 logs

Copy link
Contributor

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tedsharpe this is going to be a major QOL improvement for users, who sometimes spend weeks wrangling SNP calls from various sources.

Mostly minor comments below. How much larger at the LD files after adding the ref / alt indices? These files can live in nearline storage for years, so minimizing their size is important, even if it means slowing down LD -> BAF conversion a bit. It probably compresses out well though.

On a related note, sometimes projects will mix/match samples. It may happen that someone would try to combine LD files generated with different site lists. This is not an issue with PE/SR which are collected genome-wide nor for counts where we enforce a check on the bin intervals. Is it possible to add something to the LD header to ensure consistency? Maybe a hash on the sites list would be sufficient? Or better yet an explicit check that the input LD sites are consistent across all samples? I believe all biallelic snp sites are emitted in CollectSVEvidence.

Edit: we also need to check this against the current pipeline, ie using HaplotypeCaller to call SNPs. I suspect the chi squared model will be good enough, but we should do a head-to-head on at least one sample before merging this.

@@ -24,6 +24,14 @@ public BafEvidence( final String sample, final String contig,
this.value = value;
}

// value-altering constructor
public BafEvidence( final BafEvidence that, final double value ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a quick unit test for this?

@@ -11,25 +12,36 @@ public final class LocusDepth implements SVFeature {
private final String contig;
private final int position;
private final String sample;
private final byte refCall; // index into nucleotideValues
// next three fields index "ACGT"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you create an enum type for this?

public final static String BCI_VERSION = "1.0";

public LocusDepth( final Locatable loc, final String sample, final byte refCall ) {
public LocusDepth( final Locatable loc, final String sample, final int refIndex, final int altIndex ) {
Utils.validateArg(refIndex >= 0 && refIndex <= 3, "refIndex must be between 0 and 3");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about Utils.nonNull() as well for loc and sample? I tend to use them where I can in constructors for clarity. You can even combine it with the initialization like this.sample = Utils.nonNull(sample);


/**
* <p>Merges locus-sorted LocusDepth evidence files, and calculates the bi-allelic frequency for each
* sample and site, and writes these values as a BafEvidence output file.</p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a description of the filtering methodology

final double refDiff = locusDepth.getRefDepth() - expectRefAlt;
final double altDiff = locusDepth.getAltDepth() - expectRefAlt;
double chiSq = (refDiff * refDiff + altDiff * altDiff) / expectRefAlt;
double fitProb = 1. - chiSqDist.cumulativeProbability(chiSq);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment that this is a Pearson chi-squared test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
double fitProb = 1. - chiSqDist.cumulativeProbability(chiSq);
final double fitProb = 1. - chiSqDist.cumulativeProbability(chiSq);

if ( fitProb < minHetLikelihood ) {
return 0.;
}
return (double)locusDepth.getAltDepth() / totalDepth;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return (double)locusDepth.getAltDepth() / totalDepth;
return locusDepth.getAltDepth() / (double) totalDepth;

I can never remember if casts get applied before division

@@ -694,14 +695,16 @@ private boolean readNextLocus() {
return false;
}
VariantContext snp = snpSourceItr.next();
while ( !snp.isSNP() ) {
while ( !snp.isSNP() || !snp.isBiallelic() ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's future-proof this and allow multiallelic by default, but have a command line flag to restrict only to biallelic

Comment on lines 1 to 3
1 1000 smpl1 0 3 4 0 0 6
1 1000 smpl2 0 3 5 0 0 5
1 1000 smpl3 0 3 6 0 0 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a few more records at other loci, including some on another contig?

shortName = StandardArgumentDefinitions.FEATURE_SHORT_NAME )
private List<FeatureInput<LocusDepth>> locusDepthFiles;

@Argument(fullName = SAMPLE_NAMES_NAME, doc = "Sample names", optional = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on what this does in the doc? Is it used for subsetting?

Comment on lines 93 to 96
@Argument(
doc = "Minimum likelihood of site being biallelic and heterozygous",
fullName = MIN_HET_LIKELIHOOD, optional = true )
private double minHetLikelihood = .5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a probability instead of likelihood

@tedsharpe
Copy link
Contributor Author

I believe that this is ready for a re-review. I think all comments were addressed.
In particular, we should check that the handling of multi-allelic sites is what was intended.

@codecov
Copy link

codecov bot commented Apr 25, 2022

Codecov Report

Merging #7776 (964de7a) into master (c6eb337) will increase coverage by 60.933%.
The diff coverage is 83.953%.

@@               Coverage Diff                @@
##              master     #7776        +/-   ##
================================================
+ Coverage     18.644%   79.577%   +60.933%     
- Complexity      4635     34845     +30210     
================================================
  Files           1261      2214       +953     
  Lines          73745    173540     +99795     
  Branches       11768     18736      +6968     
================================================
+ Hits           13749    138098    +124349     
+ Misses         57944     29026     -28918     
- Partials        2052      6416      +4364     
Impacted Files Coverage Δ
...ine/GATKPlugin/GATKAnnotationPluginDescriptor.java 83.851% <ø> (+46.584%) ⬆️
...ine/GATKPlugin/GATKReadFilterPluginDescriptor.java 84.615% <ø> (+32.308%) ⬆️
...nder/cmdline/PicardCommandLineProgramExecutor.java 63.636% <0.000%> (+63.636%) ⬆️
...e/argumentcollections/DbsnpArgumentCollection.java 100.000% <ø> (ø)
...llections/MultiVariantInputArgumentCollection.java 100.000% <ø> (+100.000%) ⬆️
...broadinstitute/hellbender/engine/FeatureCache.java 92.157% <ø> (+19.608%) ⬆️
...oadinstitute/hellbender/engine/FeatureContext.java 81.818% <ø> (+72.727%) ⬆️
...roadinstitute/hellbender/engine/ProgressMeter.java 93.814% <ø> (+7.899%) ⬆️
...g/broadinstitute/hellbender/engine/ReadWalker.java 100.000% <ø> (+7.407%) ⬆️
...tute/hellbender/engine/ReadlessAssemblyRegion.java 46.667% <ø> (+46.667%) ⬆️
... and 2324 more

Comment on lines 226 to 232
for ( final int altIndex : altIndices ) {
double baf = calcBAF(ld, refIndex, altIndex);
if ( baf != BafEvidence.MISSING_VALUE ) {
beList.add(new BafEvidence(ld.getSample(), ld.getContig(), ld.getStart(), baf));
break;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some cases this will lead to bias. I didn't think through some of the problems multi-allelics are going to cause us with compatibility here. Apologies for flip-flopping on this, but let's stick to bi-allelic for both LD collection and BAF generation. Regardless, it seems like this implementation would be problematic in some edge cases. I think we will need to either provide the sites list to this tool, or go back to storing the ref/alt alleles in LD (which would be fine, probably simpler and faster).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove the multi-allelic option, which kind of weirded me out anyway.
However, I'm confused by the last part of your comment: we do provide the sites list to this tool. That's where I'm getting the ref/alt call info, since it's no longer in the LD records. (See lines 210-223.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry just realized I missed that, all good to just rip out the multi-allelics.

@tedsharpe tedsharpe merged commit 9ae1fd8 into master Apr 28, 2022
@tedsharpe tedsharpe deleted the tws_LDtoBAF branch April 28, 2022 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants