Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated PostprocessGermlineCNVCalls (segments VCF writing, WDL scripts, unit tests, integration tests) #4396

Merged
merged 4 commits into from Mar 22, 2018

Conversation

mbabadi
Copy link
Contributor

@mbabadi mbabadi commented Feb 13, 2018

This PR adds segments VCF writing to PostprocessGermlineCNVCalls. Segmentation (Viterbi) and segment quality calculation are performed by gcnvkernel.

This PR introduces the following additional features:

  • Calls and model shards are not required to be provided in sorted order anymore
  • The user can specify the ref copy-number state for autosomal contigs, as well as allosomal contigs
  • For both intervals and segments VCF output: now we use either <DUP> or <DEL> alleles (in place of CN_x alleles), depending on whether the most likely copy-number call is below or above the
    contig baseline. The contig baseline state is whatever the user has specified for autosomal contigs, and the contig ploidy state on sex chromosomes (from the output of DetermineGermlineContigPloidy).
  • Fail-fast validations and better test coverage
  • Updated cohort and case WDL scripts and WDL tests

@mbabadi
Copy link
Contributor Author

mbabadi commented Feb 13, 2018

@sooheelee, could you please review the documentation of PostprocessGermlineCNVCalls?

@samuelklee
Copy link
Contributor

@asmirnov239 @sooheelee Let's hold off on reviewing until the other PR (#4335) upon which this is rebased is merged. At that point, @mbabadi can rebase (it might also be helpful to split test files into their own commit, since there are a lot that have been changed) and then we can review.

We can go ahead and start evaluations on this branch, though!

Copy link
Collaborator

@asmirnov239 asmirnov239 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with the review @mbabadi! Looks really good, thank you for finishing the postrocessing

@@ -57,6 +58,12 @@ def get_sample_name_from_txt_file(input_path: str) -> str:
return line.strip()


def write_sample_name_to_txt_file(output_path: str, sample_name: str):
"""Writes sample name to a text file."""
with open(os.path.join(output_path, io_consts.default_sample_name_txt_filename), 'w') as f:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you don't want to open it in append mode to not override the previously written lines in case in the future the first line will not be the sample name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used for writing sample_name.txt to the root of a sample calls shard. This text file is intended to a sample name one-liner.

# regular expression for matching sample name from header comment line
sample_name_header_regexp = "^@RG.*SM:(.*)[\t]*.*$"

# prefix for adding sample name as a header comment line
sample_name_header_prefix = "RG\tID:GATKCopyNumber\tSM:"
sample_name_sam_header_prefix = "RG\tID:GATKCopyNumber\tSM:"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to derive these tags from a vcf package?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:D

def export_ndarray_tc_with_copy_number_header(sample_posterior_path: str,
ndarray_tc: np.ndarray,
output_file_name: str,
delimiter='\t',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you extract the delimiter and comment prefix to a constant from this method and methods below?

Copy link
Contributor Author

@mbabadi mbabadi Feb 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will move the values of all delimiter='\t' and comment='@' keyword args to io_consts.py.

default_class_log_posterior_tsv_filename = "log_q_tau_tk.tsv"
default_baseline_copy_number_tsv_filename = "baseline_copy_number_t.tsv"
default_copy_number_segments_tsv_filename = "copy_number_segments.tsv"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to eventually export these constants to some json file so that they can be shared between java and python codebases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. However, there's logistic problem: gcnvkernel is not entirely a java resource in the codebase (only the scripts). So, we have to include a duplicate copy of the JSON as a resource. Any thoughts?

@@ -170,24 +200,49 @@ def __init__(self,
self.denoising_model_approx = denoising_model_approx
self.input_calls_path = input_calls_path

@staticmethod
def import_ndarray_tc_with_copy_number_header(sample_posterior_path: str,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does tc here stand for? Can you document it somehow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added documentation. Throughout gcnvkernel code, t refers to interval index, c refers to copy-number state index, and j refers to contig index.

@@ -17,36 +18,13 @@
* An allele representation of this copy number state (used for VCF creation)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove this - I forgot to :(

*
* @author Mehrtash Babadi &lt;mehrtash@broadinstitute.org&gt;
*/
public class IntegerCopyNumberSegmentCollectionUnitTest extends GATKBaseTest {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

.boxed()
.collect(Collectors.toMap(IntegerCopyNumberState::new,
cn -> TEST_INVALID_LOG_POSTERIOR_VECTOR[cn])));
final LocatableCopyNumberPosteriorDistribution locatablePosteriorRecord =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception should already be thrown by now, no need for these 2 lines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

*/
public class GermlineCNVSegmentVariantComposerUnitTest extends GATKBaseTest {
@Test(dataProvider = "variantCompositionSettings")
public void testVariantComposition(final int refAutosomalCopyNumber,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also add a more universal test that tests that creates an actual segment VCF and checks its correctness

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a test for the correctness of the generated VariantContext. Testing an arbitrary VariantContext is written correctly to VCF is the responsibility of the author of VariantContextWriter.

String sample_directory = "SAMPLE_${sample_index}" #this is a hardcoded convention in gcnvkernel
String vcf_filename = "${entity_id}.vcf.gz"
String genotyped_intervals_vcf_filename = "genotyped-intervals-${entity_id}.vcf.gz"
String genotyped_segments_vcf_filename = "genotyped-segments-${entity_id}.vcf.gz"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not want to provide an option in WDL to not create segments VCF?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most users would actually care about the segments VCF file anyway (only more hardcore analysts might care about the intervals VCF).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just output both for now.

@asmirnov239
Copy link
Collaborator

@samuelklee ooops I didn't see your comment until now..

mbabadi added a commit that referenced this pull request Feb 22, 2018
improvement of segment quality calculation methods
incorporated small gcnvkernel changes from PR #4396
mbabadi added a commit that referenced this pull request Feb 22, 2018
stripped thermal stuff from Viterbi
refactoring of math utils
improvement of segment quality calculation methods
incorporated small gcnvkernel changes from PR #4396
doc update
mbabadi added a commit that referenced this pull request Feb 24, 2018
code improvement of theano forward-backward and viterbi
refactoring of math utils
improvement of segment quality calculation methods
incorporated small gcnvkernel changes from PR #4396
doc update
mbabadi added a commit that referenced this pull request Feb 24, 2018
code improvement of theano forward-backward and viterbi
refactoring of math utils
improvement of segment quality calculation methods
incorporated small gcnvkernel changes from PR #4396
doc update
@mbabadi mbabadi force-pushed the mb_gcnv_postprocess_cli_update branch 2 times, most recently from 07a39cc to 5c4d167 Compare February 26, 2018 00:01
mbabadi added a commit that referenced this pull request Feb 26, 2018
… (python) CLI script (#4335)

* Viterbi segmentation and segment quality calculation in gcnvkernel

saving log emission posteriors to disk
put __init__ files back in ...
Viterbi decoder w/ theano.scan
doc updates for Viterbi
skeletons for HMM segmentation quality calculation
theano-based HMM log constrained probability calculation
fixed a notorious bug due to theano fancy indexing ...
left and right end point quality calculation for a segment
exact quality
viterbi API update
SAM sequence dictionary parsing
interval ordering using SAM sequence dictionary
lazy initialization of denoising workspace variables to make it efficient and re-usable for Viterbi segmentation
exporting baseline copy number for each sample (for java post-processing)
some I/O refactorings
loading configs from JSON
Viterbi segmentation engine
scattered model/calls assembly
some refactoring of denoising model
Viterbi segmentation and quality calculation finished
fixed numerical instability issues with segment quality calculation
Viterbi segmentation engine complete
Viterbi segmentation python script
some gcnvkernel refactoring
removed SAM sequence dictionary code
fixed the bug causing travis failure
single-sample segmentation as opposed to all in one shot
PR review:
- code improvement of theano forward-backward and viterbi
- refactoring of math utils
- improvement of segment quality calculation methods
- incorporated small gcnvkernel changes from PR #4396
- doc update
@mbabadi mbabadi force-pushed the mb_gcnv_postprocess_cli_update branch from 5c4d167 to d050e27 Compare February 26, 2018 19:27
@codecov-io
Copy link

codecov-io commented Feb 26, 2018

Codecov Report

Merging #4396 into master will decrease coverage by 0.642%.
The diff coverage is 84.221%.

@@               Coverage Diff               @@
##              master     #4396       +/-   ##
===============================================
- Coverage     79.815%   79.173%   -0.642%     
- Complexity     16933     17160      +227     
===============================================
  Files           1058      1053        -5     
  Lines          61408     61790      +382     
  Branches        9967     10343      +376     
===============================================
- Hits           49013     48921       -92     
- Misses          8512      8981      +469     
- Partials        3883      3888        +5
Impacted Files Coverage Δ Complexity Δ
...ls/copynumber/gcnv/GermlineCNVNamingConstants.java 0% <ø> (ø) 0 <0> (ø) ⬇️
.../formats/collections/SimpleIntervalCollection.java 100% <ø> (ø) 5 <0> (ø) ⬇️
...ools/copynumber/DetermineGermlineContigPloidy.java 96.471% <ø> (ø) 14 <0> (ø) ⬇️
...hellbender/tools/copynumber/GermlineCNVCaller.java 86.364% <ø> (ø) 10 <0> (ø) ⬇️
...ollections/IntegerCopyNumberSegmentCollection.java 100% <100%> (ø) 5 <5> (?)
.../tools/copynumber/gcnv/IntegerCopyNumberState.java 58.333% <100%> (-13.889%) 6 <1> (-4)
...ls/copynumber/gcnv/GermlineCNVVariantComposer.java 100% <100%> (ø) 4 <4> (?)
...number/gcnv/GermlineCNVSegmentVariantComposer.java 100% <100%> (ø) 6 <6> (?)
...rmats/records/CopyNumberPosteriorDistribution.java 61.905% <100%> (+8.964%) 5 <1> (+1) ⬆️
...mats/records/IntervalCopyNumberGenotypingData.java 41.667% <41.667%> (ø) 6 <6> (?)
... and 142 more

@mbabadi
Copy link
Contributor Author

mbabadi commented Feb 27, 2018

@samuelklee I'll write unit tests for segment quality calculation in a separate PR (issue #4464).

Copy link
Contributor

@samuelklee samuelklee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good for the most part! Added some suggestions for minor refactoring.

String sample_directory = "SAMPLE_${sample_index}" #this is a hardcoded convention in gcnvkernel
String vcf_filename = "${entity_id}.vcf.gz"
String genotyped_intervals_vcf_filename = "genotyped-intervals-${entity_id}.vcf.gz"
String genotyped_segments_vcf_filename = "genotyped-segments-${entity_id}.vcf.gz"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just output both for now.

mkdir extracted-contig-ploidy-calls
tar xzf ${contig_ploidy_calls_tar} -C extracted-contig-ploidy-calls

allosomal_contigs_array=(${sep=" " allosomal_contigs})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need the bash loop here, since we are not appending any indices to the array elements or running tar. You can just use --allosomal-contig ${sep=" --allosomal-contig " allosomal_contigs} in the command line below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -210,7 +210,7 @@
private RunMode runMode;

@Argument(
doc = "Input contig-ploidy calls directory (output of DetermlineGermlineContigPloidy).",
doc = "Input contig-ploidy calls directory (output of DetermineGermlineContigPloidy).",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you address #4403 while you're here, for both GermlineCNVCaller and DetermineGermlineContigPloidy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The choice for -I and -O is arbitrary when there are multiple mandatory inputs and/or outputs, and this is the case for both tools. Any thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion---this comment applies to a line above that is not in the diff.

We current use --input to denote the read counts for DetermineGermlineContigPloidy and GermlineCNVCaller (which seems natural, as they are arguably the "primary" input). We also use --output to denote the output directory, as we do for most CNV tools. However, in these two tools, we neglect to add the standard short names for these arguments. All that needs to be done to close that issue is to add the short names, so we can use -I/-O as expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

fullName = DRY_RUN_LONG_NAME,
optional = true
)
private boolean dryRun = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a dry-run mode? I'd rather not have it if it's only used for debugging.

In general, the WDL task for a tool should expose every parameter. Since we want to only maintain a single version of the WDL, each task should just be a one-to-one wrapper around each tool. Otherwise we may get into scenarios where we or other developers need to go back and expose things that are not exposed in the "official" version of the WDL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

* according order of the intervals in corresponding chunks. The VCF will contain an ALT allele for every
* non-reference copy number state specified in the posterior files. The reference allele corresponds to copy number 2.
* </p>
* <p>Depending on the arguments, this tool either generates a single "intervals" VCF or additionally, performs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If most users will want the segmented VCF, then let's just always output both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

runToolForSingleSample(callShards, modelShards, sampleIndex,
actualIntervalsOutputVCF, actualSegmentsOutputVCF,
ALLOSOMAL_CONTIGS, AUTOSOMAL_REF_COPY_NUMBER, false);
IntegrationTestSpec.assertEqualTextFiles(actualIntervalsOutputVCF, expectedIntervalsOutputVCF);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, interesting. I've used FileUtils.contentEquals in the past. I'd probably stick with that to avoid any interaction with IntegrationTestSpec.assertEqualTextFiles, but I leave it up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally ended up using IntegrationTestSpec.assertEqualTextFiles because it takes a comment prefix (helpful for changing VCF info field descriptions without needing to recreate the test resources).


/**
* Integration test for {@link PostprocessGermlineCNVCalls}
*
* @author Mehrtash Babadi &lt;mehrtash@broadinstitute.org&gt;
* @author Andrey Smirnov &lt;asmirnov@broadinstitute.org&gt;
*/
public class PostprocessGermlineCNVCallsIntegrationTest extends CommandLineProgramTest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can make all test classes final.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

*
* @author Mehrtash Babadi &lt;mehrtash@broadinstitute.org&gt;
*/
public class IntegerCopyNumberSegmentCollectionUnitTest extends GATKBaseTest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding these tests! I have been a little lazy about testing these collections classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)


/* assert correctness of quality metrics */
Assert.assertEquals(
(int)(long)gen.getExtendedAttribute(GermlineCNVSegmentVariantComposer.QS),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just cast the first quantity to long? Also, white space.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Assert.assertEquals(var.getEnd(), segment.getEnd());
Assert.assertEquals(var.getAlleles(), GermlineCNVSegmentVariantComposer.ALL_ALLELES);

final Genotype gen = var.getGenotype(IntegerCopyNumberSegmentCollectionUnitTest.EXPECTED_SAMPLE_NAME);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gen -> gt or genotype

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@mbabadi mbabadi force-pushed the mb_gcnv_postprocess_cli_update branch 4 times, most recently from bdc3c72 to ad46a1c Compare March 3, 2018 00:24
@mbabadi
Copy link
Contributor Author

mbabadi commented Mar 3, 2018

@samuelklee back to you -- the revision is on the longer side, though, no surprises.

@mbabadi mbabadi force-pushed the mb_gcnv_postprocess_cli_update branch from ad46a1c to a4028f5 Compare March 3, 2018 00:30
Copy link
Contributor

@samuelklee samuelklee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one or two minor things, otherwise good to go!

@@ -256,6 +256,7 @@ task PostprocessGermlineCNVCalls {

String genotyped_intervals_vcf_filename = "genotyped-intervals-${entity_id}.vcf.gz"
String genotyped_segments_vcf_filename = "genotyped-segments-${entity_id}.vcf.gz"
Boolean allosomal_contigs_specified = defined(allosomal_contigs) && length(select_first([allosomal_contigs, []])) > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, does just defined(allosomal_contigs) && length(allosomal_contigs) > 0 work (i.e., are these evaluated left-to-right with short circuiting)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no short circuiting, therefore select_first([allosomal_contigs, []]) :)

@@ -409,46 +339,23 @@ private String getShardSampleName(final int shardIndex) {
}

/**
* Returns a list of {@link LocatableCopyNumberPosteriorDistribution} for {@link #sampleIndex} from a
* Returns a list of {@link IntervalCopyNumberGenotypingData} for {@link #sampleIndex} from a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't like conflating Posteriors and Data...I'm find with IntervalCopyNumberPosterior if PosteriorDistribution is too verbose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Is it because you store the baseline CN? I'll leave it up to you.

/**
* Extracts column names from a TSV file
*/
List<String> extractCopyNumberColumnsFromHeader(final File inputFile) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@mbabadi mbabadi force-pushed the mb_gcnv_postprocess_cli_update branch from a4028f5 to 861f302 Compare March 21, 2018 22:22
@mbabadi mbabadi merged commit 1791fb4 into master Mar 22, 2018
@mbabadi mbabadi deleted the mb_gcnv_postprocess_cli_update branch March 30, 2018 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants