Updated PostprocessGermlineCNVCalls (segments VCF writing, WDL scripts, unit tests, integration tests) #4396

mbabadi · 2018-02-13T00:47:01Z

This PR adds segments VCF writing to PostprocessGermlineCNVCalls. Segmentation (Viterbi) and segment quality calculation are performed by gcnvkernel.

This PR introduces the following additional features:

Calls and model shards are not required to be provided in sorted order anymore
The user can specify the ref copy-number state for autosomal contigs, as well as allosomal contigs
For both intervals and segments VCF output: now we use either <DUP> or <DEL> alleles (in place of CN_x alleles), depending on whether the most likely copy-number call is below or above the
contig baseline. The contig baseline state is whatever the user has specified for autosomal contigs, and the contig ploidy state on sex chromosomes (from the output of DetermineGermlineContigPloidy).
Fail-fast validations and better test coverage
Updated cohort and case WDL scripts and WDL tests

mbabadi · 2018-02-13T17:47:25Z

@sooheelee, could you please review the documentation of PostprocessGermlineCNVCalls?

samuelklee · 2018-02-14T20:05:00Z

@asmirnov239 @sooheelee Let's hold off on reviewing until the other PR (#4335) upon which this is rebased is merged. At that point, @mbabadi can rebase (it might also be helpful to split test files into their own commit, since there are a lot that have been changed) and then we can review.

We can go ahead and start evaluations on this branch, though!

asmirnov239

Done with the review @mbabadi! Looks really good, thank you for finishing the postrocessing

asmirnov239 · 2018-02-13T17:10:32Z

src/main/python/org/broadinstitute/hellbender/gcnvkernel/io/io_commons.py

@@ -57,6 +58,12 @@ def get_sample_name_from_txt_file(input_path: str) -> str:
            return line.strip()


+def write_sample_name_to_txt_file(output_path: str, sample_name: str):
+    """Writes sample name to a text file."""
+    with open(os.path.join(output_path, io_consts.default_sample_name_txt_filename), 'w') as f:


Are you sure you don't want to open it in append mode to not override the previously written lines in case in the future the first line will not be the sample name?

This is used for writing sample_name.txt to the root of a sample calls shard. This text file is intended to a sample name one-liner.

asmirnov239 · 2018-02-13T17:13:07Z

src/main/python/org/broadinstitute/hellbender/gcnvkernel/io/io_consts.py

 # regular expression for matching sample name from header comment line
 sample_name_header_regexp = "^@RG.*SM:(.*)[\t]*.*$"

 # prefix for adding sample name as a header comment line
-sample_name_header_prefix = "RG\tID:GATKCopyNumber\tSM:"
+sample_name_sam_header_prefix = "RG\tID:GATKCopyNumber\tSM:"


is it possible to derive these tags from a vcf package?

asmirnov239 · 2018-02-13T17:19:36Z

src/main/python/org/broadinstitute/hellbender/gcnvkernel/io/io_denoising_calling.py

+    def export_ndarray_tc_with_copy_number_header(sample_posterior_path: str,
+                                                  ndarray_tc: np.ndarray,
+                                                  output_file_name: str,
+                                                  delimiter='\t',


Can you extract the delimiter and comment prefix to a constant from this method and methods below?

I will move the values of all delimiter='\t' and comment='@' keyword args to io_consts.py.

asmirnov239 · 2018-02-13T17:39:43Z

src/main/python/org/broadinstitute/hellbender/gcnvkernel/io/io_consts.py

 default_class_log_posterior_tsv_filename = "log_q_tau_tk.tsv"
+default_baseline_copy_number_tsv_filename = "baseline_copy_number_t.tsv"
+default_copy_number_segments_tsv_filename = "copy_number_segments.tsv"


It would be nice to eventually export these constants to some json file so that they can be shared between java and python codebases

That's a good idea. However, there's logistic problem: gcnvkernel is not entirely a java resource in the codebase (only the scripts). So, we have to include a duplicate copy of the JSON as a resource. Any thoughts?

asmirnov239 · 2018-02-13T21:39:26Z

src/main/python/org/broadinstitute/hellbender/gcnvkernel/io/io_denoising_calling.py

@@ -170,24 +200,49 @@ def __init__(self,
        self.denoising_model_approx = denoising_model_approx
        self.input_calls_path = input_calls_path

+    @staticmethod
+    def import_ndarray_tc_with_copy_number_header(sample_posterior_path: str,


What does tc here stand for? Can you document it somehow?

Added documentation. Throughout gcnvkernel code, t refers to interval index, c refers to copy-number state index, and j refers to contig index.

asmirnov239 · 2018-02-15T21:55:54Z

src/main/java/org/broadinstitute/hellbender/tools/copynumber/gcnv/IntegerCopyNumberState.java

@@ -17,36 +18,13 @@
     * An allele representation of this copy number state (used for VCF creation)


Can you remove this - I forgot to :(

asmirnov239 · 2018-02-16T01:22:57Z

...lbender/tools/copynumber/formats/collections/IntegerCopyNumberSegmentCollectionUnitTest.java

+ *
+ * @author Mehrtash Babadi &lt;mehrtash@broadinstitute.org&gt;
+ */
+public class IntegerCopyNumberSegmentCollectionUnitTest extends  GATKBaseTest {


extra space here

asmirnov239 · 2018-02-16T02:19:33Z

...adinstitute/hellbender/tools/copynumber/gcnv/GermlineCNVIntervalVariantComposerUnitTest.java

+                        .boxed()
+                        .collect(Collectors.toMap(IntegerCopyNumberState::new,
+                                cn -> TEST_INVALID_LOG_POSTERIOR_VECTOR[cn])));
+        final LocatableCopyNumberPosteriorDistribution locatablePosteriorRecord =


The exception should already be thrown by now, no need for these 2 lines

asmirnov239 · 2018-02-16T02:42:59Z

...oadinstitute/hellbender/tools/copynumber/gcnv/GermlineCNVSegmentVariantComposerUnitTest.java

+ */
+public class GermlineCNVSegmentVariantComposerUnitTest extends GATKBaseTest {
+    @Test(dataProvider = "variantCompositionSettings")
+    public void testVariantComposition(final int refAutosomalCopyNumber,


I would also add a more universal test that tests that creates an actual segment VCF and checks its correctness

This is a test for the correctness of the generated VariantContext. Testing an arbitrary VariantContext is written correctly to VCF is the responsibility of the author of VariantContextWriter.

asmirnov239 · 2018-02-16T02:46:37Z

scripts/cnv_wdl/cnv_common_tasks.wdl

-    String sample_directory = "SAMPLE_${sample_index}"  #this is a hardcoded convention in gcnvkernel
-    String vcf_filename = "${entity_id}.vcf.gz"
+    String genotyped_intervals_vcf_filename = "genotyped-intervals-${entity_id}.vcf.gz"
+    String genotyped_segments_vcf_filename = "genotyped-segments-${entity_id}.vcf.gz"


Do we not want to provide an option in WDL to not create segments VCF?

I think most users would actually care about the segments VCF file anyway (only more hardcore analysts might care about the intervals VCF).

Let's just output both for now.

asmirnov239 · 2018-02-16T02:57:19Z

@samuelklee ooops I didn't see your comment until now..

improvement of segment quality calculation methods incorporated small gcnvkernel changes from PR #4396

stripped thermal stuff from Viterbi refactoring of math utils improvement of segment quality calculation methods incorporated small gcnvkernel changes from PR #4396 doc update

code improvement of theano forward-backward and viterbi refactoring of math utils improvement of segment quality calculation methods incorporated small gcnvkernel changes from PR #4396 doc update

… (python) CLI script (#4335) * Viterbi segmentation and segment quality calculation in gcnvkernel saving log emission posteriors to disk put __init__ files back in ... Viterbi decoder w/ theano.scan doc updates for Viterbi skeletons for HMM segmentation quality calculation theano-based HMM log constrained probability calculation fixed a notorious bug due to theano fancy indexing ... left and right end point quality calculation for a segment exact quality viterbi API update SAM sequence dictionary parsing interval ordering using SAM sequence dictionary lazy initialization of denoising workspace variables to make it efficient and re-usable for Viterbi segmentation exporting baseline copy number for each sample (for java post-processing) some I/O refactorings loading configs from JSON Viterbi segmentation engine scattered model/calls assembly some refactoring of denoising model Viterbi segmentation and quality calculation finished fixed numerical instability issues with segment quality calculation Viterbi segmentation engine complete Viterbi segmentation python script some gcnvkernel refactoring removed SAM sequence dictionary code fixed the bug causing travis failure single-sample segmentation as opposed to all in one shot PR review: - code improvement of theano forward-backward and viterbi - refactoring of math utils - improvement of segment quality calculation methods - incorporated small gcnvkernel changes from PR #4396 - doc update

codecov-io · 2018-02-26T20:23:27Z

Codecov Report

Merging #4396 into master will decrease coverage by 0.642%.
The diff coverage is 84.221%.

@@               Coverage Diff               @@
##              master     #4396       +/-   ##
===============================================
- Coverage     79.815%   79.173%   -0.642%     
- Complexity     16933     17160      +227     
===============================================
  Files           1058      1053        -5     
  Lines          61408     61790      +382     
  Branches        9967     10343      +376     
===============================================
- Hits           49013     48921       -92     
- Misses          8512      8981      +469     
- Partials        3883      3888        +5

Impacted Files	Coverage Δ	Complexity Δ
...ls/copynumber/gcnv/GermlineCNVNamingConstants.java	`0% <ø> (ø)`	`0 <0> (ø)`	⬇️
.../formats/collections/SimpleIntervalCollection.java	`100% <ø> (ø)`	`5 <0> (ø)`	⬇️
...ools/copynumber/DetermineGermlineContigPloidy.java	`96.471% <ø> (ø)`	`14 <0> (ø)`	⬇️
...hellbender/tools/copynumber/GermlineCNVCaller.java	`86.364% <ø> (ø)`	`10 <0> (ø)`	⬇️
...ollections/IntegerCopyNumberSegmentCollection.java	`100% <100%> (ø)`	`5 <5> (?)`
.../tools/copynumber/gcnv/IntegerCopyNumberState.java	`58.333% <100%> (-13.889%)`	`6 <1> (-4)`
...ls/copynumber/gcnv/GermlineCNVVariantComposer.java	`100% <100%> (ø)`	`4 <4> (?)`
...number/gcnv/GermlineCNVSegmentVariantComposer.java	`100% <100%> (ø)`	`6 <6> (?)`
...rmats/records/CopyNumberPosteriorDistribution.java	`61.905% <100%> (+8.964%)`	`5 <1> (+1)`	⬆️
...mats/records/IntervalCopyNumberGenotypingData.java	`41.667% <41.667%> (ø)`	`6 <6> (?)`
... and 142 more

mbabadi · 2018-02-27T20:29:12Z

@samuelklee I'll write unit tests for segment quality calculation in a separate PR (issue #4464).

samuelklee

Thanks, looks good for the most part! Added some suggestions for minor refactoring.

samuelklee · 2018-02-27T20:46:10Z

scripts/cnv_wdl/cnv_common_tasks.wdl

-    String sample_directory = "SAMPLE_${sample_index}"  #this is a hardcoded convention in gcnvkernel
-    String vcf_filename = "${entity_id}.vcf.gz"
+    String genotyped_intervals_vcf_filename = "genotyped-intervals-${entity_id}.vcf.gz"
+    String genotyped_segments_vcf_filename = "genotyped-segments-${entity_id}.vcf.gz"


Let's just output both for now.

samuelklee · 2018-02-27T20:54:40Z

scripts/cnv_wdl/cnv_common_tasks.wdl

+        mkdir extracted-contig-ploidy-calls
+        tar xzf ${contig_ploidy_calls_tar} -C extracted-contig-ploidy-calls
+
+        allosomal_contigs_array=(${sep=" " allosomal_contigs})


You don't need the bash loop here, since we are not appending any indices to the array elements or running tar. You can just use --allosomal-contig ${sep=" --allosomal-contig " allosomal_contigs} in the command line below.

samuelklee · 2018-02-27T20:58:11Z

src/main/java/org/broadinstitute/hellbender/tools/copynumber/GermlineCNVCaller.java

@@ -210,7 +210,7 @@
    private RunMode runMode;

    @Argument(
-            doc = "Input contig-ploidy calls directory (output of DetermlineGermlineContigPloidy).",
+            doc = "Input contig-ploidy calls directory (output of DetermineGermlineContigPloidy).",


Can you address #4403 while you're here, for both GermlineCNVCaller and DetermineGermlineContigPloidy?

The choice for -I and -O is arbitrary when there are multiple mandatory inputs and/or outputs, and this is the case for both tools. Any thoughts?

Sorry for the confusion---this comment applies to a line above that is not in the diff.

We current use --input to denote the read counts for DetermineGermlineContigPloidy and GermlineCNVCaller (which seems natural, as they are arguably the "primary" input). We also use --output to denote the output directory, as we do for most CNV tools. However, in these two tools, we neglect to add the standard short names for these arguments. All that needs to be done to close that issue is to add the short names, so we can use -I/-O as expected.

samuelklee · 2018-02-27T21:01:44Z

src/main/java/org/broadinstitute/hellbender/tools/copynumber/PostprocessGermlineCNVCalls.java

+            fullName = DRY_RUN_LONG_NAME,
+            optional = true
+    )
+    private boolean dryRun = false;


Do we need a dry-run mode? I'd rather not have it if it's only used for debugging.

In general, the WDL task for a tool should expose every parameter. Since we want to only maintain a single version of the WDL, each task should just be a one-to-one wrapper around each tool. Otherwise we may get into scenarios where we or other developers need to go back and expose things that are not exposed in the "official" version of the WDL.

samuelklee · 2018-02-27T21:03:26Z

src/main/java/org/broadinstitute/hellbender/tools/copynumber/PostprocessGermlineCNVCalls.java

- * according order of the intervals in corresponding chunks. The VCF will contain an ALT allele for every
- * non-reference copy number state specified in the posterior files. The reference allele corresponds to copy number 2.
- * </p>
+ * <p>Depending on the arguments, this tool either generates a single "intervals" VCF or additionally, performs


If most users will want the segmented VCF, then let's just always output both.

samuelklee · 2018-02-27T23:20:01Z

...g/broadinstitute/hellbender/tools/copynumber/PostprocessGermlineCNVCallsIntegrationTest.java

+            runToolForSingleSample(callShards, modelShards, sampleIndex,
+                    actualIntervalsOutputVCF, actualSegmentsOutputVCF,
+                    ALLOSOMAL_CONTIGS, AUTOSOMAL_REF_COPY_NUMBER, false);
+            IntegrationTestSpec.assertEqualTextFiles(actualIntervalsOutputVCF, expectedIntervalsOutputVCF);


Hmm, interesting. I've used FileUtils.contentEquals in the past. I'd probably stick with that to avoid any interaction with IntegrationTestSpec.assertEqualTextFiles, but I leave it up to you.

I finally ended up using IntegrationTestSpec.assertEqualTextFiles because it takes a comment prefix (helpful for changing VCF info field descriptions without needing to recreate the test resources).

samuelklee · 2018-02-27T23:21:11Z

...g/broadinstitute/hellbender/tools/copynumber/PostprocessGermlineCNVCallsIntegrationTest.java


 /**
 * Integration test for {@link PostprocessGermlineCNVCalls}
+ *
+ * @author Mehrtash Babadi &lt;mehrtash@broadinstitute.org&gt;
+ * @author Andrey Smirnov &lt;asmirnov@broadinstitute.org&gt;
 */
 public class PostprocessGermlineCNVCallsIntegrationTest extends CommandLineProgramTest {


Can make all test classes final.

samuelklee · 2018-02-27T23:22:11Z

...lbender/tools/copynumber/formats/collections/IntegerCopyNumberSegmentCollectionUnitTest.java

+ *
+ * @author Mehrtash Babadi &lt;mehrtash@broadinstitute.org&gt;
+ */
+public class IntegerCopyNumberSegmentCollectionUnitTest extends GATKBaseTest {


Thanks for adding these tests! I have been a little lazy about testing these collections classes.

samuelklee · 2018-02-27T23:26:14Z

...oadinstitute/hellbender/tools/copynumber/gcnv/GermlineCNVSegmentVariantComposerUnitTest.java

+
+            /* assert correctness of quality metrics */
+            Assert.assertEquals(
+                    (int)(long)gen.getExtendedAttribute(GermlineCNVSegmentVariantComposer.QS),


Why not just cast the first quantity to long? Also, white space.

samuelklee · 2018-02-27T23:26:52Z

...oadinstitute/hellbender/tools/copynumber/gcnv/GermlineCNVSegmentVariantComposerUnitTest.java

+            Assert.assertEquals(var.getEnd(), segment.getEnd());
+            Assert.assertEquals(var.getAlleles(), GermlineCNVSegmentVariantComposer.ALL_ALLELES);
+
+            final Genotype gen = var.getGenotype(IntegerCopyNumberSegmentCollectionUnitTest.EXPECTED_SAMPLE_NAME);


gen -> gt or genotype

mbabadi · 2018-03-03T00:25:58Z

@samuelklee back to you -- the revision is on the longer side, though, no surprises.

samuelklee

Just one or two minor things, otherwise good to go!

samuelklee · 2018-03-06T18:58:03Z

scripts/cnv_wdl/cnv_common_tasks.wdl

@@ -256,6 +256,7 @@ task PostprocessGermlineCNVCalls {

    String genotyped_intervals_vcf_filename = "genotyped-intervals-${entity_id}.vcf.gz"
    String genotyped_segments_vcf_filename = "genotyped-segments-${entity_id}.vcf.gz"
+    Boolean allosomal_contigs_specified = defined(allosomal_contigs) && length(select_first([allosomal_contigs, []])) > 0


Just curious, does just defined(allosomal_contigs) && length(allosomal_contigs) > 0 work (i.e., are these evaluated left-to-right with short circuiting)?

no short circuiting, therefore select_first([allosomal_contigs, []]) :)

samuelklee · 2018-03-06T19:10:57Z

src/main/java/org/broadinstitute/hellbender/tools/copynumber/PostprocessGermlineCNVCalls.java

@@ -409,46 +339,23 @@ private String getShardSampleName(final int shardIndex) {
    }

    /**
-     * Returns a list of {@link LocatableCopyNumberPosteriorDistribution} for {@link #sampleIndex} from a
+     * Returns a list of {@link IntervalCopyNumberGenotypingData} for {@link #sampleIndex} from a


Hmm, I don't like conflating Posteriors and Data...I'm find with IntervalCopyNumberPosterior if PosteriorDistribution is too verbose.

Ah, I see. Is it because you store the baseline CN? I'll leave it up to you.

samuelklee · 2018-03-06T19:15:57Z

...llbender/tools/copynumber/formats/collections/CopyNumberPosteriorDistributionCollection.java

+        /**
+         * Extracts column names from a TSV file
+         */
+        List<String> extractCopyNumberColumnsFromHeader(final File inputFile) {


Can this be private?

…s, unit tests, integration tests)

…eview)

mbabadi requested review from asmirnov239 and samuelklee February 13, 2018 00:47

mbabadi added Copy Number tools Germline CNV labels Feb 13, 2018

This was referenced Feb 13, 2018

Add BETA tag to PostprocessGermlineCNVCalls. #4373

Closed

Add copy-number segments VCF output to PostprocessGermlineCNVCalls #4336

Closed

mbabadi force-pushed the mb_gcnv_postprocess_cli_update branch 2 times, most recently from 8e33396 to 282352d Compare February 13, 2018 04:24

mbabadi requested a review from sooheelee February 13, 2018 17:46

asmirnov239 requested changes Feb 16, 2018

View reviewed changes

mbabadi added a commit that referenced this pull request Feb 22, 2018

refactoring of math utils

0052e19

improvement of segment quality calculation methods incorporated small gcnvkernel changes from PR #4396

mbabadi added a commit that referenced this pull request Feb 22, 2018

PR review

0def362

stripped thermal stuff from Viterbi refactoring of math utils improvement of segment quality calculation methods incorporated small gcnvkernel changes from PR #4396 doc update

mbabadi added a commit that referenced this pull request Feb 24, 2018

PR review

34cdf8b

code improvement of theano forward-backward and viterbi refactoring of math utils improvement of segment quality calculation methods incorporated small gcnvkernel changes from PR #4396 doc update

mbabadi added a commit that referenced this pull request Feb 24, 2018

PR review

912f994

code improvement of theano forward-backward and viterbi refactoring of math utils improvement of segment quality calculation methods incorporated small gcnvkernel changes from PR #4396 doc update

mbabadi force-pushed the mb_gcnv_postprocess_cli_update branch 2 times, most recently from 07a39cc to 5c4d167 Compare February 26, 2018 00:01

mbabadi force-pushed the mb_gcnv_postprocess_cli_update branch from 5c4d167 to d050e27 Compare February 26, 2018 19:27

samuelklee requested changes Feb 27, 2018

View reviewed changes

mbabadi force-pushed the mb_gcnv_postprocess_cli_update branch 4 times, most recently from bdc3c72 to ad46a1c Compare March 3, 2018 00:24

mbabadi force-pushed the mb_gcnv_postprocess_cli_update branch from ad46a1c to a4028f5 Compare March 3, 2018 00:30

samuelklee approved these changes Mar 6, 2018

View reviewed changes

samuelklee assigned mbabadi Mar 7, 2018

mbabadi added 4 commits March 21, 2018 18:21

Updated PostprocessGermlineCNVCalls (segments VCF writing, WDL script…

f9757d9

…s, unit tests, integration tests)

Java-side updates after rebasing on the latest gcnvkernel (PR #4335 r…

f8dd7e8

…eview)

PR review (from Andrei)

fb4a3af

PR review (Sam)

861f302

mbabadi force-pushed the mb_gcnv_postprocess_cli_update branch from a4028f5 to 861f302 Compare March 21, 2018 22:22

mbabadi merged commit 1791fb4 into master Mar 22, 2018

mbabadi deleted the mb_gcnv_postprocess_cli_update branch March 30, 2018 19:34

samuelklee mentioned this pull request May 21, 2018

Add short names for standard arguments to gCNV tools. #4403

Closed

		@@ -17,36 +18,13 @@
		* An allele representation of this copy number state (used for VCF creation)

Updated PostprocessGermlineCNVCalls (segments VCF writing, WDL scripts, unit tests, integration tests) #4396

Updated PostprocessGermlineCNVCalls (segments VCF writing, WDL scripts, unit tests, integration tests) #4396

Conversation

mbabadi commented Feb 13, 2018 • edited

mbabadi commented Feb 13, 2018

samuelklee commented Feb 14, 2018

asmirnov239 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbabadi Feb 22, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asmirnov239 commented Feb 16, 2018

codecov-io commented Feb 26, 2018 • edited

Codecov Report

mbabadi commented Feb 27, 2018

samuelklee left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbabadi commented Mar 3, 2018

samuelklee left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbabadi commented Feb 13, 2018 •

edited

mbabadi Feb 22, 2018 •

edited

codecov-io commented Feb 26, 2018 •

edited