Viterbi segmentation and segment quality calculation for gcnvkernel + (python) CLI script #4335

mbabadi · 2018-02-03T21:19:09Z

No description provided.

mbabadi · 2018-02-03T22:07:25Z

hmm... travis python fails and very uninformative log (tests pass locally)

samuelklee · 2018-02-05T14:56:11Z

Did you figure out why Travis is failing? Also, I think this PR needs at least one test---you can just add to the gCNV integration tests for now, but we should extract out the HMM code and its tests at some point.

mbabadi · 2018-02-05T19:59:25Z

@samuelklee I suspect it was because of a broken setup.py script (forgot to add gcnvkernel.postprocess). It still worked locally though... let's see if it fixes the travis issue.

samuelklee · 2018-02-05T20:00:14Z

Great! I suspected it might've been something like that.

saving log emission posteriors to disk put __init__ files back in ... Viterbi decoder w/ theano.scan doc updates for Viterbi skeletons for HMM segmentation quality calculation theano-based HMM log constrained probability calculation fixed a notorious bug due to theano fancy indexing ... left and right end point quality calculation for a segment exact quality viterbi API update SAM sequence dictionary parsing interval ordering using SAM sequence dictionary lazy initialization of denoising workspace variables to make it efficient and re-usable for Viterbi segmentation exporting baseline copy number for each sample (for java post-processing) some I/O refactorings loading configs from JSON Viterbi segmentation engine scattered model/calls assembly some refactoring of denoising model Viterbi segmentation and quality calculation finished fixed numerical instability issues with segment quality calculation Viterbi segmentation engine complete Viterbi segmentation python script some gcnvkernel refactoring removed SAM sequence dictionary code fixed the bug causing travis failure

codecov-io · 2018-02-06T19:44:46Z

Codecov Report

Merging #4335 into master will increase coverage by 0.295%.
The diff coverage is n/a.

@@               Coverage Diff               @@
##              master     #4335       +/-   ##
===============================================
+ Coverage     79.112%   79.407%   +0.295%     
- Complexity     16470     17141      +671     
===============================================
  Files           1047      1052        +5     
  Lines          59198     61386     +2188     
  Branches        9677     10096      +419     
===============================================
+ Hits           46833     48745     +1912     
- Misses          8597      8798      +201     
- Partials        3768      3843       +75

Impacted Files	Coverage Δ	Complexity Δ
...nder/cmdline/PicardCommandLineProgramExecutor.java	`60% <0%> (-10%)`	`5% <0%> (+2%)`
...er/tools/spark/sv/evidence/EvidenceTargetLink.java	`70.513% <0%> (-4.114%)`	`18% <0%> (+2%)`
...er/tools/copynumber/formats/records/CopyRatio.java	`74.359% <0%> (-1.641%)`	`17% <0%> (+8%)`
...broadinstitute/hellbender/tools/GetSampleName.java	`65.517% <0%> (-1.149%)`	`12% <0%> (+5%)`
...kers/variantutils/CalculateGenotypePosteriors.java	`91.398% <0%> (-0.91%)`	`23% <0%> (+9%)`
...lbender/utils/read/SAMRecordToGATKReadAdapter.java	`92.027% <0%> (-0.403%)`	`233% <0%> (+98%)`
...ls/walkers/mutect/M2FiltersArgumentCollection.java	`100% <0%> (ø)`	`1% <0%> (ø)`	⬇️
...itute/hellbender/engine/spark/GATKRegistrator.java	`100% <0%> (ø)`	`4% <0%> (+2%)`	⬆️
...ecaller/AssemblyBasedCallerArgumentCollection.java	`100% <0%> (ø)`	`1% <0%> (ø)`	⬇️
...der/tools/walkers/mutect/M2ArgumentCollection.java	`100% <0%> (ø)`	`1% <0%> (ø)`	⬇️
... and 34 more

mbabadi · 2018-02-06T20:06:56Z

@samuelklee this PR adds new features to gcnvkernel (postprocessing) that are not currently invoked by any GATK tool. Ideally, we need python unit tests for such PRs, but right now, perhaps the tests can wait until I update PostprocessGermlineCNVCalls?

samuelklee

Some minor comments to address and warnings to fix. Otherwise, looks good!

samuelklee · 2018-02-12T15:39:19Z

src/main/python/org/broadinstitute/hellbender/gcnvkernel/io/io_consts.py

@@ -21,18 +21,29 @@
 ploidy_column_name = 'PLOIDY'
 ploidy_gq_column_name = 'PLOIDY_GQ'

+# column names for copy-number segments file
+copy_number_call_column_name = 'COPY_NUMBER_CALL'
+num_spanning_intervals_column_name = 'NUM_SPANNING_INTERVALS'


NUM_POINTS would be more consistent with the ModelSegments pipeline.

samuelklee · 2018-02-12T15:39:54Z

src/main/python/org/broadinstitute/hellbender/gcnvkernel/io/io_consts.py

+# column names for copy-number segments file
+copy_number_call_column_name = 'COPY_NUMBER_CALL'
+num_spanning_intervals_column_name = 'NUM_SPANNING_INTERVALS'
+some_quality_column_name = 'SOME_QUALITY'


Do you feel strongly about keeping this name? Can we come up with something slightly more descriptive?

lol, YES, I LOVE SOME QUALITY :D

I challenge you to come up with a better name though hehe ;-)

time for another name contest

What about QUALITY_SOME_CALLED and QUALITY_ALL_CALLED instead of SOME_QUALITY and EXACT_QUALITY?

samuelklee · 2018-02-12T16:07:09Z

src/main/python/org/broadinstitute/hellbender/gcnvkernel/io/io_intervals_and_counts.py

@@ -68,6 +68,29 @@ def load_interval_list_tsv_file(interval_list_tsv_file: str,
    return _convert_interval_list_pandas_to_gcnv_interval_list(interval_list_pd, interval_list_tsv_file)


+def extract_sam_sequence_dictionary_from_file(input_file: str):


This will extract the entire SAM header (or any lines starting with @) and should be renamed to indicate this.

samuelklee · 2018-02-12T16:08:01Z

src/main/python/org/broadinstitute/hellbender/gcnvkernel/io/io_intervals_and_counts.py

@@ -153,7 +176,8 @@ def _convert_interval_list_pandas_to_gcnv_interval_list(interval_list_pd: pd.Dat
    return interval_list


-def write_interval_list_to_tsv_file(output_file: str, interval_list: List[Interval]):
+def write_interval_list_to_tsv_file(output_file: str, interval_list: List[Interval],


Not in this PR, but there is a warning about this line above:

annotation: IntervalAnnotation = interval_annotations_dict[annotation_key](raw_value)

Running inspect code on the entire package gives 32 + 18 weak warnings as well, can you clean up as necessary?

There is nothing wrong with that line. IDEA fails to infer the type of annotation properly. I'll take a look at the code inspection though.

samuelklee · 2018-02-12T16:10:19Z

src/main/python/org/broadinstitute/hellbender/gcnvkernel/io/io_commons.py

@@ -57,6 +58,12 @@ def get_sample_name_from_txt_file(input_path: str) -> str:
            return line.strip()


+def write_sample_name_to_txt_file(output_path: str, sample_name: str):


Do we still need the sample name in a separate file? Is it not written with the @RG tag in some other file?

I still use the text file for quick sample name lookup and fail-fast validations (e.g. in the next PR). The sample name in the text file is assumed to be the tag of the calls directory, and every @RG tag found in the constituents must match it.

samuelklee · 2018-02-14T17:38:26Z

src/main/python/org/broadinstitute/hellbender/gcnvkernel/postprocess/viterbi_segmentation.py

+        This class is callable. Upon calling, all samples in the call-set will be processed sequentially.
+
+    Note:
+        It is assumed that the model and calls shards are provided in the ascending ordering according


"in the ascending ordering according" -> "in order according"

samuelklee · 2018-02-14T18:32:21Z

src/main/python/org/broadinstitute/hellbender/gcnvkernel/postprocess/viterbi_segmentation.py

+
+                    # exact quality calculation is prone to numerical instability and may overflow
+                    # in that case, some quality will be reported
+                    if np.isinf(segment.exact_quality):


This seems non-ideal. Does this only occur for high quality segments, long segments, etc.? What causes the overflow?

I thought quite a bit about this but I can't find an easy solution. It only occurs for HQ segments where log P(all intervals = call) ~ round-off error such that log P(some intervals != call) is ~ -inf and unreliable. Unfortunately, it is not easy to calculate the latter directly. The same thing occurred for breakpoint quality calculation for HQ segments and that's why I wrote those additional _direct methods. In that case, my workaround was to calculate the complementary log P directly by summing over complementary events.

Perhaps we could neglect correlations while calculating the exact quality, i.e. log P(all intervals = call) ~ \sum_j log P (interval_j = call). This is likely to be less prone to round-off error.

I implemented an adaptive method that switches between a robust upper bound approximation and the (shaky) exact calculation in cases the exact calculation is expected to be robust, too.

samuelklee · 2018-02-14T18:32:45Z

src/main/python/org/broadinstitute/hellbender/gcnvkernel/postprocess/viterbi_segmentation.py

+        for si in range(self.num_samples):
+            self.export_copy_number_segments_for_single_sample(si)
+
+    def viterbi_segments_generator_for_single_sample(self, sample_index: int)\


Does this method need to be exposed?

samuelklee · 2018-02-14T18:33:21Z

src/main/python/org/broadinstitute/hellbender/gcnvkernel/postprocess/viterbi_segmentation.py

+            sample_names += (io_commons.get_sample_name_from_txt_file(sample_posteriors_path),)
+            sample_index += 1
+        if len(sample_names) == 0:
+            raise Exception("Could not file any sample posterior calls in {0}.".format(calls_path))


file -> find

samuelklee · 2018-02-14T18:35:57Z

src/main/python/org/broadinstitute/hellbender/gcnvkernel/postprocess/viterbi_segmentation.py

+                io_consts.default_copy_number_log_emission_tsv_filename)
+
+    @staticmethod
+    def coalesce_seq_into_segments(seq: List[TypeVar('_T')]) -> List[Tuple[TypeVar('_T'), int, int]]:


I also don't think this method needs to be exposed. Does it work OK for lists of length 1 (I didn't check)?

I hid it. Also, it does work for length 1.

mbabadi · 2018-02-23T22:43:56Z

@samuelklee can you take a look at the changes? thanks!

code improvement of theano forward-backward and viterbi refactoring of math utils improvement of segment quality calculation methods incorporated small gcnvkernel changes from PR #4396 doc update

…eview)

samuelklee · 2018-02-26T18:54:50Z

Changes look good! Thanks for adding the documentation as well. Let's go ahead and get this merged in so we can look at the Java side.

…eview)

…s, unit tests, integration tests) (#4396) * Updated PostprocessGermlineCNVCalls (segments VCF writing, WDL scripts, unit tests, integration tests) * Java-side updates after rebasing on the latest gcnvkernel (PR #4335 review) * PR review (from Andrei) * PR review (Sam)

mbabadi requested a review from samuelklee February 3, 2018 21:19

mbabadi added Copy Number tools Germline CNV labels Feb 3, 2018

mbabadi assigned samuelklee Feb 3, 2018

mbabadi mentioned this pull request Feb 3, 2018

Add copy-number segments VCF output to PostprocessGermlineCNVCalls #4336

Closed

2 tasks

mbabadi force-pushed the mb_theano_hmm_viterbi branch from 33e0e99 to e4a3c54 Compare February 5, 2018 21:07

single-sample segmentation as opposed to all in one shot

68f39cd

samuelklee requested changes Feb 14, 2018

View reviewed changes

samuelklee assigned mbabadi and unassigned samuelklee Feb 14, 2018

samuelklee mentioned this pull request Feb 14, 2018

Updated PostprocessGermlineCNVCalls (segments VCF writing, WDL scripts, unit tests, integration tests) #4396

Merged

mbabadi force-pushed the mb_theano_hmm_viterbi branch from 0052e19 to 0def362 Compare February 22, 2018 23:09

mbabadi force-pushed the mb_theano_hmm_viterbi branch from 9d02adf to 34cdf8b Compare February 24, 2018 17:14

PR review

912f994

code improvement of theano forward-backward and viterbi refactoring of math utils improvement of segment quality calculation methods incorporated small gcnvkernel changes from PR #4396 doc update

mbabadi force-pushed the mb_theano_hmm_viterbi branch from fe5639e to 912f994 Compare February 24, 2018 17:18

mbabadi added a commit that referenced this pull request Feb 26, 2018

Java-side updates after rebasing on the latest gcnvkernel (PR #4335 r…

5c4d167

…eview)

samuelklee approved these changes Feb 26, 2018

View reviewed changes

mbabadi merged commit abce8b3 into master Feb 26, 2018

mbabadi added a commit that referenced this pull request Feb 26, 2018

Java-side updates after rebasing on the latest gcnvkernel (PR #4335 r…

d050e27

…eview)

mbabadi added a commit that referenced this pull request Mar 21, 2018

Java-side updates after rebasing on the latest gcnvkernel (PR #4335 r…

f8dd7e8

…eview)

mbabadi deleted the mb_theano_hmm_viterbi branch March 30, 2018 19:34

samuelklee mentioned this pull request Aug 21, 2018

Test DetermineGermlineContigPloidy on known aneuploidy samples #4371

Open

		@@ -68,6 +68,29 @@ def load_interval_list_tsv_file(interval_list_tsv_file: str,
		return _convert_interval_list_pandas_to_gcnv_interval_list(interval_list_pd, interval_list_tsv_file)


		def extract_sam_sequence_dictionary_from_file(input_file: str):

		@@ -57,6 +58,12 @@ def get_sample_name_from_txt_file(input_path: str) -> str:
		return line.strip()


		def write_sample_name_to_txt_file(output_path: str, sample_name: str):

Viterbi segmentation and segment quality calculation for gcnvkernel + (python) CLI script #4335

Viterbi segmentation and segment quality calculation for gcnvkernel + (python) CLI script #4335

Conversation

mbabadi commented Feb 3, 2018

mbabadi commented Feb 3, 2018

samuelklee commented Feb 5, 2018

mbabadi commented Feb 5, 2018

samuelklee commented Feb 5, 2018

codecov-io commented Feb 6, 2018 • edited

Codecov Report

mbabadi commented Feb 6, 2018

samuelklee left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbabadi commented Feb 23, 2018

samuelklee commented Feb 26, 2018

codecov-io commented Feb 6, 2018 •

edited