Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Viterbi segmentation and segment quality calculation for gcnvkernel + (python) CLI script #4335

Merged
merged 3 commits into from Feb 26, 2018

Conversation

mbabadi
Copy link
Contributor

@mbabadi mbabadi commented Feb 3, 2018

No description provided.

@mbabadi
Copy link
Contributor Author

mbabadi commented Feb 3, 2018

hmm... travis python fails and very uninformative log (tests pass locally)

@samuelklee
Copy link
Contributor

Did you figure out why Travis is failing? Also, I think this PR needs at least one test---you can just add to the gCNV integration tests for now, but we should extract out the HMM code and its tests at some point.

@mbabadi
Copy link
Contributor Author

mbabadi commented Feb 5, 2018

@samuelklee I suspect it was because of a broken setup.py script (forgot to add gcnvkernel.postprocess). It still worked locally though... let's see if it fixes the travis issue.

@samuelklee
Copy link
Contributor

Great! I suspected it might've been something like that.

saving log emission posteriors to disk
put __init__ files back in ...
Viterbi decoder w/ theano.scan
doc updates for Viterbi
skeletons for HMM segmentation quality calculation
theano-based HMM log constrained probability calculation
fixed a notorious bug due to theano fancy indexing ...
left and right end point quality calculation for a segment
exact quality
viterbi API update
SAM sequence dictionary parsing
interval ordering using SAM sequence dictionary
lazy initialization of denoising workspace variables to make it efficient and re-usable for Viterbi segmentation
exporting baseline copy number for each sample (for java post-processing)
some I/O refactorings
loading configs from JSON
Viterbi segmentation engine
scattered model/calls assembly
some refactoring of denoising model
Viterbi segmentation and quality calculation finished
fixed numerical instability issues with segment quality calculation
Viterbi segmentation engine complete
Viterbi segmentation python script
some gcnvkernel refactoring
removed SAM sequence dictionary code
fixed the bug causing travis failure
@codecov-io
Copy link

codecov-io commented Feb 6, 2018

Codecov Report

Merging #4335 into master will increase coverage by 0.295%.
The diff coverage is n/a.

@@               Coverage Diff               @@
##              master     #4335       +/-   ##
===============================================
+ Coverage     79.112%   79.407%   +0.295%     
- Complexity     16470     17141      +671     
===============================================
  Files           1047      1052        +5     
  Lines          59198     61386     +2188     
  Branches        9677     10096      +419     
===============================================
+ Hits           46833     48745     +1912     
- Misses          8597      8798      +201     
- Partials        3768      3843       +75
Impacted Files Coverage Δ Complexity Δ
...nder/cmdline/PicardCommandLineProgramExecutor.java 60% <0%> (-10%) 5% <0%> (+2%)
...er/tools/spark/sv/evidence/EvidenceTargetLink.java 70.513% <0%> (-4.114%) 18% <0%> (+2%)
...er/tools/copynumber/formats/records/CopyRatio.java 74.359% <0%> (-1.641%) 17% <0%> (+8%)
...broadinstitute/hellbender/tools/GetSampleName.java 65.517% <0%> (-1.149%) 12% <0%> (+5%)
...kers/variantutils/CalculateGenotypePosteriors.java 91.398% <0%> (-0.91%) 23% <0%> (+9%)
...lbender/utils/read/SAMRecordToGATKReadAdapter.java 92.027% <0%> (-0.403%) 233% <0%> (+98%)
...ls/walkers/mutect/M2FiltersArgumentCollection.java 100% <0%> (ø) 1% <0%> (ø) ⬇️
...itute/hellbender/engine/spark/GATKRegistrator.java 100% <0%> (ø) 4% <0%> (+2%) ⬆️
...ecaller/AssemblyBasedCallerArgumentCollection.java 100% <0%> (ø) 1% <0%> (ø) ⬇️
...der/tools/walkers/mutect/M2ArgumentCollection.java 100% <0%> (ø) 1% <0%> (ø) ⬇️
... and 34 more

@mbabadi
Copy link
Contributor Author

mbabadi commented Feb 6, 2018

@samuelklee this PR adds new features to gcnvkernel (postprocessing) that are not currently invoked by any GATK tool. Ideally, we need python unit tests for such PRs, but right now, perhaps the tests can wait until I update PostprocessGermlineCNVCalls?

Copy link
Contributor

@samuelklee samuelklee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments to address and warnings to fix. Otherwise, looks good!

@@ -21,18 +21,29 @@
ploidy_column_name = 'PLOIDY'
ploidy_gq_column_name = 'PLOIDY_GQ'

# column names for copy-number segments file
copy_number_call_column_name = 'COPY_NUMBER_CALL'
num_spanning_intervals_column_name = 'NUM_SPANNING_INTERVALS'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NUM_POINTS would be more consistent with the ModelSegments pipeline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# column names for copy-number segments file
copy_number_call_column_name = 'COPY_NUMBER_CALL'
num_spanning_intervals_column_name = 'NUM_SPANNING_INTERVALS'
some_quality_column_name = 'SOME_QUALITY'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you feel strongly about keeping this name? Can we come up with something slightly more descriptive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, YES, I LOVE SOME QUALITY :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I challenge you to come up with a better name though hehe ;-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time for another name contest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about QUALITY_SOME_CALLED and QUALITY_ALL_CALLED instead of SOME_QUALITY and EXACT_QUALITY?

@@ -68,6 +68,29 @@ def load_interval_list_tsv_file(interval_list_tsv_file: str,
return _convert_interval_list_pandas_to_gcnv_interval_list(interval_list_pd, interval_list_tsv_file)


def extract_sam_sequence_dictionary_from_file(input_file: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will extract the entire SAM header (or any lines starting with @) and should be renamed to indicate this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -153,7 +176,8 @@ def _convert_interval_list_pandas_to_gcnv_interval_list(interval_list_pd: pd.Dat
return interval_list


def write_interval_list_to_tsv_file(output_file: str, interval_list: List[Interval]):
def write_interval_list_to_tsv_file(output_file: str, interval_list: List[Interval],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this PR, but there is a warning about this line above:

annotation: IntervalAnnotation = interval_annotations_dict[annotation_key](raw_value)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running inspect code on the entire package gives 32 + 18 weak warnings as well, can you clean up as necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is nothing wrong with that line. IDEA fails to infer the type of annotation properly. I'll take a look at the code inspection though.

@@ -57,6 +58,12 @@ def get_sample_name_from_txt_file(input_path: str) -> str:
return line.strip()


def write_sample_name_to_txt_file(output_path: str, sample_name: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need the sample name in a separate file? Is it not written with the @RG tag in some other file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still use the text file for quick sample name lookup and fail-fast validations (e.g. in the next PR). The sample name in the text file is assumed to be the tag of the calls directory, and every @RG tag found in the constituents must match it.

This class is callable. Upon calling, all samples in the call-set will be processed sequentially.

Note:
It is assumed that the model and calls shards are provided in the ascending ordering according
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"in the ascending ordering according" -> "in order according"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


# exact quality calculation is prone to numerical instability and may overflow
# in that case, some quality will be reported
if np.isinf(segment.exact_quality):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems non-ideal. Does this only occur for high quality segments, long segments, etc.? What causes the overflow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought quite a bit about this but I can't find an easy solution. It only occurs for HQ segments where log P(all intervals = call) ~ round-off error such that log P(some intervals != call) is ~ -inf and unreliable. Unfortunately, it is not easy to calculate the latter directly. The same thing occurred for breakpoint quality calculation for HQ segments and that's why I wrote those additional _direct methods. In that case, my workaround was to calculate the complementary log P directly by summing over complementary events.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could neglect correlations while calculating the exact quality, i.e. log P(all intervals = call) ~ \sum_j log P (interval_j = call). This is likely to be less prone to round-off error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented an adaptive method that switches between a robust upper bound approximation and the (shaky) exact calculation in cases the exact calculation is expected to be robust, too.

for si in range(self.num_samples):
self.export_copy_number_segments_for_single_sample(si)

def viterbi_segments_generator_for_single_sample(self, sample_index: int)\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this method need to be exposed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, no.

sample_names += (io_commons.get_sample_name_from_txt_file(sample_posteriors_path),)
sample_index += 1
if len(sample_names) == 0:
raise Exception("Could not file any sample posterior calls in {0}.".format(calls_path))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file -> find

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

io_consts.default_copy_number_log_emission_tsv_filename)

@staticmethod
def coalesce_seq_into_segments(seq: List[TypeVar('_T')]) -> List[Tuple[TypeVar('_T'), int, int]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't think this method needs to be exposed. Does it work OK for lists of length 1 (I didn't check)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hid it. Also, it does work for length 1.

@mbabadi
Copy link
Contributor Author

mbabadi commented Feb 23, 2018

@samuelklee can you take a look at the changes? thanks!

code improvement of theano forward-backward and viterbi
refactoring of math utils
improvement of segment quality calculation methods
incorporated small gcnvkernel changes from PR #4396
doc update
@samuelklee
Copy link
Contributor

Changes look good! Thanks for adding the documentation as well. Let's go ahead and get this merged in so we can look at the Java side.

@mbabadi mbabadi merged commit abce8b3 into master Feb 26, 2018
mbabadi added a commit that referenced this pull request Feb 26, 2018
mbabadi added a commit that referenced this pull request Mar 21, 2018
mbabadi added a commit that referenced this pull request Mar 22, 2018
…s, unit tests, integration tests) (#4396)

* Updated PostprocessGermlineCNVCalls (segments VCF writing, WDL scripts, unit tests, integration tests)

* Java-side updates after rebasing on the latest gcnvkernel (PR #4335 review)

* PR review (from Andrei)

* PR review (Sam)
@mbabadi mbabadi deleted the mb_theano_hmm_viterbi branch March 30, 2018 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants