New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Viterbi segmentation and segment quality calculation for gcnvkernel + (python) CLI script #4335
Conversation
hmm... travis python fails and very uninformative log (tests pass locally) |
Did you figure out why Travis is failing? Also, I think this PR needs at least one test---you can just add to the gCNV integration tests for now, but we should extract out the HMM code and its tests at some point. |
@samuelklee I suspect it was because of a broken setup.py script (forgot to add |
Great! I suspected it might've been something like that. |
saving log emission posteriors to disk put __init__ files back in ... Viterbi decoder w/ theano.scan doc updates for Viterbi skeletons for HMM segmentation quality calculation theano-based HMM log constrained probability calculation fixed a notorious bug due to theano fancy indexing ... left and right end point quality calculation for a segment exact quality viterbi API update SAM sequence dictionary parsing interval ordering using SAM sequence dictionary lazy initialization of denoising workspace variables to make it efficient and re-usable for Viterbi segmentation exporting baseline copy number for each sample (for java post-processing) some I/O refactorings loading configs from JSON Viterbi segmentation engine scattered model/calls assembly some refactoring of denoising model Viterbi segmentation and quality calculation finished fixed numerical instability issues with segment quality calculation Viterbi segmentation engine complete Viterbi segmentation python script some gcnvkernel refactoring removed SAM sequence dictionary code fixed the bug causing travis failure
33e0e99
to
e4a3c54
Compare
Codecov Report
@@ Coverage Diff @@
## master #4335 +/- ##
===============================================
+ Coverage 79.112% 79.407% +0.295%
- Complexity 16470 17141 +671
===============================================
Files 1047 1052 +5
Lines 59198 61386 +2188
Branches 9677 10096 +419
===============================================
+ Hits 46833 48745 +1912
- Misses 8597 8798 +201
- Partials 3768 3843 +75
|
@samuelklee this PR adds new features to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments to address and warnings to fix. Otherwise, looks good!
@@ -21,18 +21,29 @@ | |||
ploidy_column_name = 'PLOIDY' | |||
ploidy_gq_column_name = 'PLOIDY_GQ' | |||
|
|||
# column names for copy-number segments file | |||
copy_number_call_column_name = 'COPY_NUMBER_CALL' | |||
num_spanning_intervals_column_name = 'NUM_SPANNING_INTERVALS' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NUM_POINTS
would be more consistent with the ModelSegments pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
# column names for copy-number segments file | ||
copy_number_call_column_name = 'COPY_NUMBER_CALL' | ||
num_spanning_intervals_column_name = 'NUM_SPANNING_INTERVALS' | ||
some_quality_column_name = 'SOME_QUALITY' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you feel strongly about keeping this name? Can we come up with something slightly more descriptive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol, YES, I LOVE SOME QUALITY :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I challenge you to come up with a better name though hehe ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time for another name contest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about QUALITY_SOME_CALLED
and QUALITY_ALL_CALLED
instead of SOME_QUALITY
and EXACT_QUALITY
?
@@ -68,6 +68,29 @@ def load_interval_list_tsv_file(interval_list_tsv_file: str, | |||
return _convert_interval_list_pandas_to_gcnv_interval_list(interval_list_pd, interval_list_tsv_file) | |||
|
|||
|
|||
def extract_sam_sequence_dictionary_from_file(input_file: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will extract the entire SAM header (or any lines starting with @
) and should be renamed to indicate this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -153,7 +176,8 @@ def _convert_interval_list_pandas_to_gcnv_interval_list(interval_list_pd: pd.Dat | |||
return interval_list | |||
|
|||
|
|||
def write_interval_list_to_tsv_file(output_file: str, interval_list: List[Interval]): | |||
def write_interval_list_to_tsv_file(output_file: str, interval_list: List[Interval], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in this PR, but there is a warning about this line above:
annotation: IntervalAnnotation = interval_annotations_dict[annotation_key](raw_value)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running inspect code on the entire package gives 32 + 18 weak warnings as well, can you clean up as necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is nothing wrong with that line. IDEA fails to infer the type of annotation
properly. I'll take a look at the code inspection though.
@@ -57,6 +58,12 @@ def get_sample_name_from_txt_file(input_path: str) -> str: | |||
return line.strip() | |||
|
|||
|
|||
def write_sample_name_to_txt_file(output_path: str, sample_name: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need the sample name in a separate file? Is it not written with the @RG
tag in some other file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still use the text file for quick sample name lookup and fail-fast validations (e.g. in the next PR). The sample name in the text file is assumed to be the tag of the calls directory, and every @RG
tag found in the constituents must match it.
This class is callable. Upon calling, all samples in the call-set will be processed sequentially. | ||
|
||
Note: | ||
It is assumed that the model and calls shards are provided in the ascending ordering according |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"in the ascending ordering according" -> "in order according"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
# exact quality calculation is prone to numerical instability and may overflow | ||
# in that case, some quality will be reported | ||
if np.isinf(segment.exact_quality): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems non-ideal. Does this only occur for high quality segments, long segments, etc.? What causes the overflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought quite a bit about this but I can't find an easy solution. It only occurs for HQ segments where log P(all intervals = call)
~ round-off error such that log P(some intervals != call)
is ~ -inf and unreliable. Unfortunately, it is not easy to calculate the latter directly. The same thing occurred for breakpoint quality calculation for HQ segments and that's why I wrote those additional _direct
methods. In that case, my workaround was to calculate the complementary log P directly by summing over complementary events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could neglect correlations while calculating the exact quality, i.e. log P(all intervals = call) ~ \sum_j log P (interval_j = call)
. This is likely to be less prone to round-off error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented an adaptive method that switches between a robust upper bound approximation and the (shaky) exact calculation in cases the exact calculation is expected to be robust, too.
for si in range(self.num_samples): | ||
self.export_copy_number_segments_for_single_sample(si) | ||
|
||
def viterbi_segments_generator_for_single_sample(self, sample_index: int)\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this method need to be exposed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, no.
sample_names += (io_commons.get_sample_name_from_txt_file(sample_posteriors_path),) | ||
sample_index += 1 | ||
if len(sample_names) == 0: | ||
raise Exception("Could not file any sample posterior calls in {0}.".format(calls_path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file -> find
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
io_consts.default_copy_number_log_emission_tsv_filename) | ||
|
||
@staticmethod | ||
def coalesce_seq_into_segments(seq: List[TypeVar('_T')]) -> List[Tuple[TypeVar('_T'), int, int]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't think this method needs to be exposed. Does it work OK for lists of length 1 (I didn't check)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hid it. Also, it does work for length 1.
0052e19
to
0def362
Compare
@samuelklee can you take a look at the changes? thanks! |
9d02adf
to
34cdf8b
Compare
fe5639e
to
912f994
Compare
Changes look good! Thanks for adding the documentation as well. Let's go ahead and get this merged in so we can look at the Java side. |
No description provided.